![]() |
![]() |
![]() |
Unicode and Glyph Names
[ Document version 1.1. Last updated 17 December 1998 ]
This document describes Adobe(R)'s PostScript(R) glyph naming conventions in the context of Unicode. The purpose of these conventions is to attach standardized semantics to glyph names, including glyphs that represent characters that don't have standard Unicode values (UVs) like certain ligatures or glyphic variants. Two perspectives are presented: that of the font developer, when deciding what to name the glyphs in a font; and that of any process that needs to extract Unicode semantics from glyph names, such as a Type 1-to-OpenType converter when creating a Unicode 'cmap', or the search facility in an application that does not use OpenType layout tables. The 3 data files referred to in this document are:
The Unicode Standard 2.1 has been used in this document and related data files, except for the 4 characters mentioned in the header of the AGL data file. Font developers should follow these guidelines in all cases where glyph names are needed: Type 1 fonts, OpenType fonts with non-CID-keyed CFF data, and TrueType fonts and OpenType fonts with TrueType data that contain 'post' tables with implicit or explicit glyph names. 2.a. Maximum name length and permissible characters A glyph name may be up to 31 characters in length, must be entirely comprised of characters from the following set:
a-z 0-9 . (period) _ (underscore) and must not start with a digit or period. The only exception is the special character ".notdef". "twocents", "a1", and "_" are valid glyph names. "2cents" and ".twocents" are not. 2.b. The uni<CODE> glyph naming convention In certain situations described in this document, a glyph needs to be named according to the uni<CODE> convention. This means that the glyph name should be of the format "uni<CODE>", where <CODE> is the glyph's UV represented as a 4-digit uppercase hexadecimal number. The "uni" component must be lowercase. For example, the uni<CODE> glyph name for U+054A, ARMENIAN CAPITAL LETTER PEH, should be "uni054A". Each step below should be considered, in order, until a name is assigned to a particular glyph. The process should be repeated for every glyph. In order to verify the choice of glyph name, the result of applying the algorithm in section 3.a to the glyph name can be compared with the intended meaning. Note that font developers can implement a glyph aliasing mechanism in their production tools that could provide more descriptive glyph name aliases for uni<CODE> or any other glyph names, as long as the glyph names in the final font follow the guidelines below. If the character that the glyph represents has a standard UV, i.e. a UV assigned by the Unicode Standard, then assign its name as follows: If the UV is in the Adobe Glyph List, use the glyph name associated with it in AGL. For example, the glyph name for U+00C1, LATIN CAPITAL LETTER A WITH ACUTE, should be "Aacute". If the UV is one of a double-mapping in AGL, and if separate designs are desired for each UV, then one glyph should be given the AGL name and the other should be given a uni<CODE> name according to the table in section 4.c. If the UV is not in AGL, use a uni<CODE> glyph name. For example, the glyph name for U+0AB8, GUJARATI LETTER SA, should be "uni0AB8". If the glyph represents a Unicode surrogate character, name the glyph "uni<CODE1><CODE2>", where <CODE1> is the high-surrogate UV, and <CODE2> is the low-surrogate UV. No surrogate characters have been assigned by the Unicode Standard as of Version 2.1. If the character that the glyph represents is in the data file "Unicode's Corporate Use Subarea as used by Adobe," then assign its name as follows: If the CUS UV is in AGL, use the glyph name associated with it in AGL. For example, the glyph name for CUS U+F761, LATIN SMALL CAPITAL LETTER A, should be "Asmall". If the CUS UV is not in AGL, the glyph should be named according to the uni<CODE> convention. For example, the glyph name for CUS U+F66D, LATIN SMALL CAPITAL LETTER A WITH BREVE, should be "uniF66D". (If this point is reached, the character that the glyph represents has neither a standard nor a CUS UV.) If the character that the glyph represents is a ligature, or otherwise decomposes into standard Unicode or CUS characters, then two formats are available for its name:
2.c.iv. Non-Unicode glyphic variant If the glyph is a glyphic variant of a character in category (i), (ii), or (iii) above: The glyph name is of the form:
Note the period after the base glyph name. An optional variant descriptor can follow the period. <base glyph name> is:
Any process which determines semantics from glyph names, such as the one described in section 3.a, will ignore the descriptor, if present. The descriptor may contain periods or any other permitted characters; only the first period in the glyph name is relevant. For example, a variant of the "T h" ligature can be named "T_h.swash". "T.swash_h" would be incorrect since this will be interpreted as a glyphic variant of "T". Some of Adobe's internal conventions for variant descriptors are listed below; other developers may use or ignore these additional conventions as they see fit.
If this step is reached, then the glyph has no useful semantic value. Examples of such glyphs include ornaments. Any glyph name may be assigned as long as it cannot be interpreted as a glyph name from (i) to (iv) above. Adobe's current internal practice is to name ornaments "orn.001", "orn.002" and so on; developers may choose other naming conventions. Do not use a uni<CODE> name for a glyph that is in AGL (except for one of a double-mapping), since this might produce undesirable results in pieces of software that use only the AGL glyph name to test for the presence of a particular UV in a font. For example, the Adobe PostScript driver for Windows re-encodes a Type 1 (PFB/PFM) font to a particular Windows code page on the basis of AGL glyph names; it does not recognize uni<CODE> glyph names. When re-encoding to Windows ANSI (code page 1252) and printing code point 0xE9 (U+00E9, LATIN CAPITAL LETTER E WITH ACUTE), for instance, it will print the glyph "eacute" if present, ".notdef" otherwise, even if the glyph "uni00E9" were present). Ideally, such an application would first check to see whether the font had a uni<CODE> name for a particular UV before using the AGL name (including double-mapped AGL glyph names). However, this might not be acceptable to the application since the re-encoding array could change depending on the font. Note that Adobe Type Manager(R) for Windows NT(R) (ATM(R)/NT) tests whether a Type 1 font supports a particular Windows code page by the presence of the following UVs. If there are two UVs indicated for a code page then the presence of either one is sufficient for the code page to be considered supported.
The guidelines in 3.a should be followed by any application that needs to determine the meaning of a glyph from its glyph name. Sections 3.b and 3.c give examples of such applications. The following pseudocode should be implemented for all fonts except Zapf Dingbats (PostScript FontName: ZapfDingbats); this has its own separate Unicode lookup table in the data file "Zapf Dingbats Glyph Names and UVs." The pseudocode does not determine the validity of a glyph name. ".notdef" is special: it is used when a glyph name in a PostScript encoding does not exist in a font. It does not have a UV and does not appear in AGL.
|
getGlyphNameSemantics() Input: glyphName g Output: UV+ # One or more UVs isDecomp # A boolean, indicating that UV+ is a decomposition, as opposed to 1 or 2 UVs (2 for an AGL double-mapping) isVar # A boolean, indicating that g is a glyphic variant of UV+ { isDecomp = false; isVar = false; If g contains a period: # Sec. I isVar = true; g = everything before the first period in g; If g is empty: Return (UNRECOGNIZED, -, -); If g is in AGL: # Sec. II Return (AGL UVs, isDecomp, isVar); If g is of the form uni<CODE>: # Sec. III Return (<CODE>, isDecomp, isVar); isDecomp = true; # Sec. IV If g contains an underscore: Split g by underscores; If each component yeilds a UV by sections II or III above: Return (<UV1><UV2><UV3>..., isDecomp, isVar); Return (UNRECOGNIZED, -, -); If g is of the form "uni" with 2 or more <CODE>s following it: # Sec. V Return (<UV1><UV2><UV3>..., isDecomp, isVar); Return (UNRECOGNIZED, -, -); # Sec. VI } |
When two UVs are returned by a double-mapped glyph, and only one can be
accepted, then the UV that corresponds to the first of each pair in the
table in section 4.c should be used. Some sample inputs and outputs of this function are:
Note that the UV or UVs that this function will produce might be CUS UVs. Each such UV can be decomposed into its standard Unicode values by consulting the decompositions in the data file "Unicode's CUS as used by Adobe." In the example of "zerooldstyle" in the table above, CUS U+F730 decomposes into "<osf> 0030", a glyphic variant of DIGIT ZERO (U+0030). Surrogate character names (described in section 2.c.i) can be easily distinguished from two-component non-Unicode ligatures (described in section 2.c.iii Format 2): the high-surrogate UV will be in the range U+D800 through U+DBFF, and the low-surrogate UV will be in the range U+DC00 through U+DFFF, as defined by the Unicode Standard. 3.b. Populating a Unicode space If an application is interested only in extracting standard or CUS UVs (categories (i) and (ii) in section 2.c), it can modify the algorithm to simply delete sections I, IV, and V. Examples of such applications include a Type 1-to-OpenType converter, when creating the Unicode 'cmap'; and ATM/NT, when loading Type 1 fonts, since Windows NT represents all characters in terms of Unicode. If the uni<CODE> as well as the AGL glyph name for a particular UV are present, then the uni<CODE> glyph should take precedence at that UV. If an application wants to encode unrecognized glyphs in Unicode, it should do so in the End User subarea, by sorting the unrecognized glyph names in the font by increasing ASCII order, and assigning them to a contiguous run of UVs starting at U+E000, the lower end of the Private Use Area. If this run of UVs overlaps with the UV assigned to a glyph with a uni<CODE> name, the results are undefined. The Unicode Standard makes no provision for avoiding a "stack-heap collision" between the End User subarea and CUS. Furthermore, Microsoft(R) will treat the range U+F000 through U+F0FF as the definition of its symbol code page. 3.c. Search and copy/paste facilities If a font's glyphs have been properly named, search facilities can accurately locate all glyphic variants of the seach string's characters. For example, if the user types in the letter "t", then glyphs "t", "Tsmall", "t.swash", "t.begin", and "t.end" can all be matched. The same principle applies to copy/paste facilities. For example, glyph "ampersand.alt01", when copied and pasted from one application into another, would be known to be a glyphic variant of AMPERSAND (U+0026), and the regular ampersand could be used to display the character as a fallback strategy. Note that applications that have access to a font's OpenType layout tables can also glean this information from the various features in the glyph substitution ('GSUB') table. This would be the only recourse to identify non-Unicode glyphic variants for fonts that do not have glyph names, such as OpenType fonts with CID-keyed CFF data. The Adobe Glyph List includes the complete character complements from:
The glyphs in Zapf Dingbats are in a separate table and are recognized by the UV-assigning algorithm as a special case (see section 3.a). The Unicode Standard states that character assignments in the CUS could be completely internal, hidden from end users, and used only for vendor-specific application support, or could be published as vendor-specific character assignments available to applications and end users. The CUS characters in AGL fall into the latter category in that they are available to end users; however, several of them, such as the Cyrillic glyphic variants, are not vendor-specific, and would be useful to several vendors. In fact, we envision the CUS as a collaborative effort among vendors, wherein each vendor ensures that new assignments do not overlap with existing ones. This shared approach ensures optimal use of the limited UVs available. It also avoids the obvious problems that applications would have in identifying the vendor of a font in order to determine which vendor's CUS assignments were in effect. In addition, we regard CUS assignments as useful until OpenType features become widely available in fonts and supported by applications. Apple has published CUS assignments in the range U+F800 through U+F8FF. Adobe uses CUS assignments in the range U+F600 through U+F7FF, as well as the same assignments for some characters in Symbol and Zapf Dingbats from the Apple-defined range. Microsoft is treating the range U+F000 through U+F0FF as the definition of its symbol code page. Glyph names of future CUS characters should follow the uni<CODE> naming convention. See the data file "Unicode's CUS as used by Adobe" for a description of the CUS assignments and their Unicode-style character decompositions. AGL contains certain double-mappings, i.e. glyphs that are mapped to two UVs for compatibility with legacy fonts. If a developer wishes to provide separate designs for a double-mapping (see section 2.c.i), then one of the UVs may have a uni<CODE> glyph name. AGL 1.2 contains the following double-mapped glyphs:
For example, if different designs are desired for MICRO SIGN (U+00B5) and GREEK SMALL LETTER MU (U+03BC), then glyph "mu" should be designed as MICRO SIGN and glyph "uni03BC" should be designed as GREEK SMALL LETTER MU. Developers should be aware, however, that the Adobe PostScript driver for Windows re-encodes a Type 1 (PFB/PFM) font to a particular Windows code page on the basis of AGL glyph names; it does not recognize uni<CODE> glyph names. For example, when the driver needs to re-encode to Windows Greek, it will use glyph "mu" for both Windows Greek code point 0xB5 (MICRO SIGN) and code point 0xEC (GREEK SMALL LETTER MU), even if the glyph "uni03BC" were present. This limitation is not present for GDI printing with OpenType or TrueType fonts. Adobe's current plans are to provide separate designs for some double-mappings in OpenType fonts. If developers produce PFB fonts that have separate designs for double-mappings, ensuring that the advance widths of both glyphs in each pair of designs is the same will prevent line rewrap in the problem situation.
|
|
Copyright © 1999 Adobe Systems Incorporated. |