Internationalization & Localization


[HOME]

[Professional XML]

XML Soup

INDEX:
[Glossary]

The XML 1.0 2nd Edition

Spec 2.2 Characters :

A parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.

... All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, ...

Spec 4.3.3 Character Encoding in Entities

Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part number) should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

ISO/IEC 10646 and Unicode

Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode scalar value to a unique byte sequence. UTF-16 is a transformation format that enables the use of 16 planes in addition to Basic Multilingual Plane (BMP) keeping compatibility with UCS-2. UCS-2 had the advantage of being able to handle all characters in a 2 byte fixed length. Characters of Plane 0x00, BMP, are represented as 2 bytes and Planes 0x01 to 0x10 are represented as 4 bytes. Therefore, UTF-16 allows access to 63K characters as single Unicode 16-bit units. It can, in addition, access 1M characters by a mechanism known as surrogate pairs.

Two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. The first 2 bytes of a 4 byte character are selected from high-half zone codes, from 0xD800 to 0xDBFF, and the second 2 bytes of a 4 byte character are selected from low-half zone codes, from 0xDC00 to 0xDFFF. In the additional planes, Planes 0x0F and 0x10 are defined as Private Use Planes (PUPs).

The ISO 10646 standard uses the term "UCS (Universal Character Set) transformation format" for UTF; the two terms are merely synonyms for the same concept.

Unicode 3.0 has the same character repertoire as ISO/IEC 10646-1:2000

Unicode 2.1 has the same character repertoire as ISO/IEC 10646-1:1993

Although the character codes are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646. For example, in Unicode 3.0, there are no assigned surrogate pairs. Since the most common characters have already been encoded in the first 64K values, the characters requiring surrogate pairs will be relatively rare.

Character Sets and the HTML Standard

The standard character set for computers has traditionally been US-ASCII (American Standard Code for Information Interchange). (The "US" is to distinguish it from non-US ASCIIs.) This group of characters is numbered from 0 through 127, and comprises the upper and lower case letters, numbers, and punctuation, as well as some control characters such as tabs and linefeeds. But there is no foreign character or specialized symbol defined Hence, Windows, the Mac, and the IBM PC's text-mode have made their own sets of extended characters. Since they're widely different from each others, the only "safe" characters to transmit and count on the user receiving correctly have traditionally been the "7-bit" characters 0-127.

Originally, the Web has adopted an extended character set, ISO 8859-1 (sometimes referred to as ISO Latin-1), as its standard. (Or at least it did before HTML 4.0; the new standard, out of international political-correctness, has no default character set in order not to favor one language over another, and requires an explicit "charset" parameter in the HTTP content-type header; despite this, the World Wide Web Consortium site has contradictory information in its own character set page.) This set includes various foreign and special characters. Thus, you can use these characters if your editing program saves them in the standardized ISO codes, but not if system-specific character codes are used. (The Windows character set is mostly the same as the ISO character set, but the Macintosh set is completely different.)

The HTML standard 4.0 adopted Unicode as the official document character set. Following this the numeric character references are always interpreted with regard to Unicode, as opposed to the character encoding, which is the character set used to transmit the characters over the network (and possibly also to store the Web pages on the server's file system, but not necessarily as the server might transform the characters as it transmits them.) This encoding has no standard defined and is supposed to be specified in the HTTP content-type header, but the numerical character references should be unaffected by the chosen encoding of a document.

The first 256 characters of Unicode (#0-#255) are equivalent to the ISO Latin-1 standard, which in turn has its first 128 characters equivalent to the older US-ASCII (with the minor exception that Unicode has chosen not to give any definition as to the functions of the control characters from #0-#31 and #128-#159, leaving them entirely system-specific) so existing Web documents will work the same as always. But additional characters #256-#65535 are also available, including many other foreign languages, mathematical characters, and more, including curly quotes.

If UCS is used on UNIX systems, it is recommended that UTF-8 be used as multibyte character encoding (contents of char data type in the C language and contents of text files) and UCS-4 ((Not UCS-2 or UTF-16) and Private Use Planes (PUPs)) be used as wide character encoding. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. On most the UNIX systems, the "wide character" data type of the C programming language (wchar_t) is defined as 32 bit integer. If Private Use Planes are used for user defined characters, such UDCs are converted from UCS-4 to UTF-8 and used as multibyte encoding.

Standard UTF-8 won't interoperate well in an EBCDIC system because of the different arrangements of control codes. The UTF-EBCDIC is called to interoperate in EBCDIC systems.

UTF-8 is the most common on the web.

UTF16, UTF16LE, UTF16BE are used by Java and Windows.

UTF32, UTF32LE, UTF32BE are used by various Unix systems.

    Reference:
  1. W3C XML 1.0 2nd edition
  2. ISO Working Group
  3. Unicode Consortium FAQs
  4. Dan's Web Tips: Characters and Fonts
  5. Problems and Solutions for Unicode and User/Vendor Defined Characters

Unicode Glossary

ANSI. (1) The American National Standards Institute. (2) The Microsoft collective name for all Windows code pages. Sometimes used specifically for code page 1252, which is a superset of ISO/IEC 8859-1.

ASCII. Acronym for American Standard Code for Information Interchange, a 7-bit code that is the U.S. national variant of ISO/IEC 646. Formally, the U.S. standard ANSI X3.4.

BMP. Abbreviation for Basic Multilingual Plane.

BMP Code Point. A Unicode code point between U+0000 and U+FFFF. See supplementary code point.

BMP Character. A Unicode encoded character having a BMP code point. See supplementary character.

BOM. Acronym for byte order mark.

Control Codes. The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. Also known as control characters.

EBCDIC. Acronym for Extended Binary-Coded Decimal Interchange Code. A group of coded character sets used on mainframes that consist of 8-bit coded characters. EBCDIC coded character sets reserve the first 64 code positions (x00 to x3F) for control codes, and reserve the range x41 to xFE for graphic characters. The English alphabetic characters are in discontinuous segments with uppercase at xC1 to xC9, xD1 to xD9, xE2 to xE9, and lowercase at x81 to x89, x91 to x99, xA2 to xA9.

Encoded Character. An abstract character together with its associated Unicode scalar value (code point). By itself, an abstract character has no numerical value, but the process of "encoding a character" associates a particular Unicode scalar value with a particular abstract character, thereby resulting in an "encoded character."

Fancy Text. Also known as rich text. The result of adding additional information to plain text. Examples of information that can be added include font data, color, formatting information, phonetic annotations, interlinear text, and so on. The Unicode Standard does not address the representation of fancy text. It is expected that systems and applications will implement proprietary forms of fancy text. Some public forms of fancy text are available (for example, ODA, HTML, and SGML). When everything but primary content is removed from fancy text, only plain text should remain.

Plane. A range of 65,536 (1000016) contiguous Unicode code points, where the first code point is an integer multiple of 65,636 (1000016). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF. (Note that ISO/IEC 10646 uses hexadecimal notation for the plane numbers, e.g. Plane B instead of Plane 11). See Basic Multilingual Plane and Supplementary Planes.

SGML. Standard Generalized Markup Language. A standard framework for defining particular text markup languages. The SGML framework allows for mixing structural tags that describe format with the plain text content of documents, so that fancy text can be fully described in a plain text stream of data. (See also HTML, XML, and fancy text.)

Surrogate Code Point. A Unicode code point in the range U+D800 through U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) "stand in" for a supplementary code point.

Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

Surrogate Pair. A coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high-surrogate and the second is a low-surrogate. (See Definition D27 in Section 3.7, Surrogates .)

Transcoding. Conversion of character data between different character sets.

UCS. Abbreviation for Universal Character Set, which is specified by International Standard ISO/IEC 10646.

UCS-2. ISO/IEC 10646 encoding form: Universal Character Set coded in 2 octets. (See Appendix C, Relationship to ISO/IEC 10646 .)

UCS-4. ISO/IEC 10646 encoding form: Universal Character Set coded in 4 octets. (See Appendix C, Relationship to ISO/IEC 10646 .)

Unicode Signature. An implicit marker to identify a file as containing Unicode text in a particular encoding form. An initial byte order mark (BOM) may be used as a Unicode signature.

Unicode (or UCS) Transformation Format. (See Definition D29 in Section 3.8, Transformations , see also Section C.3, UCS Transformation Formats .)

UTF. Abbreviation for Unicode (or UCS) Transformation Format.

UTF-2. Obsolete name for UTF-8.

UTF-7. Unicode (or UCS) Transformation Format, 7-bit encoding form, specified by RFC-2152.

UTF-8. Unicode (or UCS) Transformation Format, 8-bit encoding form. UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value (code point) as a sequence of one to four bytes, as specified in Table 3-1, UTF-8 Bit Distribution . (See Definition D36 in Section 3.8, Transformations .)

UTF-16. Unicode (or UCS) Transformation Format, 16-bit encoding form. The UTF-16 is the Unicode Transformation Format that serializes a Unicode scalar value (code point) as a sequence of two bytes, in either big-endian or little-endian format. (See Definition D35 in Section 3.8, Transformations .)

UTF-16BE. The Unicode Transformation Format that serializes a Unicode scalar value (code point) as a sequence of two bytes, in big-endian format. An initial sequence corresponding to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE. (See Definition D33 in Section 3.8, Transformations .)

UTF-16LE. The Unicode Transformation Format that serializes a Unicode scalar value (code point) as a sequence of two bytes, in little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a ZERO WIDTH NO-BREAK SPACE. (See Definition D34 in Section 3.8, Transformations .)

XML. eXtensible Markup Language. A subset of SGML constituting a particular text markup language for interchange of structured data. The Unicode Standard is the reference character set for XML content. (See also SGML and fancy text.) XML is a trademark of the World Wide Web Consortium.

    Reference:
  1. The official Web site of Unicode Glossary
  2. This glossary is originally based on The Unicode Standard, Version 3.0