Internationalization & Localization
INDEX:
[Glossary]
A parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.
... All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, ...
Spec 4.3.3 Character Encoding in Entities
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part number) should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).
Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode scalar value to a unique byte sequence. UTF-16 is a transformation format that enables the use of 16 planes in addition to Basic Multilingual Plane (BMP) keeping compatibility with UCS-2. UCS-2 had the advantage of being able to handle all characters in a 2 byte fixed length. Characters of Plane 0x00, BMP, are represented as 2 bytes and Planes 0x01 to 0x10 are represented as 4 bytes. Therefore, UTF-16 allows access to 63K characters as single Unicode 16-bit units. It can, in addition, access 1M characters by a mechanism known as surrogate pairs.
Two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. The first 2 bytes of a 4 byte character are selected from high-half zone codes, from 0xD800 to 0xDBFF, and the second 2 bytes of a 4 byte character are selected from low-half zone codes, from 0xDC00 to 0xDFFF. In the additional planes, Planes 0x0F and 0x10 are defined as Private Use Planes (PUPs).
The ISO 10646 standard uses the term "UCS (Universal Character Set) transformation format" for UTF; the two terms are merely synonyms for the same concept.
Unicode 3.0 has the same character repertoire as ISO/IEC 10646-1:2000
Unicode 2.1 has the same character repertoire as ISO/IEC 10646-1:1993
Although the character codes are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646. For example, in Unicode 3.0, there are no assigned surrogate pairs. Since the most common characters have already been encoded in the first 64K values, the characters requiring surrogate pairs will be relatively rare.
The standard character set for computers has traditionally been US-ASCII (American Standard Code for Information Interchange). (The "US" is to distinguish it from non-US ASCIIs.) This group of characters is numbered from 0 through 127, and comprises the upper and lower case letters, numbers, and punctuation, as well as some control characters such as tabs and linefeeds. But there is no foreign character or specialized symbol defined Hence, Windows, the Mac, and the IBM PC's text-mode have made their own sets of extended characters. Since they're widely different from each others, the only "safe" characters to transmit and count on the user receiving correctly have traditionally been the "7-bit" characters 0-127.
Originally , the Web has adopted an extended character set, ISO 8859-1 (sometimes referred to as ISO Latin-1), as its standard. (Or at least it did before HTML 4.0; the new standard, out of international political-correctness, has no default character set in order not to favor one language over another, and requires an explicit "charset" parameter in the HTTP content-type header; despite this, the World Wide Web Consortium site has contradictory information in its own character set page.) This set includes various foreign and special characters. Thus, you can use these characters if your editing program saves them in the standardized ISO codes, but not if system-specific character codes are used. (The Windows character set is mostly the same as the ISO character set, but the Macintosh set is completely different.)
The HTML standard 4.0 adopted Unicode as the official document character set. Following this the numeric character references are always interpreted with regard to Unicode, as opposed to the character encoding, which is the character set used to transmit the characters over the network (and possibly also to store the Web pages on the server's file system, but not necessarily as the server might transform the characters as it transmits them.) This encoding has no standard defined and is supposed to be specified in the HTTP content-type header, but the numerical character references should be unaffected by the chosen encoding of a document.
The first 256 characters of Unicode (#0-#255) are equivalent to the ISO Latin-1 standard, which in turn has its first 128 characters equivalent to the older US-ASCII (with the minor exception that Unicode has chosen not to give any definition as to the functions of the control characters from #0-#31 and #128-#159, leaving them entirely system-specific) so existing Web documents will work the same as always. But additional characters #256-#65535 are also available, including many other foreign languages, mathematical characters, and more, including curly quotes.
If UCS is used on UNIX systems, it is recommended that UTF-8 be used as multibyte character encoding (contents of char data type in the C language and contents of text files) and UCS-4 ((Not UCS-2 or UTF-16) and Private Use Planes (PUPs)) be used as wide character encoding. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. On most the UNIX systems, the "wide character" data type of the C programming language (wchar_t) is defined as 32 bit integer. If Private Use Planes are used for user defined characters, such UDCs are converted from UCS-4 to UTF-8 and used as multibyte encoding.
Standard UTF-8 won't interoperate well in an EBCDIC system because of the different arrangements of control codes. The UTF-EBCDIC is called to interoperate in EBCDIC systems.
UTF-8 is the most common on the web.
UTF16, UTF16LE, UTF16BE are used by Java and Windows.
UTF32, UTF32LE, UTF32BE are used by various Unix systems.