UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.
Advantages and disadvantages
- General
- Advantages
- Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points, but this is unlikely to be considered a culturally acceptable sort order.
- UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration. http://www.w3.org/TR/REC-xml/#charencoding
- Disadvantages
- A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation. See the W3 FAQ: Multilingual Forms for a Perl regular expression to validate a UTF-8 string
- Compared to legacy encodings
- Advantages
- UTF-8 can encode any Unicode character.
- UTF-8 guarantees resynchronisation is possible with only the character that was cut in the middle lost. Many legacy multibyte encodings are much harder to resynchronise.
- A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift-JIS (see the previous section on this).
- The first byte of a multibyte sequence is enough to determine the length of the multibyte sequence (just count the number of leading set bits). This makes it extremely simple to extract a substring from a given string without elaborate parsing. This was often not the case in legacy multibyte encodings.
- Disadvantages
- UTF-8 is generally larger than the appropriate legacy encoding for everything except diacritic-free, Latin-alphabet text. Most alphabetic scripts had only a single byte per character in legacy encodings but their letters take at least two bytes in UTF-8. Ideographic scripts generally had two bytes per character in their legacy encodings yet take three bytes per character in UTF-8.
- Legacy encodings for almost all non-ideographic scripts use a single byte per character making string cutting and joining easy.
- Compared to UTF-7
- Advantages
- UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
- UTF-8 encodes "+" as itself UTF-7 encodes it as "+-"
- Disadvantages
- UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64. This extra stage of encoding carries a significant size penalty. For base64, the overhead is 33⅓%, while for quoted printable the overhead varies depending on how ASCII-heavy the text is; for French the overhead is about 14%, for non-Roman scripts containing no ASCII characters the overhead is 200%! In most cases UTF-7 will be smaller than the combination of UTF-8 with either quoted printable or base64 (even for ASCII-heavy languages such as French, UTF-7 is about 4% smaller than UTF-8 quoted printable).
- Compared to UTF-16
- Advantages
- Unicode code points in the ASCII range {{uplusfirst}}0000 to U+007F, including the basic Latin alphabet and the space character, can be represented in a single byte. Therefore text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in UTF-16, and text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces.
- Most existing computer programs (including operating systems) were not written with Unicode in mind, and using UTF-16 with them would create major compatibility issues as it is not a superset of ASCII. UTF-8 allows programs to treat ASCII as they always did, and changes behaviour only for non-ASCII characters that were different by location anyway.
- UTF-16 needs a Byte Order Mark (U+FEFF) in the beginning of the stream to identify the byte order. This is not necessary in UTF-8, as the sequences always start with the most significant byte on every platform.
- Disadvantages
- UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user. While UTF-16 is technically also variable length many people do not know this or simply do not care about the rarely used code points outside the BMP.
- Chinese, Japanese, and Korean (CJK) ideographs use three bytes in UTF-8, but only two in UTF-16. So CJK text takes up more space when represented in UTF-8. There are a few other less-well-known groups of code points that this also applies to.
~ Table of Content ~
| ► | Introduction |
| ► | Description |
| ► | Modified UTF-8 |
| ► | Rationale behind UTF-8's mechanics |
| ► | Overlong forms, invalid input, and security considerations |
| ► | Advantages and disadvantages |
| ► | History |
| ► | See also |
| ► | External links |
~ What's Hot ~
Hannah Montana The Movie, Dear John, The Ugly Truth, The Mummy 4 Rise Of The Aztec, Alvin And The Chipmunks The Squeakquel, The Boondock Saints Ii All Saints Day, Ninja Assassin, Cedar Boys, Sorority Row, Avatar, The Lovely Bones, New Moon, The Goods Live Hard Sell Hard, Fantastic Mr Fox, Twilight, 500 Days Of Summer, The Blind Side, The Hangover, My Sister S Keeper, The Princess And The Frog,
~ Community ~
| ► | History Forum Come and discuss about History, Civilizations, Historical Events and Figures |
| ► | History Web-Ring A community of sites, blogs and forums dedicated to History. Do not hesitate to submit your site. |
and are licensed under the GNU Free Documentation License.
Lexicon - Privacy Policy - Spiritus-Temporis.com ©2005.
