UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.
Rationale behind UTF-8's mechanics
As a consequence of the exact mechanics of UTF-8, the following properties of multi-byte sequences hold:
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
- The most significant bit of a single-byte character is always
0. - The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are
110for two-byte sequences;1110for three-byte sequences, and so on. - The remaining bytes in a multi-byte sequence have
10as their two most significant bits.
UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.
Related Topics:
Shift-JIS - Redundancy - Data compression
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The ISO-8859 series uses 100xxxxx for extremely rare control codes. Some other legacy encodings do make more use of bytes in this range, but generally only for rare characters.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
You can use the bit patterns to identify UTF-8 characters. If the byte's first hex code begins with 0-7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes.) If it begins with E, it is 16 bit (expressed in 3 bytes,) and if it begins with F, it is 21 bits (expressed in 4 bytes.) 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, you can tell at a glance that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" is a valid UTF-8 character.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
~ Table of Content ~
| ► | Introduction |
| ► | Description |
| ► | Modified UTF-8 |
| ► | Rationale behind UTF-8's mechanics |
| ► | Overlong forms, invalid input, and security considerations |
| ► | Advantages and disadvantages |
| ► | History |
| ► | See also |
| ► | External links |
~ What's Hot ~
~ Community ~
| ► | History Forum Come and discuss about History, Civilizations, Historical Events and Figures |
| ► | History Web-Ring A community of sites, blogs and forums dedicated to History. Do not hesitate to submit your site. |
and are licensed under the GNU Free Documentation License.
Lexicon - Privacy Policy - Spiritus-Temporis.com ©2005.
