UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.
Modified UTF-8
The Java programming language, which uses UTF-16 for its internal
Related Topics:
Java programming language - UTF-16
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
text representation, supports a non-standard modification of UTF-8 for
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
string serialization. This encoding is called
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
modified UTF-8.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
There are two differences between modified and standard UTF-8. The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, perhaps to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
The second difference is in the way characters outside the BMP are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8. The reason for this modification is more subtle. In Java a character is 16 bits long; therefore some Unicode characters require two Java characters in order to be represented. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one Java character at a time, rather than one Unicode character at a time. Unfortunately, this also means that characters requiring four bytes in UTF-8 require six bytes in modified UTF-8.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~
~ Table of Content ~
| ► | Introduction |
| ► | Description |
| ► | Modified UTF-8 |
| ► | Rationale behind UTF-8's mechanics |
| ► | Overlong forms, invalid input, and security considerations |
| ► | Advantages and disadvantages |
| ► | History |
| ► | See also |
| ► | External links |
~ What's Hot ~
~ Community ~
| ► | History Forum Come and discuss about History, Civilizations, Historical Events and Figures |
| ► | History Web-Ring A community of sites, blogs and forums dedicated to History. Do not hesitate to submit your site. |
and are licensed under the GNU Free Documentation License.
Lexicon - Privacy Policy - Spiritus-Temporis.com ©2005.
