Microsoft Store
 

UTF-8


 

UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.

Description

UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646).

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

In summary, the bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes.

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII.

Related Topics:
Bit - ASCII

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

In other cases, up to four bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters, particularly characters with code points lower than U+0020, traditionally called control characters, for example, carriage return.

Related Topics:
Control character - Carriage return

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Code rangehexadecimal

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

UTF-16

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

UTF-8binary

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Notes

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

000000–00007F

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

00000000 0xxxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

0xxxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

ASCII equivalence range; byte begins with zero

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

seven x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

seven x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

000080–0007FF

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

00000xxx xxxxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

110xxxxx 10xxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

first byte begins with 110 or 1110, the following byte(s) begin with 10

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

three x, eight x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

five x, six x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

000800–00FFFF

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

xxxxxxxx xxxxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

1110xxxx 10xxxxxx 10xxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

eight x, eight x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

four x, six x, six x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

010000–10FFFF

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

110110xx xxxxxxxx 110111xx xxxxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

two x, eight x, two x, eight x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

three x, six x, six x, six x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

For example, the character alef (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

  • It falls into the range of U+0080 to U+07FF. The table shows it will be encoded using two bytes, 110xxxxx 10xxxxxx.
  • Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
  • The eleven bits are put in their order into the positions marked by "x"-s: 11010111 10010000.
  • The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That is the letter aleph in UTF-8.
  • So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.

    Related Topics:
    Latin alphabet characters with diacritics - Greek - Cyrillic - Coptic - Armenian - Hebrew - Arabic

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

    By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering the whole area U+0000 to U+7FFFFFFF (31 bits). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, and 0xF5 to 0xFF.

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~