Microsoft Store
 

Character encoding


 

A character encoding consists of a code that pairs a set of characters (representations of graphemes or grapheme-like units, such as might appear in an alphabet or syllabary for the communication of a natural language) with a set of something else, such as numbers or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols, both as integers and as 7-bit binary versions of those integers.

Encoding forms and encoding schemes

Computer scientists sometimes overload the term character encoding to mean also how a specific sequence of bits represent characters. This involves an encoding form which specifies the conversion of the integer code into a series of integer code values that facilitate storage in a system that uses fixed bit widths. For example, integers greater than 65535 (hex FFFF) will not fit in 16 bits, so the UTF-16 encoding form mandates representation of these integers as a surrogate pair of integers, each less than 65536 and not assigned to characters (for example, hex 10000 becomes the pair D800 DC00). An encoding scheme then converts code values to bit sequences, with attention given to things like platform-dependent byte order issues (for example, D800 DC00 might become 00 D8 00 DC on an Intel x86 architecture). A character set or character map or code page shortcuts this process by directly mapping abstract characters to specific bit patterns. Unicode Technical Report #17 explains this terminology in depth and provides further examples.

Related Topics:
Hex - UTF-16 - Byte order - Intel - X86 - Code page

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Since most applications use only a small subset of Unicode, encoding schemes (like UTF-8 and UTF-16) and character maps (like ASCII) provide efficient ways to represent Unicode characters in computer storage or communications by using short binary words. Some of these simple encodings use data compression techniques to represent a large repertoire with a smaller number of codes.

Related Topics:
UTF-8 - UTF-16 - Computer storage - Data compression

~ ~ ~ ~ ~ ~ ~ ~ ~ ~