Microsoft Store
 

UTF-8


 

UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems.

Overlong forms, invalid input, and security considerations

The exact response of a decoder on invalid input is largely undefined. There are several ways a decoder can behave in the event of invalid input:

~ ~ ~ ~ ~ ~ ~ ~ ~ ~

  • Insert a replacement character (e.g. '?', '�')
  • Skip the character
  • Interpret the character as being from another charset (often Latin-1)
  • Interpret the whole text as being from another charset (often Latin-1 or the local charset), common in applications like IRC where the use of UTF-8 is not yet universal and there is no built-in mechanism for specifying the character set
  • Not notice and decode as if the character were some similar bit of UTF-8
  • Report an error
  • Decoders may of course behave in different ways for different types of invalid input.

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

    All possibilities have their advantages and disadvantages but care must be taken to avoid security issues if validation is performed before conversion from UTF-8.

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

    Overlong forms (where a character is encoded in more bytes than needed but still following the forms above) are one of the most troublesome types of data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server.

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

    To maintain security in the case of invalid input there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a strict decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless.

    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~