Skip to content

Latest commit

 

History

History
149 lines (83 loc) · 3.48 KB

Encoding.md

File metadata and controls

149 lines (83 loc) · 3.48 KB

Characters and encoding

It does not make sense to have a string without knowing what encoding it uses.

Character Set

  • List of distinct characters -> Integers (code-points) are mapped to each character
  • Typical character sets: ASCII, Unicode, ISO-8859-1

ASCII

  • English no accent code
  • 7 bits used to display up to 127 different things
  • 32 -> space, 65 -> "A"
  • spare bit could be used by programs

OEM-Trouble

  • multiple manufactures created their own charset by extending the last bit and changing things in the lower ones for their purposes

ANSI-Standard

  • idea was to standardize the charsets that got created by different OEMs
  • agreement found on the lower 7 bits (almost the same as ASCII)
  • no agreement could be found on the 8th bit
  • code pages got created to specify the 8th bit for different regions
  • MS-DOS featured multiple code-pages to display different ANSI-formed charsets from English to Icelandic ..
  • Almost no multi lingual code page got created -> no support to show characters from different code-pages e.g. Greek and Hebrew at the same time

Unicode

  • Internet required a larger charset to display characters from multiple languages at the same time
  • Effort to include all existent characters in the same charset

Code-point examples

  • Display Unicode code-points

    >> 'a'.codepoints.to_a
    => [97]
    >> ''.codepoints.to_a
    => [8364]
  • Display ISO-8859-15 code-points

    >> 'a'.encode('iso-8859-15').codepoints.to_a
    => [97]
    >> ''.encode('iso-8859-15').codepoints.to_a
    => [164]

Code-Page

  • Special-Code-Page for different charsets based on ANSI

Encoding

  • Mechanism to map between characters and byte-based representation -> transfer characters into bytes
  • Describes how to represent integer code points in the memory
  • Typical encodings are: UTF-8, Windows-1252, ISO-8859-1 (Latin1)
  • UTF-8 maps the Unicode Character Set into bytes

Code-point examples

  • Display UTF-8 encoding into bytes:

    >> 'a'.bytes.to_a
    => [97]
    >> ''.bytes.to_a
    => [226, 130, 172]
  • Display ISO-8859-15 encoding into bytes:

    >> 'a'.encode('iso-8859-15').bytes.to_a
    => [97]
    >> ''.encode('iso-8859-15').bytes.to_a
    => [164]

UCS-2 / UTF-16

  • idea just use two bytes / 16 bits to store for each unicode character

  • Storage of Codeword Hello in Unicode?

    • Hello is represented by Unicode-Codepoints: U+0048 U+0065 U+006C U+006C U+006F

    • Codeword can be stored by

      high-endian: 00 48 00 65 00 6C 00 6C 00 6F
      low-endian: 48 00 65 00 6C 00 6C 00 6F 00
    • Problem: a lot of zeros

BOM (Byte-Order-Mark)

  • Define if the document is stored with high or low ending by adding hex code before each Unicode string (FE FF or FF FE)

UTF-8

  • Idea was to avoid the double memory demand for Unicode when it's possible

  • Code-points from 0-127 are still stored in one byte

  • Code-points from 128 are stored stored using from 2 to 6 bytes

  • Codeword Hello (U+0048 U+0065 U+006C U+006C U+006F) can now be stored without zeros

    48 65 6C 6C 6F

References