Information encoding plays a crucial role in information technology, serving as a vital element in representing the low-level mapping of the processed information.
The encoding process, which is frequently observable by end users, takes place whenever an application is tasked with handling data. This includes web applications, which, like numerous other applications, consistently manage vast amounts of information, even for basic requests.
Gaining insight into the encoding schemes employed by a web application can provide a significant edge when identifying and exploiting vulnerabilities.
Every day, internet users request billions of pages through their web browsers, with all the content of these pages being displayed based on a charset. But what exactly is a charset? As the term implies, it consists of a set of characters, representing the collection of symbols visible to end users in their browser windows. In technical terms, a charset comprises pairs of symbols and code points. The symbol is what the user reads on the screen, while the code point is a numeric index distinguishing the symbol within the charset. Symbols can only be shown if they exist in the charset. Examples of charsets include ASCII, Unicode, Latin-1, and others.
The ASCII (American Standard Code for Information Interchange) charset contains a limited set of symbols, originally 128 and now typically defined by its extended version with a total of 255. It is an older charset designed to support only US symbols and cannot display characters such as Chinese symbols, σ, and Σ. See complete list
{% embed url="https://www.ascii-code.com/" %}
Examples:
Code | Hex | Symbol |
---|---|---|
65 | 41 | A |
66 | 42 | B |
67 | 43 | C |
68 | 44 | D |
... | ... | ... |
Unicode (Universal Character Set) is a character encoding standard created to enable global computer usage in any language, supporting all writing systems worldwide. The entire Unicode charset can be found here.
{% embed url="https://symbl.cc/en/unicode/table/#0032" %}
Character encoding, or encoding, represents, in bytes, the symbols of a charset, mapping symbols to ordered bytes. A symbol can be represented by one or more bytes.
Unicode has three main types of character encoding: UTF-8, UTF-16, and UTF-32, where UTF stands for Unicode Transformation Format. The numbers 8, 16, and 32 indicate the number of bits used to represent code points.
Example:
Symbol | Unicode | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|
! | U+0021 | 21 | 00 21 | 00 00 00 21 |
W | U+0057 | 57 | 00 57 | 00 00 00 57 |
⮀ | U+2B80 | E2 AE 80 | 2B 80 | 00 00 2B 80 |
⌗ | U+2317 | E2 8C 97 | 23 17 | 00 00 23 17 |
In HTML, it's crucial to consider the integrity of URLs to ensure correct data display. Two main issues include informing the user agent of the character encoding and preserving the real meaning of certain characters.
According to HTTP 1.1 RFC, the Content-Type HTTP header can specify the character encoding. Examples for HTML4 and HTML5:
HTML4: <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
HTML5: <meta charset="UTF-8">
Setting an incorrect charset or omitting it may lead to unexpected behavior. HTML entities, such as <
for <
, can be used to represent symbols in HTML without being interpreted as language elements.
As per RFC 3986, URLs must contain characters within the US-ASCII code character set. Unsafe characters must be encoded, limiting URL characters to specific subsets: Unreserved Chars [a-zA-Z0-9-._~] and Reserved Chars [:/?#[]@!$&"()*+,;=%]. Other characters are encoded using percent encoding (%), followed by two hexadecimal digits.
URL-encoding, while not a security feature, helps limit the attack surface. Web browsers and server-side script engines often perform URL-encoding automatically.
Base64 is a binary-to-text encoding scheme used to convert binary files for internet transmission. It includes digits [0-9], Latin letters [a-zA-Z], and characters + and /, totaling 62 values. Different implementations may use variations for the last two characters and padding (=). Base64 encoding is employed in HTML to include resources, like embedding binary content of an image directly into a page.
A very interesting tool to joke with information encoding, decoding, conversion, etc is CyberChef.
It is a simple, intuitive web app for carrying out all manner of "cyber" operations within a web browser. These operations include simple encoding like XOR and Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509 parsing, changing character encodings, and much more.
{% embed url="https://gchq.github.io/CyberChef/" %}
{% embed url="https://ray.run/tools" %}