Understanding CBOR

Introduction

This document is a gentral introduction to CBOR. The full spec is here, and this document is no substitute for that. RFC7049.

The genome browser uses CBOR so that it can efficiently store and parse big datasets without the CPU overhead of parsing JSON. This is an issue which was seen in practice before switching to it from JSON. CBOR was chosen over other formats because it is an open standard and because it's relatively easy to make big messages out of cached message fragments which have already been serialised into CBOR (composability). This allows us to cache bits of responses and make a big response out of the bits, even if the combination requested is unique, without the head of re-serialising the components.

Underneath CBOR, or at least the subset of it we use, maps pretty cleanly to JSON. So you can see this as just a binary version of JSON. The other fun things which CBOR can do are not used by the genome browser. So basically we've got:

strings, numbers, bools
arrays
maps

CBOR is a byte oriented format. If you're forced to read it, it's best to use a hexdumper. Unix has hexdump -C which is a great format for understanding CBOR. Afdter the offset of the line in the file (in hex), it has the bytes in hex on the main bit of the screen and ASCII (for when you're looking at strings) on the right.

00000000  68 65 6c 6c 6f 20 77 6f  72 6c 64 2c 20 68 65 72  |hello world, her|
00000010  65 20 69 73 20 63 68 61  72 61 63 74 65 72 20 30  |e is character 0|
00000020  30 37 20 66 6f 72 20 65  78 61 6d 70 6c 65 20 07  |07 for example .|
00000030

"Major Types"

The most significant three bits of a byte represent what's next in the file. The values are

0 : 0x00 - 0x1f -- positive integer
1 : 0x20 - 0x3f -- negative integer
2 : 0x40 - 0x5f -- byte string
3 : 0x60 - 0x7f -- normal string
4 : 0x80 - 0x9f -- arrays
5 : 0xa0 - 0xbf -- maps
6 : 0xc0 - 0xdf -- tags (we don't use these)
7 : 0xe0 - 0xff -- floats and misc markers

Tip: when just scanning a cbor file looking at the top digit of the hex is super-helpful to work out what's going on.

The major type combines with the "value" stored in the first byte.

For example the value "6" in the first byte for major type 5 would be the byte 0xa6, etc. So you get 32 values in that first byte for exch major type.

Numbers

Positive and negative numbers are represented by 0x00-0x1f (+ve) and 0x20-0x3f (-ve). Negative numbers are presented as if positive, just with a different major type, and one less (as we don't have negative zero). For example, *-1 is represented as 0 but with major type 1; *-2 is represented as 1 but with major type 1; *-3 is represented as 2 but with major type 1, etc.

Numbers less than 24 are just stored in this byte itself. For example 3 is actually just 0x03, 7 is 0x07, -3 is 0x22.
Numbers that will fit in one extra byte are stored with the representation of "24" in that first byte, then one more byte with the actual value in. 24 is 0x18 in hex. So the number 128 which is 0x80 in hex is represented as 0x18 0x80. The number 170 which is 0xAA in hex is represendet as 0x18 0xAA etc.
If your number won't fit into one extra byte but will fit into two, you can but the representation of "25" (0x19) in that first byte and then you get two more for your number, MSB first. For example, the number 0x1234 can be encoded as 0x19 0x12 0x34.
You can get four additional bytes by using value "26" (0x1A) in the first byte. eg 0x12345678 is 0x1A 0x12 0x34 0x56 0x78.
For eight bytes use "27" (0x1B) eg 0x0123456789ABCDEF is 0x1B 0x01 0x23 0x45 0x67 0x89 0xAB 0xCD 0xEF
The other potentially available values in the first byte (28-31) (0x1C-0x1F) are unused.

Most usefully, small positive integers are represented with just their value, and these are the ones you encounter most.

Strings and Byte Strings

These start with a representation of the length. They work just like numbers above, just with their different major type. When the length has been specified in this way, you then get this many extra byttes for your string.

"hello" is 5 bytes long 0x65 represents 5, so 0x65 h e l l o
"" is 0 bytes long, so 0x60
a 65535 byte long string (0xFFFF bytes long) would be 0x79 0xFF 0xFF then 65535 bytes.

Normal (definite) Arrays and hashes

Arrays and hashes work like strings. They start with the number of items represented as a number (but with the relevant major type for arrays and hashes), and then there's that many CBOR items afterwards.

An array containing two zeros would be 0x82 0x00 0x00. 0x82 is "2" in major type 4 then there are two cbor items
An empty array would be 0x80
An array containing the strings "hello" and "world" would be 0x82 0x65 h e l l o 0x65 h e l l o

Hashes work like arrays except the number is the number of key/value pairs.

Indefinite arrays and hashes

These represent the same thing as definite arrays and hashes (not code visible) but are serialized differently so that you don't need to know the length in advance. Use code "31" in the first byte (which is never valid as a number) and then just start adding entries. When done put byte 0xFF (break).

Another representation of an array containing the strings "hello" and "world" is 0x9F 0x65 h e l l o 0x65 h e l l o 0xFF
Another representation of an empty array is 0x9f 0xFF

Misc (major type 7)

These are for odds and ends. The values in the first byte are

0-19 unused (0xE0 - 0xF3)
20 (0xF4) false
21 (0xF5) true
22 (0xF6) null
23 (0xF7) undefined (ie it is defined in the CBOR standard to be the value undefined)
24 (0xF8) one extra byte but all values for it unassigned (future expansion)
25 (0xF9) 2-byte float follows
26 (0xFA) 4-byte float follows
27 (0xFB) 8-byte float follows
28-30 (0xFC-0xFE) unused
31 (0xFF) "break" for indefinite arrays and hashes (see ab ove)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly