-
Notifications
You must be signed in to change notification settings - Fork 0
KWZ Format
The KWZ format is used to store Flipnote animations. The body of the file is structured into sections, each of which begins with an 8-byte header. The last 256 bytes of a KWZ is an SHA-256 RSA-2048 signature over the whole file.
Variants of the format are also used for folder icons and comments.
Type | Details |
---|---|
char[4] | Section magic |
uint32 | Section size (not including header) |
The first 3 chars of the section magic identify the section type, and the last char seems to be used for flags of some kind but the meaning of these isn't known.
Offset | Type | Details |
---|---|---|
0x0 | uint32 | CRC32 checksum |
0x4 | uint32 | Creation timestamp |
0x8 | uint32 | Last edit timestamp |
0xC | uint32 | App version? - seen as 0 , 1 or 3 so far |
0x10 | hex[10] | Root author ID |
0x1A | hex[10] | Parent author ID |
0x24 | hex[10] | Current author ID |
0x2E | wchar[11] | Root author name |
0x44 | wchar[11] | Parent author name |
0x5A | wchar[11] | Current author name |
0x70 | char[28] | Root filename |
0x8C | char[28] | Parent filename |
0xA8 | char[28] | Current filename |
0xC4 | uint16 | Frame count |
0xC6 | uint16 | Thumbnail frame index |
0xC8 | uint16 | Flags |
0xCA | uint8 | Frame speed |
0xCB | uint8 | Layer visibility flags |
Timestamps are stored as the number of seconds since midnight 1 Jan 2000.
Author names are null-padded UTF-16 LE strings, author IDs are usually displayed with dashes like xxxx-xxxx-xxxx-xxxxxx
(using lowercase hex).
Filenames are base32-encoded using a custom character sequence of cwmfjordvegbalksnthpyxquiz012345
. The decoded filename can be unpacked to get the author ID, creation timestamp and modified timestamp.
Bit index (0 = lowest) | Details |
---|---|
0 | Layer A invisible |
1 | Layer B invisible |
2 | Layer C invisible |
Bit index (0 = lowest) | Details |
---|---|
0 | Lock flag |
1 | Loop playback flag |
4 | Toolset flag |
Value | Frames / Seconds |
---|---|
0 | 1 / 5 |
1 | 1 / 2 |
2 | 1 / 1 |
3 | 2 / 1 |
4 | 4 / 1 |
5 | 6 / 1 |
6 | 8 / 1 |
7 | 12 / 1 |
8 | 20 / 1 |
9 | 24 / 1 |
10 | 30 / 1 |
This section starts with a CRC32 checksum followed by the Flipnote's thumbnail stored as JPEG image data. Thumbnails are 80px x 64px, with a black line at the bottom which is normally cropped when displayed in the app.
Starts with a CRC32 checksum followed by compressed frame data. Frames are stored in playback sequence, and split into separate images for layer A, layer B, and layer C.
The frame size is 320 x 240 pixels, although under normal circumstances the outer edges can't be drawn on since the app puts a border around the bottom screen.
Layers are divided into 8x8 tiles; 1200 in total. Every horizontal line in the tile references a line table index. The line table contains every possible combination of pixels for a line.
Bitpacking is used to achieve minimal data size.
8x8 tiles (shown in red) are grouped into larger tiles (shown in blue) which are 128x128 unless they fall off the edge of the frame. Tiles are stored in sequence from left-to-right, top to bottom.
The following pseudocode shows how we currently deal with this:
for large_tile_y = 0; large_tile_y < 240; large_tile_y += 128:
for large_tile_x = 0; large_tile_x < 320; large_tile_x += 128:
for tile_y = 0; tile_y < 128; tile_y += 8:
y = large_tile_y + tile_y
# if the tile falls off the bottom of the frame, jump to the next large tile
if y >= 240: break
for tile_x = 0; tile_x < 128; tile_x += 8:
x = large_tile_x + tile_x
# if the tile falls off the right of the frame, jump to the next small tile row
if x >= 320: break
# ... decode tile -- (x, y) is the position of the tile's top-left corner
Each tile starts with a 3-bit value which gives the type of compression it uses:
All lines are the same and use one of the commonly occurring line indexes defined in table 1. A single 5-bit value provides an index for table 1, which in turn gives the linetable index.
Pseudocode:
line_index = table1[read_bits(5)]
line = linetable[line_index]
tile = [
line,
line,
line,
line,
line,
line,
line,
line,
]
All lines are the same like in type 0, but instead a 13-bit value gives the line index directly.
Pseudocode:
line_index = read_bits(13)
line = linetable[line_index]
tile = [
line,
line,
line,
line,
line,
line,
line,
line,
]
Like type 0, all lines use a commonly occurring line index, given by a single 5-bit value. However, every other line is rotated one pixel to the left, so table 1 is used to get the index for odd lines and table 2 is used for even ones.
This tile type is most commonly used for pixel patterns created with the paintbrush tool:
Pseudocode:
index = read_bits(5)
line_index_a = table1[index]
line_index_b = table2[index]
a = linetable[line_index_a]
b = linetable[line_index_b]
tile = [a, b, a, b, a, b, a, b]
Same as type 3, except a 13-bit value gives the line index for odd lines directly. Indexes for even lines are translates using table 3.
Pseudocode:
line_index_a = read_bits(13)
line_index_b = table3[line_index_a]
a = linetable[line_index_a]
b = linetable[line_index_b]
tile = [a, b, a, b, a, b, a, b]
Each line can either be a 5-bit common line index from table 1, or a 13-bit line index. The tile starts with an 8-bit mask which indicates which to use for each line.
Pseudocode:
mask = read_bits(8)
for i in range(8):
if mask & (1 << i):
line_index = table1[read_bits(5)]
else:
line_index = read_bits(13)
tile[i] = linetable[line_index]
This indicates that one or more tiles have not changed since the previous frame, so they can be skipped. A 5-bit value gives the number of tiles to skip after the current one.
Not used.
The lines in this tile are arranged in a pattern of two components (A and B). A 2-bit value describes the pattern type (detailed below), followed by a 1-bit value.
If the 1-bit value is set to 1
, then A and B should be read as 5-bit table 1 indexes, and the pattern type should also be incremented by 1.
If the 1-bit value is 0
, then A and B should be read as 13-bit line indexes.
Pattern Type | Pattern |
---|---|
0 | A B A B A B A B |
1 | A A B A A B A A |
2 | A B A A B A A B |
3 | A B B A B B A B |
Pseudocode:
pattern = read_bits(2)
use_table = read_bits(1)
if use_table:
line_index_a = table1[read_bits(5)]
line_index_b = table1[read_bits(5)]
pattern = (pattern + 1) % 4
else:
line_index_a = read_bits(13)
line_index_b = read_bits(13)
a = linetable[line_index_a]
b = linetable[line_index_b]
if pattern == 0: tile = [a, b, a, b, a, b, a, b]
if pattern == 1: tile = [a, a, b, a, a, b, a, a]
if pattern == 2: tile = [a, b, a, a, b, a, a, b]
if pattern == 3: tile = [a, b, b, a, b, b, a, b]
Represents indexes for commonly occurring lines:
0x0000, 0x0CD0, 0x19A0, 0x02D9, 0x088B, 0x0051, 0x00F3, 0x0009,
0x001B, 0x0001, 0x0003, 0x05B2, 0x1116, 0x00A2, 0x01E6, 0x0012,
0x0036, 0x0002, 0x0006, 0x0B64, 0x08DC, 0x0144, 0x00FC, 0x0024,
0x001C, 0x0004, 0x0334, 0x099C, 0x0668, 0x1338, 0x1004, 0x166C
Represents the same common lines as table 1, but their pixels are rotated one place to the left, so they have different offsets:
0x0000, 0x0CD0, 0x19A0, 0x0003, 0x02D9, 0x088B, 0x0051, 0x00F3,
0x0009, 0x001B, 0x0001, 0x0006, 0x05B2, 0x1116, 0x00A2, 0x01E6,
0x0012, 0x0036, 0x0002, 0x02DC, 0x0B64, 0x08DC, 0x0144, 0x00FC,
0x0024, 0x001C, 0x099C, 0x0334, 0x1338, 0x0668, 0x166C, 0x1004
Represents indexes for all possible lines, but where the line's pixels are first rotated one place to the left. This table is 6561 items long, but fortunately it is possible to generate it:
table3 = Array(6561)
index = 0
for a = 0; a < 2187; a += 729:
for b = 0; b < 729; b += 243:
for c = 0; c < 243; c += 81:
for d = 0; d < 81; d += 27:
for e = 0; e < 27; e += 9:
for f = 0; f < 9; f += 3:
for g = 0; g < 3; g += 1:
for h = 0; h < 6561; h += 2187:
table3[index] = a + b + c + d + e + f + g + h
index += 1
Contains every possible combination of pixels for an 8-pixel line. This can be generated too -- our method creates the linetable as an array of 6561 items, where each item represents a line of 8 pixels. Pixel values are 0
for transparent, 1
for layer color 1 and 2
for layer color 2:
# for this example, the line table is a 2d array of size [6561][8]
# linetable should be of type uint8[][]
linetable = Array(6561, 8)
index = 0
for a = 0; a < 3; a += 1:
for b = 0; b < 3; b += 1:
for c = 0; c < 3; c += 1:
for d = 0; d < 3; d += 1:
for e = 0; e < 3; e += 1:
for f = 0; f < 3; f += 1:
for g = 0; g < 3; g += 1:
for h = 0; h < 3; h += 1:
linetable[index] = [b, a, d, c, f, e, h, g]
index += 1
Just for reference, the linetable Nintendo uses for the app is 4-bits-per-pixel, where the packed pixel order is 2, 1, 4, 3, 6, 5, 8, 7
and pixel values are 0x0
for transparent, 0x1
for layer color 1 and 0xF
for layer color 2. This can be generated like so:
linetable = Array(6561)
values = [0, 1, 0xF, 0x10, 0x11, 0x1F, 0xF0, 0xF1, 0xFF]
index = 0
for a = 0; a < 9; a += 1:
for b = 0; b < 9; b += 1:
for c = 0; c < 9; c += 1:
for d = 0; d < 9; d += 1:
linetable[index] = (values[d] << 24) | (values[c] << 16) | (values[d] << 8) | values[a]
index += 1
We find that using generating the linetable our way is easier to work with though, since pixel values and order are a bit more workable.
This section starts contains a table of metadata for each frame. Each entry in the table is 28 bytes long:
Offset | Type | Details |
---|---|---|
0x0 | uint32 | Flags |
0x4 | uint16 | Layer A size |
0x6 | uint16 | Layer B size |
0x8 | uint16 | Layer C size |
0xA | hex[10] | Frame author ID |
0x14 | uint8 | Layer A depth |
0x15 | uint8 | Layer B depth |
0x16 | uint8 | Layer B depth |
0x17 | uint8 | Frame sound effect flags |
0x18 | uint32 | Usage flags |
Layer depths range from 0
(nearest) to 6
(furthest).
The camera flag is usually 0
, but will be 0x0007
if the frame uses a photo.
Mask | Details |
---|---|
flags & 0xF |
Paper color index |
(flags >> 4) & 0x1 |
Layer A diffing flag |
(flags >> 5) & 0x1 |
Layer B diffing flag |
(flags >> 6) & 0x1 |
Layer C diffing flag |
(flags >> 7) & 0x1 |
Is frame based on prev frame |
(flags >> 8) & 0xF |
Layer A first color index |
(flags >> 12) & 0xF |
Layer A second color index |
(flags >> 16) & 0xF |
Layer B first color index |
(flags >> 20) & 0xF |
Layer B second color index |
(flags >> 24) & 0xF |
Layer C first color index |
(flags >> 28) & 0xF |
Layer C second color index |
Diffing flags are stored in the order of Layer A, layer B, layer C starting from the lowest bit. The bit will be set to 0
if the layer is based on the same layer from the previous frame.
Each color is stored as a palette index.
Index | Name | HEX color |
---|---|---|
0 |
white | #ffffff |
1 |
black | #141414 |
2 |
red | #ff1717 |
3 |
yellow | #ffe600 |
4 |
green | #008232 |
5 |
blue | #06aeff |
6 |
transparent (paper only) | - |
Mask | Details |
---|---|
(soundFlags & 0x1) !== 0 |
SE1 |
(soundFlags & 0x2) !== 0 |
SE2 |
(soundFlags & 0x4) !== 0 |
SE3 |
(soundFlags & 0x8) !== 0 |
SE4 |
Type | Description |
---|---|
uint32 | Flipnote speed when recorded |
uint32 | BGM size |
uint32 | SE1 (A) size |
uint32 | SE2 (X) size |
uint32 | SE3 (Y) size |
uint32 | SE4 (up) size |
uint32 | CRC32 checksum of the audio tracks |
After the header, the audio tracks are stored in the order of BGM, SE1, SE2, SE3, and SE4.
Sound data is mono-channel IMA ADPCM sampled at 16364Hz. That said, Nintendo's implementation differs from the norm ever so slightly (of course!), and these differences need to be accounted for in order to accurately decode audio.
Typically 4-bit IMA ADPCM data is used, however in order to save space, the audio may switch into a 2-bit sample mode in places where the audio signal is relatively flat. The decoder will read the next sample as a 2-bit value if the previous sample was below 18, or if it is only possible to read 2 bits from the current byte (conveniently, the audio encoder avoids 4-bit samples that overlap byte boundaries).
In addition, there are a couple of small divergences from the IMA ADPCM standard:
- The decoder uses the standard IMA ADPCM step table, but with an extra
0
at the end. The reason for this isn't exactly clear. - The step index is clamped between
0
and79
, compared to the standard0
to88
. - The diff is clamped to between
-2047
and2047
, compared to the standard-32767
to32767
. - The diff value is also multiplied by 16 after being clamped.
- The initial decoder state is
diff = 0, step index = 40
The full step table is:
7, 8, 9, 10, 11, 12, 13, 14, 16, 17,
19, 21, 23, 25, 28, 31, 34, 37, 41, 45,
50, 55, 60, 66, 73, 80, 88, 97, 107, 118,
130, 143, 157, 173, 190, 209, 230, 253, 279, 307,
337, 371, 408, 449, 494, 544, 598, 658, 724, 796,
876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066,
2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358,
5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899,
15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767, 0
Since this is fairly complex, here is pseudocode to convert the sound data to 16-bit signed PCM:
# track_length is assumed to be the size of the audio track, in bytes
# track_data is assumed to be the audio data, as an array of bytes
# index table for 2-bit samples
index_table_2 = [
-1, 2,
-1, 2,
]
# index table for 4-bit samples
index_table_4 = [
-1, -1, -1, -1, 2, 4, 6, 8,
-1, -1, -1, -1, 2, 4, 6, 8,
]
# we don't know how long the unpacked audio is,
# so create an output buffer with enough space for 60 seconds of audio at 16364 Hz
# output_buffer should be of type int16[]
output_buffer = Array(16364 * 60)
output_offset = 0
# initial decoder state:
prev_diff = 0
prev_step_index = 40
for track_offset = 0; track_offset < track_length; track_offset += 1:
byte = track_buffer[track_offset]
bit_pos = 0
while bit_pos < 8:
if prev_step_index < 18 or bit_pos == 6:
# read 2-bit sample
sample = (byte >> bit_pos) & 0x3
# get diff
step = step_table[prev_step_index];
diff = step >> 3
if sample & 1: diff += step
if sample & 2: diff = -diff
diff = prev_diff + diff
# get step index
step_index = prev_step_index + index_table_2[sample]
bit_pos += 2
else:
# read 4-bit sample
sample = (byte >> bit_pos) & 0xF
# get diff
step = step_table[prev_step_index];
diff = step >> 3
if sample & 4: diff += step
if sample & 2: diff += step >> 1
if sample & 1: diff += step >> 2
if sample & 8: diff = -diff
diff = prev_diff + diff
# get step index
step_index = prev_step_index + index_table_4[sample]
bit_pos += 4
# clamp step index and diff
step_index = max(0, min(step_index, 79))
diff = max(-2047, min(diff, 2047)) * 16
output_buffer[output_offset] = diff
output_offset +=1
# set prev decoder state
prev_step_index = step_index
prev_diff = diff
Flipnote Gallery World comments use a variant of the KWZ format with the extension .kwc
. The only difference is that comments do not have KTN or KSN sections, and can only ever have 1 frame. Additionally, the diffing flag in the frame's meta entry is always 0x7
.
Icons used for SD card folders are also a variant of the KWZ format, which only use the KMC and KMI sections.
The final 256 bytes of a KWZ file should consist of an SHA-256 RSA-2048 signature over the rest of the file.
The DER format private key for signing a KWZ file can be found as plaintext in memory. It will begin with the bytes 30 82 04
and end with the bytes E4 07 50
, resulting in a total of 1,218 bytes overall. Its SHA-256 checksum should match E6892FF794E8A768C9ECC76152C4E72823514366B3A206298F5CB603D5EB797A
.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.