Skip to content

Commit

Permalink
Document dictionary page position (#177)
Browse files Browse the repository at this point in the history
  • Loading branch information
gszadovszky authored Jun 24, 2021
1 parent 473a3a7 commit 43c891a
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 0 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,13 @@ header and readers can skip over pages they are not interested in. The data for
page follows the header and can be compressed and/or encoded. The compression and
encoding is specified in the page metadata.

A column chunk might be partly or completely dictionary encoded. It means that
dictionary indexes are saved in the data pages instead of the actual values. The
actual values are stored in the dictionary page. See details in
[Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8).
The dictionary page must be placed at the first position of the column chunk. At
most one dictionary page can be placed in a column chunk.

Additionally, files can contain an optional column index to allow readers to
skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and
the reasoning behind adding these to the format.
Expand Down
5 changes: 5 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -534,6 +534,11 @@ struct IndexPageHeader {
// TODO
}

/**
* The dictionary page must be placed at the first position of the column chunk
* if it is partly or completely dictionary encoded. At most one dictionary page
* can be placed in a column chunk.
**/
struct DictionaryPageHeader {
/** Number of values in the dictionary **/
1: required i32 num_values;
Expand Down

0 comments on commit 43c891a

Please sign in to comment.