Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-758: Add files with Float16 column #40

Merged
merged 3 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 98 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@
| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values |
| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you perhaps show how the file was generated, or how the data looks like, in the same spirit as was done for "NaN in stats" in below?

pitrou marked this conversation as resolved.
Show resolved Hide resolved

TODO: Document what each file is in the table above.

Expand Down Expand Up @@ -94,7 +96,7 @@ The schema for the `datapage_v1-*-checksum.parquet` test files is:
message m {
required int32 a;
required int32 b;
}
}
```

The detailed structure for these files is as follows:
Expand Down Expand Up @@ -182,7 +184,7 @@ metadata = pq.read_metadata("nan_in_stats.parquet")
metadata.row_group(0).column(0)
# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
# file_offset: 88
# file_path:
# file_path:
# type: DOUBLE
# num_values: 2
# path_in_schema: x
Expand Down Expand Up @@ -223,3 +225,97 @@ pq.write_table(tab, "test.parquet", compression='BROTLI')
It is meant to exercise reading of structured data where each value
is smaller than 2GB but the combined uncompressed column chunk size
is greater than 2GB.

## Float16 Files

The files `float16_zeros_and_nans.parquet` and `float16_nonzeros_and_nans.parquet`
are meant to exercise a variety of test cases regarding `Float16` columns (which
are represented as 2-byte `FixedLenByteArray`s), including:
* Basic binary representations of standard values, +/- zeros, and NaN
* Comparisons between finite values
* Exclusion of NaNs from statistics min/max
* Normalizing min/max values when only zeros are present (i.e. `min` is always -0 and `max` is always +0)

The aforementioned files were generated with:

```python
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

t1 = pa.Table.from_arrays(
[pa.array([None,
np.float16(0.0),
np.float16(np.NaN)], type=pa.float16())],
names="x")
t2 = pa.Table.from_arrays(
[pa.array([None,
np.float16(1.0),
np.float16(-2.0),
np.float16(np.NaN),
np.float16(0.0),
np.float16(-1.0),
np.float16(-0.0),
np.float16(2.0)],
type=pa.float16())],
names="x")

pq.write_table(t1, "float16_zeros_and_nans.parquet")
pq.write_table(t2, "float16_nonzeros_and_nans.parquet")

m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")

print(m1.row_group(0).column(0))
print(m2.row_group(0).column(0))
# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
# file_offset: 72
# file_path:
# physical_type: FIXED_LEN_BYTE_ARRAY
# num_values: 3
# path_in_schema: x
# is_stats_set: True
# statistics:
# <pyarrow._parquet.Statistics object at 0x7f24d48c4ea0>
# has_min_max: True
# min: b'\x00\x80'
# max: b'\x00\x00'
# null_count: 1
# distinct_count: None
# num_values: 2
# physical_type: FIXED_LEN_BYTE_ARRAY
# logical_type: Float16
# converted_type (legacy): NONE
# compression: SNAPPY
# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
# has_dictionary_page: True
# dictionary_page_offset: 4
# data_page_offset: 24
# total_compressed_size: 68
# total_uncompressed_size: 64
# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f24d48c4d60>
# file_offset: 84
# file_path:
# physical_type: FIXED_LEN_BYTE_ARRAY
# num_values: 8
# path_in_schema: x
# is_stats_set: True
# statistics:
# <pyarrow._parquet.Statistics object at 0x7f24d48c4e50>
# has_min_max: True
# min: b'\x00\xc0'
# max: b'\x00@'
# null_count: 1
# distinct_count: None
# num_values: 7
# physical_type: FIXED_LEN_BYTE_ARRAY
# logical_type: Float16
# converted_type (legacy): NONE
# compression: SNAPPY
# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
# has_dictionary_page: True
# dictionary_page_offset: 4
# data_page_offset: 34
# total_compressed_size: 80
# total_uncompressed_size: 76
```
Binary file added data/float16_nonzeros_and_nans.parquet
Binary file not shown.
Binary file added data/float16_zeros_and_nans.parquet
Binary file not shown.