TFRecords File is too big! 10X the size of parquet #47

kart2k15 · 2022-03-09T18:31:13Z

See similar git issues here:--
tensorflow/ecosystem#61 (comment)
tensorflow/ecosystem#61
tensorflow/ecosystem#106

This how I'm writing a PySpark dataframe to tf-records to an S3 bucket:---

s3_path = "s3://Shuks/dataframe_tf_records"   
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(s3_path)

This creates a new key/"directory" on S3 with the following path : s3://Shuks/dataframe_tf_records/
And under this directory are all the tf-records.

How do I specify compression type during conversion?

The text was updated successfully, but these errors were encountered:

junshi15 · 2022-04-08T05:31:46Z

try this:
option("codec","org.apache.hadoop.io.compress.GzipCodec")

sosixyz · 2024-05-22T06:35:51Z

try this: option("codec","org.apache.hadoop.io.compress.GzipCodec")
I use this method, data.repartition(50).write.mode("overwrite").format('tfrecords').option("codec", "org.apache.hadoop.io.compress.GzipCodec").save(path), but the file seems not to be small. the option did not take effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFRecords File is too big! 10X the size of parquet #47

TFRecords File is too big! 10X the size of parquet #47

kart2k15 commented Mar 9, 2022 •

edited

Loading

junshi15 commented Apr 8, 2022

sosixyz commented May 22, 2024 •

edited

Loading

TFRecords File is too big! 10X the size of parquet #47

TFRecords File is too big! 10X the size of parquet #47

Comments

kart2k15 commented Mar 9, 2022 • edited Loading

junshi15 commented Apr 8, 2022

sosixyz commented May 22, 2024 • edited Loading

kart2k15 commented Mar 9, 2022 •

edited

Loading

sosixyz commented May 22, 2024 •

edited

Loading