[Feature Request] Add option to batch data when using SequenceExample #51

utkarshgupta137 · 2022-05-19T05:24:48Z

It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.

junshi15 · 2022-06-03T15:21:56Z

I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically?

utkarshgupta137 · 2022-06-03T15:36:34Z

The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements.

junshi15 · 2022-06-05T13:48:27Z

Which Spark operation does batching correspond to? GroupBy?
Spark-TFRecord is implemented as a Spark data source (similar to Avro, Parquet, CSV), so it supports most data source options. I don't see batching in Spark's data source API.
TFRecordReader does batching, why is it not an option for you?

utkarshgupta137 · 2022-06-05T16:52:03Z

Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using batch = index % batch_size.
Yes, TFRecordReader supports batching, but the whole point of doing it in spark is mentioned in my last comment.

junshi15 · 2022-06-05T17:51:59Z

It's not clear to me how to implement the logic in a Spark data source which basically is a format converter.
Contributions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add option to batch data when using SequenceExample #51

[Feature Request] Add option to batch data when using SequenceExample #51

utkarshgupta137 commented May 19, 2022

junshi15 commented Jun 3, 2022

utkarshgupta137 commented Jun 3, 2022

junshi15 commented Jun 5, 2022

utkarshgupta137 commented Jun 5, 2022

junshi15 commented Jun 5, 2022

[Feature Request] Add option to batch data when using SequenceExample #51

[Feature Request] Add option to batch data when using SequenceExample #51

Comments

utkarshgupta137 commented May 19, 2022

junshi15 commented Jun 3, 2022

utkarshgupta137 commented Jun 3, 2022

junshi15 commented Jun 5, 2022

utkarshgupta137 commented Jun 5, 2022

junshi15 commented Jun 5, 2022