Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add option to batch data when using SequenceExample #51

Open
utkarshgupta137 opened this issue May 19, 2022 · 5 comments

Comments

@utkarshgupta137
Copy link

It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.

@junshi15
Copy link
Contributor

junshi15 commented Jun 3, 2022

I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically?

@utkarshgupta137
Copy link
Author

The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements.

@junshi15
Copy link
Contributor

junshi15 commented Jun 5, 2022

Which Spark operation does batching correspond to? GroupBy?
Spark-TFRecord is implemented as a Spark data source (similar to Avro, Parquet, CSV), so it supports most data source options. I don't see batching in Spark's data source API.
TFRecordReader does batching, why is it not an option for you?

@utkarshgupta137
Copy link
Author

Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using batch = index % batch_size.
Yes, TFRecordReader supports batching, but the whole point of doing it in spark is mentioned in my last comment.

@junshi15
Copy link
Contributor

junshi15 commented Jun 5, 2022

It's not clear to me how to implement the logic in a Spark data source which basically is a format converter.
Contributions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants