-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Add option to batch data when using SequenceExample #51
Comments
I am curious why batching can not be done on user side? I don't see the benefit of doing it inside the converter. Assuming you will feed the examples to training/test/eval, won't TF handle batching automatically? |
The difference in file size of say 1000 Example vs SequenceExample of 1000 rows is very high (unbatched data is ~50% larger in my case). Thus, it takes longer to read/write the files as well as increases memory/disk space requirements. |
Which Spark operation does batching correspond to? GroupBy? |
Batching can be implemented by adding an index to all the rows & then assigning a batch to each row using |
It's not clear to me how to implement the logic in a Spark data source which basically is a format converter. |
It would be great if this library could automatically create batches & save them using SequenceExample. I tried to batches myself, but I got memory issues when trying to do so. I think if it was handled properly at the partition level, then it would both be faster & easy to use.
The text was updated successfully, but these errors were encountered: