-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid the use of a temporary files whenenver possible #50
Comments
Yes, this is because the dictionary is located first in the archive and the exact size of it is unknown before scanning and compressing the source. While the dictionary is being built chunks are hashed and compressed to a temporary file. Then the dictionary is written to file and the temporary chunk file appended on to it. Other solutions would be to either keep all chunks in memory, which obvious could be very memory consuming, or doing two passes on the source file. The first pass would find all unique chunks in the source file and allocate space for a worst-case sized dictionary in the output archive. The second pass would read the unique chunks from the source, compress them and append to the archive after the allocated dictionary. After all chunks has been appended the real dictionary is written to the archive. Some waste of space will be left between the dictionary and the chunks in the archive but it shouldn't really matter. I think this two passes approach should be doable and would solve this problem, right? |
What about storing the dictionary at the end, like zip? |
I remember considering that initially but for some reason I decided not to. Right now I can't remember why. |
I've been thinking a bit more about this and how it does seem like a pretty good idea to have the dictionary in the end of the archive. As already said this should simplify the process of building and making the awkward temporary file redundant. Looking through the code I realize that using a temporary file was also to support the case of building an archive from a stream (like stdin) where we can't know the size of the source in advance. Hence the method I previously mentioned ("...allocate space for a worst-case sized dictionary in the output archive") wouldn't work since we can't restart a file stream once it ends. The other method of keeping all compressed chunks in memory during building of an archive would of course work but it feels ugly, especially when we don't know the size of the source on before hand. On the other hand, with the dictionary located after the chunks, building from a stream would be no problem. Anyway... I'm a bit curios to try this out. My idea, and proposal if implementation is successful, would be to implement it as a v2 of the file format. Where bita will be able to create and decode both v1 and v2. Over a release, or a few, swap from building v1 to v2 by default and put v1 behind a future flag. An old version of bita will of course not be able to decode the new format so users wanting to update and use the new format would probably have to provide two versions of the same archive during a period. Not optimal, but no one is forced to update either. And I still can't remember why I did rule out placing the dictionary in the end before. Might just have been a feeling of it being wrong. Or I might rediscover it soon. But I'm also writing this to see if anyone has any objections or further thoughts on this. Nothing is done yet and I may, or may not, proceed 🙂 |
I think end is the right place for it, and there is precident (zip does the same thing for similar reasons). Your migration plan also sounds fine. If I remember there are some magic bytes at the beginning of a bita archive, so perhaps the version/magic bytes can change at the beginning. That way bita cli v1 just won't even try to parse a v2 archive. Another benefit of it being at the end, is arbitrary data can be put at the beginning of the archive. For example, it's common practice with zips to store them at the end of an executable, yet they can still be opened like a zip because the header is at the end and specifies all the offsets relative to the header. My only question is, how do we know where the metadata at the end starts? Since when we start writing the file we don't know the offsets yet. I guess we need some kind of fixed size payload at the very end, which contains the info about where the variable-sized header starts? |
I think the header would be of static size with magic, offset and size of the dictionary/container. This space will be reserved first, then after all chunks and dictionary has been written the header can be filled in, as the last thing. When reading the archive the first request would just read this N bytes header pointing it to the correct offset and size. And then read the whole dictionary with another request.
When all is written just seek the file to go back to the beginning and fill in the details, or did I overlook something here? |
And just to clarify, when I say header I mean the header located as the first thing in the archive. |
In the
compress
command, it looks like the chunks are all written to a temporary file, then the bita header is written to the output file, and finally the chunks are copied to the output file.This introduces significant points for failure (esp on Windows) and we should avoid temp files whenever possible. Ideally, we will open the output file once, and then not close it until we are done. So we open the file, write the header first, and then give the same output file handle to the chunker.
If you are in overwrite/force mode, this does introduce a slightly increased possibility that we would have paved over a perfectly good file and left junk in it's wake, but I think that is going to be a less common scenario than the compression has failed because Windows Defender or some other junk has decided to lock the output file between writing the header and copying the chunks etc.
The text was updated successfully, but these errors were encountered: