-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is pixz tar-append possible? #83
Comments
Something like that ought to be possible, but a bit complicated. You'd need to truncate, append the data, and then rewrite the index. |
Hi! I'm curious if it's possible to place the index at the end of file? It is not a request, it's just a general discuss about formats. I wrote some storage system for keeping tons of telemetry and media from industrial equipment, and the end placed index had shown itself very well. I understand that it was a special case, but it seems that most pixz's use cases fit that well. |
Oh, the index does go near the end! A pixz file with 3 data blocks looks like this:
See the XZ file format info more more details on the XZ wrapper. The problem is that to append to a pixz file, you need to move the pixz file index. So it’s not trivial. |
Ah sorry, you're right. Then I would like to see in detail the problems with the high cost of adding a file to the archive @eichin How big is your index? In experience, yes, the index can increase itself to unacceptable size (compared to a chunk of added data). I used 2-level index (some split 1-lvl indexes over file, a 2-lvl index ("index of indexes") at the end of file). It works ok if we mostly add data sequentially. But that may not acceptable for general purpose archiver. |
What is the format of the index? I am also curious how easy it is to remove a file from |
As mentioned above, there's actually two indexes in a pixz file: The XZ index, and the pixz tar file index. You can read about the XZ index in the link above. The pixz tar file index isn't documented, but if you read the code, you can see it's basically a bunch of filename/offset pairs. It sounds very difficult to remove a file from a .tpxz archive. You'd have to recompress the partial blocks on each end, and rewrite both the pixz tar file index and the XZ index. If you find yourself frequently wanting to remove files from compressed archives, there are probably better archive formats out there! |
From the spec XZ index is just a list of pairs I can not read C code as freely as a text spec. If I know the binary structure, I could try to experiment with that algorithm. The problem is more generic than it seems facebook/zstd#2396 |
pixz's tar-mode uses (abuses?) the fact that tar-files always end with a couple of blocks full of zeros, and tar ignores anything after it. So pixz's index can just go after the end-of-file blocks, and it preserves compatibility with tar, even if you're using plain old xz to decompress the archive. |
I'm not really interested in that use case, but I can share my thoughts anyways, it may help someone who is. When you write to a plain tar file in append mode, here is what tar does:
Let me illustrate how much work is needed to do tar-append in tpxz format, with the example:
Here what you would need to do:
In other words, you cannot just “insert” a XZ block, you would need to modify existing ones, and override the bytes that come after. And also, you will face the issue that
I agree, that would be quite a bit of work. |
Some naive experiments didn't work, and I was wondering if this makes sense structurally - I'd like to add a small file to a (large) pixz-compressed tarfile, and was wondering if updating the index was possibly cheaper than rebuilding the whole archive.
The text was updated successfully, but these errors were encountered: