-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed (.gz) Writers need to be manually closed #146
Comments
Can you provide a minimal reproducer and a |
I assume you also did not use I had problems in early Cutadapt versions when I was not so diligent closing each output file, that is, when I was relying on Since you are working with many files, either use your fix of manually closing each file using |
@marcelm, thanks for the info (I did not know this), however I'm not sure it's relevant here. In this specific instance, I was calling the writers through a function, so (as I understand it) the destructors would be called when the function exits. I would also expect the writing to finish before the script exits (granted, buffering can be confusing but surely these should be flushed on program exit?). |
@rhpvorderman, I've made an example. It looks like this behaviour only occurs when the writing is behind a function. You can reproduce this using a single writer, which, when compressed, truncates its output. This doesn't occur with uncompressed files. import dnaio
import random
# generate some data
DNA = ["A", "C", "G", "T"]
def random_dna(n):
return dnaio.SequenceRecord("read", "".join(random.choices(DNA, k=n)))
reads = [random_dna(500) for _ in range(1000)]
# wrap writing inside a function
def write_demultiplexed(seqs, file):
fa_file = dnaio.open(file, mode='w')
for seq in seqs:
fa_file.write(seq)
# fix
# fa_file.close()
write_demultiplexed(reads, "reads.fa.gz")
write_demultiplexed(reads, "reads.fa") The results on my machine:
To my eyes, it looks like the compressed writing happens in a separate thread which is silently dropped when the function exits, rather than the function waiting for it to finish writing (which is what I would expect in Python). Thanks for taking a look. edit: pip list output, fresh
|
There’s no guarantee that Many file-like objects implement a |
I've tracked this down to behaviour in I think this makes sense in a context where you pass in a file object, and python seems to clean this up properly (ie close the file and flush the buffer), at least if the file object is in the main scope. In the context of using I haven't tried to implement this, but I think if you change Footnotes |
This is done to simulate the behavior of The GzipFile should work properly when you close it. You can't rely on the garbage collector to do this reliably. |
That is because I build the threaded gzip writer and I made it reproducible by default. Threaded gzip writing was deemed "too specialistic" for CPython and as a result python-zlib-ng and python-isal have a little more freedom in defining the threaded conventions. |
Thanks, that all makes sense. Nice work on the threaded implementations, this is off-topic but is that going to be materially faster in most use cases? An alternative quick fix would be to add a |
It is better to properly close files rather than to rely on the CPython behavior of the garbage collector.
So adding a See also this stackoverflow question and answer: https://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important |
Yes. Compression is a heavy workload. Offloading it to another thread frees up the python thread (python is single threaded, as you know) to do actual interesting work. Using more than one offloading thread only matters if a level higher than 1 or 2 is chosen. For any bioinformatics tool I recommend setting the compression level at 1. |
I don't disagree, thanks for taking the time to help with this. My suggestion was just to help others avoid what I would consider surprising behaviour in the future, if you don't feel it's an issue (or it's better solved with docs/examples) then that's fair, feel free to close. |
I’m not entirely opposed to this. I agree with Ruben that it’s risky because users should really use I run my tests with
If we decide to implement An alternative is to force users to use a context manager: The |
I've encountered a tricky bug where writing uncompressed fasta is fine, but writing compressed fasta produces an empty file in some cases. This was inside a demultiplexing script (my own, not using cutadapt) where I was writing multiple (~100) files simultaneously.
The fix was manually closing the files (i.e. calling
writer.close()
for each one), but I would expect that this would be called automatically when theWriter
object goes out of scope (as it appears to be for non-compressed files).My best guess would be that the process responsible for writing the compressed files is getting silently dropped, however I don't really speak cython so I haven't been able to look at the source. If the fix for this isn't obvious, I can try to generate a minimal example for this.
Thanks.
The text was updated successfully, but these errors were encountered: