Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ preprocess (m6A) creates a large number of temporary files #43

Open
olliecheng opened this issue Sep 29, 2024 · 5 comments
Open

C++ preprocess (m6A) creates a large number of temporary files #43

olliecheng opened this issue Sep 29, 2024 · 5 comments

Comments

@olliecheng
Copy link

Hi all,

Not sure if this is a duplicate of #38.

I've been trying to get CHEUI up and running but have been running into an issue with the preprocessing. Using the C++ preprocessing script (compiled on GCC v11.3, RHEL 9.4, commit 7b422f7808a3c2ffff56a9ead33a199824753b4e), I have been running it on a 858GB nanopolish eventalign file, generated from a ~2GB .fastq of reads.

In this instance, I noticed that the preprocessing script was producing an absurd amount of temporary files - over 7 million, before the HPC file quota ran out and the script was killed. (This was much to the chagrin of my university's HPC admin, and I promptly received a very strongly worded email advising me not to generate so many temporary files! 😅)

I've attached a small selection of ~1000 of these temporary files for debugging purposes, if it interests you. Each file seems to be very small - a few lines max, based off of my n = 10 sample size.

The aligned events file was called using:

nanopolish eventalign -t {threads} \
    --reads {input.reads} \
    --bam {input.bam} \
    --genome {REF} \
    --scale-events --signal-index --samples --print-read-names > {output}

and I was calling preprocess using:

# must first be in this path, or else the program crashes
cd $PATH_TO_CHEUI_PREPROCESS_DIR

./CHEUI -i $INPUT -m ../../kmer_models/model_kmer.csv -n {threads} --m6A -o $OUTPUT

See the attached sample of temporary files below: out_A_signals+IDs.zip

Let me know if there's anything else that you need.
Ollie

@olliecheng
Copy link
Author

olliecheng commented Sep 29, 2024

maybe this will help

redacted mega.co.nz link

Password: changeme I put the necessary dlls in the archive

Sorry, I'm confused by this comment. I'm not sure how this is applicable, and I also don't want to run precompiled code with no source, especially on a shared HPC cluster. I'm also using RHEL v9.4 on the cluster, not Windows. Furthermore, VirusTotal flags x86_64-w64-ranlib.exe as malware. Are you affiliated with the Eyras computational RNA biology group?

@comprna comprna deleted a comment Sep 29, 2024
@EduEyras
Copy link
Member

EduEyras commented Sep 29, 2024 via email

@EduEyras
Copy link
Member

EduEyras commented Sep 29, 2024 via email

@EduEyras
Copy link
Member

Did you check the possible solution given by @pre-mRNA?:

_We apologise for the issue. CHEUI currently generates a large amount of intermediate files. You can process POD5 files sequentially to predict model 1, e.g.:

POD5 1 -> eventalign -> preprocess_m6a -> CHEUI model 1
You can then delete the eventalign and preprocess files, and keep only the model 1.

Before predicting model 2, you will need to merge and sort all the model 1 files.

Hope this helps._

@olliecheng
Copy link
Author

olliecheng commented Sep 30, 2024

Hi,

Thanks for your reply.

I have not tried it yet, as I am unsure about the quantity of intermediate files generated. I noticed that the solution above is for the Python preprocess script, while I am using the C++ script. I just wanted to first confirm that the small size of each temporary file (a few lines in size) is expected behaviour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants