Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module A eukaryote script is identical to prokaryote script #6

Open
martaolmu opened this issue Feb 2, 2024 · 2 comments
Open

Module A eukaryote script is identical to prokaryote script #6

martaolmu opened this issue Feb 2, 2024 · 2 comments

Comments

@martaolmu
Copy link

Hi!

I would like to run the module A for an eukaryote organism (Mus musculus). For now the analysis has been running for a month (and I think it will take a lot more time).
I've been looking at both scripts (putative_orfs.sh and putative_orfs_eukaryotes.sh) and I cannot see any differences. Is it possible that there's an error and the eukaryote script is not separating the chromosomes in the fasta file and that's why it is taking so long? Would you have any recommendations?

Thanks so much!

Marta

@AlexanderBartholomaeus
Copy link
Owner

Hi Marta,

the problem might be the ORF detection when you try to apply it on a complete eukayotic genome. This is because there are simply too many START/STOP codons. There are ~1.2mio potential START codons in E.coli (5mio bp genome size) and on the genome of mouse I would estimate ~600mio START codons. This will likely cause the script to use a lot of RAM and get very slow.

Did you use the whole genome as input?

Please consider the following thoughts: For eukayotes the smORF analysis might be a bit more complicated because you might run the pipeline multiple times, depending on what you want to find. When you use the genome, you will detect small ORFs that cover exon-intron boundaries. These might be very likely to be false positives if the intron is splices out. You could think about using only mRNA sequences as an input. This would also reduce the search space for possible ORFs a lot. However, you would not be able to detect new ORFs that are on possibly short and new yet detected (m)RNAs.

Currently, I am trying to write a more efficient ORF detection procedure, but I not sure how long this will take (and if it is finally faster). I will also check the code regarding that multi-chromosome issue.

Best,
Alex

@martaolmu
Copy link
Author

Hi Alex,

yes I did use the whole genome as input. I will consider to use only mRNA sequences as input as you say.
Thanks so much for the info and your time!

Marta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants