Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein FASTA file not found, exiting - happens when --name has a particular format #1089

Open
JWDebler opened this issue Jan 7, 2025 · 5 comments

Comments

@JWDebler
Copy link

JWDebler commented Jan 7, 2025

Are you using the latest release?
1.8.17 conda

Describe the bug
The annotate module exits with "Protein FASTA file not found, exiting" error for some of my assemblies. The thing all these have in common is that the assembly name which I use as the --name flag has this format: Al-Tr-xx where xx is a number.

Sample names that worked fine:

3C-1
Al-16-NDUS
AL-84

Sample names that failed:

Al-Tr-19-2
Al-Tr-26

There appears to be something about the --name format that trips the annotate module up.
Any idea what it could be?

Cheers.

What command did you issue?

for file in *.fasta 
do
ID="${file%%.fasta}" && \
funannotate predict --cpus $(nproc) -i $file -o funannotate_$ID -s "Ascochyta lentis" --augustus_species Alentis --name $ID --force && \
funannotate iprscan --cpus $(nproc) -i funannotate_$ID -m local && \
funannotate annotate --cpus $(nproc) -i funannotate_$ID --isolate $ID
done

Logfiles


[Jan 06 07:35 AM]: OS: Ubuntu 24.04, 32 cores, ~ 66 GB RAM. Python: 3.9.19
[Jan 06 07:35 AM]: Running funannotate v1.8.17
[Jan 06 07:35 AM]: Skipping CodingQuarry as no --rna_bam passed
[Jan 06 07:35 AM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method                                                                                                                                                                                                                                                                                                augustus     pretrained                                                                                                                                                                                                                                                                                                     genemark     selftraining                                                                                                                                                                                                                                                                                                   glimmerhmm   busco                                                                                                                                                                                                                                                                                                          snap         busco                                                                                                                                                                                                                                                                                                        [Jan 06 07:36 AM]: Loading genome assembly and parsing soft-masked repetitive sequences                                                                                                                                                                                                                                     [Jan 06 07:36 AM]: Genome loaded: 24 scaffolds; 41,969,303 bp; 0.00% repeats masked
/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/aux_scripts/funannotate-p2g.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import parse_version
[Jan 06 07:36 AM]: Mapping 559,318 proteins to genome using diamond and exonerate
[Jan 06 07:37 AM]: Found 302,613 preliminary alignments with diamond in 0:01:09 --> generated FASTA files for exonerate in 0:00:37
     Progress: 302613 complete, 0 failed, 0 remaining
[Jan 06 07:57 AM]: Exonerate finished in 0:19:01: found 1,491 alignments
[Jan 06 07:58 AM]: Running GeneMark-ES on assembly
[Jan 06 08:10 AM]: 12,034 predictions from GeneMark
[Jan 06 08:10 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[Jan 06 08:16 AM]: 1,277 valid BUSCO predictions found, validating protein sequences
[Jan 06 08:17 AM]: 799 BUSCO predictions validated
[Jan 06 08:17 AM]: Running Augustus gene prediction using Alentis parameters
     Progress: 95 complete, 0 failed, 0 remaining
[Jan 06 08:19 AM]: 10,035 predictions from Augustus
[Jan 06 08:19 AM]: Pulling out high quality Augustus predictions
[Jan 06 08:19 AM]: Found 57 high quality predictions from Augustus (>90% exon evidence)
[Jan 06 08:19 AM]: Running SNAP gene prediction, using training data: funannotate_Al-Tr-12/predict_misc/busco.final.gff3
[Jan 06 08:21 AM]: 0 predictions from SNAP
[Jan 06 08:21 AM]: SNAP prediction failed, moving on without result
[Jan 06 08:21 AM]: Running GlimmerHMM gene prediction, using training data: funannotate_Al-Tr-12/predict_misc/busco.final.gff3
[Jan 06 08:30 AM]: 8,683 predictions from GlimmerHMM
[Jan 06 08:30 AM]: Summary of gene models passed to EVM (weights):
  Source         Weight   Count
  Augustus       1        9978
  Augustus HiQ   2        57
  GeneMark       1        12034
  GlimmerHMM     1        8683
  Total          -        30752
[Jan 06 08:30 AM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval
     Progress: 306 complete, 0 failed, 0 remaining
[Jan 06 08:33 AM]: Converting to GFF3 and collecting all EVM results
[Jan 06 08:33 AM]: 11,128 total gene models from EVM
[Jan 06 08:33 AM]: Generating protein fasta files from 11,128 EVM models
[Jan 06 08:33 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Jan 06 08:34 AM]: Found 56 gene models to remove: 1 too short; 0 span gaps; 55 transposable elements
[Jan 06 08:34 AM]: 11,072 gene models remaining
[Jan 06 08:34 AM]: Predicting tRNAs
[Jan 06 08:35 AM]: 132 tRNAscan models are valid (non-overlapping)
[Jan 06 08:35 AM]: Generating GenBank tbl annotation file
[Jan 06 08:35 AM]: Collecting final annotation files for 11,204 total gene models
[Jan 06 08:35 AM]: Converting to final Genbank format
[Jan 06 08:36 AM]: Funannotate predict is finished, output files are in the funannotate_Al-Tr-12/predict_results folder
[Jan 06 08:36 AM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (manual install):
funannotate iprscan -i funannotate_Al-Tr-12 -c 32

Run antiSMASH (optional):
funannotate remote -i funannotate_Al-Tr-12 -m antismash -e [email protected]

Annotate Genome:
funannotate annotate -i funannotate_Al-Tr-12 --cpus 32 --sbt yourSBTfile.txt
-------------------------------------------------------

[Jan 06 08:36 AM]: Training parameters file saved: funannotate_Al-Tr-12/predict_results/Alentis.parameters.json
[Jan 06 08:36 AM]: Add species parameters to database:

  funannotate species -s Alentis -a funannotate_Al-Tr-12/predict_results/Alentis.parameters.json

Running InterProScan5 on 11072 proteins
Important: you need to manually configure your interproscan.properties file for embedded workers.
Will try to launch 32 interproscan processes, adjust -c,--cpus for your system
     Progress: 11 complete, 0 failed, 0 remaining
InterProScan5 search has completed successfully!
Results are here: funannotate_Al-Tr-12/annotate_misc/iprscan.xml
-------------------------------------------------------
[Jan 06 09:53 AM]: OS: Ubuntu 24.04, 32 cores, ~ 66 GB RAM. Python: 3.9.19
[Jan 06 09:53 AM]: Running 1.8.17
[Jan 06 09:53 AM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jan 06 09:53 AM]: Parsing input files
[Jan 06 09:53 AM]: Existing tbl found: funannotate_Al-Tr-12/predict_results/Ascochyta_lentis.tbl
[Jan 06 09:53 AM]: Protein FASTA file not found, exiting

[01/06/25 23:04:50]: /data/mamba_envs/envs/funannotate/bin/funannotate annotate --cpus 32 -i funannotate_Al-Tr-25 --isolate Al-Tr-25

[01/06/25 23:04:51]: OS: Ubuntu 24.04, 32 cores, ~ 66 GB RAM. Python: 3.9.19
[01/06/25 23:04:51]: Running 1.8.17
[01/06/25 23:04:52]: hmmscan version=HMMER 3.4 (Aug 2023) path=/data/mamba_envs/envs/funannotate/bin/hmmscan
[01/06/25 23:04:52]: hmmsearch version=HMMER 3.4 (Aug 2023) path=/data/mamba_envs/envs/funannotate/bin/hmmsearch
[01/06/25 23:04:52]: diamond version=2.1.8 path=/data/mamba_envs/envs/funannotate/bin/diamond
[01/06/25 23:04:55]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[01/06/25 23:04:55]: Parsing input files
[01/06/25 23:04:55]: Existing tbl found: funannotate_Al-Tr-25/predict_results/Ascochyta_lentis.tbl
[01/06/25 23:05:04]: TBL file: funannotate_Al-Tr-25/annotate_misc/genome.tbl
[01/06/25 23:05:04]: GFF3 file: funannotate_Al-Tr-25/predict_results/Ascochyta_lentis.gff3
[01/06/25 23:05:04]: Protein FASTA file not found, exiting

OS/Install Information

-------------------------------------------------------
Checking dependencies for 1.8.17
-------------------------------------------------------
You are running Python v 3.9.19. Now checking python packages...
biopython: 1.79
goatools: 1.4.12
matplotlib: 3.9.2                                                                                                                                                                                                                                                                                                           natsort: 8.4.0
numpy: 1.26.4
pandas: 2.2.3
psutil: 6.0.0
requests: 2.32.3
scikit-learn: 1.5.2
scipy: 1.13.1
seaborn: 0.13.2
All 11 python packages installed


You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.050
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.58
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.03
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.17
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed


Checking Environmental Variables...
$FUNANNOTATE_DB=/data/databases/
$PASAHOME=/data/mamba_envs/envs/funannotate/opt/pasa-2.5.3
$TRINITY_HOME=/data/mamba_envs/envs/funannotate/opt/trinity-2.15.2
$EVM_HOME=/data/mamba_envs/envs/funannotate/opt/evidencemodeler-2.1.0
$AUGUSTUS_CONFIG_PATH=/data/mamba_envs/envs/funannotate/config/
$GENEMARK_PATH=/opt/genemark/current/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.3
CodingQuarry: 2.0
Trinity: 2.15.2
augustus: 3.5.0
bamtools: bamtools 2.5.2
bedtools: bedtools v2.31.1
blat: BLAT v39x1
diamond: 2.1.8
emapper.py: 2.1.12
ete3: 3.1.3
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2024-09-18
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.4 (Aug 2023)
hmmsearch: HMMER 3.4 (Aug 2023)
java: 18.0.2.1
kallisto: 0.46.1
mafft: v7.526 (2024/Apr/26)
makeblastdb: makeblastdb 2.16.0+
minimap2: 2.28-r1209
pigz: 2.8
proteinortho: 6.3.2
pslCDnaFilter: no way to determine
salmon: salmon 1.10.3
samtools: samtools 1.21
signalp: 6.0
snap: 2006-07-28
stringtie: 2.2.3
tRNAscan-SE: 2.0.12 (Nov 2022)
tantan: tantan 50
tbl2asn: 25.8
tblastn: tblastn 2.16.0+
trimal: trimAl v1.5.rev0 build[2024-05-27]
trimmomatic: 0.39
All 37 external dependencies are installed
@nextgenusfs
Copy link
Owner

Seems like it must be related to tbl2asn and how it parses the isolate field. You can try to run that command separately and see if you get any errors.

@JWDebler JWDebler changed the title Protein FASTA file not found, exiting - happens when --isolate has a particular format Protein FASTA file not found, exiting - happens when --name has a particular format Jan 7, 2025
@JWDebler
Copy link
Author

JWDebler commented Jan 7, 2025

Sorry, I can't extract that command info from the funannotate repo, can you tell me what command to try and run?

This is what it managed to create before stopping:

image

The few entries in the genome.transcripts.fasta file are garbage.

@hyphaltip
Copy link
Collaborator

hyphaltip commented Jan 7, 2025

The name argument is how the genes will be named it is typically XXX_NNNN where XXX is an NCBI assigned prefix and NNN will be sequential numbers from funannotate in numbering the genes. I don't think using '-' is a good idea in the name. You can pass in something simple eg ABC or leave it empty -- I think you might be conflating name for strain instead of name of genes?

As you may be predicting many of the same species it would be good to put the strain info as an argument to the --strain STRAINID in the predict and annotate steps as well.

if you go to the predict_misc/tbl2asn it will have results from tbl2asn and you can look in log/error files there for starters.

I would look at size and content of the files in funannotate_Al-Tr-12/predict_results to make sure it completed the prediction process fully. I don't think the isolate is the issue as much as the locus prefix.

@JWDebler
Copy link
Author

JWDebler commented Jan 8, 2025

No, I need the gene names to be unique, which is why I am using the sample ID for that instead of the default FUN and in the absence of an NCBI prefix. It works fine for most samples, but seems to break down for these particular ones.
I will try running the pipeline with some random --name string later to see if that helps.

The tbl2asn folder does not exist for the samples that failed, so I can't check the logs unfortunately. The files in the screenshot above are the only things created before the pipeline aborts.
The problem seems to be generating the genome.proteins.fasta file, which is strange, because the respective file exists in the predict_results folder.

Here's a screenshot of the predict_results content of a sample that correctly finished annotate versus one that failed. Everything looks similar enough to me.
image

@hyphaltip
Copy link
Collaborator

My suggestion to not put '-' in the locus prefix still stands. Write a simple function to remove the dashes and see. It looks like your 'failed' are succeeding in that you have protein files.

I would look in the annotate_misc folder to try to make sense of what is being generated or used in the failed isolate.

I still provide the fields --species and --strain/isolate when I run annotate myself so you may want to still include that

funannotate annotate -i $OUTDIR/${name} --cpus $CPUS  \
		--species "$SPECIES" --strain $STRAIN --sbt $SBTTEMPLATE \
	        --busco_db $BUSCODB --rename $LOCUSTAG

I would put the isolate/strain in '--isolate "$ISOLATE"' as well in your cmdline for predict so that is part of the final filenames. It seems like you aren't doing that in the predict step as the files are all only named genus_species

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants