-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GAMBITDB] - Create new utility for fungal databases #1
base: main
Are you sure you want to change the base?
Conversation
… name to just binomial, report species ID instead of lower ranks
…ved output logging; updated max_contigs parameter in CLI; added directory creation for database output.
…gnore for new directories
… with gambitdb-fungi script details and usage instructions
…ng scripts for improved maintainability
…d clean up intermediate summary files
…ables and update filtering logic for overlaps
… SplitSpecies classes and adjust expected output values
…gic for overlaps; update .gitignore to include new Aspergillus directory
…ies; modify Diameters class logging for clarity; add linkage method argument to gambitdb-curate script
…eTable for improved readability
…iled species data, including name, position, assemblies, diameter, and genome count.
…IT database; update .gitignore to include fixed_final directory
…ion during db contruction
… taxa ranks for subspecies
…rt and genus fix scripts
output_logfile.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be removed?
@@ -586,6 +586,157 @@ options: | |||
--verbose, -v Turn on verbose output (default: False) | |||
``` | |||
|
|||
## gambitdb-update-taxa-report | |||
### Description | |||
The `gambitdb-update-taxa-report` script updates report flags and taxonomic rankgins in a GAMBIT database. The script specifically handles subspeces designations and report flag settings for different taxonomic rankings. For our use cases, we set subspecies report to 0, and species & genus to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subspeces -> subspecies
self.logger.debug(f"Fetching next page for taxon {taxon_id}") | ||
genome_data = self.api_client.get_json(url, params={**self.genome_report_params, 'page_token': page_token}) | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may terminate populating genome_list regardless of when a failure is noted; i.e. if there is an error in the first genome, then no genomes would be returned even if there are more genomes to be attempted. Is this the desired behavior, or should it be implemented so Exception handling occurs on a report-by-report basis?
# Process valid genomes for species | ||
if len(valid_genomes) >= self.minimum_genomes_per_species: | ||
self.logger.debug( | ||
f"Taxon {taxon_id} has {len(valid_genomes)} valid genomes after filtering" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other debugs for minimum genome threshold explicitly announce the threshold; consider including here as well, e.g.
(minimum required: {self.minimum_genomes_per_species})
self.species_genome_counts[taxon_id] = len(valid_genomes) | ||
else: | ||
self.logger.debug( | ||
f"Taxon {taxon_id} only has {len(valid_genomes)} valid genomes after filtering" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
GAMBIT Database Fungi Processing Updates
Added features for downloading fungal data from NCBI-Datasets
FungiParser - Fungi.py
This is a class for processing fungal genomic data from RefSeq. Handles downloading, filtering, and organizing fungal genome data based on various quality criteria set by defualts or user input.
Takes in an assembly summary from refseq:
Key columns used by GAMBIT database processing are:
The
FungiParser
class does the following:gambitdb-fungi
Command-line interface script for filtering and downloading fungal genome data using the
FungiParser
class.gambitdb-fungi
has the following optional flags:These two additions are the major entry points into downloading fungal data from refseq and processing genome metadata / downloading genomes for gambitdb.
run_gambitdb_fungi.sh
This is the utility script to run the download and gambitdb creation steps
In future could add gnu parellel
Helper scripts added
gambitdb-update-taxa-report
: Updates report flags and taxonomic rankingsgambitdb-merge-duplicates
: Identifies and merges duplicate taxa entriesgambitdb-fungi-fix-genera
: Fixes issues with genus diameters set to 0 due to subspeciationgambitdb-fungi-analyze
: New tool for post-processing analysis of species distances and overlaps, reports overlaps in in txt file and creates dendrogram for each overlap.Bug Fixes and Improvements
Diameters.py
functioncalculate_diameters
that allows us to track genomes + species + diameters 1:1 so there is no index shift. The long term fix for this would be to update the data structures in the code base to make sure there are no shifts and everything is explicity being tracked. Rely less on looping through a pandas series and more on exact matching of species to their genomes and all data that is generated throughout the process.—-accessions_to_remove
flag was used in the creation process, it did not actually remove the assemblies from the PW calculations but instead used everything in the assembly directory. I think what is happening here is that thecontinue
only skips for the next iteration of the innerfor accession
loop, but after the inner loop finnishes the code still procceeds to write the assembly list because it needs to exit the outer assembly loop: https://github.com/gambit-suite/gambitdb/blob/im-fungal-db/gambitdb/PairwiseTable.py#L98-L124. This was updated and tested with Aspergillus exclusions.Documentation
Added comprehensive documentation for new scripts in README.md including usage examples and parameter descriptions for
gambitdb-fungi
,gambitdb-update-taxa-report
,gambitdb-merge-duplicates
,gambitdb-fungi-fix-genera
,gambitdb-fungi-analyze
.