Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GAMBITDB] - Create new utility for fungal databases #1

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Jan 8, 2025

GAMBIT Database Fungi Processing Updates

Added features for downloading fungal data from NCBI-Datasets

FungiParser - Fungi.py

This is a class for processing fungal genomic data from RefSeq. Handles downloading, filtering, and organizing fungal genome data based on various quality criteria set by defualts or user input.

Takes in an assembly summary from refseq:

Column Description Example
assembly_accession Unique identifier GCF_000002945.2
bioproject BioProject accession PRJNA127
species_taxid Species taxonomic ID 4896
organism_name Organism name Schizosaccharomyces pombe
assembly_level Assembly completion level Chromosome
release_type Release significance Major
genome_rep Genome representation Full
seq_rel_date Sequence release date 2019/07/23
asm_name Assembly name ASM294v3
ftp_path Download path https://ftp.ncbi.nlm.nih.gov/all/...
genome_size Total genome size 12571820
gc_percent GC content percentage 36.0
scaffold_count Number of scaffolds 3
contig_count Number of contigs 3
annotation_provider Source of annotation PomBase
annotation_date Date of annotation 2024-08-16
total_gene_count Total genes 12645
protein_coding_gene_count Protein coding genes 5124
non_coding_gene_count Non-coding genes 7489

Key columns used by GAMBIT database processing are:

  • assembly_accession
  • species_taxid
  • organism_name
  • contig_count

The FungiParser class does the following:

  • Parse Input: Reads and parses RefSeq assembly summary TSV metadata file
  • Process Species: Group genomes by species ID and get parent taxonomy from NCBI API; filter out genomes based on quality criteria (contigs, 'sp.' designation, min genomes per species)
  • Download Genomes: Download filtered genomes in batches of 100 using NCBI datasets API, extract FASTA files and record failures
  • Apply Preferences: Prioritize RefSeq assemblies over GenBank when duplicates exist
  • Generate Outputs: Write CSV files for assembly metadata, taxonomy relationships and filtered genome reports to be used in gambit generation

gambitdb-fungi

Command-line interface script for filtering and downloading fungal genome data using the FungiParser class.

gambitdb-fungi has the following optional flags:

# Processing
--max_contigs | Maximum contigs (default: 100000) 
--minimum_genomes | Min genomes per species (default: 2)
--exclude_atypical | Exclude atypical genomes
--is_metagenome_derived | MAG metadata handling (defualt: 'metagenome_derived_exclude')
--parent_taxonomy | Parent level

# Output
-g | Genome metadata output 
-a | Species data output
-o | FASTA output directory

These two additions are the major entry points into downloading fungal data from refseq and processing genome metadata / downloading genomes for gambitdb.

run_gambitdb_fungi.sh

This is the utility script to run the download and gambitdb creation steps

  • Split Input: Take large RefSeq assembly metadata and split into smaller chunks of 100
  • Process Chunks: Runs gambitdb-fungi sequentially on each split
  • Merges Results: Combines all spits into a single sets of CSV files
  • Make Database: Uses merged results and builds signatures and final GAMBIT DB

In future could add gnu parellel

Helper scripts added

  • Added scripts to handle fungi-specific database processing and analysis:
    • gambitdb-update-taxa-report: Updates report flags and taxonomic rankings
    • gambitdb-merge-duplicates: Identifies and merges duplicate taxa entries
    • gambitdb-fungi-fix-genera: Fixes issues with genus diameters set to 0 due to subspeciation
    • gambitdb-fungi-analyze: New tool for post-processing analysis of species distances and overlaps, reports overlaps in in txt file and creates dendrogram for each overlap.

Bug Fixes and Improvements

  • We noticed that after a few rounds of creation and looking at diameters, that the diameters did not match what should have been with the actual species. This was caused due to a lack of tracking genomes + species index and lead to slight misalignment with what diameter was being reported for what species. This was patched by creating a dictionary in Diameters.py function calculate_diameters that allows us to track genomes + species + diameters 1:1 so there is no index shift. The long term fix for this would be to update the data structures in the code base to make sure there are no shifts and everything is explicity being tracked. Rely less on looping through a pandas series and more on exact matching of species to their genomes and all data that is generated throughout the process.
  • Another issue is that when the —-accessions_to_remove flag was used in the creation process, it did not actually remove the assemblies from the PW calculations but instead used everything in the assembly directory. I think what is happening here is that the continue only skips for the next iteration of the inner for accession loop, but after the inner loop finnishes the code still procceeds to write the assembly list because it needs to exit the outer assembly loop: https://github.com/gambit-suite/gambitdb/blob/im-fungal-db/gambitdb/PairwiseTable.py#L98-L124. This was updated and tested with Aspergillus exclusions.

Documentation

Added comprehensive documentation for new scripts in README.md including usage examples and parameter descriptions for gambitdb-fungi, gambitdb-update-taxa-report, gambitdb-merge-duplicates, gambitdb-fungi-fix-genera, gambitdb-fungi-analyze.

cimendes and others added 30 commits October 22, 2024 13:27
… name to just binomial, report species ID instead of lower ranks
…ved output logging; updated max_contigs parameter in CLI; added directory creation for database output.
… with gambitdb-fungi script details and usage instructions
…ables and update filtering logic for overlaps
… SplitSpecies classes and adjust expected output values
…gic for overlaps; update .gitignore to include new Aspergillus directory
output_logfile.txt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be removed?

@@ -586,6 +586,157 @@ options:
--verbose, -v Turn on verbose output (default: False)
```

## gambitdb-update-taxa-report
### Description
The `gambitdb-update-taxa-report` script updates report flags and taxonomic rankgins in a GAMBIT database. The script specifically handles subspeces designations and report flag settings for different taxonomic rankings. For our use cases, we set subspecies report to 0, and species & genus to 1.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subspeces -> subspecies

self.logger.debug(f"Fetching next page for taxon {taxon_id}")
genome_data = self.api_client.get_json(url, params={**self.genome_report_params, 'page_token': page_token})

except Exception as e:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may terminate populating genome_list regardless of when a failure is noted; i.e. if there is an error in the first genome, then no genomes would be returned even if there are more genomes to be attempted. Is this the desired behavior, or should it be implemented so Exception handling occurs on a report-by-report basis?

# Process valid genomes for species
if len(valid_genomes) >= self.minimum_genomes_per_species:
self.logger.debug(
f"Taxon {taxon_id} has {len(valid_genomes)} valid genomes after filtering"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other debugs for minimum genome threshold explicitly announce the threshold; consider including here as well, e.g.

(minimum required: {self.minimum_genomes_per_species})

self.species_genome_counts[taxon_id] = len(valid_genomes)
else:
self.logger.debug(
f"Taxon {taxon_id} only has {len(valid_genomes)} valid genomes after filtering"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants