[GAMBITDB] - Create new utility for fungal databases #1

cimendes · 2025-01-08T13:58:32Z

GAMBIT Database Fungi Processing Updates

Added features for downloading fungal data from NCBI-Datasets

FungiParser - Fungi.py

This is a class for processing fungal genomic data from RefSeq. Handles downloading, filtering, and organizing fungal genome data based on various quality criteria set by defualts or user input.

Takes in an assembly summary from refseq:

Column	Description	Example
assembly_accession	Unique identifier	GCF_000002945.2
bioproject	BioProject accession	PRJNA127
species_taxid	Species taxonomic ID	4896
organism_name	Organism name	Schizosaccharomyces pombe
assembly_level	Assembly completion level	Chromosome
release_type	Release significance	Major
genome_rep	Genome representation	Full
seq_rel_date	Sequence release date	2019/07/23
asm_name	Assembly name	ASM294v3
ftp_path	Download path	https://ftp.ncbi.nlm.nih.gov/all/...
genome_size	Total genome size	12571820
gc_percent	GC content percentage	36.0
scaffold_count	Number of scaffolds	3
contig_count	Number of contigs	3
annotation_provider	Source of annotation	PomBase
annotation_date	Date of annotation	2024-08-16
total_gene_count	Total genes	12645
protein_coding_gene_count	Protein coding genes	5124
non_coding_gene_count	Non-coding genes	7489

Key columns used by GAMBIT database processing are:

assembly_accession
species_taxid
organism_name
contig_count

The FungiParser class does the following:

Parse Input: Reads and parses RefSeq assembly summary TSV metadata file
Process Species: Group genomes by species ID and get parent taxonomy from NCBI API; filter out genomes based on quality criteria (contigs, 'sp.' designation, min genomes per species)
Download Genomes: Download filtered genomes in batches of 100 using NCBI datasets API, extract FASTA files and record failures
Apply Preferences: Prioritize RefSeq assemblies over GenBank when duplicates exist
Generate Outputs: Write CSV files for assembly metadata, taxonomy relationships and filtered genome reports to be used in gambit generation

gambitdb-fungi

Command-line interface script for filtering and downloading fungal genome data using the FungiParser class.

gambitdb-fungi has the following optional flags:

# Processing
--max_contigs | Maximum contigs (default: 100000) 
--minimum_genomes | Min genomes per species (default: 2)
--exclude_atypical | Exclude atypical genomes
--is_metagenome_derived | MAG metadata handling (defualt: 'metagenome_derived_exclude')
--parent_taxonomy | Parent level

# Output
-g | Genome metadata output 
-a | Species data output
-o | FASTA output directory

These two additions are the major entry points into downloading fungal data from refseq and processing genome metadata / downloading genomes for gambitdb.

run_gambitdb_fungi.sh

This is the utility script to run the download and gambitdb creation steps

Split Input: Take large RefSeq assembly metadata and split into smaller chunks of 100
Process Chunks: Runs gambitdb-fungi sequentially on each split
Merges Results: Combines all spits into a single sets of CSV files
Make Database: Uses merged results and builds signatures and final GAMBIT DB

In future could add gnu parellel

Helper scripts added

Added scripts to handle fungi-specific database processing and analysis:
- gambitdb-update-taxa-report: Updates report flags and taxonomic rankings
- gambitdb-merge-duplicates: Identifies and merges duplicate taxa entries
- gambitdb-fungi-fix-genera: Fixes issues with genus diameters set to 0 due to subspeciation
- gambitdb-fungi-analyze: New tool for post-processing analysis of species distances and overlaps, reports overlaps in in txt file and creates dendrogram for each overlap.

Bug Fixes and Improvements

We noticed that after a few rounds of creation and looking at diameters, that the diameters did not match what should have been with the actual species. This was caused due to a lack of tracking genomes + species index and lead to slight misalignment with what diameter was being reported for what species. This was patched by creating a dictionary in Diameters.py function calculate_diameters that allows us to track genomes + species + diameters 1:1 so there is no index shift. The long term fix for this would be to update the data structures in the code base to make sure there are no shifts and everything is explicity being tracked. Rely less on looping through a pandas series and more on exact matching of species to their genomes and all data that is generated throughout the process.
Another issue is that when the —-accessions_to_remove flag was used in the creation process, it did not actually remove the assemblies from the PW calculations but instead used everything in the assembly directory. I think what is happening here is that the continue only skips for the next iteration of the inner for accession loop, but after the inner loop finnishes the code still procceeds to write the assembly list because it needs to exit the outer assembly loop: https://github.com/gambit-suite/gambitdb/blob/im-fungal-db/gambitdb/PairwiseTable.py#L98-L124. This was updated and tested with Aspergillus exclusions.

Documentation

Added comprehensive documentation for new scripts in README.md including usage examples and parameter descriptions for gambitdb-fungi, gambitdb-update-taxa-report, gambitdb-merge-duplicates, gambitdb-fungi-fix-genera, gambitdb-fungi-analyze.

… name to just binomial, report species ID instead of lower ranks

…ved output logging; updated max_contigs parameter in CLI; added directory creation for database output.

…gnore for new directories

… with gambitdb-fungi script details and usage instructions

…ng scripts for improved maintainability

…d clean up intermediate summary files

…asses

…ables and update filtering logic for overlaps

… SplitSpecies classes and adjust expected output values

…sults to CSV

…directories

…gic for overlaps; update .gitignore to include new Aspergillus directory

…ies; modify Diameters class logging for clarity; add linkage method argument to gambitdb-curate script

…eTable for improved readability

… method

…iled species data, including name, position, assemblies, diameter, and genome count.

…IT database; update .gitignore to include fixed_final directory

…ion during db contruction

… taxa ranks for subspecies

…rt and genus fix scripts

…lyze.py

xonq · 2025-02-11T19:39:41Z

.gitignore

should these be removed?

xonq · 2025-02-11T19:41:47Z

README.md

@@ -586,6 +586,157 @@ options:
  --verbose, -v         Turn on verbose output (default: False)
 ```

+## gambitdb-update-taxa-report
+### Description
+The `gambitdb-update-taxa-report` script updates report flags and taxonomic rankgins in a GAMBIT database. The script specifically handles subspeces designations and report flag settings for different taxonomic rankings. For our use cases, we set subspecies report to 0, and species & genus to 1. 


subspeces -> subspecies

xonq · 2025-02-11T19:54:10Z

gambitdb/Fungi.py

+                self.logger.debug(f"Fetching next page for taxon {taxon_id}")
+                genome_data = self.api_client.get_json(url, params={**self.genome_report_params, 'page_token': page_token})
+
+        except Exception as e:


This may terminate populating genome_list regardless of when a failure is noted; i.e. if there is an error in the first genome, then no genomes would be returned even if there are more genomes to be attempted. Is this the desired behavior, or should it be implemented so Exception handling occurs on a report-by-report basis?

xonq · 2025-02-11T19:57:42Z

gambitdb/Fungi.py

+            # Process valid genomes for species
+            if len(valid_genomes) >= self.minimum_genomes_per_species:
+                self.logger.debug(
+                    f"Taxon {taxon_id} has {len(valid_genomes)} valid genomes after filtering"


other debugs for minimum genome threshold explicitly announce the threshold; consider including here as well, e.g.

(minimum required: {self.minimum_genomes_per_species})

xonq · 2025-02-11T19:57:52Z

gambitdb/Fungi.py

+                self.species_genome_counts[taxon_id] = len(valid_genomes)
+            else:
+                self.logger.debug(
+                    f"Taxon {taxon_id} only has {len(valid_genomes)} valid genomes after filtering"


see above comment

cimendes and others added 30 commits October 22, 2024 13:27

first brain dump!

6a75ddb

Added FungiParser capabilities

0077075

make logger log (at least for me)

53aa0fb

Added formatting for species & assembly outputs

a0dcf13

Added flags for ncbi datasets api

bb55e3c

get rank and parent_taxon from dataset_report endpoint; limit species…

aa89e4d

… name to just binomial, report species ID instead of lower ranks

add check that sp is not in organism name

f6cbd9d

trying new things - non functional

f6be2e1

Added assembly downloads via ncbi

92e9451

Added bulk download by species

ccd286d

Updates for more robust filtering and error capture

2734331

Added filtering based on database source

a4a29a2

corrected accession to assembly_accession

c0d5250

Added species names to assembly metadata

4b52fd2

Added Class to extend capabilities arund datasets api

ca6a8fe

Refactored for max retries and ensuring parent_taxid is present

03ff009

Refactored FungiParser to handle duplicate genome filtering and impro…

347919d

…ved output logging; updated max_contigs parameter in CLI; added directory creation for database output.

Add script to split input files and process fungal data; update .giti…

d8575f5

…gnore for new directories

Update CompressClusters test for sample removal count; enhance README…

7cf2666

… with gambitdb-fungi script details and usage instructions

Remove process_fungi.sh script; consolidate functionality into existi…

4532754

…ng scripts for improved maintainability

Add tqdm dependency to env.yaml and setup.py for progress tracking

35b7638

Add functionality to concatenate filtered out genome summary files an…

aaadd5e

…d clean up intermediate summary files

Add fungi analysis script and update .gitignore for output files

d046dc9

Redid FungiAnalyze completely to fit correct filtering for overlaps

b42c933

Add linkage method parameter to Curate, GambitDb, and SplitSpecies cl…

b57da4f

…asses

Enhance FungiAnalyzer and Diameters classes to output species taxon t…

62de373

…ables and update filtering logic for overlaps

Update test cases to use 'average' parameter in Curate, GambitDb, and…

48b384e

… SplitSpecies classes and adjust expected output values

Implement genome-level overlap detection in FungiAnalyzer and save re…

9398740

…sults to CSV

Update .gitignore to refine fungaldb entries and add new Aspergillus …

a09b0b9

…directories

Enhance FungiAnalyzer with parent species extraction and filtering lo…

098600e

…gic for overlaps; update .gitignore to include new Aspergillus directory

Michal-Babins and others added 11 commits December 4, 2024 19:45

Remove debug file comment for distance calculations in FungiAnalyzer

766d130

Update .gitignore to include new directories and remove obsolete entr…

6255f74

…ies; modify Diameters class logging for clarity; add linkage method argument to gambitdb-curate script

Bump version to 0.0.2; refactor accessions filtering logic in Pairwis…

580443f

…eTable for improved readability

Update Diameters_test to capture species data in calculate_thresholds…

01cb06b

… method

Enhance calculate_thresholds method in Diameters class to return deta…

83190c3

…iled species data, including name, position, assemblies, diameter, and genome count.

Fix typo in comment for database creation in gambitdb-create script

90f0529

Add gambitdb-update-taxa-report script to update report flags in GAMB…

c5e9b71

…IT database; update .gitignore to include fixed_final directory

add script to fix genera where diameter is set to 0 due to subspeciat…

5f0b04d

…ion during db contruction

Add script to merge duplicate taxa entries in GAMBIT database; update…

6948c65

… taxa ranks for subspecies

Update .gitignore to include final_fungal_db_merged_dupes directory

fa3ca2c

Update README and scripts to correct usage instructions for taxa repo…

8b84536

…rt and genus fix scripts

Michal-Babins requested a review from andrewjpage January 14, 2025 13:36

Remove commented-out code for calculating genome overlaps in FungiAna…

7200f46

…lyze.py

xonq reviewed Feb 11, 2025

View reviewed changes

.gitignore

output_logfile.txt

Copy link

xonq Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be removed?

xonq reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GAMBITDB] - Create new utility for fungal databases #1

[GAMBITDB] - Create new utility for fungal databases #1

cimendes commented Jan 8, 2025 •

edited by Michal-Babins

Loading

xonq Feb 11, 2025

xonq Feb 11, 2025

xonq Feb 11, 2025

xonq Feb 11, 2025

xonq Feb 11, 2025

[GAMBITDB] - Create new utility for fungal databases #1

Are you sure you want to change the base?

[GAMBITDB] - Create new utility for fungal databases #1

Conversation

cimendes commented Jan 8, 2025 • edited by Michal-Babins Loading

GAMBIT Database Fungi Processing Updates

Added features for downloading fungal data from NCBI-Datasets

FungiParser - Fungi.py

gambitdb-fungi

run_gambitdb_fungi.sh

Helper scripts added

Bug Fixes and Improvements

Documentation

xonq Feb 11, 2025

Choose a reason for hiding this comment

xonq Feb 11, 2025

Choose a reason for hiding this comment

xonq Feb 11, 2025

Choose a reason for hiding this comment

xonq Feb 11, 2025

Choose a reason for hiding this comment

xonq Feb 11, 2025

Choose a reason for hiding this comment

cimendes commented Jan 8, 2025 •

edited by Michal-Babins

Loading