Merging counts tables with taxonomy classifications

This walkthrough goes through merging your counts table with your taxonomy classifications.

What you'll end up with at the end of this is either a .tsv.gz with your counts tables merged to your taxonomy classifications.

It is assumed you've completed either the end-to-end metagenomics or the recovering viruses from metatranscriptomics and either the read mapping/counts tables or taxonomic profiling walkthroughs.

Steps:

Merge taxonomy from different organism types
Merge counts tables
Merge counts table with taxonomy classifications

Conda Environment: conda activate VEBA. Use this for intermediate scripts.

1. Merge taxonomy from different organism types

merge_taxonomy_classifications.py -i veba_output/classify/ -o veba_output/classify/

This command outputs the following tables:

veba_output/classify/taxonomy_classifications.tsv - Table with [id_genome][id_organism_type][classification]
veba_output/classify/taxonomy_classifications.clusters.tsv - Table with [id_genome][id_organism_type][classification][more fields with clustering information]

2. Merge counts table

Now let's merge the counts tables. For counts tables from the mapping module, if you haven't done so already you can merge them with this script:

# Get mapping directory
MAPPING_DIRECTORY=veba_output/mapping/global

# Set output directory (this is default)
OUT_DIR=veba_output/counts
SCAFFOLDS_TO_MAGS=veba_output/cluster/output/global/scaffolds_to_mags.tsv
SCAFFOLDS_TO_SLCS=veba_output/cluster/output/global/scaffolds_to_slcs.tsv

merge_contig_mapping.py -m ${MAPPING_DIRECTORY} -c ${MAGS_TO_SLCS}  -i ${SCAFFOLDS_TO_MAGS} -o ${OUT_DIR}

You'll get the following output files:

X_mags.tsv.gz - Counts tables gzipped and tab-delimited (samples, MAGs)
X_slcs.tsv.gz - Counts tables gzipped and tab-delimited (samples, SLCs)

If you have taxonomic profiling abundances instead:

merge_generalized_mapping.py -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.tsv.gz veba_output/profiling/taxonomy/*/output/taxonomic_abundance.tsv.gz

merge_generalized_mapping.py -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.clusters.tsv.gz veba_output/profiling/taxonomy/*/output/taxonomic_abundance.clusters.tsv.gz

You'll get the following output files:

merged.taxonomic_abundance.tsv.gz - Merged genome-level taxonomic abundance matrix
merged.taxonomic_abundance.clusters.tsv.gz - Merged SLC-level taxonomic abundance matrix

3. Merge counts table with taxonomy classifications

Now that the taxonomy is merged from different organism types and counts tables are merged, we can join them.

If you have mapping module output from above or similar you can use this:

merge_counts_with_taxonomy.py -X veba_output/counts/X_mags.tsv.gz -t veba_output/classify/taxonomy_classifications.tsv -o veba_output/counts/X_mags.with_taxonomy.tsv.gz

merge_counts_with_taxonomy.py -X veba_output/counts/X_slcs.tsv.gz -t veba_output/classify/taxonomy_classifications.clusters.tsv -o veba_output/counts/X_slcs.with_taxonomy.tsv.gz

You'll get the following output which will contain transposed counts (rows=features) with organism type and classifications prepended:

veba_output/counts/X_mags.with_taxonomy.tsv.gz
veba_output/counts/X_slcs.with_taxonomy.tsv.gz

If you have profile-taxonomy module output from above or similar you can use this:

merge_counts_with_taxonomy.py -X veba_output/profiling/taxonomy/merged.taxonomic_abundance.tsv.gz -t veba_output/classify/taxonomy_classifications.tsv -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.with_taxonomy.tsv.gz

merge_counts_with_taxonomy.py -X veba_output/profiling/taxonomy/merged.taxonomic_abundance.clusters.tsv.gz -t veba_output/classify/taxonomy_classifications.clusters.tsv -o veba_output/profiling/taxonomy/merged.taxonomic_abundance.clusters.with_taxonomy.tsv.gz

You'll get the following output which will contain transposed counts (rows=features) with organism type and classifications prepended:

merged.taxonomic_abundance.with_taxonomy.tsv.gz
merged.taxonomic_abundance.clusters.with_taxonomy.tsv.gz

If you want to have genome-level counts but also include the genome-clusters:

MAGS_TO_SLCS=veba_output/cluster/output/global/mags_to_slcs.tsv

# Bowtie2 Mapping
merge_counts_with_taxonomy.py -X veba_output/counts/X_mags.tsv.gz -t veba_output/classify/taxonomy_classifications.tsv -o veba_output/counts/X_mags.with_taxonomy.with_slcs.tsv.gz -c ${MAGS_TO_SLCS}

# Sylph profiling
merge_counts_with_taxonomy.py -X veba_output/profiling/taxonomy/merged.taxonomic_abundance.tsv.gz -t veba_output/classify/taxonomy_classifications.tsv -o veba_output/counts/X_mags.with_taxonomy.with_slcs.tsv.gz -c ${MAGS_TO_SLCS}

You'll get the following output which will contain transposed counts (rows=features) with organism type and classifications prepended which now includes the genome cluster column:

veba_output/counts/X_mags.with_taxonomy.with_slcs.tsv.gz

Next steps:

Now it's time to analyze the data (recommend using compositional data analysis (CoDA) approach).

If you have a pickle format, then you probably know what to do already.

If you have a biom format look into the QIIME2 ecosystem or BIRDMAn for CoDA-aware bayesian differential abundance.

If you have an anndata format, look into the Scanpy ecosystem. Though Scanpy does not nativly support CoDA-aware methodologies, they are easy to adapt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merging_counts_with_taxonomy.md

merging_counts_with_taxonomy.md

Merging counts tables with taxonomy classifications

Steps:

1. Merge taxonomy from different organism types

2. Merge counts table

3. Merge counts table with taxonomy classifications

Next steps:

Recommended reading for CoDA:

Files

merging_counts_with_taxonomy.md

Latest commit

History

merging_counts_with_taxonomy.md

File metadata and controls

Merging counts tables with taxonomy classifications

Steps:

1. Merge taxonomy from different organism types

2. Merge counts table

3. Merge counts table with taxonomy classifications

Next steps:

Recommended reading for CoDA: