This repository contains multiple bash, python and R scripts used for genome assembly gap filling and annotation. The gap-filling levereaged HIFI PacBio long reads to improve the genome completenes. The protein-coding genes was built using multi-stage RNA-seq libraries.
In 1_1_Assembly_gapfilling.ipynb are the files and scripts used:
- To run Samba tool to scaffold genome assembly using PacBi0 HIFI long reads (
- To run TGSGapCloser tool to fill gaps in scaffolds using PacBi0 HIFI long reads (
- Run stattistics like gap prevalence before and after gap-filling step
- Run BUSCO to assess completeness of assembly fasta file
In 1_2_GenomeAssembly_analysis.ipynb are the files and scripts to:
- Identify gaps in genome assembly files using .
- To run assembly stats, like N50 and N_content, using assembly-stats (
- To run BUSCO (
- To assess gap prevalende per scaffold before and after gap-filling
In 2_1_Genome_annotation_TE.ipynb are the files and scripts to:
- Run Identification and annotation of repetitve elements in genome assembly. This includes running tools like RepeatModeler and RepeatMasker (
- Analyse statistics like TE representation per family
In 2_2_TEannotation_analysis.ipynb are the files and scripts to:
- Analyse and plotRepresentation of TE families
- Analyse and graph TE families prevalence within the genome
- Analyse and graph TE distribution in the genome
In 2_3_Genome_annotation_CodingGenes.ipynb are the steps to run the transcriptome assembly based on 79 PolyA RNA-seq libraries. Briefly, we integrated 79 RNA-seq samples and current Parhyale reference annotation into a StringTie2 annotation pipeline. Some analyses include:
- To use GMAP to map reference transcriptome to current assembly
- Quality check of RNA-seq libraries using FASTQC (
- Reads trimming using Trimmomatic software (
- Map RNA-seq libraries to the reference genome using HISAT2 (
- To use StringTie2 to generate the transcriptome. First, the RNA-seq reads were assembled into transcripts for each sample. Then, all sample-specific annotations were merged into one final annotation
In 2_4_CodingGenes_Annotation_analysis.ipynb are the steps to run the identification of protein-coding genes. The analyses run include:
- Reduce redundancy of annotated transcripts using CD-HIT (
- ORF identification with TransDecoder
- Pfam and BlastP searches to enable homology-based coding region identification
- Run Pfam search on longest ORFs
- Run BlastP search on longest ORFs
- Final coding region predictions
- Select representattive transcript per gene
In 2_5_BLAST_and_Orthofinder.ipynb are the steps to run the annotation of protein-coding genes using BLAST and ORTHOFINDER. The analyses run include:
- Download proteomes from UNIPROT
- Blast Parhyale proteins to multiple species peptides (
- Run Orhofinder analysis (
In 2_8_Process_MappingStats.ipynb are the steps to summarise RNA-seq maping metrics
Please email [email protected] if there is any problem, thanks! (Manuel)
This is a tool to view the html files.