This repository contains multiple bash, python and R scripts used for genome assembly gap filling and annotation. The gap-filling levereaged HIFI PacBio long reads to improve the genome completenes. The protein-coding genes was built using multi-stage RNA-seq libraries.
In 1_1_Assembly_gapfilling.ipynb are the files and scripts used:
- To run Samba tool to scaffold genome assembly using PacBi0 HIFI long reads (https://github.com/alekseyzimin/masurca/releases)
- To run TGSGapCloser tool to fill gaps in scaffolds using PacBi0 HIFI long reads (https://github.com/BGI-Qingdao/TGS-GapCloser)
- Run stattistics like gap prevalence before and after gap-filling step
- Run BUSCO to assess completeness of assembly fasta file
In 1_2_GenomeAssembly_analysis.ipynb are the files and scripts to:
- Identify gaps in genome assembly files using scaffoldgap2bed.py .
- To run assembly stats, like N50 and N_content, using assembly-stats (https://github.com/sanger-pathogens/assembly-stats)
- To run BUSCO (https://busco.ezlab.org/)
- To assess gap prevalende per scaffold before and after gap-filling
In 2_1_Genome_annotation_TE.ipynb are the files and scripts to:
- Run Identification and annotation of repetitve elements in genome assembly. This includes running tools like RepeatModeler and RepeatMasker (https://darencard.net/blog/2022-10-13-install-repeat-modeler-masker/)
- Analyse statistics like TE representation per family
In 2_2_TEannotation_analysis.ipynb are the files and scripts to:
- Analyse and plotRepresentation of TE families
- Analyse and graph TE families prevalence within the genome
- Analyse and graph TE distribution in the genome
In 2_3_Genome_annotation_CodingGenes.ipynb are the steps to run the transcriptome assembly based on 79 PolyA RNA-seq libraries. Briefly, we integrated 79 RNA-seq samples and current Parhyale reference annotation into a StringTie2 annotation pipeline. Some analyses include:
- To use GMAP to map reference transcriptome to current assembly
- Quality check of RNA-seq libraries using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- Reads trimming using Trimmomatic software (https://bioinformatics-core-shared-training.github.io/Bulk_RNAseq_Course_Nov22/Bulk_RNAseq_Course_Base/Markdowns/S3_Trimming_Reads.html)
- Map RNA-seq libraries to the reference genome using HISAT2 (http://daehwankimlab.github.io/hisat2/)
- To use StringTie2 to generate the transcriptome. First, the RNA-seq reads were assembled into transcripts for each sample. Then, all sample-specific annotations were merged into one final annotation
In 2_4_CodingGenes_Annotation_analysis.ipynb are the steps to run the identification of protein-coding genes. The analyses run include:
- Reduce redundancy of annotated transcripts using CD-HIT (https://www.bioinformatics.org/cd-hit/)
- ORF identification with TransDecoder
- Pfam and BlastP searches to enable homology-based coding region identification
- Run Pfam search on longest ORFs
- Run BlastP search on longest ORFs
- Final coding region predictions
- Select representattive transcript per gene
In 2_5_BLAST_and_Orthofinder.ipynb are the steps to run the annotation of protein-coding genes using BLAST and ORTHOFINDER. The analyses run include:
- Download proteomes from UNIPROT
- Blast Parhyale proteins to multiple species peptides (https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)
- Run Orhofinder analysis (https://github.com/davidemms/OrthoFinder)
In 2_8_Process_MappingStats.ipynb are the steps to summarise RNA-seq maping metrics
Please email [email protected] if there is any problem, thanks! (Manuel)
This is a tool to view the html files.