Codes_GenomeAnnotation

This repository contains multiple bash, python and R scripts used for genome assembly gap filling and annotation. The gap-filling levereaged HIFI PacBio long reads to improve the genome completenes. The protein-coding genes was built using multi-stage RNA-seq libraries.

In 1_1_Assembly_gapfilling.ipynb are the files and scripts used:

To run Samba tool to scaffold genome assembly using PacBi0 HIFI long reads (https://github.com/alekseyzimin/masurca/releases)
To run TGSGapCloser tool to fill gaps in scaffolds using PacBi0 HIFI long reads (https://github.com/BGI-Qingdao/TGS-GapCloser)
Run stattistics like gap prevalence before and after gap-filling step
Run BUSCO to assess completeness of assembly fasta file

In 1_2_GenomeAssembly_analysis.ipynb are the files and scripts to:

Identify gaps in genome assembly files using scaffoldgap2bed.py .
To run assembly stats, like N50 and N_content, using assembly-stats (https://github.com/sanger-pathogens/assembly-stats)
To run BUSCO (https://busco.ezlab.org/)
To assess gap prevalende per scaffold before and after gap-filling

In 2_1_Genome_annotation_TE.ipynb are the files and scripts to:

Run Identification and annotation of repetitve elements in genome assembly. This includes running tools like RepeatModeler and RepeatMasker (https://darencard.net/blog/2022-10-13-install-repeat-modeler-masker/)
Analyse statistics like TE representation per family

In 2_2_TEannotation_analysis.ipynb are the files and scripts to:

Analyse and plotRepresentation of TE families
Analyse and graph TE families prevalence within the genome
Analyse and graph TE distribution in the genome

In 2_3_Genome_annotation_CodingGenes.ipynb are the steps to run the transcriptome assembly based on 79 PolyA RNA-seq libraries. Briefly, we integrated 79 RNA-seq samples and current Parhyale reference annotation into a StringTie2 annotation pipeline. Some analyses include:

To use GMAP to map reference transcriptome to current assembly
Quality check of RNA-seq libraries using FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Reads trimming using Trimmomatic software (https://bioinformatics-core-shared-training.github.io/Bulk_RNAseq_Course_Nov22/Bulk_RNAseq_Course_Base/Markdowns/S3_Trimming_Reads.html)
Map RNA-seq libraries to the reference genome using HISAT2 (http://daehwankimlab.github.io/hisat2/)
To use StringTie2 to generate the transcriptome. First, the RNA-seq reads were assembled into transcripts for each sample. Then, all sample-specific annotations were merged into one final annotation

In 2_4_CodingGenes_Annotation_analysis.ipynb are the steps to run the identification of protein-coding genes. The analyses run include:

Reduce redundancy of annotated transcripts using CD-HIT (https://www.bioinformatics.org/cd-hit/)
ORF identification with TransDecoder
Pfam and BlastP searches to enable homology-based coding region identification
Run Pfam search on longest ORFs
Run BlastP search on longest ORFs
Final coding region predictions
Select representattive transcript per gene

In 2_5_BLAST_and_Orthofinder.ipynb are the steps to run the annotation of protein-coding genes using BLAST and ORTHOFINDER. The analyses run include:

Download proteomes from UNIPROT
Blast Parhyale proteins to multiple species peptides (https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)
Run Orhofinder analysis (https://github.com/davidemms/OrthoFinder)

In 2_8_Process_MappingStats.ipynb are the steps to summarise RNA-seq maping metrics

Please email [email protected] if there is any problem, thanks! (Manuel)

This is a tool to view the html files.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Codes_GenomeAnnotation		Codes_GenomeAnnotation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codes_GenomeAnnotation

About

Releases

Packages

Languages

Mjaraespejo/Codes_GenomeAnnotation

Folders and files

Latest commit

History

Repository files navigation

Codes_GenomeAnnotation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages