Skip to content

Latest commit

 

History

History
148 lines (133 loc) · 6.43 KB

README.md

File metadata and controls

148 lines (133 loc) · 6.43 KB

Willow/poplar SDR gene identification and annotation based on Arabidopsis orthologs

This pipeline contains 4 steps. Step 1 and 2 only need to be done once for each genome; and you may get these results from other rescources. I will provide you all files needed for step 3 (identify SDR genes) and 4 (identify genes with high Fst). So you can jumpt directly to step 3.

1. Ortholog identification

Identify target genome's (S. purpurea pacbio annotaion v5.1/P. trichocharpa pacbio annotation v1.1) best bastp hit in Arabidopsis database (Araport11 annotation). This is a "quick-and-dirty" way to identify "ortholog". But in our case I think it works for the purpose.

$ makeblastdb -in Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa -dbtype prot -out Araport11.db

$ blastp -db Araport11.db -query Spurpurea94006v5.1.primaryTrs.pep.fa -out Spur_v5.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1

$ blastp -db Araport11.db -query Ptrichocarpavar145v1.1b.primaryTrs.pep.fa -out Ptri_v1.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1

2. Find Arabidopsis Annotation

I find these information from "Araport11_genes.201606.pep.fasta" from Tair website. Two things needs to be done: 1) extract the primary transcripts; 2) extract their annotation. There might be better ways to find these information. You may find a file contain these information already. Then you can skip this step.

perl extract_At_anno.pl

Code of the extract_At_anno.pl

#!/usr/bin/perl

%hash = ();
@seqs = ();

open FH, "<Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa";
while (<FH>)
{
    if (/^>(\S+)/)
    {
        $name = $1;
        $hash{$name} = 0;
    }
}
close FH;

open FH, "<Araport11_genes.201606.pep.fasta";
open OUT, ">Araport11_gene_anno.txt";
while (<FH>)
{
    $seq = $_;
    chomp $seq;
    if (/^>(\S+)/)
    {
        $gene = $1;
        if (exists ($hash{$gene}))
        {
            @seqs = split(/\|/, $seq);
            $seqs[0] =~ s/>//;
            print OUT "$seqs[0]\t$seqs[1]\n";
        }
    }
}
close FH;
close OUT;

3. Identify SDR genes (S. nigra example)

Four input files needed:

  1. SDR file in the format of a single line of "CHROMOSOME START_POSITION END_POSITION".
    eg. Snig_SDR.txt <-- generated by SDR mapping pipeline
Chr07	4880000	6860000
  1. GFF file of the target genome.
    eg. Spurpurea94006v5.1.gene.gff3 <-- download from genome annotation
##gff-version 3
##annot-version v5.1
##species Salix purpurea 94006
Chr01	JGI	gene	8313	12196	.	+	.	ID=Sa940v51000005m.g;Name=Sa940v51000005m.g
Chr01	JGI	mRNA	8328	10828	.	+	.	ID=PAC4GC:37960894;Name=Sa940v51000005m;longest=1;Parent=Sa940v51000005m.g
Chr01	JGI	five_prime_UTR	8328	8481	.	+	.	ID=PAC4GC:37960894.five_prime_UTR.1;Parent=PAC4GC:37960894
Chr01	JGI	CDS	8482	8511	.	+	0	ID=PAC4GC:37960894.CDS.1;Parent=PAC4GC:37960894
Chr01	JGI	CDS	8705	8771	.	+	0	ID=PAC4GC:37960894.CDS.2;Parent=PAC4GC:37960894
Chr01	JGI	CDS	9974	10069	.	+	2	ID=PAC4GC:37960894.CDS.3;Parent=PAC4GC:37960894
Chr01	JGI	CDS	10164	10279	.	+	2	ID=PAC4GC:37960894.CDS.4;Parent=PAC4GC:37960894
  1. Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name".
    eg. Spur_b5.1_to_Ara11.txt <-- generated by step 1
Sa940v51059915m	AT2G17930.1	84.934	677	102	0	1	677	1341	2017	0.0	1152
Sa940v51061191m	AT3G24880.1	68.182	110	35	0	3	112	535	644	5.62e-49	169
Sa940v51061190m	AT3G24880.1	58.515	229	94	1	1	229	712	939	2.74e-81	265
Sa940v51061192m	AT3G60860.1	77.802	464	103	0	15	478	398	861	0.0	751
Sa940v51061189m	AT3G24870.1	64.474	76	27	0	2	77	990	1065	5.95e-26	99.4
Sa940v51061138m	AT5G19820.1	47.619	84	44	0	22	105	988	1071	5.18e-22	89.7
  1. Arabidopsis gene annotation in the format of each line start with the gene name followed by description.
    eg. Araport11_gene_anno.txt <-- generated by step 2
AT1G01010.1 	 NAC domain containing protein 1 
AT1G01020.1 	 ARV1 family protein 
AT1G01030.1 	 AP2/B3-like transcriptional factor family protein 
AT1G01040.2 	 dicer-like 1 
AT1G01050.1 	 pyrophosphorylase 1 

To run the job:

perl extract_gene_SDG_anno_willow.pl Snig_SDR.txt

Output file "Willow_SDR_gene_anno_output.txt" contains the following:

Locus	Gene	At_ortholog	Annotation
Chr07:4885351_4889304	Sa940v51022368m	AT4G36050.2 	 endonuclease/exonuclease/phosphatase family protein 
Chr07:4898456_4905398	Sa940v51022369m	AT5G66210.1 	 calcium-dependent protein kinase 28 
Chr07:4933756_4962623	Sa940v51022373m	AT2G17930.1 	 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein 
Chr07:4970230_4970625	Sa940v51022374m	
Chr07:4978885_4985505	Sa940v51022375m	AT4G10890.2 	 DDE family endonuclease 
Chr07:4999765_5001204	Sa940v51022376m	AT5G65590.1 	 Dof-type zinc finger DNA-binding family protein 

4. High Fst located gene (S. nigra example)

Four input files needed (All the same expect as step3 except instead of SDR file, it need FST file):

  1. FST file in the format of a single line of "CHROMOSOME POSITION".
    eg. Snig_Fst_test.txt <-- generated by SDR mapping pipeline
Chr07	4885391
Chr07	4898856
Chr07	4953756
Chr07	4970230
Chr07	4985505
Chr01	8313

Below the same as step 3: 2. GFF file of the target genome. <-- download from genome annotation
3. Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name". <-- generated by step 1
4. Arabidopsis gene annotation in the format of each line start with the gene name followed by description. <-- generated by step 2

To run the code:

perl extract_gene_Fst_anno_willow.pl Snig_Fst_test.txt

Output file "Willow_Fst_gene_anno_output.txt" contains the following:

Locus	Gene	At_ortholog	Annotation	Fst
Chr01:8313_12196	Sa940v51000005m	AT2G30942.1 	 transmembrane protein%2C putative (DUF3317) 	8313
Chr07:4885351_4889304	Sa940v51022368m	AT4G36050.2 	 endonuclease/exonuclease/phosphatase family protein 	4885391
Chr07:4898456_4905398	Sa940v51022369m	AT5G66210.1 	 calcium-dependent protein kinase 28 	4898856
Chr07:4933756_4962623	Sa940v51022373m	AT2G17930.1 	 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein 	4953756
Chr07:4970230_4970625	Sa940v51022374m	NA	NA	4970230
Chr07:4978885_4985505	Sa940v51022375m	AT4G10890.2 	 DDE family endonuclease 	4985505