by Jonggeol Jeff Kim and Hui Liu
Part of the effort by Hui Liu and Jonggeol Jeffrey Kim to confirm candidate PD-risk gene SLC25A46 based on this paper: https://www.ncbi.nlm.nih.gov/pubmed/32259769
Software used:
- PLINK v2.00a2LM/1.9
- BCFTools
- Annovar
- RVTESTS
Steps:
- Subset and convert data
- Annotation
- Allele frequency + count
- Burden analysis
- Score test
- Multiple test correction
- Compound Heterozygous Fisher Test
The data has already been cleaned (e.g. genotype quality, ancestry etc.) and subsetted into a PLINK binary format.
Subsetting log can be read here:
# kernel: bash
cat SLC25A46_WGS_AMP_PD.log
PLINK v1.90b6.16 64-bit (19 Feb 2020)
Options in effect:
--bfile sample_filtered_geno05_euro
--chr 5
--from-bp 110738145
--make-bed
--out SLC25A46_WGS_AMP_PD
--to-bp 110765157
...
243 out of 28195229 variants loaded from .bim file.
2927 people (0 males, 0 females, 2927 ambiguous) loaded from .fam.
...
Total genotyping rate is 0.999947.
243 variants and 2927 people pass filters and QC.
Note: No phenotypes present.
...
Filter out participants with sex information missing, as well as variants 0 allele count.
module load plink/2.3-alpha
plink2 --bfile SLC25A46_WGS_AMP_PD \
--update-sex data/AMP_PD_sex.txt \
--remove-nosex \
--mac 1 \
--make-bed \
--out data/SLC25A46.filtered
[+] Loading plink 2.3-alpha
PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020) www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to data/SLC25A46.filtered.log.
Options in effect:
--bfile /data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD
--mac 1
--make-bed
--out data/SLC25A46.filtered
--remove-nosex
--update-sex data/AMP_PD_sex.txt
...
2927 samples (0 females, 0 males, 2927 ambiguous; 2927 founders) loaded from
/data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD.fam.
243 variants loaded from
/data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD.bim.
Note: No phenotype data present.
--update-sex: 2697 samples updated.
230 samples removed due to sex filter(s).
2697 samples (1188 females, 1509 males; 2697 founders) remaining after main
filters.
Calculating allele frequencies... done.
59 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
184 variants remaining after main filters.
...
module load plink
# use plink 1.9 for this
plink --bfile data/SLC25A46.filtered \
--recode vcf-fid bgz \
--out data/vcf/SLC25A46.filtered.fid
[-] Unloading plink 2.3-alpha
[+] Loading plink 1.9.0-beta4.4 on cn0927
The following have been reloaded with a version change:
1) plink/2.3-alpha => plink/1.9.0-beta4.4
PLINK v1.90b4.4 64-bit (21 May 2017) www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to data/vcf/SLC25A46.filtered.fid.log.
Options in effect:
--bfile data/SLC25A46.filtered
--out data/vcf/SLC25A46.filtered.fid
--recode vcf-fid bgz
...
184 variants loaded from .bim file.
2697 people (1509 males, 1188 females) loaded from .fam.
...
Total genotyping rate is 0.99994.
184 variants and 2697 people pass filters and QC.
Note: No phenotypes present.
...
cd data/vcf
module load samtools
# tabulate
tabix -f -p vcf SLC25A46.filtered.fid.vcf.gz
cd ../..
[+] Loading samtools 1.10 ...
# bring back plink2
module load plink/2.3-alpha
plink2 --bfile data/SLC25A46.filtered \
--freq \
--out result/freq/SLC25A46.filtered.af
plink2 --bfile data/SLC25A46.filtered \
--geno-counts \
--out result/freq/SLC25A46.filtered.ac
[+] Loading plink 2.3-alpha
PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020) www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to result/freq/SLC25A46.filtered.af.log.
Options in effect:
--bfile data/SLC25A46.filtered
--freq
--out result/freq/SLC25A46.filtered.af
...
2697 samples (1188 females, 1509 males; 2697 founders) loaded from
data/SLC25A46.filtered.fam.
184 variants loaded from data/SLC25A46.filtered.bim.
Note: No phenotype data present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to222232324252526262727282829293030313232333334343535363637383839394040414142424344444545464647474848495050515152525353545455555657575858595960606161626363646465656666676768696970707171727273737475757676777778787979808081828283838484858586868788888989909091919292939494959596969797989899%
result/freq/SLC25A46.filtered.af.afreq .
...
PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020) www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to result/freq/SLC25A46.filtered.ac.log.
Options in effect:
--bfile data/SLC25A46.filtered
--geno-counts
--out result/freq/SLC25A46.filtered.ac
...
2697 samples (1188 females, 1509 males; 2697 founders) loaded from
data/SLC25A46.filtered.fam.
184 variants loaded from data/SLC25A46.filtered.bim.
Note: No phenotype data present.
Calculating allele frequencies... done.
--geno-counts: Genotype counts written to1717181919202021212222232324252526262727282829293030313232333334343535363637383839394040414142424344444545464647474848495050515152525353545455555657575858595960606161626363646465656666676768696970707171727273737475757676777778787979808081828283838484858586868788888989909091919292939494959596969797989899%
result/freq/SLC25A46.filtered.ac.gcount .
...
table_annovar.pl data/vcf/SLC25A46.filtered.fid.vcf.gz \
$ANNOVAR_DATA/hg38 --thread 20 -buildver hg38 \
-out result/anno/SLC25A46.filtered.fid.annovar \
-arg '-splicing 15',,, \
-remove -protocol refGene,ljb26_all,gnomad211_genome,clinvar_20190305 \
-operation g,f,f,f -nastring . -vcfinput -polish
# remove unnecessary VCF file
rm result/anno/SLC25A46.filtered.fid.annovar.hg38_multianno.vcf
# truncate unnecessay genotype information
cd result/anno
head -1 SLC25A46.filtered.fid.annovar.hg38_multianno.txt > header.txt
colct="$(wc -w header.txt| cut -f1 -d' ')"
cut -f1-$colct SLC25A46.filtered.fid.annovar.hg38_multianno.txt > SLC25A46.filtered.fid.annovar.hg38_multianno.trimmed.txt
rm header.txt
cd ../..
# makes rvtests more manageable
grep SLC25A46 /data/LNG/KIM/burden_resources/refFlat_hg38.txt > data/SLC25A46.hg38.genefile
module load rvtests
rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
--pheno data/covariateInfo_AMPPD.txt \
--covar data/covariateInfo_AMPPD.txt \
--covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
--kernel skat,skato \
--burden cmc,zeggini,mb,fp,cmcWald \
--geneFile data/SLC25A46.hg38.genefile \
--freqUpper 0.03 \
--out result/burden/SLC25A46.filtered.burden.maf03
rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
--pheno data/covariateInfo_AMPPD.txt \
--covar data/covariateInfo_AMPPD.txt \
--covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
--kernel skat,skato \
--burden cmc,zeggini,mb,fp,cmcWald \
--geneFile data/SLC25A46.hg38.genefile \
--freqUpper 0.01 \
--out result/burden/SLC25A46.filtered.burden.maf01
[+] Loading rvtests 2.1.0 on cn0929
Thank you for using rvtests (version: 20190205, git: c86e589efef15382603300dc7f4c3394c82d69b8)
For documentations, refer to http://zhanxw.github.io/rvtests/
For questions and comments, plase send to Xiaowei Zhan <[email protected]>
For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues
...
[INFO] Program version: 20190205
[INFO] Analysis started at: Thu Sep 3 18:55:49 2020
[INFO] Loaded [ 2697 ] samples from genotype files
[INFO] Loaded [ 2697 ] sample phenotypes
[INFO] Begin to read covariate file
[INFO] Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO] Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN] -- Enabling binary phenotype mode --
[INFO] Analysis begins with [ 2697 ] samples...
[INFO] MadsonBrowning test significance will be evaluated using 10000 permutations at alpha = 0.05
[INFO] SKAT test significance will be evaluated using 10000 permutations at alpha = 0.05 weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO] SKAT-O test significance will be evaluated using weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO] Loaded [ 1 ] genes.
[INFO] Impute missing genotype to mean (by default)
[INFO] Set upper minor allele frequency limit to 0.03
[INFO] Analysis started
[INFO] Analyzed [ 132 ] variants from [ 1 ] genes/regions
[INFO] Analysis ends at: Thu Sep 3 18:55:53 2020
[INFO] Analysis took 4 seconds
RVTESTS finished successfully
...
[INFO] Program version: 20190205
[INFO] Analysis started at: Thu Sep 3 18:55:53 2020
[INFO] Loaded [ 2697 ] samples from genotype files
[INFO] Loaded [ 2697 ] sample phenotypes
[INFO] Begin to read covariate file
[INFO] Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO] Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN] -- Enabling binary phenotype mode --
[INFO] Analysis begins with [ 2697 ] samples...
[INFO] MadsonBrowning test significance will be evaluated using 10000 permutations at alpha = 0.05
[INFO] SKAT test significance will be evaluated using 10000 permutations at alpha = 0.05 weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO] SKAT-O test significance will be evaluated using weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO] Loaded [ 1 ] genes.
[INFO] Impute missing genotype to mean (by default)
[INFO] Set upper minor allele frequency limit to 0.01
[INFO] Analysis started
[INFO] Analyzed [ 115 ] variants from [ 1 ] genes/regions
[INFO] Analysis ends at: Thu Sep 3 18:55:56 2020
[INFO] Analysis took 3 seconds
RVTESTS finished successfully
rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
--pheno data/covariateInfo_AMPPD.txt \
--covar data/covariateInfo_AMPPD.txt \
--covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
--single score \
--geneFile data/SLC25A46.hg38.genefile \
--out result/score/SLC25A46.filtered.score
Thank you for using rvtests (version: 20190205, git: c86e589efef15382603300dc7f4c3394c82d69b8)
For documentations, refer to http://zhanxw.github.io/rvtests/
For questions and comments, plase send to Xiaowei Zhan <[email protected]>
For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues
...
[INFO] Program version: 20190205
[INFO] Analysis started at: Fri Sep 4 17:27:10 2020
[INFO] Loaded [ 2697 ] samples from genotype files
[INFO] Loaded [ 2697 ] sample phenotypes
[INFO] Begin to read covariate file
[INFO] Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO] Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN] -- Enabling binary phenotype mode --
[INFO] Analysis begins with [ 2697 ] samples...
[INFO] Loaded [ 1 ] genes.
[INFO] Impute missing genotype to mean (by default)
[INFO] Analysis started
[INFO] Analyzed [ 718 ] variants from [ 1 ] genes/regions
[INFO] Analysis ends at: Fri Sep 4 17:27:14 2020
[INFO] Analysis took 4 seconds
RVTESTS finished successfully
We will use False Discovery Rate (FDR)
# kernel: python 3.7
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats as sm
from patsy import dmatrices
os.chdir("/data/LNG/KIM/burdenAnalysis/SLC25A46-replication/")
score = pd.read_csv("result/score/SLC25A46.filtered.score.SingleScore.assoc", sep = "\t")
score
Gene | CHROM | POS | REF | ALT | N_INFORMATIVE | AF | U | V | STAT | DIRECTION | EFFECT | SE | PVALUE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SLC25A46 | 5 | 110738249 | T | C | 2697 | 0.000927 | 1.069550 | 0.908378 | 1.259320 | + | 1.177430 | 1.049220 | 0.261780 |
1 | SLC25A46 | 5 | 110738407 | C | T | 2697 | 0.000371 | -0.045776 | 0.388462 | 0.005394 | - | -0.117838 | 1.604450 | 0.941452 |
2 | SLC25A46 | 5 | 110738538 | A | C | 2697 | 0.002039 | 0.600238 | 2.072550 | 0.173837 | + | 0.289614 | 0.694621 | 0.676724 |
3 | SLC25A46 | 5 | 110738643 | G | A | 2697 | 0.000185 | -0.109522 | 0.097021 | 0.123633 | - | -1.128850 | 3.210460 | 0.725127 |
4 | SLC25A46 | 5 | 110738696 | T | C | 2697 | 0.021505 | -0.053915 | 20.393400 | 0.000143 | - | -0.002644 | 0.221440 | 0.990474 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
713 | SLC25A46 | 5 | 110764313 | T | C | 2697 | 0.001112 | -1.321120 | 1.396020 | 1.250240 | - | -0.946349 | 0.846359 | 0.263506 |
714 | SLC25A46 | 5 | 110764363 | C | T | 2697 | 0.117723 | -3.186630 | 119.149000 | 0.085226 | - | -0.026745 | 0.091612 | 0.770336 |
715 | SLC25A46 | 5 | 110764618 | G | A | 2697 | 0.099926 | -12.843500 | 81.840200 | 2.015590 | - | -0.156934 | 0.110539 | 0.155691 |
716 | SLC25A46 | 5 | 110764674 | G | GA | 2697 | 0.000927 | 0.547030 | 0.464448 | 0.644296 | + | 1.177810 | 1.467340 | 0.422159 |
717 | SLC25A46 | 5 | 110764969 | G | T | 2697 | 0.243789 | -14.176400 | 139.849000 | 1.437040 | - | -0.101369 | 0.084561 | 0.230618 |
718 rows × 14 columns
score = score.drop_duplicates()
score
Gene | CHROM | POS | REF | ALT | N_INFORMATIVE | AF | U | V | STAT | DIRECTION | EFFECT | SE | PVALUE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SLC25A46 | 5 | 110738249 | T | C | 2697 | 0.000927 | 1.069550 | 0.908378 | 1.259320 | + | 1.177430 | 1.049220 | 0.261780 |
1 | SLC25A46 | 5 | 110738407 | C | T | 2697 | 0.000371 | -0.045776 | 0.388462 | 0.005394 | - | -0.117838 | 1.604450 | 0.941452 |
2 | SLC25A46 | 5 | 110738538 | A | C | 2697 | 0.002039 | 0.600238 | 2.072550 | 0.173837 | + | 0.289614 | 0.694621 | 0.676724 |
3 | SLC25A46 | 5 | 110738643 | G | A | 2697 | 0.000185 | -0.109522 | 0.097021 | 0.123633 | - | -1.128850 | 3.210460 | 0.725127 |
4 | SLC25A46 | 5 | 110738696 | T | C | 2697 | 0.021505 | -0.053915 | 20.393400 | 0.000143 | - | -0.002644 | 0.221440 | 0.990474 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
179 | SLC25A46 | 5 | 110764313 | T | C | 2697 | 0.001112 | -1.321120 | 1.396020 | 1.250240 | - | -0.946349 | 0.846359 | 0.263506 |
180 | SLC25A46 | 5 | 110764363 | C | T | 2697 | 0.117723 | -3.186630 | 119.149000 | 0.085226 | - | -0.026745 | 0.091612 | 0.770336 |
181 | SLC25A46 | 5 | 110764618 | G | A | 2697 | 0.099926 | -12.843500 | 81.840200 | 2.015590 | - | -0.156934 | 0.110539 | 0.155691 |
182 | SLC25A46 | 5 | 110764674 | G | GA | 2697 | 0.000927 | 0.547030 | 0.464448 | 0.644296 | + | 1.177810 | 1.467340 | 0.422159 |
183 | SLC25A46 | 5 | 110764969 | G | T | 2697 | 0.243789 | -14.176400 | 139.849000 | 1.437040 | - | -0.101369 | 0.084561 | 0.230618 |
184 rows × 14 columns
score['FDR-Corrected'] = sm.multitest.multipletests(score['PVALUE'])[1]
# alpha = 0.05
/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
score['FDR-Corrected']
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
...
179 1.0
180 1.0
181 1.0
182 1.0
183 1.0
Name: FDR-Corrected, Length: 184, dtype: float64
score.head()
Gene | CHROM | POS | REF | ALT | N_INFORMATIVE | AF | U | V | STAT | DIRECTION | EFFECT | SE | PVALUE | FDR-Corrected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SLC25A46 | 5 | 110738249 | T | C | 2697 | 0.000927 | 1.069550 | 0.908378 | 1.259320 | + | 1.177430 | 1.049220 | 0.261780 | 1.0 |
1 | SLC25A46 | 5 | 110738407 | C | T | 2697 | 0.000371 | -0.045776 | 0.388462 | 0.005394 | - | -0.117838 | 1.604450 | 0.941452 | 1.0 |
2 | SLC25A46 | 5 | 110738538 | A | C | 2697 | 0.002039 | 0.600238 | 2.072550 | 0.173837 | + | 0.289614 | 0.694621 | 0.676724 | 1.0 |
3 | SLC25A46 | 5 | 110738643 | G | A | 2697 | 0.000185 | -0.109522 | 0.097021 | 0.123633 | - | -1.128850 | 3.210460 | 0.725127 | 1.0 |
4 | SLC25A46 | 5 | 110738696 | T | C | 2697 | 0.021505 | -0.053915 | 20.393400 | 0.000143 | - | -0.002644 | 0.221440 | 0.990474 | 1.0 |
score.to_csv("result/score/SLC25A46.filtered.score.SingleScore.FDR.dup_cleaned.assoc", sep="\t", index=False)
Let's also filter out the non-synonymous variants of interest. They are:
- 5:110739267
- 5:110739354
- 5:110746300
- 5:110756712
- 5:110761217
- 5:110761271
- 5:110761292
- 5:110761301
- 5:110761662
anno = pd.read_csv("result/anno/SLC25A46.filtered.fid.annovar.hg38_multianno.trimmed.txt", sep = "\t")
anno.head()
Chr | Start | End | Ref | Alt | Func.refGene | Gene.refGene | GeneDetail.refGene | ExonicFunc.refGene | AAChange.refGene | ... | non_topmed_AF_popmax | non_neuro_AF_popmax | non_cancer_AF_popmax | controls_AF_popmax | CLNALLELEID | CLNDN | CLNDISDB | CLNREVSTAT | CLNSIG | Otherinfo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 110738249 | 110738249 | T | C | splicing | SLC25A46 | NM_001303250:exon1:c.10+2T>C | . | . | ... | 0.0006 | 0.0005 | . | 0.0008 | . | . | . | . | . | 0.000927 |
1 | 5 | 110738407 | 110738407 | C | T | intronic | SLC25A46 | . | . | . | ... | 0.0267 | 0.0269 | . | 0.0292 | . | . | . | . | . | 0.000371 |
2 | 5 | 110738538 | 110738538 | A | C | intronic | SLC25A46 | . | . | . | ... | 0.0037 | 0.0036 | . | 0.0081 | . | . | . | . | . | 0.002039 |
3 | 5 | 110738643 | 110738643 | G | A | intronic | SLC25A46 | . | . | . | ... | . | . | . | . | . | . | . | . | . | 0.000185 |
4 | 5 | 110738696 | 110738696 | T | C | intronic | SLC25A46 | . | . | . | ... | 0.0302 | 0.0264 | . | 0.0285 | . | . | . | . | . | 0.021510 |
5 rows × 58 columns
is_nonsyn = anno['ExonicFunc.refGene']=="nonsynonymous SNV"
anno_nonsyn = anno[is_nonsyn]
anno_nonsyn
Chr | Start | End | Ref | Alt | Func.refGene | Gene.refGene | GeneDetail.refGene | ExonicFunc.refGene | AAChange.refGene | ... | non_topmed_AF_popmax | non_neuro_AF_popmax | non_cancer_AF_popmax | controls_AF_popmax | CLNALLELEID | CLNDN | CLNDISDB | CLNREVSTAT | CLNSIG | Otherinfo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 5 | 110739267 | 110739267 | C | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303249:exon1:c.C148A:p.P50T,SLC... | ... | 0.0012 | 0.0018 | . | 0.0004 | 520610 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Uncertain_significance | 0.000742 |
11 | 5 | 110739354 | 110739354 | G | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303249:exon1:c.G235A:p.E79K,SLC... | ... | 0.0014 | 0.0015 | . | 0.0011 | 520315 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Likely_benign | 0.002225 |
61 | 5 | 110746300 | 110746300 | C | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303249:exon4:c.C416A:p.T139N,SL... | ... | 0.0011 | 0.0010 | . | 0.0011 | 453904 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Likely_benign | 0.001112 |
129 | 5 | 110756712 | 110756712 | G | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303249:exon7:c.G631A:p.V211M,SL... | ... | 0.0035 | 0.0030 | . | 0.0035 | 453906 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Likely_benign | 0.003337 |
156 | 5 | 110761217 | 110761217 | G | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303250:exon8:c.G419A:p.R140Q,SL... | ... | . | . | . | . | . | . | . | . | . | 0.000185 |
158 | 5 | 110761271 | 110761271 | G | A | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303250:exon8:c.G473A:p.G158D,SL... | ... | 0.0004 | 0.0002 | . | 0.0004 | 359143 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | no_assertion_criteria_provided | Pathogenic | 0.000185 |
159 | 5 | 110761292 | 110761292 | A | G | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303250:exon8:c.A494G:p.K165R,SL... | ... | 0.0041 | 0.0037 | . | 0.0053 | 454455 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Likely_benign | 0.002225 |
160 | 5 | 110761301 | 110761301 | T | G | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303250:exon8:c.T503G:p.L168R,SL... | ... | . | 7.348e-05 | . | . | 454461 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Uncertain_significance | 0.000185 |
161 | 5 | 110761662 | 110761662 | G | T | exonic | SLC25A46 | . | nonsynonymous SNV | SLC25A46:NM_001303249:exon8:c.G894T:p.E298D,SL... | ... | 0.1177 | 0.1124 | . | 0.1157 | 454465 | NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... | MedGen:C4225302,OMIM:616505 | criteria_provided,_single_submitter | Benign | 0.000371 |
9 rows × 58 columns
nonsym_POS = anno_nonsyn.rename(columns={"Start": "POS"})[['POS']]
score_nonsyn = pd.merge(nonsym_POS, score, on="POS")
score_nonsyn
POS | Gene | CHROM | REF | ALT | N_INFORMATIVE | AF | U | V | STAT | DIRECTION | EFFECT | SE | PVALUE | FDR-Corrected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 110739267 | SLC25A46 | 5 | C | A | 2697 | 0.000742 | 0.823836 | 0.599113 | 1.132850 | + | 1.375090 | 1.291950 | 0.287168 | 1.0 |
1 | 110739354 | SLC25A46 | 5 | G | A | 2697 | 0.002225 | -0.172238 | 1.856260 | 0.015982 | - | -0.092788 | 0.733974 | 0.899401 | 1.0 |
2 | 110746300 | SLC25A46 | 5 | C | A | 2697 | 0.001112 | 0.048595 | 0.935050 | 0.002526 | + | 0.051971 | 1.034150 | 0.959920 | 1.0 |
3 | 110756712 | SLC25A46 | 5 | G | A | 2697 | 0.003337 | 2.870400 | 3.130230 | 2.632130 | + | 0.916992 | 0.565213 | 0.104721 | 1.0 |
4 | 110761217 | SLC25A46 | 5 | G | A | 2697 | 0.000185 | -0.048355 | 0.045875 | 0.050969 | - | -1.054060 | 4.668870 | 0.821385 | 1.0 |
5 | 110761271 | SLC25A46 | 5 | G | A | 2697 | 0.000185 | 0.259504 | 0.191297 | 0.352031 | + | 1.356550 | 2.286360 | 0.552966 | 1.0 |
6 | 110761292 | SLC25A46 | 5 | A | G | 2697 | 0.002225 | -2.064850 | 2.313330 | 1.843060 | - | -0.892586 | 0.657478 | 0.174593 | 1.0 |
7 | 110761301 | SLC25A46 | 5 | T | G | 2697 | 0.000185 | -0.141477 | 0.120936 | 0.165506 | - | -1.169850 | 2.875560 | 0.684137 | 1.0 |
8 | 110761662 | SLC25A46 | 5 | G | T | 2697 | 0.000371 | -0.008878 | 0.130762 | 0.000603 | - | -0.067895 | 2.765410 | 0.980413 | 1.0 |
# reorder column
score_nonsyn_cols = list(score_nonsyn.columns)
score_nonsyn_new_cols = ["Gene", "CHROM", "POS"] + score_nonsyn_cols[3:]
score_nonsyn[score_nonsyn_new_cols]
Gene | CHROM | POS | REF | ALT | N_INFORMATIVE | AF | U | V | STAT | DIRECTION | EFFECT | SE | PVALUE | FDR-Corrected | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SLC25A46 | 5 | 110739267 | C | A | 2697 | 0.000742 | 0.823836 | 0.599113 | 1.132850 | + | 1.375090 | 1.291950 | 0.287168 | 1.0 |
1 | SLC25A46 | 5 | 110739354 | G | A | 2697 | 0.002225 | -0.172238 | 1.856260 | 0.015982 | - | -0.092788 | 0.733974 | 0.899401 | 1.0 |
2 | SLC25A46 | 5 | 110746300 | C | A | 2697 | 0.001112 | 0.048595 | 0.935050 | 0.002526 | + | 0.051971 | 1.034150 | 0.959920 | 1.0 |
3 | SLC25A46 | 5 | 110756712 | G | A | 2697 | 0.003337 | 2.870400 | 3.130230 | 2.632130 | + | 0.916992 | 0.565213 | 0.104721 | 1.0 |
4 | SLC25A46 | 5 | 110761217 | G | A | 2697 | 0.000185 | -0.048355 | 0.045875 | 0.050969 | - | -1.054060 | 4.668870 | 0.821385 | 1.0 |
5 | SLC25A46 | 5 | 110761271 | G | A | 2697 | 0.000185 | 0.259504 | 0.191297 | 0.352031 | + | 1.356550 | 2.286360 | 0.552966 | 1.0 |
6 | SLC25A46 | 5 | 110761292 | A | G | 2697 | 0.002225 | -2.064850 | 2.313330 | 1.843060 | - | -0.892586 | 0.657478 | 0.174593 | 1.0 |
7 | SLC25A46 | 5 | 110761301 | T | G | 2697 | 0.000185 | -0.141477 | 0.120936 | 0.165506 | - | -1.169850 | 2.875560 | 0.684137 | 1.0 |
8 | SLC25A46 | 5 | 110761662 | G | T | 2697 | 0.000371 | -0.008878 | 0.130762 | 0.000603 | - | -0.067895 | 2.765410 | 0.980413 | 1.0 |
score_nonsyn[score_nonsyn_new_cols].to_csv("result/score/SLC25A46.filtered.score.SingleScore.FDR.dup_cleaned.nonsyn.assoc", sep="\t", index=False)
1 control and 1 case was found with p.E79K/p.V211M compound heterozygosity (no homozygous found in either groups).
Generate 2x2 matrix
# kernel: R 3.6
data <- matrix(c(1,1049,1,1046), ncol=2)
data
[,1] [,2]
[1,] 1 1
[2,] 1049 1046
# run fisher test
fisher.test(data)
Fisher's Exact Test for Count Data
data: data
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.01269833 78.30123556
sample estimates:
odds ratio
0.9971414