Skip to content

Latest commit

 

History

History
1884 lines (1724 loc) · 42.5 KB

SLC25A46.md

File metadata and controls

1884 lines (1724 loc) · 42.5 KB

Candidate PD-risk gene SLC25A46 Evaluation using AMP-PD data

by Jonggeol Jeff Kim and Hui Liu


Part of the effort by Hui Liu and Jonggeol Jeffrey Kim to confirm candidate PD-risk gene SLC25A46 based on this paper: https://www.ncbi.nlm.nih.gov/pubmed/32259769


Software used:

  1. PLINK v2.00a2LM/1.9
  2. BCFTools
  3. Annovar
  4. RVTESTS

Steps:

  1. Subset and convert data
  2. Annotation
  3. Allele frequency + count
  4. Burden analysis
  5. Score test
  6. Multiple test correction
  7. Compound Heterozygous Fisher Test

1. Subset and convert data

The data has already been cleaned (e.g. genotype quality, ancestry etc.) and subsetted into a PLINK binary format.

Subsetting log can be read here:

# kernel: bash
cat SLC25A46_WGS_AMP_PD.log
PLINK v1.90b6.16 64-bit (19 Feb 2020)
Options in effect:
  --bfile sample_filtered_geno05_euro
  --chr 5
  --from-bp 110738145
  --make-bed
  --out SLC25A46_WGS_AMP_PD
  --to-bp 110765157
...
243 out of 28195229 variants loaded from .bim file.
2927 people (0 males, 0 females, 2927 ambiguous) loaded from .fam.
...
Total genotyping rate is 0.999947.
243 variants and 2927 people pass filters and QC.
Note: No phenotypes present.
...

Filter data

Filter out participants with sex information missing, as well as variants 0 allele count.

Pull out columns 1, 2, and 5 (FID, IID, SEX) without header.

module load plink/2.3-alpha
plink2 --bfile SLC25A46_WGS_AMP_PD \
       --update-sex data/AMP_PD_sex.txt \
       --remove-nosex \
       --mac 1 \
       --make-bed \
       --out data/SLC25A46.filtered
[+] Loading plink  2.3-alpha 

PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/SLC25A46.filtered.log.
Options in effect:
  --bfile /data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD
  --mac 1
  --make-bed
  --out data/SLC25A46.filtered
  --remove-nosex
  --update-sex data/AMP_PD_sex.txt

...
2927 samples (0 females, 0 males, 2927 ambiguous; 2927 founders) loaded from
/data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD.fam.
243 variants loaded from
/data/LNG/saraB/AMP_PD/genesfortrainees_2ndpart/SLC25A46_WGS_AMP_PD.bim.
Note: No phenotype data present.
--update-sex: 2697 samples updated.
230 samples removed due to sex filter(s).
2697 samples (1188 females, 1509 males; 2697 founders) remaining after main
filters.
Calculating allele frequencies... done.
59 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
184 variants remaining after main filters.
...

Create VCFs

module load plink
# use plink 1.9 for this
plink --bfile data/SLC25A46.filtered \
      --recode vcf-fid bgz \
      --out data/vcf/SLC25A46.filtered.fid
[-] Unloading plink  2.3-alpha 
[+] Loading plink  1.9.0-beta4.4  on cn0927 

The following have been reloaded with a version change:
  1) plink/2.3-alpha => plink/1.9.0-beta4.4

PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/vcf/SLC25A46.filtered.fid.log.
Options in effect:
  --bfile data/SLC25A46.filtered
  --out data/vcf/SLC25A46.filtered.fid
  --recode vcf-fid bgz

...
184 variants loaded from .bim file.
2697 people (1509 males, 1188 females) loaded from .fam.
...
Total genotyping rate is 0.99994.
184 variants and 2697 people pass filters and QC.
Note: No phenotypes present.
...
cd data/vcf

module load samtools
# tabulate
tabix -f -p vcf SLC25A46.filtered.fid.vcf.gz

cd ../..
[+] Loading samtools 1.10  ... 

2. Determine Allele Frequency + Count

# bring back plink2
module load plink/2.3-alpha

plink2 --bfile data/SLC25A46.filtered \
       --freq \
       --out result/freq/SLC25A46.filtered.af
plink2 --bfile data/SLC25A46.filtered \
       --geno-counts \
       --out result/freq/SLC25A46.filtered.ac
[+] Loading plink  2.3-alpha 
PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to result/freq/SLC25A46.filtered.af.log.
Options in effect:
  --bfile data/SLC25A46.filtered
  --freq
  --out result/freq/SLC25A46.filtered.af

...
2697 samples (1188 females, 1509 males; 2697 founders) loaded from
data/SLC25A46.filtered.fam.
184 variants loaded from data/SLC25A46.filtered.bim.
Note: No phenotype data present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to222232324252526262727282829293030313232333334343535363637383839394040414142424344444545464647474848495050515152525353545455555657575858595960606161626363646465656666676768696970707171727273737475757676777778787979808081828283838484858586868788888989909091919292939494959596969797989899%
result/freq/SLC25A46.filtered.af.afreq .
...
PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to result/freq/SLC25A46.filtered.ac.log.
Options in effect:
  --bfile data/SLC25A46.filtered
  --geno-counts
  --out result/freq/SLC25A46.filtered.ac

...
2697 samples (1188 females, 1509 males; 2697 founders) loaded from
data/SLC25A46.filtered.fam.
184 variants loaded from data/SLC25A46.filtered.bim.
Note: No phenotype data present.
Calculating allele frequencies... done.
--geno-counts: Genotype counts written to1717181919202021212222232324252526262727282829293030313232333334343535363637383839394040414142424344444545464647474848495050515152525353545455555657575858595960606161626363646465656666676768696970707171727273737475757676777778787979808081828283838484858586868788888989909091919292939494959596969797989899%
result/freq/SLC25A46.filtered.ac.gcount .
...

3. Annotation

table_annovar.pl data/vcf/SLC25A46.filtered.fid.vcf.gz \
                 $ANNOVAR_DATA/hg38 --thread 20 -buildver hg38 \
                 -out result/anno/SLC25A46.filtered.fid.annovar \
                 -arg '-splicing 15',,, \
                 -remove -protocol refGene,ljb26_all,gnomad211_genome,clinvar_20190305 \
                 -operation g,f,f,f -nastring . -vcfinput -polish

Trim results

# remove unnecessary VCF file
rm result/anno/SLC25A46.filtered.fid.annovar.hg38_multianno.vcf
# truncate unnecessay genotype information
cd result/anno

head -1 SLC25A46.filtered.fid.annovar.hg38_multianno.txt > header.txt
colct="$(wc -w header.txt| cut -f1 -d' ')"
cut -f1-$colct SLC25A46.filtered.fid.annovar.hg38_multianno.txt > SLC25A46.filtered.fid.annovar.hg38_multianno.trimmed.txt
rm header.txt

cd ../..

4. Gene-Level Burden Analyses

# makes rvtests more manageable
grep SLC25A46 /data/LNG/KIM/burden_resources/refFlat_hg38.txt > data/SLC25A46.hg38.genefile
module load rvtests

rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
       --pheno data/covariateInfo_AMPPD.txt \
       --covar data/covariateInfo_AMPPD.txt \
       --covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
       --kernel skat,skato \
       --burden cmc,zeggini,mb,fp,cmcWald \
       --geneFile data/SLC25A46.hg38.genefile \
       --freqUpper 0.03 \
       --out result/burden/SLC25A46.filtered.burden.maf03


rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
       --pheno data/covariateInfo_AMPPD.txt \
       --covar data/covariateInfo_AMPPD.txt \
       --covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
       --kernel skat,skato \
       --burden cmc,zeggini,mb,fp,cmcWald \
       --geneFile data/SLC25A46.hg38.genefile \
       --freqUpper 0.01 \
       --out result/burden/SLC25A46.filtered.burden.maf01
[+] Loading rvtests  2.1.0  on cn0929 
Thank you for using rvtests (version: 20190205, git: c86e589efef15382603300dc7f4c3394c82d69b8)
  For documentations, refer to http://zhanxw.github.io/rvtests/
  For questions and comments, plase send to Xiaowei Zhan <[email protected]>
  For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues
...
[INFO]	Program version: 20190205
[INFO]	Analysis started at: Thu Sep  3 18:55:49 2020
[INFO]	Loaded [ 2697 ] samples from genotype files
[INFO]	Loaded [ 2697 ] sample phenotypes
[INFO]	Begin to read covariate file
[INFO]	Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO]	Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN]	-- Enabling binary phenotype mode -- 
[INFO]	Analysis begins with [ 2697 ] samples...
[INFO]	MadsonBrowning test significance will be evaluated using 10000 permutations at alpha = 0.05
[INFO]	SKAT test significance will be evaluated using 10000 permutations at alpha = 0.05 weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO]	SKAT-O test significance will be evaluated using weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO]	Loaded [ 1 ] genes.
[INFO]	Impute missing genotype to mean (by default)
[INFO]	Set upper minor allele frequency limit to 0.03
[INFO]	Analysis started
[INFO]	Analyzed [ 132 ] variants from [ 1 ] genes/regions
[INFO]	Analysis ends at: Thu Sep  3 18:55:53 2020
[INFO]	Analysis took 4 seconds
RVTESTS finished successfully
...
[INFO]	Program version: 20190205
[INFO]	Analysis started at: Thu Sep  3 18:55:53 2020
[INFO]	Loaded [ 2697 ] samples from genotype files
[INFO]	Loaded [ 2697 ] sample phenotypes
[INFO]	Begin to read covariate file
[INFO]	Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO]	Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN]	-- Enabling binary phenotype mode -- 
[INFO]	Analysis begins with [ 2697 ] samples...
[INFO]	MadsonBrowning test significance will be evaluated using 10000 permutations at alpha = 0.05
[INFO]	SKAT test significance will be evaluated using 10000 permutations at alpha = 0.05 weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO]	SKAT-O test significance will be evaluated using weight = Beta[beta1 = 1.00, beta2 = 25.00]
[INFO]	Loaded [ 1 ] genes.
[INFO]	Impute missing genotype to mean (by default)
[INFO]	Set upper minor allele frequency limit to 0.01
[INFO]	Analysis started
[INFO]	Analyzed [ 115 ] variants from [ 1 ] genes/regions
[INFO]	Analysis ends at: Thu Sep  3 18:55:56 2020
[INFO]	Analysis took 3 seconds
RVTESTS finished successfully

5. Score Test

rvtest --inVcf data/vcf/SLC25A46.filtered.fid.vcf.gz \
       --pheno data/covariateInfo_AMPPD.txt \
       --covar data/covariateInfo_AMPPD.txt \
       --covar-name SEX,AGE,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI \
       --single score \
       --geneFile data/SLC25A46.hg38.genefile \
       --out result/score/SLC25A46.filtered.score
Thank you for using rvtests (version: 20190205, git: c86e589efef15382603300dc7f4c3394c82d69b8)
  For documentations, refer to http://zhanxw.github.io/rvtests/
  For questions and comments, plase send to Xiaowei Zhan <[email protected]>
  For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues
...
[INFO]	Program version: 20190205
[INFO]	Analysis started at: Fri Sep  4 17:27:10 2020
[INFO]	Loaded [ 2697 ] samples from genotype files
[INFO]	Loaded [ 2697 ] sample phenotypes
[INFO]	Begin to read covariate file
[INFO]	Loaded 1509 male, 1188 female and 0 sex-unknown samples from data/covariateInfo_AMPPD.txt
[INFO]	Loaded 1647 cases, 1050 controls, and 0 missing phenotypes
[WARN]	-- Enabling binary phenotype mode -- 
[INFO]	Analysis begins with [ 2697 ] samples...
[INFO]	Loaded [ 1 ] genes.
[INFO]	Impute missing genotype to mean (by default)
[INFO]	Analysis started
[INFO]	Analyzed [ 718 ] variants from [ 1 ] genes/regions
[INFO]	Analysis ends at: Fri Sep  4 17:27:14 2020
[INFO]	Analysis took 4 seconds
RVTESTS finished successfully

6. Add multiple-test corrections to score results

We will use False Discovery Rate (FDR)

# kernel: python 3.7
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats as sm
from patsy import dmatrices

os.chdir("/data/LNG/KIM/burdenAnalysis/SLC25A46-replication/")
score = pd.read_csv("result/score/SLC25A46.filtered.score.SingleScore.assoc", sep = "\t")
score
Gene CHROM POS REF ALT N_INFORMATIVE AF U V STAT DIRECTION EFFECT SE PVALUE
0 SLC25A46 5 110738249 T C 2697 0.000927 1.069550 0.908378 1.259320 + 1.177430 1.049220 0.261780
1 SLC25A46 5 110738407 C T 2697 0.000371 -0.045776 0.388462 0.005394 - -0.117838 1.604450 0.941452
2 SLC25A46 5 110738538 A C 2697 0.002039 0.600238 2.072550 0.173837 + 0.289614 0.694621 0.676724
3 SLC25A46 5 110738643 G A 2697 0.000185 -0.109522 0.097021 0.123633 - -1.128850 3.210460 0.725127
4 SLC25A46 5 110738696 T C 2697 0.021505 -0.053915 20.393400 0.000143 - -0.002644 0.221440 0.990474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
713 SLC25A46 5 110764313 T C 2697 0.001112 -1.321120 1.396020 1.250240 - -0.946349 0.846359 0.263506
714 SLC25A46 5 110764363 C T 2697 0.117723 -3.186630 119.149000 0.085226 - -0.026745 0.091612 0.770336
715 SLC25A46 5 110764618 G A 2697 0.099926 -12.843500 81.840200 2.015590 - -0.156934 0.110539 0.155691
716 SLC25A46 5 110764674 G GA 2697 0.000927 0.547030 0.464448 0.644296 + 1.177810 1.467340 0.422159
717 SLC25A46 5 110764969 G T 2697 0.243789 -14.176400 139.849000 1.437040 - -0.101369 0.084561 0.230618

718 rows × 14 columns

score = score.drop_duplicates()
score
Gene CHROM POS REF ALT N_INFORMATIVE AF U V STAT DIRECTION EFFECT SE PVALUE
0 SLC25A46 5 110738249 T C 2697 0.000927 1.069550 0.908378 1.259320 + 1.177430 1.049220 0.261780
1 SLC25A46 5 110738407 C T 2697 0.000371 -0.045776 0.388462 0.005394 - -0.117838 1.604450 0.941452
2 SLC25A46 5 110738538 A C 2697 0.002039 0.600238 2.072550 0.173837 + 0.289614 0.694621 0.676724
3 SLC25A46 5 110738643 G A 2697 0.000185 -0.109522 0.097021 0.123633 - -1.128850 3.210460 0.725127
4 SLC25A46 5 110738696 T C 2697 0.021505 -0.053915 20.393400 0.000143 - -0.002644 0.221440 0.990474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
179 SLC25A46 5 110764313 T C 2697 0.001112 -1.321120 1.396020 1.250240 - -0.946349 0.846359 0.263506
180 SLC25A46 5 110764363 C T 2697 0.117723 -3.186630 119.149000 0.085226 - -0.026745 0.091612 0.770336
181 SLC25A46 5 110764618 G A 2697 0.099926 -12.843500 81.840200 2.015590 - -0.156934 0.110539 0.155691
182 SLC25A46 5 110764674 G GA 2697 0.000927 0.547030 0.464448 0.644296 + 1.177810 1.467340 0.422159
183 SLC25A46 5 110764969 G T 2697 0.243789 -14.176400 139.849000 1.437040 - -0.101369 0.084561 0.230618

184 rows × 14 columns

score['FDR-Corrected'] = sm.multitest.multipletests(score['PVALUE'])[1]
# alpha = 0.05
/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
score['FDR-Corrected']
0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
179    1.0
180    1.0
181    1.0
182    1.0
183    1.0
Name: FDR-Corrected, Length: 184, dtype: float64
score.head()
Gene CHROM POS REF ALT N_INFORMATIVE AF U V STAT DIRECTION EFFECT SE PVALUE FDR-Corrected
0 SLC25A46 5 110738249 T C 2697 0.000927 1.069550 0.908378 1.259320 + 1.177430 1.049220 0.261780 1.0
1 SLC25A46 5 110738407 C T 2697 0.000371 -0.045776 0.388462 0.005394 - -0.117838 1.604450 0.941452 1.0
2 SLC25A46 5 110738538 A C 2697 0.002039 0.600238 2.072550 0.173837 + 0.289614 0.694621 0.676724 1.0
3 SLC25A46 5 110738643 G A 2697 0.000185 -0.109522 0.097021 0.123633 - -1.128850 3.210460 0.725127 1.0
4 SLC25A46 5 110738696 T C 2697 0.021505 -0.053915 20.393400 0.000143 - -0.002644 0.221440 0.990474 1.0
score.to_csv("result/score/SLC25A46.filtered.score.SingleScore.FDR.dup_cleaned.assoc", sep="\t", index=False)

Let's also filter out the non-synonymous variants of interest. They are:

  • 5:110739267
  • 5:110739354
  • 5:110746300
  • 5:110756712
  • 5:110761217
  • 5:110761271
  • 5:110761292
  • 5:110761301
  • 5:110761662
anno = pd.read_csv("result/anno/SLC25A46.filtered.fid.annovar.hg38_multianno.trimmed.txt", sep = "\t")
anno.head()
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene ... non_topmed_AF_popmax non_neuro_AF_popmax non_cancer_AF_popmax controls_AF_popmax CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG Otherinfo
0 5 110738249 110738249 T C splicing SLC25A46 NM_001303250:exon1:c.10+2T>C . . ... 0.0006 0.0005 . 0.0008 . . . . . 0.000927
1 5 110738407 110738407 C T intronic SLC25A46 . . . ... 0.0267 0.0269 . 0.0292 . . . . . 0.000371
2 5 110738538 110738538 A C intronic SLC25A46 . . . ... 0.0037 0.0036 . 0.0081 . . . . . 0.002039
3 5 110738643 110738643 G A intronic SLC25A46 . . . ... . . . . . . . . . 0.000185
4 5 110738696 110738696 T C intronic SLC25A46 . . . ... 0.0302 0.0264 . 0.0285 . . . . . 0.021510

5 rows × 58 columns

is_nonsyn = anno['ExonicFunc.refGene']=="nonsynonymous SNV"
anno_nonsyn = anno[is_nonsyn]
anno_nonsyn
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene ... non_topmed_AF_popmax non_neuro_AF_popmax non_cancer_AF_popmax controls_AF_popmax CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG Otherinfo
10 5 110739267 110739267 C A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303249:exon1:c.C148A:p.P50T,SLC... ... 0.0012 0.0018 . 0.0004 520610 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Uncertain_significance 0.000742
11 5 110739354 110739354 G A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303249:exon1:c.G235A:p.E79K,SLC... ... 0.0014 0.0015 . 0.0011 520315 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Likely_benign 0.002225
61 5 110746300 110746300 C A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303249:exon4:c.C416A:p.T139N,SL... ... 0.0011 0.0010 . 0.0011 453904 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Likely_benign 0.001112
129 5 110756712 110756712 G A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303249:exon7:c.G631A:p.V211M,SL... ... 0.0035 0.0030 . 0.0035 453906 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Likely_benign 0.003337
156 5 110761217 110761217 G A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303250:exon8:c.G419A:p.R140Q,SL... ... . . . . . . . . . 0.000185
158 5 110761271 110761271 G A exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303250:exon8:c.G473A:p.G158D,SL... ... 0.0004 0.0002 . 0.0004 359143 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 no_assertion_criteria_provided Pathogenic 0.000185
159 5 110761292 110761292 A G exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303250:exon8:c.A494G:p.K165R,SL... ... 0.0041 0.0037 . 0.0053 454455 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Likely_benign 0.002225
160 5 110761301 110761301 T G exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303250:exon8:c.T503G:p.L168R,SL... ... . 7.348e-05 . . 454461 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Uncertain_significance 0.000185
161 5 110761662 110761662 G T exonic SLC25A46 . nonsynonymous SNV SLC25A46:NM_001303249:exon8:c.G894T:p.E298D,SL... ... 0.1177 0.1124 . 0.1157 454465 NEUROPATHY,_HEREDITARY_MOTOR_AND_SENSORY,_TYPE... MedGen:C4225302,OMIM:616505 criteria_provided,_single_submitter Benign 0.000371

9 rows × 58 columns

nonsym_POS = anno_nonsyn.rename(columns={"Start": "POS"})[['POS']]
score_nonsyn = pd.merge(nonsym_POS, score, on="POS")
score_nonsyn
POS Gene CHROM REF ALT N_INFORMATIVE AF U V STAT DIRECTION EFFECT SE PVALUE FDR-Corrected
0 110739267 SLC25A46 5 C A 2697 0.000742 0.823836 0.599113 1.132850 + 1.375090 1.291950 0.287168 1.0
1 110739354 SLC25A46 5 G A 2697 0.002225 -0.172238 1.856260 0.015982 - -0.092788 0.733974 0.899401 1.0
2 110746300 SLC25A46 5 C A 2697 0.001112 0.048595 0.935050 0.002526 + 0.051971 1.034150 0.959920 1.0
3 110756712 SLC25A46 5 G A 2697 0.003337 2.870400 3.130230 2.632130 + 0.916992 0.565213 0.104721 1.0
4 110761217 SLC25A46 5 G A 2697 0.000185 -0.048355 0.045875 0.050969 - -1.054060 4.668870 0.821385 1.0
5 110761271 SLC25A46 5 G A 2697 0.000185 0.259504 0.191297 0.352031 + 1.356550 2.286360 0.552966 1.0
6 110761292 SLC25A46 5 A G 2697 0.002225 -2.064850 2.313330 1.843060 - -0.892586 0.657478 0.174593 1.0
7 110761301 SLC25A46 5 T G 2697 0.000185 -0.141477 0.120936 0.165506 - -1.169850 2.875560 0.684137 1.0
8 110761662 SLC25A46 5 G T 2697 0.000371 -0.008878 0.130762 0.000603 - -0.067895 2.765410 0.980413 1.0
# reorder column
score_nonsyn_cols = list(score_nonsyn.columns)
score_nonsyn_new_cols = ["Gene", "CHROM", "POS"] + score_nonsyn_cols[3:]
score_nonsyn[score_nonsyn_new_cols]
Gene CHROM POS REF ALT N_INFORMATIVE AF U V STAT DIRECTION EFFECT SE PVALUE FDR-Corrected
0 SLC25A46 5 110739267 C A 2697 0.000742 0.823836 0.599113 1.132850 + 1.375090 1.291950 0.287168 1.0
1 SLC25A46 5 110739354 G A 2697 0.002225 -0.172238 1.856260 0.015982 - -0.092788 0.733974 0.899401 1.0
2 SLC25A46 5 110746300 C A 2697 0.001112 0.048595 0.935050 0.002526 + 0.051971 1.034150 0.959920 1.0
3 SLC25A46 5 110756712 G A 2697 0.003337 2.870400 3.130230 2.632130 + 0.916992 0.565213 0.104721 1.0
4 SLC25A46 5 110761217 G A 2697 0.000185 -0.048355 0.045875 0.050969 - -1.054060 4.668870 0.821385 1.0
5 SLC25A46 5 110761271 G A 2697 0.000185 0.259504 0.191297 0.352031 + 1.356550 2.286360 0.552966 1.0
6 SLC25A46 5 110761292 A G 2697 0.002225 -2.064850 2.313330 1.843060 - -0.892586 0.657478 0.174593 1.0
7 SLC25A46 5 110761301 T G 2697 0.000185 -0.141477 0.120936 0.165506 - -1.169850 2.875560 0.684137 1.0
8 SLC25A46 5 110761662 G T 2697 0.000371 -0.008878 0.130762 0.000603 - -0.067895 2.765410 0.980413 1.0
score_nonsyn[score_nonsyn_new_cols].to_csv("result/score/SLC25A46.filtered.score.SingleScore.FDR.dup_cleaned.nonsyn.assoc", sep="\t", index=False)

7. Compound Heterozygous Fisher Test

1 control and 1 case was found with p.E79K/p.V211M compound heterozygosity (no homozygous found in either groups).

Generate 2x2 matrix

# kernel: R 3.6
data <- matrix(c(1,1049,1,1046), ncol=2)
data
     [,1] [,2]
[1,]    1    1
[2,] 1049 1046
# run fisher test
fisher.test(data)
	Fisher's Exact Test for Count Data

data:  data
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.01269833 78.30123556
sample estimates:
odds ratio 
 0.9971414