diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index fa2e2c52..b13f79e6 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2024-09-09T02:55:17","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.9.4","generation_timestamp":"2024-09-10T02:52:25","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/index.html b/dev/index.html index 50b88c4a..e4cc3907 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · GhostKnockoffGWAS

GWAS summary statistics analysis via Knockoff-filter

This is package for performing knockoff-based analysis of GWAS summary statistics. The knockoff-filter finds conditionally independent discoveries while controlling the FDR (false discovery rate) to any specified level.

cover

Warning

This package currently only works on Linux platforms with aarch64 or x86_64 CPUs. We plan to support mac (x86_64 and aarch64) in the near future, but not for windows.

+Home · GhostKnockoffGWAS

GWAS summary statistics analysis via Knockoff-filter

This is package for performing knockoff-based analysis of GWAS summary statistics. The knockoff-filter finds conditionally independent discoveries while controlling the FDR (false discovery rate) to any specified level.

cover

Warning

This package currently only works on Linux platforms with aarch64 or x86_64 CPUs. We plan to support mac (x86_64 and aarch64) in the near future, but not for windows.

diff --git a/dev/man/FAQ/index.html b/dev/man/FAQ/index.html index efcb109f..523f2c83 100644 --- a/dev/man/FAQ/index.html +++ b/dev/man/FAQ/index.html @@ -1,2 +1,2 @@ -FAQ · GhostKnockoffGWAS

Common questions and Answers

Here is a collection of common questions & answers. If you have a question not listed here, do not hesitate to open a new issue on Github.

How do I obtain Z-scores from p-values, effect sizes, odds-ratios...etc?

See the Notes on computing Z-scores section of this blog post

Is the result is trustworthy?

Knockoff's FDR guarantees requires that the correlation matrices used in the analysis approximates the LD structure for the original GWAS study. Their consistency is measured by the mean_LD_shrinkage parameter in the summary output. This value lies between 0 and 1. Values close to 0 indicates good performance. Larger values (e.g. >0.1) indicates deviation. Very larger values (e.g. >0.25) will cause the program to hault and users should download a different set of precomputed knockoff data instead. See equation 24 of the SuSiE paper for details.

Expected run time?

On roughly 0.6 million Z-scores, our software completed a GhostKnockoff analysis in roughly 15 minutes on a single 2.3GHz CPU. If your analysis is taking much longer than this, please see Q&A on software unexpectedly slow.

Memory requirement?

Our software requires ~9.1GB of RAM on an Alzheimer's Diseases anslysis with ~0.6 million SNPs.

Software unexpectedly slow?

Because the knockoff pipeline requires reading the pre-computed knockoff statistics sequentially into memory, both the downloaded software and data should be stored at a high speed (e.g. Lustre) file system. On most HPC clusters, one should store the data in the SCRATCH directory to run GhostKnockoffGWAS.

To check whether I/O is the bottleneck, one can check the CPU usage while GhostKnockoffGWAS is running. For example, one can examine CPU usage via the top or htop command. CPU usage should almost always be at 99% or above.

For undiagnosed performance issues, please file a new issue.

Sex chromosome support?

We currently do not support X/Y/M chromosome analysis.

When will non-European LD files be available?

We will release more pre-processed LD files for download, once we tested and verified the methodology against suitable datasets. For now, users with non-EUR samples must build their own LD files using the solveblock executable, see Customizing LD files.

Admixed samples?

If your study subjects are somewhat admixed, one can try GhostKnockoffGWAS with the most suitable LD files, then check how much deviation there are by examining the LD_shrinkage parameter in the output of GhostKnockoffGWAS, see Is-the-result-is-trustworthy?.

If your study subjects are extremely admixed, then it is unlikely that GhostKnockoffGWAS will return good results. The main difficulty in enabling analysis for admixed cohorts lies in pre-computing good LD files for admixed subjects. Computing required quantities on the fly is too computationally intensive.

I want to build my own LD files. How do I determine start_bp and end_bp?

In our papers, we defined each start and end position by adapting the quasi-independent regions of ldetect. Given individual level data, one can compute approximately independent LD blocks directly, see reference and R software.

+FAQ · GhostKnockoffGWAS

Common questions and Answers

Here is a collection of common questions & answers. If you have a question not listed here, do not hesitate to open a new issue on Github.

How do I obtain Z-scores from p-values, effect sizes, odds-ratios...etc?

See the Notes on computing Z-scores section of this blog post

Is the result is trustworthy?

Knockoff's FDR guarantees requires that the correlation matrices used in the analysis approximates the LD structure for the original GWAS study. Their consistency is measured by the mean_LD_shrinkage parameter in the summary output. This value lies between 0 and 1. Values close to 0 indicates good performance. Larger values (e.g. >0.1) indicates deviation. Very larger values (e.g. >0.25) will cause the program to hault and users should download a different set of precomputed knockoff data instead. See equation 24 of the SuSiE paper for details.

Expected run time?

On roughly 0.6 million Z-scores, our software completed a GhostKnockoff analysis in roughly 15 minutes on a single 2.3GHz CPU. If your analysis is taking much longer than this, please see Q&A on software unexpectedly slow.

Memory requirement?

Our software requires ~9.1GB of RAM on an Alzheimer's Diseases anslysis with ~0.6 million SNPs.

Software unexpectedly slow?

Because the knockoff pipeline requires reading the pre-computed knockoff statistics sequentially into memory, both the downloaded software and data should be stored at a high speed (e.g. Lustre) file system. On most HPC clusters, one should store the data in the SCRATCH directory to run GhostKnockoffGWAS.

To check whether I/O is the bottleneck, one can check the CPU usage while GhostKnockoffGWAS is running. For example, one can examine CPU usage via the top or htop command. CPU usage should almost always be at 99% or above.

For undiagnosed performance issues, please file a new issue.

Sex chromosome support?

We currently do not support X/Y/M chromosome analysis.

When will non-European LD files be available?

We will release more pre-processed LD files for download, once we tested and verified the methodology against suitable datasets. For now, users with non-EUR samples must build their own LD files using the solveblock executable, see Customizing LD files.

Admixed samples?

If your study subjects are somewhat admixed, one can try GhostKnockoffGWAS with the most suitable LD files, then check how much deviation there are by examining the LD_shrinkage parameter in the output of GhostKnockoffGWAS, see Is-the-result-is-trustworthy?.

If your study subjects are extremely admixed, then it is unlikely that GhostKnockoffGWAS will return good results. The main difficulty in enabling analysis for admixed cohorts lies in pre-computing good LD files for admixed subjects. Computing required quantities on the fly is too computationally intensive.

I want to build my own LD files. How do I determine start_bp and end_bp?

In our papers, we defined each start and end position by adapting the quasi-independent regions of ldetect. Given individual level data, one can compute approximately independent LD blocks directly, see reference and R software.

diff --git a/dev/man/developer/index.html b/dev/man/developer/index.html index 4ee5bb69..14c52ade 100644 --- a/dev/man/developer/index.html +++ b/dev/man/developer/index.html @@ -54,4 +54,4 @@ force=true, precompile_execution_file=precompile_script, executables=["GhostKnockoffGWAS"=>"julia_main", "solveblock"=>"julia_solveblock"] -)

The last step takes >15 minutes.

+)

The last step takes >15 minutes.

diff --git a/dev/man/documentation/index.html b/dev/man/documentation/index.html index 5539736f..6768fa9c 100644 --- a/dev/man/documentation/index.html +++ b/dev/man/documentation/index.html @@ -1,2 +1,2 @@ -Documentation · GhostKnockoffGWAS

Command-line documentation and usage of GhostKnockoffGWAS

Usage

Simple run

GhostKnockoffGWAS --zfile example_zfile.txt --LD-files EUR --N 506200 --genome-build 38 --out example_output

Required inputs

Option nameArgumentDescription
--zfileStringInput file containing Z-scores as well as CHR/POS/REF/ALT. See Acceptable Z-score files for detailed requirement on this file.
--LD-filesStringInput directory to the pre-processed LD files. Most users downloads this from the Downloads Page
--NIntSample size for target (original) study
--genome-buildIntThe human genome build used for SNP positions in zfile (this value must be 19 or 38)
--outStringOutput file name (without extensions)

Optional inputs

Option nameArgumentDescription
--CHRIntThe column in zfile that will be read as chromosome number (note this must be an integer, e.g. chr22, X, chrX, ...etc are NOT acceptable). [If not specified, we will search for a column with header CHR]
--POSIntThe column in zfile that will be read as SNP position . [If not specified, we will search for a column with header POS]
--REFIntThe column in zfile that will be read as REF (non-effectiv) allele . [If not specified, we will search for a column with header REF]
--ALTIntThe column in zfile that will be read as ALT (effective allele). [If not specified, we will search for a column with header REF]
--ZIntThe column in zfile that will be read as Z-scores. [If not specified, we will search for a column with header Z]
--seedIntSets the random seed [If not specified, defaults to 2023]
--verboseBoolWhether to print intermediate messages [If not specified, defaults to true]
--random-shuffleBoolWhether to randomly permute the order of Z-scores and their knockoffs to adjust for potential ordering bias. The main purpose of this option is to take care of potential ordering bias of Lasso solvers. However, in our simulations we never observed such biases, so we turn this off by default.[If not specified, defaults to false]
--skip-shrinkage-checkBoolWhether to allow Knockoff analysis to proceed even with large (>0.25) LD shrinkages [If not specified, defaults to false]

Output format

  1. A summary file, e.g. example_output_summary.txt. This file contains broad summary of the analysis
  2. A comma-separated file, e.g. example_output.txt. This file contains the full GhostKnockoffGWAS output, one SNP in each row.
  3. (optional) Manhattan plots, which can be generated by following step 5 of detailed example.

For a more detailed explanation on these 2 files, see Tutorial.

+Documentation · GhostKnockoffGWAS

Command-line documentation and usage of GhostKnockoffGWAS

Usage

Simple run

GhostKnockoffGWAS --zfile example_zfile.txt --LD-files EUR --N 506200 --genome-build 38 --out example_output

Required inputs

Option nameArgumentDescription
--zfileStringInput file containing Z-scores as well as CHR/POS/REF/ALT. See Acceptable Z-score files for detailed requirement on this file.
--LD-filesStringInput directory to the pre-processed LD files. Most users downloads this from the Downloads Page
--NIntSample size for target (original) study
--genome-buildIntThe human genome build used for SNP positions in zfile (this value must be 19 or 38)
--outStringOutput file name (without extensions)

Optional inputs

Option nameArgumentDescription
--CHRIntThe column in zfile that will be read as chromosome number (note this must be an integer, e.g. chr22, X, chrX, ...etc are NOT acceptable). [If not specified, we will search for a column with header CHR]
--POSIntThe column in zfile that will be read as SNP position . [If not specified, we will search for a column with header POS]
--REFIntThe column in zfile that will be read as REF (non-effectiv) allele . [If not specified, we will search for a column with header REF]
--ALTIntThe column in zfile that will be read as ALT (effective allele). [If not specified, we will search for a column with header REF]
--ZIntThe column in zfile that will be read as Z-scores. [If not specified, we will search for a column with header Z]
--seedIntSets the random seed [If not specified, defaults to 2023]
--verboseBoolWhether to print intermediate messages [If not specified, defaults to true]
--random-shuffleBoolWhether to randomly permute the order of Z-scores and their knockoffs to adjust for potential ordering bias. The main purpose of this option is to take care of potential ordering bias of Lasso solvers. However, in our simulations we never observed such biases, so we turn this off by default.[If not specified, defaults to false]
--skip-shrinkage-checkBoolWhether to allow Knockoff analysis to proceed even with large (>0.25) LD shrinkages [If not specified, defaults to false]

Output format

  1. A summary file, e.g. example_output_summary.txt. This file contains broad summary of the analysis
  2. A comma-separated file, e.g. example_output.txt. This file contains the full GhostKnockoffGWAS output, one SNP in each row.
  3. (optional) Manhattan plots, which can be generated by following step 5 of detailed example.

For a more detailed explanation on these 2 files, see Tutorial.

diff --git a/dev/man/download/index.html b/dev/man/download/index.html index f584f3fc..19e8bbf5 100644 --- a/dev/man/download/index.html +++ b/dev/man/download/index.html @@ -1,2 +1,2 @@ -Downloads · GhostKnockoffGWAS

Downloads page

Here is the main downloads page. New software and pre-processed knockoff data will be released here.

Software

Operating Systemv0.2.2 (June 28th, 2024)
Linux 64-bitDownload

After unzipping, the executable will be located inside bin/GhostKnockoffGWAS. We recommend adding the folder containing the GhostKnockoffGWAS executable to PATH for easier access.

Pre-processed LD files

PopulationLinkNumber of SNPsDescription
EUR (Europeans)download (7.5GB)650826See Note 1
ASN (East Asians)TBD
AFR (Africans)TBD
AMR (Admixed Americans)TBD
  • Note 1: This file contain pre-processed LD files generated from the typed SNPs of the EUR cohort from the Pan-UKB panel. The quasi-independent regions were obtained by directly adapting the output of ldetect
+Downloads · GhostKnockoffGWAS

Downloads page

Here is the main downloads page. New software and pre-processed knockoff data will be released here.

Software

Operating Systemv0.2.2 (June 28th, 2024)
Linux 64-bitDownload

After unzipping, the executable will be located inside bin/GhostKnockoffGWAS. We recommend adding the folder containing the GhostKnockoffGWAS executable to PATH for easier access.

Pre-processed LD files

PopulationLinkNumber of SNPsDescription
EUR (Europeans)download (7.5GB)650826See Note 1
ASN (East Asians)TBD
AFR (Africans)TBD
AMR (Admixed Americans)TBD
  • Note 1: This file contain pre-processed LD files generated from the typed SNPs of the EUR cohort from the Pan-UKB panel. The quasi-independent regions were obtained by directly adapting the output of ldetect
diff --git a/dev/man/examples/index.html b/dev/man/examples/index.html index df850a24..953ea434 100644 --- a/dev/man/examples/index.html +++ b/dev/man/examples/index.html @@ -64,4 +64,4 @@ rs4535687,0.15927,7,G,C,41892,chr7_start16161_end972751_group1_0,-1.17940334810126,0.0,0,0.0,0.0,1.0,0.23823760256835697,0,0,0,0 rs62429406,0.031058,7,T,G,43748,chr7_start16161_end972751_group2_0,0.636126444862832,0.0,0,0.0,0.0,1.0,0.5246940103826294,0,0,0,0 rs117163387,0.034958,7,C,T,43961,chr7_start16161_end972751_group3_0,-0.548757491205702,0.0,0,0.0,0.0,1.0,0.5831718861307663,0,0,0,0 -rs4247525,0.040199,7,T,C,44167,chr7_start16161_end972751_group4_0,0.463442453535633,0.0,0,0.0,0.0,1.0,0.6430472544316368,0,0,0,0

The first row is a header row. Each proceeding row corresponds to a SNP that was used in the analysis.

Note

Sometimes it is useful to determine the number of conditionally independent discoveries according to the knockoff procedure. In this case, one should count the number of unique groups that contains the discovered SNPs. In this example, when target FDR is $10\%$, there are 15 SNPs with knockoff q-values less than 0.1, and they reside in 11 unique groups. Thus, the knockoff procedure claims there are at least 11 unique (conditionally-independent) causal variables.

Step 5: Generating Manhattan plots

We can generate Manhattan plots by running this R script in the terminal (this requires the R packages data.table, plyr, dplyr, CMplot). Usage:

$ Rscript --vanilla manhattan.R arg1 arg2 arg3 arg4

For example,

$ Rscript --vanilla manhattan.R example_output.txt . example_plot 0.1

This produced the following plots

knockoff_manhattan marginal_manhattan

Explanation:

+rs4247525,0.040199,7,T,C,44167,chr7_start16161_end972751_group4_0,0.463442453535633,0.0,0,0.0,0.0,1.0,0.6430472544316368,0,0,0,0

The first row is a header row. Each proceeding row corresponds to a SNP that was used in the analysis.

Note

Sometimes it is useful to determine the number of conditionally independent discoveries according to the knockoff procedure. In this case, one should count the number of unique groups that contains the discovered SNPs. In this example, when target FDR is $10\%$, there are 15 SNPs with knockoff q-values less than 0.1, and they reside in 11 unique groups. Thus, the knockoff procedure claims there are at least 11 unique (conditionally-independent) causal variables.

Step 5: Generating Manhattan plots

We can generate Manhattan plots by running this R script in the terminal (this requires the R packages data.table, plyr, dplyr, CMplot). Usage:

$ Rscript --vanilla manhattan.R arg1 arg2 arg3 arg4

For example,

$ Rscript --vanilla manhattan.R example_output.txt . example_plot 0.1

This produced the following plots

knockoff_manhattan marginal_manhattan

Explanation:

diff --git a/dev/man/gallery/index.html b/dev/man/gallery/index.html index 6a7da660..209fc904 100644 --- a/dev/man/gallery/index.html +++ b/dev/man/gallery/index.html @@ -1,2 +1,2 @@ -Gallery · GhostKnockoffGWAS

Gallery

We applied GhostKnockoffGWAS to 400+ phenotypes currated by Mike Gloudemans, available here. Below showcases some of these results, limiting the study population to EUR ancestry.

...coming soon

+Gallery · GhostKnockoffGWAS

Gallery

We applied GhostKnockoffGWAS to 400+ phenotypes currated by Mike Gloudemans, available here. Below showcases some of these results, limiting the study population to EUR ancestry.

...coming soon

diff --git a/dev/man/intro/index.html b/dev/man/intro/index.html index e0758546..f3b10be4 100644 --- a/dev/man/intro/index.html +++ b/dev/man/intro/index.html @@ -1,4 +1,4 @@ Introduction · GhostKnockoffGWAS

Introduction

This package conducts knockoff-based inference to perform genome-wide conditional independent tests based on GWAS summary statistics. The methodology is described in the following papers

Chen Z, He Z, Chu BB, Gu J, Morrison T, Sabatti C, Candes C. "Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression", arXiv preprint arXiv:2402.12724 (2024); doi: https://doi.org/10.48550/arXiv.2402.12724

Chu BB, Gu J, Chen Z, Morrison T, Candes E, He Z, Sabatti C. (2023). Second-order group knockoffs with applications to GWAS. arXiv preprint arXiv:2310.15069; doi: https://doi.org/10.48550/arXiv.2310.15069

He Z, Chu BB, Yang J, Gu J, Chen Z, Liu L, Morrison T, Bellow M, Qi X, Hejazi N, Mathur M, Le Guen Y, Tang H, Hastie T, Ionita-laza I, Sabatti C, Candes C. "In silico identification of putative causal genetic variants", bioRxiv, 2024.02.28.582621; doi: https://doi.org/10.1101/2024.02.28.582621

The main working assumption is that we do not have access to individual level genotype or phenotype data. Rather, for each SNP, we have its Z-scores with respect to some phenotype from a GWAS, and access to LD (linkage disequilibrium) data. The user is expected supply the Z-scores, while we supply the LD data in addition to some pre-computed knockoff data.

Q: When should I use GhostKnockoffGWAS?

Answer: If you already conducted a GWAS, have an output file that includes Z scores (or equivalent) for each SNP, and there exist pre-processed LD files in downloads page in which the listed population matches the ethnicities for your original GWAS study.

  • If your original study had little (e.g. <5) discoveries, then GhostKnockoffGWAS may not give better results. The methodology works better for more polygenic traits.
  • If your study subjects are somewhat admixed, one can try using the most suitable LD files, and check how much deviation there are from the LD files by examining the LD_shrinkage parameter in the output of GhostKnockoffGWAS, see this FAQ.
  • If instead you have individual level genotypes, you should run a GWAS using standard tools (e.g. PLINK, BOLT, GCTA, SAIGE, GEMMA, ...etc) before running GhostKnockoffGWAS.

Quick Start

Most users are expected to follow this workflow. Detailed explanations for each step is available in Tutorial.

  1. Go to Download Page and download (1) the software and (2) the pre-processed LD files. For example,

     wget https://github.com/biona001/GhostKnockoffGWAS/releases/download/v0.2.2/app_linux_x86.tar.gz
      wget https://zenodo.org/records/10433663/files/EUR.zip
  2. Unzip them both:

     tar -xvzf app_linux_x86.tar.gz
    - unzip EUR.zip  # decompresses to ~8.7GB
  3. Prepare your input Z score file into accepted format, see Acceptable Z-scores. A toy example can be downloaded by:

     wget https://github.com/biona001/GhostKnockoffGWAS/raw/main/data/example_zfile.txt
  4. Run the executable

     app_linux_x86/bin/GhostKnockoffGWAS --zfile example_zfile.txt --LD-files EUR --N 506200 --genome-build 38 --out example_output
  5. Make Manhattan plot with this R script. See step 5 in Tutorial for more details.

Those familiar with the Julia programming language can use GhostKnockoffGWAS as a regular julia package, see usage within Julia.

More general knockoff constructions

If you are interested in the broader knockoff methodology, not necessarily based on GWAS summary statistics, see for example

+ unzip EUR.zip # decompresses to ~8.7GB
  • Prepare your input Z score file into accepted format, see Acceptable Z-scores. A toy example can be downloaded by:

     wget https://github.com/biona001/GhostKnockoffGWAS/raw/main/data/example_zfile.txt
  • Run the executable

     app_linux_x86/bin/GhostKnockoffGWAS --zfile example_zfile.txt --LD-files EUR --N 506200 --genome-build 38 --out example_output
  • Make Manhattan plot with this R script. See step 5 in Tutorial for more details.

  • Those familiar with the Julia programming language can use GhostKnockoffGWAS as a regular julia package, see usage within Julia.

    More general knockoff constructions

    If you are interested in the broader knockoff methodology, not necessarily based on GWAS summary statistics, see for example

    diff --git a/dev/man/julia/index.html b/dev/man/julia/index.html index ce430197..27e1d9a3 100644 --- a/dev/man/julia/index.html +++ b/dev/man/julia/index.html @@ -30,4 +30,4 @@ [min_hwe=0.0], [force_block_diag=true], [method::String = "maxent"], [linkage::String="average"], [force_contiguous::Bool=false], [group_cor_cutoff::Float64=0.5], - [group_rep_cutoff::Float64=0.5], [verbose=true])

    Solves the group knockoff optimization problem on provided individual-level data and outputs the result into outdir. All variants that reside on chromosome chr with position between start_bp and end_bp (inclusive) will be included.

    Note on large VCF files

    Currently reading/parsing a VCF file is a single-threaded operation (even if it is indexed). Thus, we strongly recommend one convert to binary PLINK format.

    Inputs

    Optional inputs (for group knockoff optimization)

    output

    Calling solve_blocks will create 3 files in the directory outdir/chr:

    source + [group_rep_cutoff::Float64=0.5], [verbose=true])

    Solves the group knockoff optimization problem on provided individual-level data and outputs the result into outdir. All variants that reside on chromosome chr with position between start_bp and end_bp (inclusive) will be included.

    Note on large VCF files

    Currently reading/parsing a VCF file is a single-threaded operation (even if it is indexed). Thus, we strongly recommend one convert to binary PLINK format.

    Inputs

    Optional inputs (for group knockoff optimization)

    output

    Calling solve_blocks will create 3 files in the directory outdir/chr:

    source diff --git a/dev/man/solveblocks/index.html b/dev/man/solveblocks/index.html index 4a93ee12..5e230fa0 100644 --- a/dev/man/solveblocks/index.html +++ b/dev/man/solveblocks/index.html @@ -3,4 +3,4 @@ solveblock --file test.vcf.gz --chr 1 --start_bp 10583 --end_bp 1892607 --outdir ./test_LD_files --genome-build 19 # PLINK format -solveblock --file test.bed --chr 1 --start_bp 10583 --end_bp 1892607 --outdir ./test_LD_files --genome-build 19

    Required inputs

    Option nameArgumentDescription
    --fileStringA VCF or binary PLINK file storing individual level genotypes. Must end in .vcf, .vcf.gz, or .bed. If a VCF file is used, the ALT field for each record must be unique, i.e. multiallelic records must be split first. Missing genotypes will be imputed by column mean.
    --chrIntTarget chromosome. This MUST be an integer and it must match the CHROM field in your VCF/PLINK file. For example, if your VCF file has CHROM field like chr1, CHR1, or CHROM1 etc, they must be renamed into 1.
    --start_bpIntstarting basepair (position)
    --end_bpIntending basepair (position)
    --outdirStringDirectory that the output will be stored in (must exist)
    --genome-buildInthuman genome build for position of each SNP, must be 19 (hg19) or 38 (hg38)

    Optional inputs

    Option nameArgumentDescription
    --tolFloat64Convergence tolerlance for group knockoff coordinate descent optimization (default 0.0001)
    --min_mafFloat64Minimum minor allele frequency for a variable to be considered (default 0.01)
    --min_hweIntCutoff for hardy-weinburg equilibrium p-values. Only SNPs with p-value >= min_hwe will be included (default 0.0)
    --methodStringgroup knockoff optimization algorithm, choices include maxent (defualt), mvr, sdp, or equi. See sec 2 of https://arxiv.org/abs/2310.15069
    --linkageStringLinkage function to use for defining group membership. It defines how the distances between features are aggregated into the distances between groups. Valid choices include average (default), single, complete, ward, and ward_presquared. Note if force_contiguous=true, linkage must be :single
    --force_contiguousBoolwhether to force groups to be contiguous (default false). Note if force_contiguous=true, linkage must be :single)
    --group_cor_cutoffFloat64correlation cutoff value for defining groups (default 0.5). Value should be between 0 and 1, where larger values correspond to larger groups.
    --group_rep_cutoffFloat64cutoff value for selecting group-representatives (default 0.5). Value should be between 0 and 1, where larger values correspond to more representatives per group.
    --verboseBoolWhether to print intermediate messages (default true)

    Output format

    Calling solveblock will create 3 files in the directory outdir/chr (the chr directory will be created if it doesn't exist, but outdir must exist):

    1. XXX.h5: This contains the following data for region XXX
    2. Info_XXX.csv: This includes information for each variant (chr/pos/etc) present in the corresponding .h5 file.
    3. summary_XXX.csv: Summary file for the knockoff optimization problem

    Determining start_bp and end_bp

    A note on run-time

    Because VCF files are plain text files, it is inherently slow to read even if it is indexed. Thus, we recommend one to convert VCFs to binary PLINK format via PLINK 1.9:

    $plink_exe --vcf $vcffile --double-id --keep-allele-order --real-ref-alleles --make-bed --out $plinkprefix

    Note the --keep-allele-order is crucial to prevent PLINK from randomly converting the minor allele into A1.

    +solveblock --file test.bed --chr 1 --start_bp 10583 --end_bp 1892607 --outdir ./test_LD_files --genome-build 19

    Required inputs

    Option nameArgumentDescription
    --fileStringA VCF or binary PLINK file storing individual level genotypes. Must end in .vcf, .vcf.gz, or .bed. If a VCF file is used, the ALT field for each record must be unique, i.e. multiallelic records must be split first. Missing genotypes will be imputed by column mean.
    --chrIntTarget chromosome. This MUST be an integer and it must match the CHROM field in your VCF/PLINK file. For example, if your VCF file has CHROM field like chr1, CHR1, or CHROM1 etc, they must be renamed into 1.
    --start_bpIntstarting basepair (position)
    --end_bpIntending basepair (position)
    --outdirStringDirectory that the output will be stored in (must exist)
    --genome-buildInthuman genome build for position of each SNP, must be 19 (hg19) or 38 (hg38)

    Optional inputs

    Option nameArgumentDescription
    --tolFloat64Convergence tolerlance for group knockoff coordinate descent optimization (default 0.0001)
    --min_mafFloat64Minimum minor allele frequency for a variable to be considered (default 0.01)
    --min_hweIntCutoff for hardy-weinburg equilibrium p-values. Only SNPs with p-value >= min_hwe will be included (default 0.0)
    --methodStringgroup knockoff optimization algorithm, choices include maxent (defualt), mvr, sdp, or equi. See sec 2 of https://arxiv.org/abs/2310.15069
    --linkageStringLinkage function to use for defining group membership. It defines how the distances between features are aggregated into the distances between groups. Valid choices include average (default), single, complete, ward, and ward_presquared. Note if force_contiguous=true, linkage must be :single
    --force_contiguousBoolwhether to force groups to be contiguous (default false). Note if force_contiguous=true, linkage must be :single)
    --group_cor_cutoffFloat64correlation cutoff value for defining groups (default 0.5). Value should be between 0 and 1, where larger values correspond to larger groups.
    --group_rep_cutoffFloat64cutoff value for selecting group-representatives (default 0.5). Value should be between 0 and 1, where larger values correspond to more representatives per group.
    --verboseBoolWhether to print intermediate messages (default true)

    Output format

    Calling solveblock will create 3 files in the directory outdir/chr (the chr directory will be created if it doesn't exist, but outdir must exist):

    1. XXX.h5: This contains the following data for region XXX
    2. Info_XXX.csv: This includes information for each variant (chr/pos/etc) present in the corresponding .h5 file.
    3. summary_XXX.csv: Summary file for the knockoff optimization problem

    Determining start_bp and end_bp

    A note on run-time

    Because VCF files are plain text files, it is inherently slow to read even if it is indexed. Thus, we recommend one to convert VCFs to binary PLINK format via PLINK 1.9:

    $plink_exe --vcf $vcffile --double-id --keep-allele-order --real-ref-alleles --make-bed --out $plinkprefix

    Note the --keep-allele-order is crucial to prevent PLINK from randomly converting the minor allele into A1.

    diff --git a/dev/man/video/index.html b/dev/man/video/index.html index 701029c2..ef6341f7 100644 --- a/dev/man/video/index.html +++ b/dev/man/video/index.html @@ -1,2 +1,2 @@ -Video Tutorials · GhostKnockoffGWAS
    +Video Tutorials · GhostKnockoffGWAS
    diff --git a/dev/man/zfile/index.html b/dev/man/zfile/index.html index 615d13b8..bb5644bf 100644 --- a/dev/man/zfile/index.html +++ b/dev/man/zfile/index.html @@ -8,4 +8,4 @@ 17 152104 G A -0.28387322965385 17 152248 G A 0.901618600934489 17 152427 G A 1.10987516000804 -17 152771 A G 0.708492545266136

    A toy example is example_zfile.txt (17MB).

    Tip

    Missing Z scores can be specified as NaN or as an empty cell. If you do not want a SNP to be considered in the analysis, you can change the its Z-score to NaN. CHR/POS/REF/ALT fields cannot have missing values.

    Requirements on the input Z-scores

    In our papers, Z-scores are defined by $z = \frac{1}{\sqrt{N}}X^ty$ where $X$ is the $N \times P$ standardized genotype matrix with $N$ samples and $P$ SNPs, $y$ is the normalized $n \times 1$ phenotype vector, and these Z-scores have $N(0, 1)$ distribution under the null.

    In practice, this paper shows that other association test statistics that are $N(0, 1)$ under the null also result in FDR control. This includes commonly used tests in genetic association studies such as:

    If you have p-values, effect sizes, odds ratios,...etc, converting them into Z score might be possible, for example by following the Notes on computing Z-scores of this blog post.

    +17 152771 A G 0.708492545266136

    A toy example is example_zfile.txt (17MB).

    Tip

    Missing Z scores can be specified as NaN or as an empty cell. If you do not want a SNP to be considered in the analysis, you can change the its Z-score to NaN. CHR/POS/REF/ALT fields cannot have missing values.

    Requirements on the input Z-scores

    In our papers, Z-scores are defined by $z = \frac{1}{\sqrt{N}}X^ty$ where $X$ is the $N \times P$ standardized genotype matrix with $N$ samples and $P$ SNPs, $y$ is the normalized $n \times 1$ phenotype vector, and these Z-scores have $N(0, 1)$ distribution under the null.

    In practice, this paper shows that other association test statistics that are $N(0, 1)$ under the null also result in FDR control. This includes commonly used tests in genetic association studies such as:

    If you have p-values, effect sizes, odds ratios,...etc, converting them into Z score might be possible, for example by following the Notes on computing Z-scores of this blog post.