Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colheaders.tsv format #12

Open
jjfarrell opened this issue Jun 2, 2024 · 6 comments
Open

colheaders.tsv format #12

jjfarrell opened this issue Jun 2, 2024 · 6 comments

Comments

@jjfarrell
Copy link

For the munge -C colheaders.tsv option, what is the format of the colheaders.tsv file? The -c options does not include ourput from GENESIS (https://github.com/UW-GAC/GENESIS).

@freeseek
Copy link
Owner

freeseek commented Jun 3, 2024

Did you check the Column Headers Mappings section in the documentation? It shows how to create that file. If it does not have the column headers from GENESIS let me know

@jjfarrell
Copy link
Author

Thanks! I found that section. The column headers for GENESIS were not there. Their columns had a lot of small caps. So I made a genesis.colheaders.tsv. It looks like this:

rsid SNP
P P
n.obs N
chr CHR
pos BP
Est BETA
alt A1
ref A2
Est.SE SE
eff.frq FRQ
Z Z
mac AC

I assigned the `mac column (minor allele count) to AC which is probably not ideal. A lot of our users and TOPMed researchers use GENESIS so it would be useful to have a -c genesis option.

@jjfarrell
Copy link
Author

jjfarrell commented Jun 3, 2024

@freeseek I ran metal on the 4 summary files (3 with SAIGE output and 1 with this GENESIS output). With just the 3 SAIGE results, I got this output for the APOE variant:
chr19 44908684 rs429358 T C . . . NS:NC:ES:SE:LP:AF:AC:NE:I2:CQ:ED 725785:81165:1.04249:0.0155521:977.642:0.176641:223431:268674:98.6529:32.2391:+++

When I added the GENESIS summary data, I ended up with 3 lines for some reason.

chr19 44908684 rs429358;chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 729318:1.03617:0.0150746:1027.89:0.178736:224924:98.0159:31.8383:++++
chr19 44908684 chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 3533:0.937911:0.0613045:52.1114:0.211294:1493:.:.:???+
chr19 44908684 chr19:44908684:T:C T C . . . NS:ES:SE:LP:AF:AC:I2:CQ:ED 3533:0.937911:0.0613045:52.1114:0.211294:1493:.:.:???+

Any ideas or suggestions on this? The first line looks correct, not sure why the other 2 lines are in there.

Here is another line.

chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-
chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-
chr19   44909080        chr19:44909080:G:A      G       A       .       .       .       NS:ES:SE:LP:AF:AC:I2:CQ:ED      3540:-0.459873:0.373812:0.660325:0.00423729:30:.:.:???-

I can clean them out with grep -v '???-'.

Further checking shows that this is also happening with munging the 3 datasets.

@freeseek
Copy link
Owner

Can you give me a reference for the GENESIS bindings and can you give me a way to reproduce the METAL issue?

@jjfarrell
Copy link
Author

@freeseek I will try go create some smaller gwas_vcfs to try to reproduce the issue with the APOE variant and get back to you. It may have to do with the number of threads being used which I will test out.

For the GENESIS bindings:

https://github.com/UW-GAC/GENESIS/blob/devel/man/assocTestSingle.Rd

\item{variant.id}{The variant ID}
\item{chr}{The chromosome value}
\item{pos}{The base pair position}
\item{allele.index}{The index of the alternate allele. For biallelic variants, this will always be 1.}
\item{n.obs}{The number of samples with non-missing genotypes}
\item{freq}{The estimated effect allele frequency}
\item{MAC}{The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the allele specified by \code{allele.index} with the sum of all other alleles.}
If \code{geno.coding} is \code{"recessive"}:
\item{n.hom.eff}{The number of samples homozygous for the effect allele.}
If \code{geno.coding} is \code{"dominant"}:
\item{n.any.eff}{The number of samples with any copies of the effect allele.}
If \code{test} is \code{"Score"}:
\item{Score}{The value of the score function}
\item{Score.SE}{The estimated standard error of the Score}
\item{Score.Stat}{The score Z test statistic}
\item{Score.pval}{The score p-value}
\item{Est}{An approximation of the effect size estimate for each additional copy of the effect allele}
\item{Est.SE}{An approximation of the standard error of the effect size estimate}
\item{PVE}{An approximation of the proportion of phenotype variance explained}
% If \code{test} is \code{"Wald"} and \code{GxE} is \code{NULL}:
% \item{Est}{The effect size estimate for each additional copy of the effect allele}
% \item{Est.SE}{The estimated standard error of the effect size estimate}
% \item{Wald.Stat}{The Wald Z test statistic}
% \item{Wald.pval}{The Wald p-value}
If \code{test} is \code{"Score.SPA"}:
\item{SPA.pval}{The score p-value after applying the saddle point approximation (SPA)}
\item{SPA.converged}{logical indiactor of whether the SPA converged; \code{NA} indicates that the SPA was not applied and the original Score.pval was returned}
If \code{GxE} is not \code{NULL}:
\item{Est.G}{The effect size estimate for the genotype term}
\item{Est.G:env}{The effect size estimate for the genotype*env interaction term. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.}
\item{SE.G}{The estimated standard error of the genotype term effect size estimate}
\item{SE.G:env}{The estimated standard error of the genotype*env effect size estimate. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.}
\item{GxE.Stat}{The Wald Z test statistic for the test of all genotype interaction terms.  When there is only one genotype interaction term, this is the test statistic for that term.}
\item{GxE.pval}{The Wald p-value for the test of all genotype interaction terms; i.e. the test of any genotype interaction effect}
\item{Joint.Stat}{The Wald Z test statistic for the joint test of the genotype term and all of the genotype interaction terms}
\item{Joint.pval}{The Wald p-value for the joint test of the genotype term and all of the genotype interaction terms; i.e. the test of any genotype effect}
If \code{test} is \code{"BinomiRare" or "CMP"}:
\item{n.carrier}{Number of individuals with at least one copy of the effect allele}
\item{n.D.carrier}{Number of cases with at least one copy of the effect allele}
\item{pval}{p-value}
\item{mid.pval}{mid-p-value}

@freeseek
Copy link
Owner

I don't see rsid, alt, ref, eff.frq, or Z among the GENESIS bindings. Maybe it would be simply better to add some of the bindings you are working with in colheaders.tsv rather than add a -c option for a non-conventional summary statistics format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants