Error while running munge_sumstats.py

646 views
Skip to first unread message

남심현만

unread,
Jun 29, 2018, 11:31:29 AM6/29/18
to ldsc_users
Hello, I'm sorry to write question at friday afternoon.

While running munge_sumstats.py, I have a error about median value of beta.

Error message is the following sentences.


Interpreting column names as follows:
a1:     Allele 1, interpreted as ref allele for signed sumstat.
pval:   p-Value
beta:   [linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
snpid:  Variant ID (e.g., rs number)
a2:     Allele 2, interpreted as non-ref allele for signed sumstat.

Reading list of SNPs for allele merge from YY.assoc1.snp.snplist
Read 25361 SNPs for allele merge.
Reading sumstats from YY.assoc1.ld.txt into memory 5000000 SNPs at a time.
. done
Read 25361 SNPs from --sumstats file.
Removed 0 SNPs not in --merge-alleles.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 247 variants that were not SNPs or were strand-ambiguous.
25114 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (25114 SNPs remain).
Using N = 25361.0

ERROR converting summary statistics:

Traceback (most recent call last):
  File "C:/Users/inha/ldsc/munge_sumstats.py", line 701, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "C:/Users/inha/ldsc/munge_sumstats.py", line 373, in check_median
    raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is 0.27 (should be close to 0). This column may be mislabeled.


Conversion finished at Fri Jun 29 15:08:42 2018
Total time elapsed: 0.24s
Traceback (most recent call last):
  File "C:/Users/inha/ldsc/munge_sumstats.py", line 746, in <module>
    munge_sumstats(parser.parse_args(), p=True)
  File "C:/Users/inha/ldsc/munge_sumstats.py", line 701, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "C:/Users/inha/ldsc/munge_sumstats.py", line 373, in check_median
    raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is 0.27 (should be close to 0). This column may be mislabeled.


Here, I don't know why ldsc calculate median value of beta and  produce error like this.

I hope that someone explain why this error occurs and provide me the solution for this error.

Raymond Walters

unread,
Jul 12, 2018, 7:12:54 PM7/12/18
to 남심현만, ldsc_users
Hi,

The median beta is computed largely as a sanity check that the file has been parsed as expected. As indicated in the log under "Interpreting column names..." munge_sumstats.py is checking that the column labelled "beta" has a median value near its expected null value of 0. This is a relatively reasonable assumption if the betas are randomly distributed around zero (as is assumed by the ldsc model).

A couple things you might look at to verify if something is actually problematic is happening:

1) Check whether your GWAS results file has been set up to always report a1 as the trait-increasing allele. If that's the case we no longer expect the median to be zero, and you'll want to specify the "--a1-inc" flag for munge_sumstats.py.

2) Manually compute the median beta of your GWAS and verify you get a similar result. This is intended to make sure that the beta=0.27 being observed is an accurate description of your file and not an indication that it's being parsed incorrectly (for example if it's reading minor allele frequencies rather than the betas).

3) Verify your full genome-wide data is getting input to ldsc. The log reports only 25,361 input SNPs, which is very small. (If not for the median beta error, you would have ended up with a warning that the number of SNPs is smaller than expected.) Normally we'd anticipate a few million variants in a standard GWAS with imputed data, and ~1 million remaining after standard --merge-alleles. It's possible that the small number of SNPs is directly contributing to the median beta by allowing more sampling variation in the median (though this is unlikely to be the full explanation). Either way, I'd suggesting checking some of the other threads on this group about the 200K SNP warning for more information about why GWAS with a limited number of SNPs may be problematic.

4) Evaluate the scaling of your betas. If your betas are for an unstandardized phenotype with a large variance then 0.27 may not be meaningfully different from zero. I.e. a beta of 0.27 means very different things whether the phenotype has a standard deviation of 1 vs. when it has a standard deviation of 1000. The munge_sumstats.py script doesn't currently infer this scaling, and just checks whether the absolute value of the median is within 0.1 of the expected null value. If your betas are on a much more variable scale, then you might consider either standardizing the betas for the phenotypic variance before inputting to ldsc or editing munge_sumstats.py (line 701, in the check_median() function) to loosen the tolerance on the check.

Hope that helps!

Cheers,
Raymond



--
You received this message because you are subscribed to the Google Groups "ldsc_users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ldsc_users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ldsc_users/e0506a91-e8a4-4b86-b3e2-b35bf0c21b4c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages