munge_sumstats.py - ERROR converting summary statistics

Paul Hook

unread,

Jun 15, 2018, 5:58:26 PM6/15/18

to ldsc_users

Hello,

I am currently trying to run munge_sumstats.py on the "49 EUR samples" sumstats from the 2014 Sz GWAS from the PGC consortium (http://www.med.unc.edu/pgc/results-and-downloads) with the file "daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz".

When I run munge_sumstats.py I get the following error:

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.0
* (C) 2014-2015 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./munge_sumstats.py \
--daner \
--out munged/test \
--merge-alleles ../w_hm3.snplist \
--sumstats daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz

Inferred that N_cas = 33640.0, N_con = 43456.0 from the FRQ_[A/U] columns.
Interpreting column names as follows:
INFO: INFO score (imputation quality; higher --> better imputation)
A1: Allele 1, interpreted as ref allele for signed sumstat.
P: p-Value
A2: Allele 2, interpreted as non-ref allele for signed sumstat.
SNP: Variant ID (e.g., rs number)
FRQ_U_43456: Allele frequency
OR: Odds ratio (1 --> no effect; above 1 --> A1 is risk increasing)
Reading list of SNPs for allele merge from ../w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz into memory 5000000 SNPs at a time.

ERROR converting summary statistics:
Traceback (most recent call last):
File "/home-3/pho...@jhu.edu/my-python-modules/ldsc/munge_sumstats.py", line 686, in munge_sumstats
dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args)
File "/home-3/pho...@jhu.edu/my-python-modules/ldsc/munge_sumstats.py", line 238, in parse_dat
for block_num, dat in enumerate(dat_gen):
File "/cm/shared/apps/anaconda2/4.4.0/lib/python2.7/site-packages/pandas/io/common.py", line 93, in <lambda>
BaseIterator.next = lambda self: self.__next__()
File "/cm/shared/apps/anaconda2/4.4.0/lib/python2.7/site-packages/pandas/io/parsers.py", line 959, in __next__
return self.get_chunk()
File "/cm/shared/apps/anaconda2/4.4.0/lib/python2.7/site-packages/pandas/io/parsers.py", line 1019, in get_chunk
return self.read(nrows=size)
File "/cm/shared/apps/anaconda2/4.4.0/lib/python2.7/site-packages/pandas/io/parsers.py", line 982, in read
ret = self._engine.read(nrows)
File "/cm/shared/apps/anaconda2/4.4.0/lib/python2.7/site-packages/pandas/io/parsers.py", line 1719, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11343)
File "pandas/_libs/parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:12175)
File "pandas/_libs/parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas/_libs/parsers.c:14136)
File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_tokens (pandas/_libs/parsers.c:15475)
ValueError: cannot safely convert passed user dtype of float64 for object dtyped data in column 8

I haven't seen this error mentioned before but "column 8" in this file is the "INFO" column. I have run this exact set up on files from the PGC before with no issues.

Any idea what this error is pointing to? Do I need to use a previous version of pandas? Is there something wrong with the file?

Thanks

Paul

Mike Nalls

unread,

Jun 20, 2018, 12:06:51 PM6/20/18

to ldsc_users

Ran into pretty much exact same error yesterday after a sys admin update here at NIH.

Try to roll back pandas...

sudo pip uninstall pandas
sudo pip install pandas==0.18.1

...as any version of pandas after that seems to die.

Paul Hook

unread,

Jun 20, 2018, 3:51:53 PM6/20/18

to ldsc_users

Hi Mike,

Thanks for the response! As with other errors that come up on here, changing pandas version seems like a good move, however, when I set up an alternative LDSC environment with a older version of pandas, this error still occurred (however, I am not sure if I went back far enough).

So, after digging through the forums, I found an explanation on why this is happening and how to fix it here:

https://groups.google.com/forum/#!searchin/ldsc_users/dtype%7Csort:date/ldsc_users/JcrWYLRJm_s/D-jWSF7oBwAJ

As the above post states, pandas/python cannot read in values >1e300 and some of the odds ratios in my summary statistics file were larger than that.

This seems like a ridiculous value for an odds ratio and I'm not sure why summary statistics would contain ORs that high, so I followed the solution in the above post and removed SNPs with ORs that high (~10 in the file I was working with, none significant). See code to do that below:

zcat daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz | awk 'NR==1 || $9 < 1e300' - > OR-FILTER_daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.sumstats