Sumstats QC filtering

76 views
Skip to first unread message

Daniel Rosoff

unread,
Apr 18, 2023, 6:47:52 PM4/18/23
to Genomic SEM Users
Another question that has just come up.  In the sumstats log prior to multivariate common factor analysis, the numbers don't seem to add up - but maybe there are QC deletions not mentioned or some mentioned are duplicated.  I am mostly concerned about number of SNPs remaining in final multivariate file exceeding SNPs remaining in one of the GWAS.  Actually, I see that  my GSEM common factor projects done earlier than last 2 months actually tally correctly, but not those done recently: 

Preparing summary statistics for file: logTGeas.txt
Interpreting the SNP column as the SNP column.
Interpreting the A1 column as the A1 column.
Interpreting the A2 column as the A2 column.
Interpreting the BETA column as the effect column.
Interpreting the P column as the P column.
Interpreting the N column as the N column.
Interpreting the MAF column as the MAF column.
Interpreting the SE column as the SE column.
73890 rows were removed from the logTGeas.txt summary statistics file due to entries that were duplicated across rsID, A1, and A2.
Merging file: logTGeas.txt with the reference file: ref.1000G.EAS.0.005.nox.txt
15181672 rows present in the full logTGeas.txt summary statistics file.
6832347 rows were removed from the logTGeas.txt summary statistics file as the rsIDs for these SNPs were not present in the reference file.
The effect column was determined NOT to be coded as an odds ratio (OR) for the logTGeas.txt summary statistics file based on the median of the effect column being close to 0.
352107 row(s) were removed from thelogTGeas.txt summary statistics file due to the effect allele (A1) column not matching A1 or A2 in the reference file.
435334 row(s) were removed from the logTGeas.txt summary statistics file due to the other allele (A2) column not matching A1 or A2 in the reference file.
No INFO column, cannot filter on INFO, which may influence results
7558217 SNPs are left in the summary statistics file logTGeas.txt after QC and merging with the reference file
.

As for the GWAS with fewer left than in multivariate file:

...
7113128 SNPs are left in the summary statistics file Feasrevised.txt after QC and merging with the reference file.

But
After merging across all summary statistics using listwise deletion, performing QC, and merging with the reference file, there are 7147787 SNPs left in the final multivariate summary statistics file

Thanks for your help!   Dan

agro...@gmail.com

unread,
Apr 19, 2023, 3:12:56 PM4/19/23
to Genomic SEM Users
Do you have any duplicate rsIDs in the final sumstats output file? It could be that one of your input files has multi-allelic variants as sumstats currently only prunes out rows that are duplicated across rsID, A2, and A2 that is then causing the total number of rows to increase when merged by rsID. 

If that turns out to be the problem I'll put in an update, if not I can think more on this. 

Daniel Rosoff

unread,
Apr 20, 2023, 12:38:42 PM4/20/23
to Genomic SEM Users
Yes, there were duplicates in the final sumstats output file.  As regards EAS populations, I used the 1000G East Asian reference panel from FUMA, but unlike the EUR, there are duplicated variants, same SNPid, different alleles, different MAFs.  which should come out in the wash but don't because both may also in the East Asian GWASs.  So here is my final sumstats (one variant), with duplicate problem compounded over 3 factors:

             SNP CHR       BP       MAF A1 A2 beta.FIeasRev se.FIeasRev beta.logTGeasRev
1168 rs141222684  21 15209936 0.2519841  A  C  -0.005849399  0.01606635      0.004889387
1169 rs141222684  21 15209936 0.2519841  A  C  -0.005849399  0.01606635      0.004889387
1170 rs141222684  21 15209936 0.2519841  A  C  -0.005849399  0.01606635     -0.021940527
1171 rs141222684  21 15209936 0.2519841  A  C  -0.005849399  0.01606635     -0.021940527
1172 rs141222684  21 15209936 0.2212302  T  C  -0.005849399  0.01606635      0.004889387
1173 rs141222684  21 15209936 0.2212302  T  C  -0.005849399  0.01606635      0.004889387
1174 rs141222684  21 15209936 0.2212302  T  C  -0.005849399  0.01606635     -0.021940527
1175 rs141222684  21 15209936 0.2212302  T  C  -0.005849399  0.01606635     -0.021940527
     se.logTGeasRev beta.HDLReasRev se.HDLReasRev
1168    0.006970045    -0.001302549   0.006934257
1169    0.006970045     0.008682679   0.011268806
1170    0.011115659    -0.001302549   0.006934257
1171    0.011115659     0.008682679   0.011268806
1172    0.006970045    -0.001302549   0.006934257
1173    0.006970045     0.008682679   0.011268806
1174    0.011115659    -0.001302549   0.006934257
1175    0.011115659     0.008682679   0.011268806

That is one problem - 2 variants in 1000G East Asian reference, found also in the 3 East Asian GWASs filtered on the 1000G East Asian. Not sure what to do with this -  only 2 rows are correct and I don't know which 2 - if final sumstats file is constructed by merging QCed input files on just SNP, then I get the 8 lines for 2 SNPs, as opposed to merging QCed input files on SNP, A1 and A2.  I am not sure what to do here, because there are 44000+ rows of duplicated SNPs.  (All 8 lines then are carried forward to the commonfactor function.)

Going to the EUR populations, and this may be trivial, I still get the wrong tallies in the sumstats log.  

Thanks again, 
Dan


From EUR input file:
Preparing summary statistics for file: hdl.reverse.txt

Interpreting the SNP column as the SNP column.
Interpreting the A1 column as the A1 column.
Interpreting the A2 column as the A2 column.
Interpreting the BETA column as the effect column.
Interpreting the P column as the P column.
Interpreting the N column as the N column.
Interpreting the MAF column as the MAF column.
Interpreting the SE column as the SE column.
1356048 rows were removed from the hdl.reverse.txt summary statistics file due to entries that were duplicated across rsID, A1, and A2.
Merging file: hdl.reverse.txt with the reference file: reference.1000G.maf.0.005.txt
44794860 rows present in the full hdl.reverse.txt summary statistics file.
36596310 rows were removed from the hdl.reverse.txt summary statistics file as the rsIDs for these SNPs were not present in the reference file.
The effect column was determined NOT to be coded as an odds ratio (OR) for the hdl.reverse.txt summary statistics file based on the median of the effect column being close to 0.
11 row(s) were removed from thehdl.reverse.txt summary statistics file due to the effect allele (A1) column not matching A1 or A2 in the reference file.
1476 row(s) were removed from the hdl.reverse.txt summary statistics file due to the other allele (A2) column not matching A1 or A2 in the reference file.

No INFO column, cannot filter on INFO, which may influence results
8193600 SNPs are left in the summary statistics file hdl.reverse.txt after QC and merging with the reference file.


  

agro...@gmail.com

unread,
May 4, 2023, 11:45:51 AM5/4/23
to Genomic SEM Users
Hi Dan, 

Apologies for dropping off on this one! The end of the semester caught up with me but here's what I would say about the behavior you are observing: 

1. The reference file we provide does not have duplicates so this issue shouldn't come up when using that ref file. Even if your GWAS files have duplicates, sumstats removes any SNPs that don't match A1 and A2 in the ref file, and in my experience even with duplicate rsIDs you will still have different A1/A2 combinations such that only one SNP should be retained. 

2. With that said, my sense is duplicate SNPs often reflect multi-allelic variants, for which the expectation for the SNP variance is going to be different than 2pq. Since we use 2pq to convert the GWAS betas to covariances, and this may not work for these SNPs, I've just updated the sumstats code to remove all instances of duplicates from the GWAS files. If you redownload the package, restart R, and rerun your code I would hope your final SNP numbers should be in line with expectation now. 

3. If your reference file you have has duplicate rsIDs I would also recommend removing these prior to running sumstats for the same reasons outlined above. 

Best, 
  Andrew

Thanh Le

unread,
Jul 16, 2024, 7:15:25 AM (yesterday) Jul 16
to Genomic SEM Users
Hi Andrew,

Thank you for the clarification. Could you explain how we use 2pq to convert the GWAS betas to covariances?

Best regards,
Thanh Le

Elliot Tucker-Drob

unread,
Jul 16, 2024, 10:18:47 AM (yesterday) Jul 16
to Thanh Le, Genomic SEM Users
This is detailed in the Grotzinger et al. 2019 publication in Nature Human Behaviour. See "Standardization and scaling of summary statistics for multivariate GWASs"

--
You received this message because you are subscribed to the Google Groups "Genomic SEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/1a6f8049-770f-4649-bd91-8bdab7719517n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages