Strange p-value distribution after merge of two datasets

159 views
Skip to first unread message

Nicolas Rosewick

unread,
Mar 15, 2021, 4:28:25 AM3/15/21
to plink2-users
Hi,

I have two datasets. dataset_A represents my cases ; dataset_B represents my controls. In order to perfom a case/control logistic analysis I merged both plink datasets as follow. dataset_A and B are already pre-processed to remove low genotyped/frequent SNP ( --maf 0.01 --geno 0.05 ). 
# take only SNP present in both datasets 
# snp_in_common.txt contains the SNP intersection between dataset_A and dataset_B
plink --keep-allele-order --bfile dataset_A --extract snp_in_common.txt --make-bed --out dataset_A_common

plink --keep-allele-order --bfile dataset_B --extract snp_in_common.txt --make-bed --out dataset_B_common
echo dataset_A_common > merge.txt
echo dataset_B_common >> merge.txt 

# merge datasets 
plink --merge-list merge.txt --make-bed --out dataset_merge

# filter out SNP with low freq and low genotyping rate 
plink --maf 0.01 --geno 0.05 --hwe 0.00001 --bfile dataset_merge --out dataset_merge

I perfomed a PCA (after pruning the merged dataset)

# pruning 
 plink --bfile dataset_merge --exclude high-ld-regions.txt --range --indep-pairwise 50 5 0.2 --out dataset_merge 
 plink --bfile datase_merge dataset_merge.prune.in --make-bed --out dataset_merge_pruned 

# pca 
plink --pca --bfile dataset_merge_pruned --out dataset_merge_pruned

When I plot PCA shows clearly a strong batch effect between both datasets

Capture du 2021-03-12 16-42-06.png

I continued the analysis by performing a logistic :

plink --bfile dataset_merge --covar pca_file.txt --covar-name PC1,PC2 --logistic --out dataset_merge

Looking at the manhattan and p-value histogram, there is clearly something not correct ... most of p-values are close to 1..

Capture du 2021-03-12 16-51-35.png

Capture du 2021-03-12 16-52-25.png

Any explanation on this srange behaviour ? Is my merging procedure correct ? 

Thank you

P.S. : I already posted this on biostars. Sorry for the cross-post

Christopher Chang

unread,
Mar 15, 2021, 11:14:51 AM3/15/21
to plink2-users
Sorry to be the bearer of bad news, but the Biostars responder is correct: you're doomed because there's an obvious batch effect that swamps the case/control difference you're trying to detect.  Next time, make sure you don't sequence/genotype all your cases in one batch and all your controls in another; mix them up.

Sal

unread,
Sep 15, 2022, 12:28:38 PM9/15/22
to plink2-users
Hi,
I have a similar issue. In my case batches are not separated by cases vs controls. They are simply from different sources. After applying pca to the merged file i get a very similar plot shown above (in my case PC1 is the seperator and PC2 is overlapping  - furthermore PC2 vs PC3 is a nice overlap). Would including PC1 and PC2 as covariates solve the issue when performing GWAS?

Esoh

unread,
Sep 15, 2022, 1:34:51 PM9/15/22
to plink2-users
Hi,
A similar issue was resolved for me by performing QC all over on the merged data set (my two batches of samples were from the same population).
The critical step that seemed to solve batch effect was testing differential missingness by batch, that is after testing differential missingness by case-control status.
Therefore, you can code batch 1 and batch 2 as case and control respectively and then test for differential missingness using plink --test-missing.
There seemed to be a substantial number of SNPs with significant differential missingness (p-value < 0.001) driving batch effect in my data.

Not sure if that will work for you. 
Reply all
Reply to author
Forward
0 new messages