Hi,
I have two datasets. dataset_A represents my cases ; dataset_B represents my controls. In order to perfom a case/control logistic analysis I merged both plink datasets as follow. dataset_A and B are already pre-processed to remove low genotyped/frequent SNP ( --maf 0.01 --geno 0.05 ).
# take only SNP present in both datasets
# snp_in_common.txt contains the SNP intersection between dataset_A and dataset_B
plink --keep-allele-order --bfile dataset_A --extract snp_in_common.txt --make-bed --out dataset_A_common
plink --keep-allele-order --bfile dataset_B --extract snp_in_common.txt --make-bed --out dataset_B_common
echo dataset_A_common > merge.txt
echo dataset_B_common >> merge.txt
# merge datasets
plink --merge-list merge.txt --make-bed --out dataset_merge
# filter out SNP with low freq and low genotyping rate
plink --maf 0.01 --geno 0.05 --hwe 0.00001 --bfile dataset_merge --out dataset_merge
I perfomed a PCA (after pruning the merged dataset)
# pruning plink --bfile dataset_merge --exclude high-ld-regions.txt --range --indep-pairwise 50 5 0.2 --out dataset_merge
# pca
plink --pca --bfile dataset_merge_pruned --out dataset_merge_pruned
When I plot PCA shows clearly a strong batch effect between both datasets

I continued the analysis by performing a logistic :
plink --bfile dataset_merge --covar pca_file.txt --covar-name PC1,PC2 --logistic --out dataset_merge
Looking at the manhattan and p-value histogram, there is clearly something not correct ... most of p-values are close to 1..


Any explanation on this srange behaviour ? Is my merging procedure correct ?
Thank you
P.S. : I already posted this on biostars. Sorry for the cross-post