QC UKB

Maria Di Biase

unread,

Jul 7, 2021, 8:55:45 PM7/7/21

to plink2-users

Dear Christopher,

I would like to filter out SNPs as follows:

1. with a minor allele frequency (MAF) < 0.01,

2. with a Hardy-Weinberg equilibrium (HWE) test p value < 10−7

3. with a proportion of missingness (Pm) > 0.05

4. with an imputation information score < 0.8,

5. that are a duplicated SNP

I am using ukbiobank data and am new to PLINK. Can I achieve steps 1-3 as follows?

plink2 --bgen ukb_imp_chr1_v3.bgen ref-first --sample ukb22828_c1_b0_v3_s487253.sample --maf .01 --hwe 1e-7 --mind .05 --make-bpgen --out plink_outputs/chr1

How can I achieve steps 4 and 5? Can this be done in the same command or is a separate command needed?

Thank you kindly in advance,

Maria

Christopher Chang

unread,

Jul 7, 2021, 11:02:17 PM7/7/21

to plink2-users

Yes, that command should handle steps 1-3.

--mach-r2-filter can be used to filter on the MaCH-r2 imputation statistic in the same command; alternatively, you can use --extract-col-cond to filter on another imputation statistic in a text file.

--rm-dup can be used to filter duplicate SNPs. I recommend performing this step separately, so you can first look at what sort of duplicate SNPs you have and then decide how you want to handle them.

Maria Di Biase

unread,

Jul 8, 2021, 8:09:57 AM7/8/21

to plink2-users

Thank you very much for the quick turn around.

Re the imputation score, I would like to use column 7 in a file called "ukb_mfi_chr1_v3.txt" to remove SNPs with an info score < 0.8. Column 1 of this file specifies the marker ID (the marker order is not necessarily the same as the BGEN files). Is the following use correct?

--extract-col-cond-min .8

ukb_mfi_chr1_v3.txt 7 1

Also, is there a command to characterize the sort of duplicate SNPs that are present in the data?

Thanks again,

Maria

Christopher Chang

unread,

Jul 8, 2021, 11:38:09 AM7/8/21

to plink2-users

No, "--extract-col-cond ukb_mfi_chr1_v3.txt 7 1 --extract-col-cond-min 0.8" is probably what you want.

--rm-dup's default mode ("error") checks groups of duplicate-ID SNPs for equality; if any mismatches are found, it errors out and reports SNPs with mismatches. You can then decide what to do about the mismatching same-ID SNPs.

Reply all

Reply to author

Forward