could not get subset of snps from .bgen file

1,748 views
Skip to first unread message

刘莹

unread,
Jul 14, 2017, 4:56:43 PM7/14/17
to plink2-users
I just started to work on UK Biobank (interim data) project and I am not PLINK savvy, so excuse me if I asked basic questions (I will definitely google first). 

Since the UKB imputed data (.bgen file) are large, and my research of interest are just a list of snps. I would love to extract a small subset of the bgen file. I know the traditional way is to use qctool, but I kept receiving error message, and I have raised the question to their mailing list as well. 

When I tried to conduct the same task using PLINK2, I tried "plink2 --bgen chr19impv1.bgen --sample impv1.sample --extract rs6857.txt --out rs6857.bgen", and it could not be done and it reminded me to rerun --make-bed. 
Options in effect:
  --bgen chr19impv1.bgen
  --extract rs6857.txt
  --out rs6857.bgen
  --sample impv1.sample

96736 MB RAM detected; reserving 48368 MB for main workspace.
Error: Basic file conversions do not support regular filtering operations.
Rerun your command with --make-bed.

I really want to keep bgen file or gen file for further analyses. So I just want to ask is there any way to accomplish the task. Also if I want to use PLINK to deal with bgen files, what steps do I have to go through to get a subset and then merge the multiple files (one for a snp of interest, or one for a list of snps from one chromosome) to one.

Thanks so much!! 

Christopher Chang

unread,
Jul 14, 2017, 5:17:58 PM7/14/17
to plink2-users
There are three issues here.

1. You generally want to use PLINK 2.0 instead of 1.9 when working with .bgen files.  PLINK 1.9 is unable to track any of the genotype probability information in the .bgen; it rounds numbers to the nearest integer and replaces those too far from an integer with missing calls.

2. If you want to generate a .bgen file with only the variants you want, replace "--out rs6857.bgen" with "--export bgen-1.1 --out rs6857".  --out just specifies an output filename prefix; including ".bgen" in the --out parameter does not tell PLINK 2.0 to export a .bgen file.

3. This will still be a lossy process; PLINK 2.0 keeps track of dosages, but not genotype probabilities.  When sample x has P(AA) = 0.8, P(AC) = 0.2, and P(CC) = 0 and sample y has P(AA) = 0.85, P(AC) = 0.1, and P(CC) = 0.05, PLINK 2.0 sees both samples as dosage(A) = 1.8, dosage(C) = 0.2, and it will make both entries look like sample x when asked to export a .bgen.  Sometimes this is okay, but if you aren't sure, you should stick to qctool/bgenix for .bgen data management because of this information loss.
Reply all
Reply to author
Forward
0 new messages