missing SNPs when using PLINK2 on UKB bgen genotype

1,010 views
Skip to first unread message

Mere H

unread,
Aug 9, 2022, 1:50:03 PM8/9/22
to plink2-users
Hi Dr. Chang and PLINK2 community,
I am using PLINK2 to filter UK Biobank (UKB) bgen genotype by giving a list of snp IDs. The list consists of a mixture of rsIDs, SNPIDs, and Affx IDs, which are provided by ukb_mfi_chr*_V3.txt. (See an example of 20 snps in the figure below). The number of snps remained after filtering is far less than what is expected. I am wondering if anyone has encountered the same issue and had a solution for it other than just replacing every alternate_id into rsID?

This is the code I used to extract snps:
plink2 --bgen ukb_imp_chr21_v3.bgen ref-first --allow-extra-chr --extract mylist.txt --sample ukb34697_imp_chr21_v3_s487317.sample --threads 16 --memory 32000 --export vcf 'vcf-dosage=HDS-force' --out myOutputFile

The code was executed without any error messages. However, only 1/10 of snps were retained in the output vcf. To give you a quick example, in the list of 20 snps below, only 11 snps remained (remained ones are highlighted in green).

Picture2.png

The ones omitted by PLINK were either Affx IDs or insertions at Allele 2. For the test list as small as 20 snps, I can certainly use rsID to replace them and then can extract all 20 snps successfully. But the problem is, my real list has 334,016 snps, and nearly 20,000 of them do not have rsID in the original ukb_mfi_chr21_V3.txt provided by UKB so I have to manually add their rsIDs, and rsID is not a unique ID for some triplet snps, which may cause additional issues. Therefore I try to stick to the alternate_ids, which is a unique ID. As a result, only 27,109 snps were retained from my 334,016 snps list. 

Is there a way to work around this losing snp issue, other than converting all alternate_ids into rsID? I have not heard of someone doing QC on UKB genotype who has to massively (and manually) replace alternate_ids by rsIDs, so I want to ask for your experience to make sure I am heading to the right direction.  

Thank you!

Mey

Christopher Chang

unread,
Aug 9, 2022, 9:48:16 PM8/9/22
to plink2-users
Your --extract file must use the same variant ID scheme as the UKB file.  You can't just use your own non-UKB IDs and expect them to be translated automatically.

--set-all-var-ids is frequently used to solve this type of variant-ID-mismatch problem (by converting all IDs in both files to be based on position+alleles).

Mere H

unread,
Aug 10, 2022, 4:11:11 PM8/10/22
to plink2-users
Hi Dr. Chang,
To clarify any misunderstanding on my question, my --extract file does match with UKB's bgen file. I did not use my own non-UKB IDs. Those alternate-ids you see in the Excel screenshot were provided by UKB. Therefore, the missing SNPs issue is not due to the ID-mismatch between txt and bgen.

Below is an example of what UKB-provided IDs look like:
Screenshot 2022-08-10 154855.jpg
The 1st column is the IDs I used in my --extract txt in my original email. Again, they match with how they appear in bgen. 

As you see, there are some ID formats provided in UKB file that are not recognizable by PLINK, such as insertion at Allele 2 or Affx IDs , which are missing when I extract them (not in green are the missing ones). Is there a way to improve my code (see original post) to make PLINK not miss these SNPs? Or do you think finding out rsID for all snps and using rsID in the --extract txt will be the best solution?

Thank you.

Mey

Christopher Chang

unread,
Aug 10, 2022, 4:16:42 PM8/10/22
to plink2-users
Nope.  Recheck the correspondence between your --extract file and the .bgen.

Christopher Chang

unread,
Aug 10, 2022, 5:00:46 PM8/10/22
to plink2-users
Ok, I think I figured out the source of the confusion.

From PLINK2's .bgen import documentation:
"With .bgen input, use the 'snpid-chr' modifier to specify that chromosome codes should be read from the "SNP ID" field. (The "SNP ID" field is usually ignored.)"

Emphasis mine.  .bgen files have two variant ID fields.  VCF and PLINK files have only one.  The standard practice is to not use the .bgen "SNP ID" field. 

Mere H

unread,
Aug 10, 2022, 6:20:40 PM8/10/22
to plink2-users
Hi Dr. Chang,
Thanks! Got your idea! I have attempted adding 'snpid-chr' modifier.  Then I got an error message: "Error: Invalid chromosome code '21:9411239_G_A' in --bgen file. (Use --allow-extra-chr to force it to be accepted.)”. If I added --allow-extra-chr (see the first code below), I got another error message: "Error: Too many distinct nonstandard chromosome/contig names."

plink2 --bgen ukb_imp_chr21_v3.bgen ref-first 'snpid-chr' --allow-extra-chr --extract  mylist.txt  --sample ukb34697_imp_chr21_v3_s487317.sample --threads 16 --memory 32000 --export vcf 'vcf-dosage=HDS-force' --out myOutputFile

In the code above, mylist.txt is a single column of alternate_id, as shown in Excel screenshot in my first post. 

It seems that making mylist.txt using the column of rsID (i.e. column 2 in the Excel of my second post) instead of alternate_id (column 1), and deleting 'snpid-chr' in my code should be the solution? Like below:

plink2 --bgen ukb_imp_chr21_v3.bgen ref-first --allow-extra-chr --extract  mylist_RsID.txt  --sample ukb34697_imp_chr21_v3_s487317.sample --threads 16 --memory 32000 --export vcf 'vcf-dosage=HDS-force' --out myOutputFile

I am still having "--allow-extra-chr" in place, because not all SNPs were provided with rsID in column 2. For example, on my list of 334,016 snps, nearly 20,000 of them do not have rsID in the original txt provided by UKB. For these snps, their column 2 is filled with alternate_id  as a repetition of their column 1. So I am adding "--allow-extra-chr" to tolerate some nonstandard format of ID. Does this sound like a reasonable solution?

Thanks,

Mey

omayma amri

unread,
Aug 10, 2022, 6:28:37 PM8/10/22
to plink2-users
Bonjour
Je suis une nouvelle dans le groupe, je vous remercie d'avance pour l'adhésion.
En effet je suis en train de calculer les 10 PCA (principal component analysis)
Jai utilisé cette commande :
--bfile BQC_forAnalysesfinal_2022_07_28
--pca
--out Top_PCA0_BQC_2022_07_28
J'ai eu ce message d'erreur : 
Extracting eigenvalues and eigenvectors... Error: Failed to extract eigenvector(s) from GRM.
sachant que je travaille avec des données de séquençage de nouvelle génération pour faire une GWAS et j'ai nettoyé mes données avec les commandes suivantes :
Etape 1
--bfile BQC_allFiles_BiAll2_sex_2022_07_28
  --hwe 0.00001
  --maf 0.005
  --make-bed
  --out BQC_allFiles_BiAll2_filt_2022_07_28
Etape 2
  --bfile BQC_allFiles_BiAll2_filt_2022_07_28
  --chr 1-22
  --make-bed
  --out BQC_allFiles_BiAll2_filt2_2022_07_28
Etape 3
--bfile BQC_allFiles_BiAll2_filt2_2022_07_28
 --set-missing-var-ids @:#[b38]
--make-bed
--out BQC_ allFiles_BiAll2_IDname_2022_07_28
Je vous remercie d´avance pour toute information utile :) 

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/80388476-afd7-44ca-a86a-2ed0e5a3d156n%40googlegroups.com.

Christopher Chang

unread,
Aug 10, 2022, 6:29:15 PM8/10/22
to plink2-users
1. The .bgen doesn't have chromosome codes in the "SNP ID" field, so yes, you should remove the 'snpid-chr' modifier from your command line.
2. After you do this, you should no longer need to use --allow-extra-chr.
3. Yes, once you use the second instead of the first column of the Excel file to produce your --extract input, you should be fine.

Mere H

unread,
Aug 10, 2022, 9:23:43 PM8/10/22
to plink2-users
Hi Dr. Chang,
Thank you for the timely diagnosis of my issue! To give you feedback of my attempt on the new code -- I was finally able to extract all the SNPs after using Column 2 of the Excel to produce my --extract input as you advised. However, the output gets more variants than my --extract input. Before extraction, my --extract input txt has 334,016 variants; after extraction, the output file has 335,093 variants. How can I end up having more variants than I requested?

Here is the code I used to extract:
plink2 --bgen ukb_imp_chr21_v3.bgen ref-first --extract Mylist_rsID.txt --sample ukb34697_imp_chr21_v3_s487317.sample --threads 16 --memory 32000 --export bgen-1.2 --out MyOutputFile

To look into why this is happening, I used bgenix tool to list the variants in my output bgen file, and compared it with Mylist_rsID.txt. The result is, all the variant IDs are the same between the input and output files, but 1077 more variants are duplicated in the output file. 

I used the following functions in R to make this comparison:

setdiff(output$rsid, input$rsid)
# Got character(0)
setdiff(input$rsid, output$rsid)
# Also got character(0)
# This means all rsids are the same between input and output files, which is what I expect. 

sum(duplicated(output$rsid))
#1561
sum(duplicated(input$rsid))
#484
# output file has 1077 more snps that have duplicates. Not sure why this is happening. 

I have not identified the reason that causes this problem. I guess it might have something to do with the swap of Ref/Alt mode? Of my 334,016 input SNPs, most of them have Ref first, but 30,003 has Alt first, which is just a problem of mixed Ref/Alt mode in the original bgen file provided by UKB.

To help me find a proper solution, could you provide any possible explanation on why I am getting more variants than I requested? 

Thank you!

Mey

Christopher Chang

unread,
Aug 10, 2022, 9:28:16 PM8/10/22
to plink2-users
Looks like your --extract file doesn't quite reflect all the duplicate IDs in the .bgen.

Christopher Chang

unread,
Aug 10, 2022, 9:29:26 PM8/10/22
to plink2-users
To spell this out: suppose a rsID appears twice in the --extract file, and twice in the .bgen.  Then it will appear twice in the output file.

Now suppose a rsID appears once in the --extract file, and twice in the .bgen.  It will still appear twice in the output file.

Mere H

unread,
Aug 11, 2022, 5:11:46 PM8/11/22
to plink2-users
Hi Dr. Chang,
Got it! Thanks to your critical hint, I examined my --extract txt and found the reason why it did not reflect all the duplicate IDs in bgen. The reason is that some duplicate copies were removed as they did not meet my filtering criteria (i.e. MAF and Imputation Score cutoff). Exactly as you said, some rsID appears once in the --extract file but twice in the .bgen, like the picture below.

Screenshot 2022-08-11 162842.jpg
The Excel in the screenshot reflects all the snps in bgen. In the example shown, only one copy of rs10482896 and rs10084570  respectively were retained in my --extract txt, because the other copy (with red circles) does not meet my filtering criteria (MAF >= 0.0005, Imputation Score > 0.5). If we only rely on rsID to extract snps, both copies will be extracted from bgen, but I only want the one copy that meets my selection criteria. The two copies are not fully duplicating each other if we look at other columns besides rsID. It looks like we need at least one more column as the identifier besides rsID to differentiate the two copies. Is there a way that PLINK can do so when it reads bgen?

I have tried on --extract-col-cond to see if I can filter the bgen based on rsID (col 2)and Imputation Score (col 8) at the same time. The codes failed immediately with error "Error: Variant ID 'rs575676233' appears multiple times in --extract-col-cond file." It seems that --extract-col-cond is not the right tool to fulfill the above purpose. Even if the code was able to run successfully,   --extract-col-cond would not accommodate if I want to add Col 6 on top of Col 8 as my filtering criteria, right?

Here is the code I used when trying out --extract-col-cond above:

plink2 --bgen ukb_imp_chr21_v3.bgen ref-first --extract-col-cond ukb_mfi_chr21_v3.txt 8 2 --extract-col-cond-min 0.5 --sample ukb34697_imp_chr21_v3_s487317.sample --threads 16 --memory 32000 --export bgen-1.2 --out MyOutFile

Dr. Chang, is there a way that I can selectively retain some but not all copies from the duplicates, using PLINK? Or is this something I'll have to use other tools like QCTOOL?

Thank you!

Mey 

Christopher Chang

unread,
Aug 11, 2022, 5:26:31 PM8/11/22
to plink2-users
This is one application of --set-all-var-ids.  There shouldn't be any duplicates in the newly created IDs, and it should be straightforward to translate your original variant list to the new ID scheme.

Mere H

unread,
Aug 11, 2022, 6:08:01 PM8/11/22
to plink2-users
Hi Dr. Chang,
Sounds good! For --set-all-var-ids @:#[b37]\$r,\$a , this can only be applied to pvar file, not bgen file, is that correct?

Therefore, my planned pipeline is:
1. Convert UKB bgen into pfiles.
2. Use  --set-all-var-ids @:#[b37]\$r,\$a to create new IDs for each variant in the pvar file. 
3. Manipulate my --extract txt in R to make its ID in the same format as 'chr*:position[b37]allele1,allele2'
4. Extract my variant list from pfiles. 

Does this sound right?

With your help, I almost see the light at the end of the tunnel. Thank you!

Mey

Reply all
Reply to author
Forward
0 new messages