plink1.9 chr23 extraction error

128 views
Skip to first unread message

Briley Park

unread,
Nov 4, 2023, 4:17:35 PM11/4/23
to plink2-users
Hi all,
I am currently using datasets(1240K+HO) from https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data. I converted the .geno, .snp, .ind files into pubic1.bed, public1.bim, and public1.fam files. I am trying to extract data for chromosome 23 only and I am having trouble. I made a new text file that includes only the SNP IDs in chromosome 23. 

head cleaned_chr23.txt
rs144338695
rs139653651
rs141853178
rs112502110
rs149208716
rs7473034
rs5983012
rs7889235
rs2124011

PLINK v1.90b7.1 64-bit (18 Oct 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to public1_filtered.log.
Options in effect:
  --allow-extra-chr
  --bfile public1
  --extract cleaned_chr.txt
  --make-bed
  --out public1_filtered

257780 MB RAM detected; reserving 128890 MB for main workspace.
Allocated 7257 MB successfully, after larger attempt(s) failed.

Error: Too many distinct nonstandard chromosome/contig names.
=> How should I solve this? I am constantly getting errors and I am not sure how to go on from here. 

Briley Park

unread,
Nov 4, 2023, 5:20:55 PM11/4/23
to plink2-users
Just a quick update - I used PLINK2.0 which allowed me to pass through the public1.bim file but now there is a problem with the public.fam file.

PLINK v2.00a6LM 64-bit Intel (30 Oct 2023)     www.cog-genomics.org/plink/2.0/

(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to public1_filtered.log.
Options in effect:
  --bfile public1
  --extract snp_ids_only.txt
  --make-bed
  --out public1_filtered

Start time: Sat Nov  4 17:16:48 2023
257780 MiB RAM detected, ~253563 available; reserving 128890 MiB for main
workspace.
Allocated 7257 MiB successfully, after larger attempt(s) failed.
Using up to 32 threads (change this with --threads).
Error: Line 1 of public1.fam has fewer tokens than expected.
End time: Sat Nov  4 17:16:48 2023

The public1.fam file was converted from the public1.ind file which only has three columns:  individual ID, Sex (M,F), and family ID
I001.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO I002.HO M Ignore_Iran_Zoroastrian_PCA_outlier.HO IREJ-T006.HO M Iran_Fars.HO IREJ-T009.HO M Iran_Fars.HO IREJ-T022.HO M Iran_Fars.HO IREJ-T023.HO M Iran_Fars.HO IREJ-T026.HO M Iran_Fars.HO IREJ-T027.HO M Iran_Fars.HO IREJ-T037.HO M Iran_Fars.HO IREJ-T040.HO M Iran_Fars.HO But the expected .fam file should have 6 columns: family ID, individual ID, paternal ID, maternal ID, sex, and phenotype. 
If my fam file does not include all of these columns, is there a way to ignore it or is it impossible that I should use a new data file?
Thank you. 

2023년 11월 4일 토요일 오후 4시 17분 35초 UTC-4에 Briley Park님이 작성:

Christopher Chang

unread,
Nov 5, 2023, 12:47:57 PM11/5/23
to plink2-users
--no-fid, --no-parents, and --no-sex can be used to tell plink2 that the corresponding column(s) are missing from the .fam file.
In the meantime, it sounds like your .bim file was not generated correctly.  See e.g. https://www.cog-genomics.org/plink/2.0/formats#bim , and compare to the first few lines of your .bim file; are chromosome codes in your first column?  Because the plink 1.9 error message implies that something else is there.

Briley Park

unread,
Nov 5, 2023, 2:27:09 PM11/5/23
to plink2-users
 Hello Chris, based on your commands, I ran the files and got past the .fam file. However, it is showing that I need to raise the limit to ~980k to compile it. I don't know where the plink2_common.h is located. Will you be able to help me with this? Thank you so much.

plink2 --bfile public1        
--extract snp_ids_only.txt                                    
--make-bed        
--out public1_filtered        
--no-fid        
--no-parents        
--no-sex

PLINK v2.00a6LM 64-bit Intel (30 Oct 2023)     www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to public1_filtered.log.
Options in effect:
  --bfile public1
  --extract snp_ids_only.txt
  --make-bed
  --no-fid
  --no-parents
  --no-sex
  --out public1_filtered

Start time: Sun Nov  5 14:16:53 2023
257667 MiB RAM detected, ~247695 available; reserving 128833 MiB for main
workspace.
Using up to 16 threads (change this with --threads).
20503 samples (0 females, 0 males, 20503 ambiguous; 20503 founders) loaded from
public1.fam.

Error: Too many distinct nonstandard chromosome/contig names.
The usual limit is about 65k.  You can raise this to ~980k by uncommenting
"#define HIGH_CONTIG_BUILD" at the top of plink2_common.h and recompiling.  If
that still isn't enough, we should be able to help you if you make a post in
the plink2-users Google group.

End time: Sun Nov  5 14:16:54 2023

2023년 11월 5일 일요일 오후 12시 47분 57초 UTC-5에 chrch...@gmail.com님이 작성:

Christopher Chang

unread,
Nov 5, 2023, 2:28:57 PM11/5/23
to plink2-users
You're asking the wrong question.

Please reread the later part of my previous response.

Briley Park

unread,
Nov 5, 2023, 2:31:43 PM11/5/23
to plink2-users
I've sent you a message about what my bim file looks like compared to what it is supposed to look like. 
In case you didn't see it, my bim file includes SNP id, chromosome number, physical/genetic location, and reference/variant alleles, where the reference allele matches hg19).
The chromosome number is in the second column. Should I change the order of the columns to make it work?
Thank you so much. 
2023년 11월 5일 일요일 오후 2시 28분 57초 UTC-5에 chrch...@gmail.com님이 작성:

Christopher Chang

unread,
Nov 5, 2023, 2:35:07 PM11/5/23
to plink2-users
Yes, that should be clear from both the content of my first response, as well as the https://www.cog-genomics.org/plink/2.0/formats#bim link.  The method you used to generate your .bim file was incorrect.

Briley Park

unread,
Nov 5, 2023, 3:01:52 PM11/5/23
to plink2-users
Hello Chris, I regenerated the .bim file but this time there is an error with the bed file. What is the magic number that is talking about in the error?
Thank you and I apologize for my inconvenience.
PLINK v2.00a6LM 64-bit Intel (30 Oct 2023) www.cog-genomics.org/plink/2.0/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to output_vcf.log. Options in effect: --bfile public1 --export vcf --no-fid --no-parents --no-sex --out output_vcf Start time: Sun Nov 5 14:53:03 2023 257667 MiB RAM detected, ~247699 available; reserving 128833 MiB for main workspace. Using up to 16 threads (change this with --threads). 20503 samples (0 females, 0 males, 20503 ambiguous; 20503 founders) loaded from public1.fam. 597573 variants loaded from public1.bim. Error: public1.bed is not a .pgen file (first two bytes don't match the magic number).

2023년 11월 5일 일요일 오후 2시 35분 7초 UTC-5에 chrch...@gmail.com님이 작성:

Christopher Chang

unread,
Nov 5, 2023, 3:03:07 PM11/5/23
to plink2-users
This means you also generated the .bed file incorrectly, which is not surprising given that you generated the .bim file incorrectly.

Briley Park

unread,
Nov 5, 2023, 4:21:14 PM11/5/23
to plink2-users
Thank you so much for your help! I regenerated the files and it is working now. 

2023년 11월 5일 일요일 오후 3시 3분 7초 UTC-5에 chrch...@gmail.com님이 작성:
Reply all
Reply to author
Forward
0 new messages