Single sample .bed and .bim file sizes

163 views
Skip to first unread message

Adam Willett

unread,
May 9, 2024, 9:00:11 AM5/9/24
to plink2-users
Hello

I'm fairly new to all of this. I was following a tutorial where the scripts converted my raw AncestryDNA file to Plink format (first by converting to 23andMe format). I noticed that the resulting .bed file was only 680kb while the .bim was 19mb. 

This seemed unusual looking at all of the other Plink file sets I had seen where the .bed file was usually larger than the .bim file.

Is this file size ratio normal for a single sample converted to Plink or is it something to do with this error message I kept getting?:

Error: 23754 variants with 3+ alleles present.

* If you believe this is due to strand inconsistency, try --flip with

  Adam_ancestry_Adam_ancestry_1001802Test-merge.missnp.

  (Warning: if the subsequent merge seems to work, strand errors involving SNPs

  with A/T or C/G alleles probably remain in your data.  If LD between nearby

  SNPs is high, --flip-scan should detect them.)

* If you are dealing with genuine multiallelic variants, we recommend exporting

  that subset of the data to VCF (via e.g. '--recode vcf'), merging with

  another tool/script, and then importing the result; PLINK is not yet suited

  to handling them.


Christopher Chang

unread,
May 9, 2024, 9:50:15 AM5/9/24
to plink2-users
Please post a full .log file when asking for troubleshooting help, rather than just an error message.

Adam Willett

unread,
May 10, 2024, 10:58:57 AM5/10/24
to plink2-users
Attached is the .log file that was generated by the process where the error message came up. 
TestScript.log

Christopher Chang

unread,
May 10, 2024, 12:12:50 PM5/10/24
to plink2-users
Sorry, that .log doesn't contain the error message, so I can't help you with that.

Meanwhile, there is nothing unusual about a single-sample .bim file being much larger than the corresponding .bed.  A .bed file has a size of 3 * (<# of samples / 4, rounded up> * <# of variants>) bytes, which is <# of variants> + 3 in this case.  A .bim file contains chromosome, ID, position, and allele information for each variant; your value of ~28 bytes/variant is typical.

Adam Willett

unread,
May 11, 2024, 10:27:00 AM5/11/24
to plink2-users
Thanks for the clarification of the .bed vs .bim size issue. 
Here is the plink logs that followed the steps I took.

[First I created a subset of reference populations]

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to REFERENCEFILES/TestScript.log.
Options in effect:
  --bfile REFERENCEFILES/Est1000HGDP
  --keep keep.keep
  --make-bed
  --out REFERENCEFILES/TestScript

8192 MB RAM detected; reserving 4096 MB for main workspace.
135056 variants loaded from .bim file.
4899 people (1737 males, 654 females, 2508 ambiguous) loaded from .fam.
Ambiguous sex IDs written to REFERENCEFILES/TestScript.nosex .
--keep: 149 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 149 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is exactly 1.
135056 variants and 149 people pass filters and QC.
Note: No phenotypes present.
--make-bed to REFERENCEFILES/TestScript.bed + REFERENCEFILES/TestScript.bim +
REFERENCEFILES/TestScript.fam ... done.

[Next I ran the main shell script to convert my raw file to Plink format and do the admixture]

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to INDIVPLINKFILES/Adam_ancestry.log.
Options in effect:
  --23file RAWINPUT/Adam_ancestry.txt Adam_ancestry Adam_ancestry
  --make-bed
  --out INDIVPLINKFILES/Adam_ancestry

8192 MB RAM detected; reserving 4096 MB for main workspace.
--23file: INDIVPLINKFILES/Adam_ancestry-temporary.bed +
INDIVPLINKFILES/Adam_ancestry-temporary.bim +
INDIVPLINKFILES/Adam_ancestry-temporary.fam written.
8835 variants with indel calls present.  '--snps-only no-DI' or
--list-23-indels may be useful here.
Inferred sex: male.
677857 variants loaded from .bim file.
1 person (1 male, 0 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1 founder and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 6 het. haploid genotypes present (see INDIVPLINKFILES/Adam_ancestry.hh
); many commands treat these as missing.
Total genotyping rate is exactly 1.
677857 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to INDIVPLINKFILES/Adam_ancestry.bed +
INDIVPLINKFILES/Adam_ancestry.bim + INDIVPLINKFILES/Adam_ancestry.fam ... done.
Adam_ancestry Adam_ancestry 0 0 1 -9


Error: 23754 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  Adam_ancestry_Adam_ancestry_1007651Test-merge.missnp.

  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to Adam_ancestry_Adam_ancestry_1007651.log.
Options in effect:
  --bfile Adam_ancestry_Adam_ancestry_1007651
  --flip flip
  --make-bed
  --out Adam_ancestry_Adam_ancestry_1007651

8192 MB RAM detected; reserving 4096 MB for main workspace.
Warning: --make-bed input and output filenames match.  Appending '~' to input
filenames.
124327 variants loaded from .bim file.
1 person (1 male, 0 females) loaded from .fam.
--flip: 23754 SNPs flipped.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1 founder and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is exactly 1.
124327 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to Adam_ancestry_Adam_ancestry_1007651.bed +
Adam_ancestry_Adam_ancestry_1007651.bim +
Adam_ancestry_Adam_ancestry_1007651.fam ... done.

Christopher Chang

unread,
May 13, 2024, 4:47:34 AM5/13/24
to plink2-users
This is *still* missing the .log information for the failed plink run in the middle.  You may need to figure out which script step that is, and rerun it in a different way that doesn't suppress the log information.

Adam Willett

unread,
May 13, 2024, 12:03:49 PM5/13/24
to plink2-users
Is there a way to generate a .log file which contains all the processes including shell scripts run etc? (I use Terminal on a Mac)
As for now I've repasted the complete output with all my commands included this time (highlighted in blue) and attached the relevant scripts in a zip. It looks like that after the error message appears, the multi allelic variants soon get flipped by the script. (I've also done a separate run using my raw file pre-converted to 23andMe format and it still comes up with exactly 23754 multi allelic variants).

adamwillett@Adams-MacBook-Pro ~ % cd ancestry_supervised/REFERENCEFILES
adamwillett@Adams-MacBook-Pro REFERENCEFILES % grep "GreatBritain\|Orcadian\|French \|German\|Sweden" Est1000HGDP.fam > ../keep.keep
adamwillett@Adams-MacBook-Pro REFERENCEFILES % cd ../
adamwillett@Adams-MacBook-Pro ancestry_supervised % ./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to REFERENCEFILES/TestScript.log.
Options in effect:
  --bfile REFERENCEFILES/Est1000HGDP
  --keep keep.keep
  --make-bed
  --out REFERENCEFILES/TestScript

8192 MB RAM detected; reserving 4096 MB for main workspace.
135056 variants loaded from .bim file.
4899 people (1737 males, 654 females, 2508 ambiguous) loaded from .fam.
Ambiguous sex IDs written to REFERENCEFILES/TestScript.nosex .
--keep: 165 people remaining.

Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 165 founders and 0 nonfounders present.

Calculating allele frequencies... done.
Total genotyping rate in remaining samples is exactly 1.
135056 variants and 165 people pass filters and QC.

Note: No phenotypes present.
--make-bed to REFERENCEFILES/TestScript.bed + REFERENCEFILES/TestScript.bim +
REFERENCEFILES/TestScript.fam ... done.
adamwillett@Adams-MacBook-Pro ancestry_supervised % bash rawFile_To_Supervised_Results.sh TestScript
RAWINPUT/Adam_ancestry.txt
Adam_ancestry

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to INDIVPLINKFILES/Adam_ancestry.log.
Options in effect:
  --23file RAWINPUT/Adam_ancestry.txt Adam_ancestry Adam_ancestry
  --make-bed
  --out INDIVPLINKFILES/Adam_ancestry

8192 MB RAM detected; reserving 4096 MB for main workspace.
--23file: INDIVPLINKFILES/Adam_ancestry-temporary.bed +
INDIVPLINKFILES/Adam_ancestry-temporary.bim +
INDIVPLINKFILES/Adam_ancestry-temporary.fam written.
8835 variants with indel calls present.  '--snps-only no-DI' or
--list-23-indels may be useful here.
Inferred sex: male.
677854 variants loaded from .bim file.

1 person (1 male, 0 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1 founder and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 6 het. haploid genotypes present (see INDIVPLINKFILES/Adam_ancestry.hh
); many commands treat these as missing.
Total genotyping rate is exactly 1.
677854 variants and 1 person pass filters and QC.

Note: No phenotypes present.
--make-bed to INDIVPLINKFILES/Adam_ancestry.bed +
INDIVPLINKFILES/Adam_ancestry.bim + INDIVPLINKFILES/Adam_ancestry.fam ... done.
Adam_ancestry Adam_ancestry 0 0 1 -9

Error: 23754 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  Adam_ancestry_Adam_ancestry_1002279Test-merge.missnp.

  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to Adam_ancestry_Adam_ancestry_1002279.log.
Options in effect:
  --bfile Adam_ancestry_Adam_ancestry_1002279
  --flip flip
  --make-bed
  --out Adam_ancestry_Adam_ancestry_1002279


8192 MB RAM detected; reserving 4096 MB for main workspace.
Warning: --make-bed input and output filenames match.  Appending '~' to input
filenames.
124327 variants loaded from .bim file.
1 person (1 male, 0 females) loaded from .fam.
--flip: 23754 SNPs flipped.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1 founder and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is exactly 1.
124327 variants and 1 person pass filters and QC.
Note: No phenotypes present.
--make-bed to Adam_ancestry_Adam_ancestry_1002279.bed +
Adam_ancestry_Adam_ancestry_1002279.bim +
Adam_ancestry_Adam_ancestry_1002279.fam ... done.
         ./admixture  -j8  -s time --supervised Adam_ancestry_Adam_ancestry_1002279.bed 5  %                      
adamwillett@Adams-MacBook-Pro ancestry_supervised % 
scripts.zip

Christopher Chang

unread,
May 14, 2024, 7:56:02 PM5/14/24
to plink2-users
Oh, that's right, if the script is already designed to handle failure of the first merge attempt, then you don't need to worry about said failure.  Sorry about any confusion I introduced earlier.

Reply all
Reply to author
Forward
0 new messages