Loading bcf files with --bcf

807 views
Skip to first unread message

freeseek

unread,
Mar 25, 2014, 10:40:33 AM3/25/14
to plink2...@googlegroups.com
I have the following issue with bcf files. It might be unrelated to plink, so just in case.

If I create a test VCF file and load it with plink with the following commands:

# (echo "##fileformat=VCFv4.1"
# echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
# echo "##contig=<ID=1,length=249212577>"
# echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
# echo -e "1\t69428\trs140739101\tT\tG\t.\t.\t.\tGT\t0/0") > test.vcf
# plink --vcf test.vcf

Everything works and I get the following output:

PLINK v1.90a 64-bit (17 Mar 2014)          https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
7762 MB RAM detected; reserving 3881 MB for main workspace.
--vcf: plink.bed + .bim + .fam written.

If then I convert the file to bcf format using bcftools and try to load it into plink with the following commands:

# git clone --branch=bcftools+calling git://github.com/samtools/htslib.git
# cd bcftools; make; cd ..
# bcftools/bcftools view -Ob -o test.bcf test.vcf
# plink --bcf test.bcf

I get the following error:

PLINK v1.90a 64-bit (17 Mar 2014)          https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
7762 MB RAM detected; reserving 3881 MB for main workspace.
Error: Unrecognized GT field format in .bcf file.

However, the bcf file does not seem malformed, and in fact unpacking and loading it seems to work just fine:

# bcftools/bcftools view test.bcf | plink --vcf /dev/stdin
PLINK v1.90a 64-bit (17 Mar 2014)          https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
7762 MB RAM detected; reserving 3881 MB for main workspace.
--vcf: plink.bed + .bim + .fam written.

Christopher Chang

unread,
Mar 25, 2014, 11:06:24 AM3/25/14
to plink2...@googlegroups.com
When I tried this, the .bcf file generated by bcftools did not put quotes around "Genotype" in the FORMAT string; that's why PLINK is complaining.

I'm not sure whether that's considered acceptable by the standard, but given that bcftools is a standard program, I'll go ahead and post a build that permits plain Description=Genotype anyway.

freeseek

unread,
Mar 25, 2014, 11:09:53 AM3/25/14
to plink2...@googlegroups.com
When I generated the bcf file, I do see the quotes around "Genotype" (both with bcftools view and with zless). I don't think this is the issue.

Christopher Chang

unread,
Mar 25, 2014, 11:22:12 AM3/25/14
to plink2...@googlegroups.com
Hmm, with quotes, I'm not getting any problems with plink --bcf.  I'll post a new stable build in about 15 minutes; try that out, and if you still have problems I'll investigate tomorrow (send me the .bcf file in that case since it might differ from what my bcftools installation is generating).

freeseek

unread,
Mar 25, 2014, 11:27:01 AM3/25/14
to plink2...@googlegroups.com
Here you can find the files I have generated with the above commands: http://bit.ly/1l0Wqj9

Christopher Chang

unread,
Mar 25, 2014, 11:29:15 AM3/25/14
to plink2...@googlegroups.com
Ah, it's the "IDX=1" at the end that was confusing PLINK.  That'll be fixed in today's build.

freeseek

unread,
Mar 25, 2014, 11:33:41 AM3/25/14
to plink2...@googlegroups.com
I was just thinking the same, but after I added the IDX=1 to the VCF file, it was still loaded correctly with --vcf but not with --bcf. Anyhow, I am happy to test the new build. Thanks a lot Christopher.

Christopher Chang

unread,
Mar 25, 2014, 11:49:57 AM3/25/14
to plink2...@googlegroups.com
New build posted.  (Yes, there are a few gratuitous differences between the --vcf and --bcf code that I should stamp out...)

freeseek

unread,
Mar 25, 2014, 12:10:34 PM3/25/14
to plink2...@googlegroups.com
Yes, it works fine now!

freeseek

unread,
Mar 25, 2014, 3:46:00 PM3/25/14
to plink2...@googlegroups.com
Also, likely unrelated to this post, it is really great that plink does read bcf files, though it cannot output them.

Maybe this is out of the scope of plink, but it would be convenient if something like the following was possible:

plink --bfile plink --recode vcf-fid --out /dev/stdout | bcftools view -Ob -o plink.bcf

Christopher Chang

unread,
Mar 25, 2014, 8:03:38 PM3/25/14
to plink2...@googlegroups.com
Yeah, that's out of scope.  Direct BCF and/or gzipped VCF output might be added in the future, though.

freeseek

unread,
Mar 25, 2014, 8:12:26 PM3/25/14
to plink2...@googlegroups.com
One potential suggestion would be to add an option to output to standard output for those formats in --recode that only yield one file. I actually write a lot of scripts where I use syntax like "--vcf /dev/stdin", "--extract /dev/stdin", and "--keep /dev/stdin" to avoid writing intermediate files. But maybe it is just me. Just a suggestion. Thanks again for your prompt responses.

freeseek

unread,
Mar 26, 2014, 9:27:02 PM3/26/14
to plink2...@googlegroups.com
Sorry to keep bothering! But I think I have found another bug. Here three scenarios:

(echo "##fileformat=VCFv4.1"
echo "##FILTER=<ID=PASS,Description=\"All filters passed\">"
echo "##FILTER=<ID=X,Description=\"X\">"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t1\trs1\tT\tG\t.\t.\t.\tGT\t0/1") | bcftools view -Ob | plink --bcf /dev/stdin

yields: "Error: No variants in .bcf file."

(echo "##fileformat=VCFv4.1"
echo "##FILTER=<ID=PASS,Description=\"All filters passed\">"
echo "##FILTER=<ID=X,Description=\"X\">"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t1\trs1\tT\tG\t.\t.\t.\tGT\t0/1") | plink --vcf /dev/stdin

works just fine: "--vcf: plink.bed + .bim + .fam written."

(echo "##fileformat=VCFv4.1"
echo "##FILTER=<ID=PASS,Description=\"All filters passed\">"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##FILTER=<ID=X,Description=\"X\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t1\trs1\tT\tG\t.\t.\t.\tGT\t0/1") | bcftools view -Ob | plink --bcf /dev/stdin

works just fine: "--vcf: plink.bed + .bim + .fam written."

There must be a problem related to the order of the headers when inputing a bcf file rather than a bcf file.

Christopher Chang

unread,
Mar 26, 2014, 11:47:47 PM3/26/14
to plink2...@googlegroups.com
Okay, this was because I didn't realize FILTER and FORMAT IDs went into the same string dictionary.  Will post a fix later today.

freeseek

unread,
Mar 27, 2014, 10:45:54 AM3/27/14
to plink2...@googlegroups.com
It is still me! :-( This must be an unrelated bug. I cannot find any good guess.

(echo "##fileformat=VCFv4.1"
echo "##FILTER=<ID=PASS,Description=\"All filters passed\">"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"GT\">"
echo "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"AD\">"
echo "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"DP\">"
echo "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"GQ\">"
echo "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"PL\">"
echo "##contig=<ID=1,length=249250621>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA\tB\tC\tD\tE"
echo -e "1\t1\trs1\tC\tG\t0\tPASS\t.\tGT:AD:DP:GQ:PL\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.") | plink --vcf /dev/stdin

yields "--vcf: plink.bed + .bim + .fam written."

(echo "##fileformat=VCFv4.1"
echo "##FILTER=<ID=PASS,Description=\"All filters passed\">"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"GT\">"
echo "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"AD\">"
echo "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"DP\">"
echo "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"GQ\">"
echo "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"PL\">"
echo "##contig=<ID=1,length=249250621>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA\tB\tC\tD\tE"
echo -e "1\t1\trs1\tC\tG\t0\tPASS\t.\tGT:AD:DP:GQ:PL\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.\t./.:.:0:.:.") | bcftools view -Ob | plink --bcf /dev/stdin

yields "Error: Improperly formatted .bcf file."

This comes from a VCF file generated by GATK and further handled by bcftools.

Christopher Chang

unread,
Mar 27, 2014, 8:51:25 PM3/27/14
to plink2...@googlegroups.com
Er, that's because --bcf forgot to skip over the other FORMAT fields when there was anything past GT...

Fix will be posted later today.

freeseek

unread,
Apr 7, 2014, 10:45:58 AM4/7/14
to plink2...@googlegroups.com
The program now processes a 60GB bcf file in less than half an hour. The speed is really impressive! The gain compared to compressed VCF files is also several fold.

freeseek

unread,
Apr 15, 2014, 11:08:49 AM4/15/14
to plink2...@googlegroups.com
Just noticed another glitch with version PLINK v1.90b1p 64-bit (13 Apr 2014):
(echo "##fileformat=VCFv4.1"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t69428\trs140739101\tT\tG\t.\t.\t.\tGT\t0/0") > plink.vcf

Now:
plink --vcf plink.vcf && cat plink.bim
Yields:
1 rs140739101 0 69428 G T

While:
bcftools view -Ob plink.vcf | plink --bcf /dev/stdin && cat plink.bim
Yields:
1 rs140739101 0 69427 G T

Somehow *all* coordinates are shifted by 1 when using --bcf rather than --vcf.

Christopher Chang

unread,
Apr 15, 2014, 11:20:42 AM4/15/14
to plink2...@googlegroups.com
...VCF coordinates are defined to be 1-based while BCF is 0-based?!

Okay, I guess I'll add a +1 to the conversion routine right now...

Christopher Chang

unread,
Apr 15, 2014, 11:51:50 AM4/15/14
to plink2...@googlegroups.com
Fix is now posted.


On Tuesday, April 15, 2014 11:08:49 PM UTC+8, freeseek wrote:

freeseek

unread,
Jan 25, 2015, 5:23:28 PM1/25/15
to plink2...@googlegroups.com
With bcftools you can output files with the following option:
-O,   --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

The -Oz/-Ov output types are well recognized by "plink --vcf" but while the -Ob is recognized by "plink --bcf" the -Ou is not accepted.

While it might seem like a useless output type, the -Ou is the best option to pass a VCF/BCF files in between two UNIX commands, as it does not require compression/decompression. In particular, it would be nice to be able to run something like "bcftools view -Ou | plink --bcf /dev/stdin". I don't know how complicated this would be, but my guess is that plink would just need to understand that there is no need to decompress the input file, the same way the --vcf command currently does.

Also, somehow bcftools automatically guesses the correct format between -Ob/-Ou/-Oz/-Ov without having to specify if it is text/binary or compressed/decompressed. Would it make sense for plink also to have a unified --vcf option which guesses automatically which one of these four formats is passed?

Christopher Chang

unread,
Jan 25, 2015, 5:33:33 PM1/25/15
to plink2...@googlegroups.com
Hrm, that's strange; I would have expected the zlib I/O functions I'm using to handle uncompressed files properly.  Could you send me small uncompressed and uncompressed BCFs representing the same data, so I can investigate this?

freeseek

unread,
Jan 25, 2015, 6:10:13 PM1/25/15
to plink2...@googlegroups.com
(echo "##fileformat=VCFv4.1"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo "##contig=<ID=1,length=249212577>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tA"
echo -e "1\t69428\trs140739101\tT\tG\t.\t.\t.\tGT\t0/0") > test.vcf
bcftools view -Ou test.vcf | plink --bcf /dev/stdin

freeseek

unread,
Jan 25, 2015, 6:14:16 PM1/25/15
to plink2...@googlegroups.com
I think the problem only happens when the uncompressed binary VCF file is passed to plink as a stream rather than a file. Passing as a stream with -Ob/-Oz/-Ov though works without problem for plink. A little unfortunate as the -Ou option is mainly useful when passing as a stream.

Christopher Chang

unread,
Jan 25, 2015, 6:45:29 PM1/25/15
to plink2...@googlegroups.com
Okay, that would make a lot more sense.  No promises, but I'll keep my eye out for a painless way to make that work.

If the issue is not restricted to streams, please let me know.

freeseek

unread,
Feb 3, 2015, 3:09:39 PM2/3/15
to plink2...@googlegroups.com
I have noticed another problem. In the new 1000 Genomes project release they have a new of encoding for genotypes in the X chromosome nonPAR region. Try the following command to see:

The genotypes for males in this region are now haploid. If I try to convert this format to plink using the --vcf command:
bcftools view --samples HG00096 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf X:2699934-2700027 | plink --vcf /dev/stdin --recode && cat plink.ped

Everything works fine. The haploid genotypes are converted to homozygous diploid genotypes. But if I try using the --bcf command:
bcftools view -Ob --samples HG00096 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf X:2699934-2700027 | plink --bcf /dev/stdin --allow-extra-chr --recode && cat plink.ped

Every haploid genotype is converted to a diploid missing genotype, which clearly is not the desired behavior.

Christopher Chang

unread,
Feb 3, 2015, 3:19:57 PM2/3/15
to plink2...@googlegroups.com
I believe this was fixed in yesterday's builds?  Let me know if you see the same problem with them.

freeseek

unread,
Feb 3, 2015, 3:25:07 PM2/3/15
to plink2...@googlegroups.com
Yes, it is actually fixed now. Ooops.
Reply all
Reply to author
Forward
0 new messages