Has a problem with bcftools merge .vcf.gz

1,192 views
Skip to first unread message

MAOMAO

unread,
Apr 3, 2020, 11:29:26 AM4/3/20
to plink2-users

Hi all,

I download all  all the chr*.vcf.gz and all chr*vcf.gz.tbi files in my working directory. And then I use bcftools merge 1kG genotype data from chr1 to chr22, but I run into a issue. Any suggestion would be appreciated. 


$ bcftools merge ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -Oz -o merged.vcf
[W::bcf_sr_add_reader] No BGZF EOF marker; file 'ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz' may be truncated[E::bgzf_uncompress] Inflate operation failed: invalid distance too far back[E::bgzf_read] Read block operation failed with error 1 after 6418 of 118752 bytes[E::hts_idx_load3] Could not load local index file 'ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi'Failed to open ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz: could not load index

Esoh

unread,
Apr 3, 2020, 1:13:48 PM4/3/20
to plink2-users
Hi,

I'm not sure if you meant to use bcftools concat since you are combining different chromosome files with the same samples basically.
With the 'No BGZF EOF marker' and '...may be truncated...' errors, it is likely your downloads did not complete or something went wrong in the course of the download.

My advice; try bcftools concat, and if the errors persist, you could consider downloading the files again.

bcftools concat [ options ] -Oz -o merge.vcf.gz ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Esoh

Ying Zhao

unread,
Apr 3, 2020, 2:58:39 PM4/3/20
to Esoh, plink2-users
Thanks so much Esoh. 

I download the files again, and use the command below to run the program, and then get another error message. Is any way to check and edit it?  Thanks a lot.

bcftools concat -Oz -o merge.vcf.gz ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
 

Checking the headers and starting positions of 22 files[E::bgzf_uncompress] CRC32 checksum mismatch[E::bgzf_read] Read block operation failed with error 33 after 14 of 16 bytesConcatenating ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz[E::bgzf_uncompress] CRC32 checksum mismatch
[E::bgzf_read] Read block operation failed with error 33 after 14 of 16 bytes 



Kevin Esoh

unread,
Apr 3, 2020, 3:29:00 PM4/3/20
to Ying Zhao, plink2-users
Hi Ying,

I'm not too familiar with this error. However, the 'checksum mismatch' error would mean there was a problem with your data transmission (download).
I found this post (https://groups.google.com/forum/#!topic/platypus-users/I-00Lb-58Ik) with a similar error although it involved BAM files.
Some lessons from it;
- Consider re-indexing the VCF files

 bcftools index -f -t file.vcf.gz OR tabix -p vcf -f file.vcf.gz

- Probably remove any other indexed file you don't need from the directory.
 
Kind regards,
Esoh

--
Kevin K. Esoh
Bioinformatics and Molecular Biology
Department of Biochemistry
Jomo Kenyatta University of Agriculture and Technology
JKUAT - Kenya

Ying Zhao

unread,
Apr 3, 2020, 3:29:41 PM4/3/20
to Esoh, plink2-users

Here is my header: It is very different from the sample .vcf file. Is that normal? I don't know how to deal with that.  Any suggestions will be appreciated. A lot of information and then the $CHROM POS ID ... (Highlighted in red) and then with HG000103, HG00105 ...
##fileformat=VCFv4.1##FILTER=<ID=PASS,Description="All filters passed">##fileDate=20150218##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz##source=1000GenomesPhase3Pipeline##contig=<ID=1,assembly=b37,length=249250621>##contig=<ID=2,assembly=b37,length=243199373>##contig=<ID=3,assembly=b37,length=198022430>##contig=<ID=4,assembly=b37,length=191154276>##contig=<ID=5,assembly=b37,length=180915260>##contig=<ID=6,assembly=b37,length=171115067>##contig=<ID=7,assembly=b37,length=159138663>##contig=<ID=8,assembly=b37,length=146364022>##contig=<ID=9,assembly=b37,length=141213431>##contig=<ID=10,assembly=b37,length=135534747>##contig=<ID=11,assembly=b37,length=135006516>##contig=<ID=12,assembly=b37,length=133851895>##contig=<ID=13,assembly=b37,length=115169878>##contig=<ID=14,assembly=b37,length=107349540>##contig=<ID=15,assembly=b37,length=102531392>##contig=<ID=16,assembly=b37,length=90354753>##contig=<ID=17,assembly=b37,length=81195210>##contig=<ID=18,assembly=b37,length=78077248>##contig=<ID=19,assembly=b37,length=59128983>##contig=<ID=20,assembly=b37,length=63025520>##contig=<ID=21,assembly=b37,length=48129895>##contig=<ID=22,assembly=b37,length=51304566>##contig=<ID=GL000191.1,assembly=b37,length=106433>##contig=<ID=GL000192.1,assembly=b37,length=547496>##contig=<ID=GL000193.1,assembly=b37,length=189789>##contig=<ID=GL000194.1,assembly=b37,length=191469>##contig=<ID=GL000195.1,assembly=b37,length=182896>##contig=<ID=GL000196.1,assembly=b37,length=38914>##ALT=<ID=CN118,Description="Copy number allele: 118 copies">##ALT=<ID=CN119,Description="Copy number allele: 119 copies">##ALT=<ID=CN120,Description="Copy number allele: 120 copies">##ALT=<ID=CN121,Description="Copy number allele: 121 copies">##ALT=<ID=CN122,Description="Copy number allele: 122 copies">##ALT=<ID=CN123,Description="Copy number allele: 123 copies">##ALT=<ID=CN124,Description="Copy number allele: 124 copies">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">##INFO=<ID=CS,Number=1,Type=String,Description="Source call set.">##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">##INFO=<ID=MC,Number=.,Type=String,Description="Merged calls.">##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this call, will be NULL NULL for START and END">##INFO=<ID=MEND,Number=1,Type=Integer,Description="Mitochondrial end coordinate of inserted sequence">##INFO=<ID=MLEN,Number=1,Type=Integer,Description="Estimated length of mitochondrial insert">##INFO=<ID=MSTART,Number=1,Type=Integer,Description="Mitochondrial start coordinate of inserted sequence">##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may calculate the SV length by INFO:END-START+1, or by finding the difference between lengthes of REF and ALT alleles">##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">##INFO=<ID=TSD,Number=1,Type=String,Description="Precise Target Site Duplication for bases, if unknown, value will be NULL">##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth; only low coverage data were counted towards th
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109 HG00110 HG00111 HG00112 HG00113 HG00114 HG00115 HG00116 HG00117 HG00118 HG00119 HG00120 HG00121 HG00122 HG00123 HG00125 HG00126 HG00127 HG00128 HG00129 HG00130 HG00131 HG00132 HG00133 HG00136 HG00137 HG00138 HG00139 HG00140 HG00141 HG00142 HG00143 HG00145 HG00146 HG00148 HG00149 HG00150 HG001#51 HG00154 HG00155 HG00157 HG00158 HG00159 HG00160 HG00171 HG00173 HG00174 HG00176 HG00177 HG00178 HG00179 HG00180 HG00181 HG00182 HG00183 HG00185 HG00186 HG00187 HG00188 HG00189 HG00190 HG00231 HG00232 HG00233 HG00234 HG00235 HG00236 HG00237 HG00238 HG00239 HG00240 HG00242 HG00243 HG00244 HG00245 HG00246 HG00250 HG00251 HG00252 HG00253 HG00254 HG00255 HG00256 HG00257 HG00258 HG00259 HG00260 HG00261 HG00262 HG00263 HG00264 HG00265 HG0 2 10179 rs567117114 TA T 100 PASS AC=15;AF=0.00299521;AN=5008;NS=2504;DP=1589;EAS_AF=0.002;AMR_AF=0.0029;AFR_AF=0.0076;EUR_AF=0;SAS_AF=0.001;AA=|||unknown(NO_COVERAGE);VT=INDEL GT 0|0 0|00|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|00|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|00|0 0|0 0|0 0|0 0|0 0|0 0|0

Kevin Esoh

unread,
Apr 3, 2020, 3:36:37 PM4/3/20
to Ying Zhao, plink2-users
Hi Ying,

The header looks ok, but for the CHROM line which is supposed to begin with a single '#'.
The information before the CHROM line is necessary.

Here's a useful resource on the VCF file format (https://en.wikipedia.org/wiki/Variant_Call_Format)

Kind regards,
Esoh

Ying Zhao

unread,
Apr 3, 2020, 3:51:46 PM4/3/20
to Kevin Esoh, plink2-users
Thanks so much for all your help Kevin. I will try that. 


Reply all
Reply to author
Forward
0 new messages