Problem in comparing two VCFs with eval module

63 views
Skip to first unread message

Luis Willian Pacheco Arge

unread,
May 17, 2022, 4:22:52 PM5/17/22
to RTG Users
Hi everyone

I'm trying to compare two VCFs, the first one come 1000 human genomes project, and the second one come from my pipeline.
The error I'm facing is the following:

Command:
for i in $(ls /media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/*-snps-indels-predicted-effect-snpEff.vcf | sed -E 's/\/media.*GATK_results\/|-.*//g'); do
    rm -rf /media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/${i}
    java -jar ~/Softwares/rtg-tools/build/rtg-tools.jar vcfeval \
       --all-records \
       --decompose \
       -b /media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/${i}-snps-indels-predicted-effect-snpEff.vcf.gz \
       -c /media/biomehub/HD-4Tb/MIOSeq/BAM/Variants-original/DMD-1kG-${i}.vcf.gz \
       -t /media/biomehub/HD-4Tb/MIOSeq/hsa-genome/GRCh38_full_analysis_set_plus_decoy_hla.sdf \
       -f QUAL \
       -o /media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/${i} \
       --bed-regions=/media/biomehub/HD-4Tb/MIOSeq/hsa-genome/DMD-SMN-seq/DMD-GRCh38.bed  \
       --output-mode=combine
done

Error:
Error: SAM input has an irrecoverable problem. Invalid GZIP header
Error: An IO problem occurred: "/media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/HG02586-snps-indels-predicted-effect-snpEff.vcf.gz has invalid uncompressedLength: -1135658362"
Error: An IO problem occurred: "/media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/HG02594-snps-indels-predicted-effect-snpEff.vcf.gz has invalid uncompressedLength: -396169846"
Error: SAM input has an irrecoverable problem. Invalid GZIP header
Error: An IO problem occurred: "/media/biomehub/HD-4Tb/MIOSeq/CRAM/GATK-1000G-pipeline/GATK_results/HG02620-snps-indels-predicted-effect-snpEff.vcf.gz has invalid uncompressedLength: -200145969"

One observation, when I use either one to compare against a dbSNP subset (VCF format), there is no error.

All files are in vcf.gz and indexed with tabix.

Cheers, 
Luis Arge

Len Trigg

unread,
May 17, 2022, 6:46:51 PM5/17/22
to Luis Willian Pacheco Arge, RTG Users
Hi Luis,

Those types of errors indicate a problem with the VCF or the tabix index. We have seen similar errors when the tabix indexes are out of date compared to the compressed VCF, and sometimes when the file is mounted via an unreliable network mount. Try creating a reproducible single example outside of your loop and then check whether the problematic files are correctly block compressed and tabixed. If you still have the error, send us a small zip containing the files that demonstrate the problem.

Cheers,
Len.


--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
To view this discussion on the web visit https://groups.google.com/a/realtimegenomics.com/d/msgid/rtg-users/7846eaa3-c6e4-4bfa-8e97-61f54375505en%40realtimegenomics.com.

Luis Willian Pacheco Arge

unread,
May 18, 2022, 7:34:44 AM5/18/22
to RTG Users, Len Trigg, RTG Users, Luis Willian Pacheco Arge
Hi Len

I re-ran all the previous steps to generate the final vcf files and the indexes, and then the rtg eval tool ran without error.
So, was just the files out of date.

Thanks to clearify this issue

Luis Arge

Reply all
Reply to author
Forward
0 new messages