Reference sequence MD5 mismatch for cram file

760 views

Skip to first unread message

Sabine van Schie

unread,

Jan 18, 2022, 12:14:18 PM1/18/22

to igv-help

Hi all,

I would like to view cram files in IGV. The samples are whole-genome sequencing data from S. cerevisiae.

We have downloaded the cram files from our collaborator and I have simply used the default samtools index command to index the cram file.

For some files I'm having problems viewing certain parts of the genome. However, for other files (from the same batch and using the same ref genome and alignment method), I can view the exact same region just fine. The error in IGV is as follows:

Error - possible sequence mismatch (wrong reference for this file): htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for slice: sequence id 10, start 1, span 666454, expected MD5 c831338ca9079d76bebd0b0a5eb102ef

We are using IGV 2.4.10, using the reference genome SacCer2 (I tried a newer version with the default ref genome SacCer3, but this did not work as the reference genome used for alignment was likely SacCer2).

I have attached one cram index file that gives this error when looking at e.g. YKR002W on Chr XI. I haven't attached the corresponding cram file because it is rather large (200 MB), but if needed I can try to send this or other info that is missing.

I'm a bit puzzled that not all samples give this error and since I'm new to samtools/IGV and the cram format I"m not sure where the issue is.. Is it simply that I don't have the right reference in IGV (and why does it work then for others) or is the cram file corrupted or did an error slip in using samtools?

I would be very happy with any troubleshooting tips!

Thanks, Sabine

42363_4#33.cram.crai

igv-help

unread,

Jan 19, 2022, 1:36:14 PM1/19/22

to igv-help

That most likely means the fasta sequence (reference genome) you are viewing the cram against differs from the reference sequence it was aligned to in the regions yielding that error. The difference can be subtle, e.g. repeat regions padded with "N"s. The way to know for sure is to get the reference (fasta) used to create the CRAM file and use this as the IGV reference. With recent versions you can just load the fasta from the "Genomes > Load from File..." menu, IGV will index it automatically. Version 2.4.10 is quite old and to be honest I don't remember if that option existed then, but you could try it.

I do not know why some samples would work and others wouldn't, unless they were aligned to different versions of the reference. That information should be in the CRAM header, which you can access with samtools.

Reply all

Reply to author

Forward

0 new messages