BAM file display error "twoBitReadSeqFrag in chrY end >= seqSize"

26 views
Skip to first unread message

Open Genomes

unread,
Apr 27, 2015, 3:33:41 PM4/27/15
to UCSC Genome Browser List
I am trying to display a human hg19 BAM file and it throws the following error:

  • twoBitReadSeqFrag in chrY end (169915806) >= seqSize (59373566)
This can be seen in the following UCSC Genome Browser public session:

Here is a screenshot of the same session in case they are unable to access it in the UCSC Browser:

The cause of this error are is that there are paired-end reads from the same pair where one end is on one chromosome and another is on an entirely different chromosome.

Here is a screenshot of IGV at the same location in the same BAM file as above, with the reads colored according to chromosome. The  pair start of the yellow-colored read on chrY is on chr4.  

For purposes of debugging, the entire BAM file  is directly available here:

These BAM files are being produced by Family Tree DNA / Gene By Gene Ltd. for their "Big Y" Y chromosome capture array sequencing product. 

Paired-end reads which must be on the same physical DNA fragment being aligned to different chromosomes is of course impossible, but apparently allowable by the current BAM standard. Understandably, the UCSC Browser throws an error, which is passed up to the user interface.   

The error prevents the display of any track in the current session, and gives no clue as to which track is causing the error. The track controls are not displayed either. Therefore, it becomes extremely difficult to debug, because each individual BAM track must be hidden (or displayed as "Dense") to figure out which BAM file is generating the error. (A second version of the session must be opened at a different location, then the tracks in that session must be hidden one at a time, and redisplayed for the "problematic" position without opening a new session. Also, IGV is needed to debug the problem, since it may be more than one BAM track that has misaligned paired-end reads.)  Also, the error gives very little information about the read causing the error, such as the other chromosome of the pair, as well as which track or tracks have paired-end reads on different chromosomes.

I can understand that the UCSC Browser was not designed to handle what one would rightly consider "invalid" BAM files. However, these files were produced by Gene By Gene Ltd. for their "BIg Y" Y capture array sequencing product using their custom Arpeggi Inc. proprietary alignment and variant calling software. Paired-end read pairs aligned to different chromosomes have appeared in every Gene By Gene  "Big Y" BAM file checked so far.  Hundreds of these BAM tiles have already been delivered to customers. It defies comprehension how what can accurately be described as invalid BAM files - which contain physically impossible alignments - can be delivered as part of a commercial product.

Even more, Gene By Gene Ltd.'s Big Y was used in a new study of the phylogeography of Y-DNA haplogroup G1, just published in PLoSOne:


The specifications for the product are given in the paper, and two employees of Gene By Gene Ltd. are listed as co-authors. It seems that peer-reviewers (and perhaps the main authors) were unable to catch these invalid alignments because they had no direct way of displaying the BAM files, and individually checking the discovered SNPs in the larger phylogenetic context  of other Y-DNA haplogroup G and G1 BAM files. As we see above, the BIg Y BAM files contain numerous "heterozygous" Y chromosome reads, and of course since the Y chromosome is haploid, two values at one physical position is impossible. Again, this is the result of invalid alignments against the Build 37 Reference Sequence  due to X homology or Y chromosome CNVs. All the authors did was to eliminate common "SNPs" (many of which have the ancestral and derived states reversed) but with the seeming heterozygosity, there are cases where it can appear that one sample is unambiguously derived while the other is "heterozygous" and under the reportable threshold. The feature where the UCSC Browser can display grey shading based on base quality scores and overall read alignment quality also very useful in visually eliminating low quality and misaligned reads. The fact that this product was "not ready for release" (let alone for use in academic studies) would have been immediately apparent if the researchers from the Vavilov Institute would have been able to compare the 20 files side-by-side in the UCSC Browser. (Obviously, they were unable to do so until UCSC - quite quickly - adapted the browser to the handle unused "optional" CIGAR operators generated by Gene By Gene. No doubt this happened after the manuscript was submitted for publication.)

Given that Gene By Gene Ltd. did nothing to correct the earlier issue with non-standard CIGAR operators for all of their customers, it seems very unlikely that they will correct these current problems with their BAM files either. Even if they just limit reads to the uniparental markers (Y and mtDNA) which are in fact captured by the Y capture array, this would not solve the problem of newly unpaired reads or any future whole genome sequencing aligned using the same software. Unfortunately, it's a situation that UCSC (and Gene By Gene's customers) will have to deal with as long as these products continue to be sold. 
 
A quote from Gareth Highnam, of Gene By Gene Ltd.:
"I am a member of the science team working on BigY, and I am sorry that you have been running into difficulty with the BAM files. The UCSC Genome Browser is a fantastic tool, except with respect to BAM file viewing it is slightly outdated. This is why in your screenshot you see the "update me" error message - our BAM files use the most recent recommended settings but the UCSC system is one version behind. I highly recommend the Integrated Genome Viewer (IGV http://www.broadinstitute.org/igv/), it is a very pretty and very functional visualizer similar in functionality to UCSC from the Broad Institute and we have been it using frequently here at the company."

As I've said before, Open Genomes' mission is to make all kinds of human genomic data public in the INSD, and freely available to everyone, researchers and the general public alike. Having considered all the available alternatives, the UCSC Browser is by far the best platform to display complex genomic data from different sources, and integrate it with a wide variety of preexisting and custom analysis tracks which make interpretation much easier. As part of our membership in the Global Alliance for Genomics and Health along with UCSC, our intention is to use the UCSC Browser as a front-end for many kinds of human genomic data in the INSD. Here then, we have a somewhat popular and expensive commercial next-generation sequencing product, which because of what can best be described as as corrupted results due to inadequately tested software, the results cannot be validated either by the average not-very-technical consumer who cannot install Java on their tablet or smartphone nor by academic researchers. 

Hopefully there will be some sort of fix that will allow the other reads to be displayed while the misaligned pair is somehow flagged, similar to how IGV does it but in a way that would indicate that there is indeed an error here. That way, such reads could be discounted in any sort of visual verification of variants, and they would also be a sign of a region with general alignment problems, as we see above. There's also no telling directly from the browser if one or the other of the pair was the one from target regions if a capture array was used. I don't think this would be too difficult to implement, I think it's more of a UI design decision about how to indicate this than a technical issue than catching the error. Perhaps also there could be a way of indicating unpaired reads, if this is useful.

Thank you again,
Ted Kandell

Brian Lee

unread,
Apr 27, 2015, 7:53:25 PM4/27/15
to Open Genomes, UCSC Genome Browser List

Dear Ted Kandell,

Thank you for using the UCSC Genome Browser and your detailed message about the error message seen with bam files that have a misaligned pair.

Our engineers created a fix that can be tested on our genome-test server to allow the display of these bam files.

Here is a quick link loading the following custom text in our Custom Tracks page:

browser position chrY:2783737-2783738
track type="bam" name="567 Kandell G2b1-M377" description="567 Kandell G2b1-M377" pairEndsByName="."  visibility=full bigDataUrl="http://www.open-genomes.org/genomes/567%20Kandell/GRC001070.bam" 

http://genome-test.soe.ucsc.edu/cgi-bin/hgTracks?db=hg19&hgt.customText=http://hgwdev.cse.ucsc.edu/~brianlee/customTracks/rm15240.remote
(fix)

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hgt.customText=http://hgwdev.cse.ucsc.edu/~brianlee/customTracks/rm15240.remote
(error)

This quick fix allows the display of bams meeting this scenario, the changes will be released in our v316 scheduled for 5/19 in three weeks. There is a window of time for more changes, so please feel free to test this fix on our genome-test site, knowing things can change there, and provide any further specific feedback.

Thank you again for using the UCSC Genome Browser and highlighting this issue. If you have any further questions, please feel free to continue with a reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages