readmzXML, tandem, and comet fail with files larger than 2 GB

Daniel Hyduke

unread,

Oct 9, 2015, 2:09:23 PM10/9/15

to spctools-discuss

I've recently noticed some failures when using comet and xtandem! from TPP to search some centroided DDA files generated by qtofpeakpicker. These failures were all associated with files > 2GB. I was able to reduce the file size by increasing the threshold which then lead to readmzXML, tandem, and comet actually reading the files and searching them. I'm guessing that there's a place in the TPP mzXML code that uses an int (which is 32-bit) that's causing this problem.

I've used TPP 4.8.0 on Windows 7 (installed via the exe provided on sourceforge) and built the tpp from the svn on GNU/Linux (centos 7) and encountered the same problem.

I was wondering if anybody had any thoughts on where I should start sifting through the code?

Jimmy Eng

unread,

Oct 9, 2015, 4:08:42 PM10/9/15

to spctools...@googlegroups.com

Daniel,

The issue you're seeing might be due to TPP windows programs compiled as 32-bit binaries. The first thing I'd try is grabbing a 64-bit binary of one of the tools to see if that fixes things. You can grab a 64-bit Comet binary from its SourceForge download site if you want to test this. I wish I could tell you definitively that the 64-bit Comet binary will work for you but I just don't have access to files >2GB to test with.

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To post to this group, send email to spctools...@googlegroups.com.
Visit this group at http://groups.google.com/group/spctools-discuss.
For more options, visit https://groups.google.com/d/optout.

Daniel Hyduke

unread,

Oct 12, 2015, 1:55:50 PM10/12/15

to spctools...@googlegroups.com

Thanks for the response Jimmy. I had compiled comet and the whole tpp pipeline in a 64-bit environment (CENTOS GNU/Linux 7 both with gcc 4.8.3 and gcc 6.0.0) and the problem persisted. Just to verify, I downloaded the recent binaries and source for comet (2015021) and tried them, but the problem persisted.

Basically, what it feels like is that there is something in the mzXML parsing portion that checks the file size (or an index or something) and uses int (instead of int64). GNU/Linux is LP64 and MS Windows is LLP64 both of which use a 32-bit representation for int (http://www.unix.org/whitepapers/64bit.html https://en.wikipedia.org/wiki/64-bit_computing), so even if you use a 64-bit compiler on a 64-bit system your int will still be 32-bit (unless I misunderstand something) and as far as I know, there's no way to tell gcc to substitute int64 for int.

If there aren't any ideas on where to start in the codebase, I'll start digging in with gdb

--
You received this message because you are subscribed to a topic in the Google Groups "spctools-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spctools-discuss/-3-ppv8-gVE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spctools-discu...@googlegroups.com.

Jimmy Eng

unread,

Oct 12, 2015, 5:28:23 PM10/12/15

to spctools...@googlegroups.com

I'll reply on-list to give some closure to this thread for anyone interested in the problem.

I asked Daniel to send me one of his >2GB mzXML files to take a look at myself. If it wasn't a Windows 32-bit binary issue, I suspected that the problem wasn't with Tandem, Comet or the TPP tools (as I recall we dealt with large files years ago) but rather it was with the conversion program itself.

Running "tail t1.mzXML" returned:

</index>

</mzXML>

And the first thing that jumps out is the indexOffset value is completely wrong and this would cause all tools to not be able to read this file. A quick fix is to run the TPP's "indexmzXML" tool on this file to re-index the file which will also generate a correct index offset value:

indexmzXML t1.mzXML

After running this command and re-naming the generated "t1.mzXML.new" file, I was able to read the mzXML file using both readmzXML and Comet. Anyways, something needs to be fixed with qtofpeakpicker to write correct >2GB mzXML files. Minimally it needs to be a 64-bit binary. A feasible but poor workaround is to simply run indexmzXML on each file.

- Jimmy

Jimmy Eng

unread,

Oct 12, 2015, 5:42:12 PM10/12/15

to spctools...@googlegroups.com

Actually not only was the indexOffset off but the scan offsets were off too. For this particular file, the offsets go bad at scan 38348, likely where the file hits the 2GB size.

Daniel Hyduke

unread,

Oct 12, 2015, 6:26:45 PM10/12/15

to spctools...@googlegroups.com

Thanks, Jimmy!

Reply all

Reply to author

Forward