tide setup questions

60 views
Skip to first unread message

Roger

unread,
Jan 26, 2012, 10:39:54 AM1/26/12
to crux-users
First of all, thanks for all the work you've put into this project!

I'm trying out tide with an eye to replacing or at least supplementing
our Proteome Discoverer 1.3 installation. I've successfully (I think!)
put together a workflow from Orbi .raw as input through pep.xml
or .sqt as output, but I've run across a few things I'm hoping to get
some help with.

1) when running tide-search with --results=protobuf (for use with tide-
results), is there switch to set the output filename? I get the
default results.tideres filename and can't see how to change the
behavior in tide-search.

2) I'm going from Orbi .raw to tide .spectrumrecords. In order to do
so I'm using msconvert on a Windows system to output .ms2 and then
using your customized msconvert on my Linux system to get
to .spectrumrecords. I also tried outputting .mzML and .mgf on the
Windows system. All seemed to work, although the scan number was
missing in .spectrumrecords with one of the input formats (I believe
it was mzML). Is there a preferred workflow from Thermo .raw
to .spectrumrecords?

3) not to be an ingrate, but I'm not seeing the improvement in speed
that I was expecting relative to PD 1.3 and I'm wondering if I'm doing
something that's slowing tide down. I don't have an exact apples to
apples comparison since tide and PD are running on separate hardware,
but I've tried to eliminate as many variables as possible. I'm running
the same data file, same fasta, same search options (i.e. full
tryptic, 1 missed, C+57, 1M+16, 1STY+80) and a pretty minimal PD
Sequest workflow. Under those conditions, tide's run time is 110 sec
and PD's reported run time is 520 sec. I believe the PD run time
includes creating the protein groups, so the actual Sequest search may
be 30-40 seconds quicker. Based on your J Prot Res paper, I was
expecting considerably better than a 10x iimprovement with tide
whereas I'm actually seeing something less than 5x.

I think the tide and PD systems are pretty comparable for these
puposes (tide: RHEL x64, 32-core 2.6 GHz Opteron 192 GB; PD: Win7
Enterprise x64 12-core 2.5 GHz Xeon 24 GB) so I wouldn't think that
was skewing things. I realize that tide is single threaded, but so is
PD so the difference in the number of cores should be irrelevent. That
does, however, bring me to my last question...

4) Do you have any plans to develop a multi-threaded version of tide?
It would be very useful on our hardware, as I've noticed that tide
maxes out one core on our 32-core system while the rest sit completely
idle. Same thing with tide-index. Seems like a waste :) If you aren't
planning on adding multi-threading, or even in the interim if you are,
do you have any suggestions on a strategy for parallelizing tide by
splitting the input file and running multiple instances? I'm assuming
that it would be more straightforward to split the data while it was
in .ms2 format, since I don't have any tools for manipulating
the .spectrumrecords format. Similary, I assume it would be most
straightforward to output to .sqt or .pep.xml and parse those files in
order combine them back into a single output. I should probably
mention that the ultimate destinations for this data will be ProteoIQ,
Scaffold and the TPP.

Thanks again,
Roger

Roger

unread,
Jan 26, 2012, 11:23:59 AM1/26/12
to crux-users
In going over the tide-search help again I realized that I'd left off
the --mass-window switch and was searching at 3 Da compared to 10 ppm
with PD. Changing it to 0.02 Da decreased the tide search time from
110 sec to 27 sec, which is 20x faster than PD and is right in the
range I was expecting. If anyone has any other speed tips I'd be happy
to hear them but otherwise I guess that part of my question is
resolved. I'd still be glad to get feedback on the other questions.

Thanks
Roger

Benjamin Diament

unread,
Feb 6, 2012, 12:07:15 PM2/6/12
to crux-users
Hi Roger,
  Thanks you for writing. Please excuse the long delay in my response.
Also, be aware that Tide was an academic research project which I did
as a graduate student. I'm now graduated and my ability to support
Tide is entirely on my own time -- please bear with me!
  I'll try to respond to your questions inline:

> 1) when running tide-search with --results=protobuf (for use with tide-
> results), is there switch to set the output filename? I get the
> default results.tideres filename and can't see how to change the
> behavior in tide-search.


Yes, use --results_filename=<xxx>.


> 2) I'm going from Orbi .raw to tide .spectrumrecords. In order to do
> so I'm using msconvert on a Windows system to output .ms2 and then
> using your customized msconvert on my Linux system to get
> to .spectrumrecords. I also tried outputting .mzML and .mgf on the
> Windows system. All seemed to work, although the scan number was
> missing in .spectrumrecords with one of the input formats (I believe
> it was mzML). Is there a preferred workflow from Thermo .raw
> to .spectrumrecords?


Good question, but no, I don't think there's a preferred way to do
this. I used the Unix version of proteowizard's msconvert as the basis
for Tide's msconvert. If a file format works well in proteowizard, it
can work in Tide, but not all formats use a spectrum number, as you
point out. Please continue to use the method you are using now. I'll
make a note that spectrum numbering/naming is an issue for Tide
because of the variety of upstream spectrum formats. If I can make
time, I'll try to generalize Tide's provisions for numbering or naming
spectra.


> 3) not to be an ingrate, but I'm not seeing the improvement in speed
> that I was expecting relative to PD 1.3 and I'm wondering if I'm doing
> something that's slowing tide down. I don't have an exact apples to
> apples comparison since tide and PD are running on separate hardware,
> but I've tried to eliminate as many variables as possible. I'm running
> the same data file, same fasta, same search options (i.e. full
> tryptic, 1 missed, C+57, 1M+16, 1STY+80) and a pretty minimal PD
> Sequest workflow. Under those conditions, tide's run time is 110 sec
> and PD's reported run time is 520 sec. I believe the PD run time
> includes creating the protein groups, so the actual Sequest search may
> be 30-40 seconds quicker. Based on your J Prot Res paper, I was
> expecting considerably better than a 10x iimprovement with tide
> whereas I'm actually seeing something less than 5x.


Thanks for answering this one yourself in your follow-up mail (not
quoted here). I will make a note that ppm-based tolerance windows are
important to many (if not most) users. In the meantime, your method of
specifying a small precursor tolerance window (using the --mass_window
flag) is a good idea.


> 4) Do you have any plans to develop a multi-threaded version of tide?
> It would be very useful on our hardware, as I've noticed that tide
> maxes out one core on our 32-core system while the rest sit completely
> idle. Same thing with tide-index. Seems like a waste :) If you aren't
> planning on adding multi-threading, or even in the interim if you are,
> do you have any suggestions on a strategy for parallelizing tide by
> splitting the input file and running multiple instances? I'm assuming
> that it would be more straightforward to split the data while it was
> in .ms2 format, since I don't have any tools for manipulating
> the .spectrumrecords format. Similary, I assume it would be most
> straightforward to output to .sqt or .pep.xml and parse those files in
> order combine them back into a single output. I should probably
> mention that the ultimate destinations for this data will be ProteoIQ,
> Scaffold and the TPP.


For now, the right way to deploy multiple CPUs or cores is indeed to
run multiple instances of Tide at the same time. Certainly some
important efficiencies (most importantly file I/O speed for the index
file) would be available with a built-in threading implementation.
However, this is not an immediate priority for me as the method you
mention of "parallelizing tide by
splitting the input file and running multiple instances" is exactly
the right thing to do for now. If your data aren't already naturally
divided into multiple input files, then you should split the input
file (at the ms2 stage) and recombine after the search is done, as you
mention. I wasn't planning to produce tools for this, but if you need
some advice/help on this, please post again.

Thanks again for your questions.

Regards,
Benjamin

Roger

unread,
Feb 7, 2012, 11:29:13 AM2/7/12
to crux-users
Hi Benjamin,

Thanks for all the information. I appreciate you taking the time to
support tide despite having graduated. The ability to set the ouput
filename for protobuf output is very helpful. It would also be great
to be able to set the mass window in ppm, although now that I
recognize the problem I can probably make do with the current option
in daltons.

Best,
Roger

On Feb 6, 12:07 pm, Benjamin Diament <bdiam...@cs.washington.edu>
wrote:
Reply all
Reply to author
Forward
0 new messages