ProteinProphet sticking in findDegenGroups3

49 views
Skip to first unread message

Emily Kawaler

unread,
Oct 6, 2020, 9:32:52 PM10/6/20
to spctools-discuss
Hello! I've been running ProteinProphet as part of the Philosopher pipeline for a while now with no problems. However, one of my datasets seems to be getting stuck in the middle of this function. It doesn't throw an error or anything - just stops advancing (the last
line of the output is "Computing degenerate peptides for 69919 proteins: 0%...10%...20%...30%...40%...50%"). Has anyone run into this problem before?

Luis Mendoza

unread,
Oct 9, 2020, 2:24:37 PM10/9/20
to spctools...@googlegroups.com
Hello Emily,

This is not a problem that we have seen much of.  Do you know which version of ProteinProphet / TPP you are using?

One potential issue is the large number of proteins (and peptides) that it is trying to process -- can you either monitor the memory usage of the machine when you run this dataset, and/or try on one with more memory?

Hope this helps,
--Luis


On Tue, Oct 6, 2020 at 6:32 PM Emily Kawaler <e.ka...@gmail.com> wrote:
Hello! I've been running ProteinProphet as part of the Philosopher pipeline for a while now with no problems. However, one of my datasets seems to be getting stuck in the middle of this function. It doesn't throw an error or anything - just stops advancing (the last
line of the output is "Computing degenerate peptides for 69919 proteins: 0%...10%...20%...30%...40%...50%"). Has anyone run into this problem before?

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spctools-discuss/be33a8fb-a6ec-41b6-a988-981161f194fcn%40googlegroups.com.

Emily Kawaler

unread,
Oct 13, 2020, 2:15:51 PM10/13/20
to spctools-discuss
Hello, and thank you for your response! It doesn't look like the process is using too much memory (I've allocated 300 GB and it's maxing out around 10), and I've kicked up the minprob parameter - it's still getting stuck, unfortunately.
Emily

David Shteynberg

unread,
Oct 13, 2020, 3:08:44 PM10/13/20
to spctools-discuss
Hello Emily,

If you are able to share the dataset including the pepXML file and the database I can try to replicate the issue here and try to troubleshoot the sticking point.

Thanks,
-David

Emily Kawaler

unread,
Oct 17, 2020, 12:04:21 AM10/17/20
to spctools-discuss
Thank you! I'm working on getting it transferred to Drive, so it might take a little while, but I'll be in touch!

Emily Kawaler

unread,
Oct 17, 2020, 3:54:47 PM10/17/20
to spctools-discuss
I've uploaded the pepXML files, the parameters I used, and the database here.
Please let me know if I should be uploading anything else! Thank you!

David Shteynberg

unread,
Oct 20, 2020, 1:30:44 AM10/20/20
to spctools-discuss
Hi Emily,

I got the data and now I am trying to understand how you are running the analysis.  Can you please describe those steps?

Thank you,
-David

Emily Kawaler

unread,
Oct 20, 2020, 4:42:42 PM10/20/20
to spctools-discuss
Sure! The spectra are from the CPTAC2 ovarian propective dataset, though I removed all scans that matched to a standard reference database (I don't think the scan removal is the issue, since I'm also having this problem on a different dataset without removing any scans; I also checked with xmllint and it looks like the mzML pepXML files are valid). I've been running it with the philosopher pipeline, so the pepXML files were generated with MSFragger as part of that pipeline. The database is a customized variant database with contaminants and decoys added by philosopher's database tool. Are there any other specifics you'd like? I can upload my full philosopher.yml file if that would be helpful.

David Shteynberg

unread,
Oct 22, 2020, 3:09:29 PM10/22/20
to spctools-discuss
Hi Emily,

I analyzed the search results that you sent and I am seeing some strange things in at least one of the files you gave me.  This may be causing some of the problems you saw.
In file 03CPTAC_OVprospective_W_PNNL_20161212_B1S3_f13.pepXML on line 171821 there are some strange characters (possibly binary) that are tripping up the TPP.  I think these might be caused by a bug in an analysis tool upstream of the TPP.  Not sure if there are other mistakes of this sort.  Also I found some 'U' amino acids in the database which the TPP complains about having a mass of 0.

I hope this helps you somewhat.  Let me know what you find on your end.

Cheers,
-David

Message has been deleted

Emily Kawaler

unread,
Oct 22, 2020, 3:31:46 PM10/22/20
to spctools-discuss
Hello,
Thanks so much for taking a look! I think the selenocysteines ("U") are likely not the problem, since I've got those in all of my databases, including the ones that run correctly. I'm looking at 03CPTAC_OVprospective_W_PNNL_20161212_B1S3_f13.pepXML and I don't see anything odd in line 171821 ("<search_score name="nextscore" value="11.532"/>"), so I think our line numberings might not match up - what does your problematic line contain?

When I try to run it on my end, it always sticks somewhere in the 10CPTAC_OV files. Right now I'm running a working set of spectra with a database that didn't work and vice versa, so hopefully that'll help me pin down whether it's a problem with my spectra or my database - will let you know how that turns out!

Emily

David Shteynberg

unread,
Oct 22, 2020, 3:49:18 PM10/22/20
to spctools-discuss
I just re extracted that file and I don't see the issue anymore.  Perhaps this was a decompression issue.

Thanks for checking.

-David

On Thu, Oct 22, 2020 at 12:19 PM Emily Kawaler <e.ka...@gmail.com> wrote:
Hello,
Thanks so much for taking a look! I think the selenocysteines ("U") are likely not the problem, since I've got those in all of my databases, including the ones that run correctly. I'm looking at 03CPTAC_OVprospective_W_PNNL_20161212_B1S3_f13.pepXML and I don't see anything odd in line 171821 ("</modification_info>"), so I think our line numberings might not match up - what does your problematic line contain?

When I try to run it on my end, it always sticks somewhere in the 10CPTAC_OV files. Right now I'm running a working set of spectra with a database that didn't work and vice versa, so hopefully that'll help me pin down whether it's a problem with my spectra or my database - will let you know how that turns out!

Emily

Emily Kawaler

unread,
Oct 23, 2020, 1:56:18 AM10/23/20
to spctools-discuss
While those tests are still running, I pulled out all 185 of the proteins that are in the 10OV pepXMLs but not in 01-09OV, figuring that maybe one of those is causing the error. I've uploaded that to the same folder everything else is in (it's called 10OV_uniq.fasta) - I don't see anything that jumps out immediately. (There are no individual characters unique to either the headers or the sequences in 10OV, so I don't think there's an individual character messing things up.)

Emily Kawaler

unread,
Oct 23, 2020, 3:45:08 PM10/23/20
to spctools-discuss
Okay - When I ran the working set of spectra with the database that failed, it seems to have failed; when I ran the set of spectra that failed with a database that worked, it ran to completion. I think we can probably narrow the problem down to something in the database.
Message has been deleted

David Shteynberg

unread,
Nov 6, 2020, 8:25:56 PM11/6/20
to spctools-discuss
Hello again Emily,

Apologies for the delay but I needed a bit more time to look into this.   You are absolutely right about the titins causing this issue.  The problem is the significant overlap in peptides in this very large titin group.   Your database contains 343 variations of titin with different SAAPs, which share large subsets of the same peptides.  Calculating this enormous protein group is certainly stressing the ProteinProphet algorithm, forcing it into a higher-order polynomial time complexity problem.  I was looking into the code to see if there was a simple way to speed it up, but unfortunately this doesn't seem to be the case.  Is there any way you can reduce the number of titin entries in your database?  Have you considered using PEFF?

Thanks,
-David

On Sat, Oct 24, 2020 at 10:48 PM Emily Kawaler <e.ka...@gmail.com> wrote:
Another update: I've pinpointed a much smaller database that reproduces the error when run with just 10OV - uploaded to the same folder as above, named "titins_revs.fasta" (it contains a bunch of titins and some reverse decoy sequences). Something in the titins is causing this error, I think (I've run this set of titins with a few different sets of reverse decoys so I don't think it's caused by the decoys). I also think there are a couple of other sequences in the database that may have the same effect, but if we can figure out what's doing it in this set, it should be easier to know what to look for. Any thoughts?

Emily Kawaler

unread,
Nov 6, 2020, 8:30:59 PM11/6/20
to spctools-discuss
Not a problem! In the end I decided just to remove all of the titins from my database - it shouldn't have a huge effect on my results - and I was indeed able to run all of my datasets to completion. Thanks for all of your help!
Emily
Reply all
Reply to author
Forward
0 new messages