First tests with PeptideShaker.

Pratik Jagtap

unread,

Sep 3, 2014, 3:41:12 PM9/3/14

to Galaxy for Proteomics, Ira Cooke, Björn Grüning, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray

Hello Ira, Bjoern, Harald and Marc,

Please see attached an Excel sheet that enlists steps and issues with SearchGUI-PeptideShaker that we encountered.

Please see a link to the history for this test: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014

The good news: Basic search with OMSSA, X!tandem and MS-GF+ searches against UniProt database with PeptideShaker worked (items 18-20 in the history).

Error 1: MS-GF+ only searches gave empty outputs (items 15-17 in the history).

Error 2: Basic search with OMSSA, X!tandem and MS-GF+ searches against UniProt database (along with HOMD and 3-frame translated cDNA db with PeptideShaker did not work (items 22-24 in the history).

Please see and suggest what would need to be done to resolve this issue. Please let me know if you need more information.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

PeptideShaker Test.xls

Pratik Jagtap

unread,

Sep 3, 2014, 9:38:25 PM9/3/14

to Harald Barsnes, Galaxy for Proteomics, Ira Cooke, Björn Grüning, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens

Hello Harald,

Please see my answers here:

> Where do all the non-UniProt headers come from?

The non-UniProt headers come from a microbial database (item # 10 ) and a 3-frame translated cDNA database (item # 11).

> Do all of the non-UniProt headers have this general formatting?

The microbial database has this format:

>cper_c_35_33 Centipeda periodontii DSM 2778 [8450 - 10108] hypothetical protein

LRLFLFQRVQNFLRALRHLRGQSCETCHVDAIGLVGCAGDNAPQECNLFPLLADGNTVVANALVFHVGQLVVVRCKERLCMDSRVDVFHDGARDAHAVERARAAPDLIEDDEAVTRRLAQDLRDLVHLHHKRALPAHEIIARTDARKDAIDGRNLRCARGDKAADLRHQREECNLPHVGGLARHVWSRNEHRFRPAVTNVGIVRNEHLTLQDALDERVTSVPYEDRTARRHPRPHVAITCRRLRETLENIDARNRPRNVLDGDELRLDLLTQLREDMIFERRDLFLCTEDFRLEDLQLLRNEALRVRECLFSDVVVRNLIPSAVAHLDVVAEDLVVADLERLDPRACALALLHLLQQGFAVMGKCVQFVERSVVTRANDAALAHGKPRLIDDGSLERIAQLVKCLNRRVHAREKIALCRAERFTYGRECRKGDTQSDQIARVRRLCFNARQKALKVIDRAQVVAQRSTQRGFFDQLRHGIQTSINTVRYEQRMLEPRFQEARTHRRAREVEHVKERVLLAAVTQAARDLEIAERVGVKLHRIRCIEKGEVIQL

>dmic_c_1_469 Dialister micraerophilus DSM 19965 [161699 - 160872] aspartate-semialdehyde dehydrogenase

LLFFAGGSISKTLAPEAVKRGAIVIDNSSAFRMESDVPLVIPEINADDIKMHKGIIANPNCSTIIMLVALNPIHKLSPIKRIIVSTYQAVSGAGKDAVEELILETKEYFKGEEYQAKILPYASAEKHYTIAFNLIPHIDKFMDNAYTKEEMKMVNETHKILHDDSIAITATTVRVPVFRSHSESIYIETQDIVDIEKIKEMMDDSPGVELRDDISEMIYPMPIEATNFYDVSVGRIRKDLYNERGINLWISGDQLLKGAALNALQIAEYIIENNLI

>dmic_c_3_436 Dialister micraerophilus DSM 19965 [64090 - 61280] isoleucine--tRNA ligase

MDYGKTLQLPQTEFPMRGNLPKREPEFVKFWQENNLYEKRIEKRSSENAPLFVLHDGPPYANGKIHIGHALNKTLKDIIVRYKFMEGNNVNFVPGWDTHGLPIEYAVLKESGEDRANMTPIELRRKCLEYAKKWIAIQKEDFIRMGVIGDWNNPYVTYDKHLEARQIEVFGEMAKRGYIYKGKKTVYWCANCETALAEAEIEYKESKSPSIYVKFPVRDVKNLLPEGVSKEKAFAVIWTTTPWTIPANQHICVNPKFDYVWVHNKDADEYYLMSKELAPKALEECKVENYEFAGRVMKGEELEMIEFSHPLVTDRVVHVLEGDHVTLDAGTGCVHTAPGHGSDDFNIHLKYVHAGKIDAEVICPVDAKGRMTKEAGEELEGLLVWDAQGKEISLLAHAGRLLGKKSMRHQYPHCWRCKNPVIYRATEQWFASINDYRDAALKVVNDTKFIPSWGHDRLYNMVRDRQDWCISRQRSWGVPIPVFYCEDTGKPIITDETIASVKAWIEKEGSDAWWTHSAKELLPEGFKSPYSGKSNFRKETDIMDVWFDSGSTWNGVVKQPREEWKGMSFPCDLYLEGSDQHRGWFQSSLLTSVAVNGKAPYKALLTHGFTVDGEGRKMSKSVGNVVAPQTIINRYGADVMRLWIASSDYQGDVRLSDKIIKQMSDVYRKIRNTYRYLLSNLYDFDVEKDAIPYNEMEEIDKWALLRLEQVRDQVTKAYKNYEFHVMYHVIHNFCTVDMSAIYLDILKDKLYEEVPNSKERRSAQTAIHEILVTLTKLMAPVLAFTTEEVWQAMKHTSKDEKSVHLENWPEAKPEYLDEKLEEKWNKRLALRGEITKALEEKRKLKEIGHSLDADVTVYAKGEAYDTLVEMEKELADFCIVSTIKVEKSDDEELKVDVKISTREKCERCWKHLPSVGNDEKHPTVCARCSRVLKEMGL

The 3-frame cDNA database has this format:

>ENST00000591655_6 [141 - 263] cdna:known chromosome:GRCh37:HG311_PATCH:118380352:118404649:1 gene:ENSG00000266200 gene_biotype:polymorphic_pseudogene transcript_biotype:nonsense_mediated_decay
CCPLGPSAFSCWPQSEEKRSATDNLAAFLMKNHGQEPFSDL
>ENST00000567207_4 [81 - 371] cdna:known chromosome:GRCh37:16:90160431:90162566:1 gene:ENSG00000261812 gene_biotype:polymorphic_pseudogene transcript_biotype:processed_transcript
TCHRLRWHLPRGQPPAAGARQRAPPRGQRWQVRASRCARGSGAGHHGLRALGALRAGLQARQLHFPSVWGRKQLGQGTLHRRRGADGVSDGRCQKGG
>ENST00000390539_12 [1 - 1020] cdna:known chromosome:GRCh37:14:106053226:106054732:-1 gene:ENSG00000211890 gene_biotype:IG_C_gene transcript_biotype:IG_C_gene
ASPTSPKVFPLSLDSTPQDGNVVVACLVQGFFPQEPLSVTWSESGQNVTARNFPPSQDASGDLYTTSSQLTLPATQCPDGKSVTCHVKHYTNSSQDVTVPCRVPPPPPCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGATFTWTPSSGKSAVQGPPERDLCGCYSVSSVLPGCAQPWNHGETFTCTAAHPELKTPLTANITKSGNTFRPEVHLLPPPSEELALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTYAVTSILRVAAEDWKKGETFSCMVGHEALPLAFTQKTIDRMAGKPTHINVSVVMAEADGTCY

The UniProt database is in the format:

>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ

> And can everything up to the first white space always be used as the accession number?

Yes, that is the case for all the non UniProt databases.

> What we need is a generic Java regular expression that can be used to pick up these headers and parse them correctly.

If needed we can incorporate this in the Galaxy workflow if we know what regular expression should be used.

> While we're looking into this, perhaps you can redo the tests on a database without the non-UniProt headers?

I will try and keep you updated.

> I don't get the part about MS-GF+ resulting in empty outputs though (15-17). As far as I can see there is an mzid output and

> peptide and protein reports?

The mzid output, peptide and protein output mentions that it is empty.

Please let me know if you need more information.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Wed, Sep 3, 2014 at 4:36 PM, Harald Barsnes <Harald....@biomed.uib.no> wrote:

Hi Pratik,

As far as I can see the problem is with the parsing of the database headers.

Where do all the non-UniProt headers come from? If these are not in a standard format we cannot parse them.

At the moment this database cannot be indexed by our code:
Galaxy21-[Merged_and_Filtered_FASTA_from_data_10,_data_11,_and_others].fasta

What we need is a generic Java regular expression that can be used to pick up these headers and parse them correctly. Do all of the non-UniProt headers have this general formatting?

dmic_c_1_469 Dialister micraerophilus DSM 19965 [161699 - 160872] aspartate-semialdehyde dehydrogenase Database

And can everything up to the first white space always be used as the accession number?

While we're looking into this, perhaps you can redo the tests on a database without the non-UniProt headers?

I don't get the part about MS-GF+ resulting in empty outputs though (15-17). As far as I can see there is an mzid output and peptide and protein reports?

Best regards,
Harald

Den 2014-09-03 21:41, skrev Pratik Jagtap:

Hello Ira, Bjoern, Harald and Marc,

Please see attached an Excel sheet that enlists steps and issues with
SearchGUI-PeptideShaker that we encountered.

Please see a link to the history for this test:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014

[1]

The good news: Basic search with OMSSA, X!tandem and MS-GF+ searches
against UniProt database with PeptideShaker worked (items 18-20 in the
history).

Error 1: MS-GF+ only searches gave empty outputs (items 15-17 in the
history).

Error 2: Basic search with OMSSA, X!tandem and MS-GF+ searches against
UniProt database (along with HOMD and 3-frame translated cDNA db with
PeptideShaker did not work (items 22-24 in the history).

Please see and suggest what would need to be done to resolve this
issue. Please let me know if you need more information.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275 [2]

Links:
------
[1]
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014
[2] tel:612-624-9275

Pratik Jagtap

unread,

Sep 3, 2014, 9:53:21 PM9/3/14

to Galaxy for Proteomics

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

---------- Forwarded message ----------
From: Pratik Jagtap <pja...@umn.edu>
Date: Wed, Sep 3, 2014 at 8:52 PM
Subject: Re: First tests with PeptideShaker.
To: Harald Barsnes <Harald....@biomed.uib.no>
Cc: Ira Cooke <irac...@gmail.com>, Björn Grüning <bjoern....@gmail.com>, Marc Vaudel <mva...@gmail.com>, Timothy Griffin <tgri...@umn.edu>, Kevin Murray <murr...@umn.edu>, Lennart Martens <lennart...@ugent.be>

Thanks Harald,

Will this take care of the microbial database as well?

Also will this need to be "Galaxy-wrapped' by Bjoern, Ira or JJ?

Thanks and Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Wed, Sep 3, 2014 at 7:59 PM, Harald Barsnes <Harald....@biomed.uib.no> wrote:

Hi again,

I've now added a regular expression to parse the cDNA FASTA headers.

New beta versions are available here:
https://www.dropbox.com/s/h4msbpxj9um3j8t/SearchGUI-1.20.8-beta-mac_and_linux.tar.gz?dl=0
https://www.dropbox.com/s/0nseai5d9nejqt1/PeptideShaker-0.33.6-beta.zip?dl=0

This should solve the database issues.

Best regards,
Harald

Den 2014-09-03 21:41, skrev Pratik Jagtap:

Hello Ira, Bjoern, Harald and Marc,

Please see attached an Excel sheet that enlists steps and issues with
SearchGUI-PeptideShaker that we encountered.

Please see a link to the history for this test:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014

[1]

The good news: Basic search with OMSSA, X!tandem and MS-GF+ searches
against UniProt database with PeptideShaker worked (items 18-20 in the
history).

Error 1: MS-GF+ only searches gave empty outputs (items 15-17 in the
history).

Error 2: Basic search with OMSSA, X!tandem and MS-GF+ searches against
UniProt database (along with HOMD and 3-frame translated cDNA db with
PeptideShaker did not work (items 22-24 in the history).

Please see and suggest what would need to be done to resolve this
issue. Please let me know if you need more information.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Pratik Jagtap

unread,

Sep 3, 2014, 10:44:23 PM9/3/14

to Harald Barsnes, Ira Cooke, Björn Grüning, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics

Hello Harald,

> I downloaded the files for 15-17 via the link you sent in the first e-mail, and these were far from empty. So I'm not sure if we're

> talking about the same files..?

The last time I saw these items in the history said its empty so I did not try to download it. Now I cannot access the history (waiting for the admin to give me access again). Kevin - can you access the history and see if you can download items 15-17?

> Also, there are no error messages for 15-17?

Yes there were no error messages - it just said that it was empty.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Wed, Sep 3, 2014 at 9:09 PM, Harald Barsnes <Harald....@biomed.uib.no> wrote:

Den 2014-09-04 03:38, skrev Pratik Jagtap:

I don't get the part about MS-GF+ resulting in empty outputs though

(15-17). As far as I can see there is an mzid output and

peptide and protein reports?

The mzid output, peptide and protein output mentions that it is empty.

Please let me know if you need more information.

I downloaded the files for 15-17 via the link you sent in the first e-mail, and these were far from empty. So I'm not sure if we're talking about the same files..?

Also, there are no error messages for 15-17?

Harald

Pratik Jagtap

unread,

Sep 3, 2014, 10:48:36 PM9/3/14

to Harald Barsnes, Ira Cooke, Björn Grüning, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics

> But if you come across a database that doesn't work, just make it available to us and we'll see what we can do.

Sure.

> You have to do the same as usual when updating to new versions. There is nothing special about these versions in that regard.

Thanks ! Bjoern, Ira or JJ - can you please let us know when the new beta version has been Galaxy-wrapped and placed in toolshed for Trevor to upload onto galaxyp-dev?

Thanks and Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Wed, Sep 3, 2014 at 9:17 PM, Harald Barsnes <Harald....@biomed.uib.no> wrote:

Den 2014-09-04 03:52, skrev Pratik Jagtap:

Will this take care of the microbial database as well?

Yes, it should. As long as the formatting is the same. At least I was able to load both databases and add decoys in the new beta version of SearchGUI.

But if you come across a database that doesn't work, just make it available to us and we'll see what we can do.

Also will this need to be "Galaxy-wrapped' by Bjoern, Ira or JJ?

Not sure what you mean. You have to do the same as usual when updating to new versions. There is nothing special about these versions in that regard.

Harald

Pratik Jagtap

unread,

Sep 3, 2014, 11:09:38 PM9/3/14

to Harald Barsnes, Ira Cooke, Björn Grüning, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics

> Kevin - can you access the history and see if you can download items 15-17?

I have an access now and I can indeed download these "empty" files.

I have reported this observation to our tool administrator.

Thanks,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

Björn Grüning

unread,

Sep 4, 2014, 7:44:59 AM9/4/14

to Pratik Jagtap, Galaxy for Proteomics, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray

Hi all!

Sorry for being so late to the discussion and thanks Harald for fixing
this so fast.

Just a few note from my side.
The Galaxy wrapper was not updated recently, so it uses a 'old' version
of peptideshaker and serachui. I'm a little bit busy right now and have
not found time for this. Also I think the updating will take more time,
because we agreed to split the wrapper into two parts, or?

@Harald: For what is the fasta header used internally? I'm very
sceptical to add new regex into peptideshaker every time a new faster
header is encountered. Fasta headers are not really standardised, at
least they are not used a lot. I think we should reformat fasta files in
Galaxy to fit your input-standard, otherwise you will endup to support a
list of endless file formats.
@Pratik: Can you try to reformat the header in Galaxy to fit the uniprot
norm? That would solve your problem.

Regarding the question, if we need to update the wrappers. Probably not,
we only need to update the wrapper if new features where added or
commandline options are changed. (We need to do that, because we are
lacking behind quite a bit). What we need to update is the installation
routine, to point to the new tarball. I would argue this is only
necessary for stable releases. In the meantime you can at any point just
replace your old jar file, with the new one ... you need to rename the
new one. This is hackish but for testing purpose it should be enough.

Cheers,
Bjoern

Pratik Jagtap

unread,

Sep 4, 2014, 8:37:56 AM9/4/14

to Harald Barsnes, Ira Cooke, Björn Grüning, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics

Hello Harald,

Thanks for your answers.

> But perhaps most importantly it adds support for MyriMatch and includes various zip export options in SearchGUI making it easier > to link SearchGUI and PeptideShaker.

Thats great to know. These 5 search algorithms make a great combination.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Thu, Sep 4, 2014 at 5:22 AM, Harald Barsnes <Harald....@biomed.uib.no> wrote:

Den 2014-09-04 04:48, skrev Pratik Jagtap:

You have to do the same as usual when updating to new versions.

There is nothing special about these versions in that regard.

Thanks ! Bjoern, Ira or JJ - can you please let us know when the new
beta version has been Galaxy-wrapped and placed in toolshed for Trevor
to upload onto galaxyp-dev?

I noticed in the mzid export that you're using PeptideShaker 0.31.4?

If this is correct, updating to the latest versions of both SearchGUI and PeptideShaker will fix a couple of bugs, and also includes improvements in the protein inference.

But perhaps most importantly it adds support for MyriMatch and includes various zip export options in SearchGUI making it easier to link SearchGUI and PeptideShaker. For details see "Optional output compression parameters" at http://code.google.com/p/searchgui/wiki/SearchCLI. The zip file from SearchGUI can be given directly as input to PeptideShaker, as detailed in a previous e-mail from Marc.

Also note the new species_update parameter added to the PeptideShakerCLI (http://code.google.com/p/peptide-shaker/wiki/PeptideShakerCLI) to control the download and update of the gene and GO mappings.

And as always, if you have any questions about the changes or come across any issues, please let us know.

Best regards,
Harald

Pratik Jagtap

unread,

Sep 4, 2014, 9:00:30 AM9/4/14

to Björn Grüning, Harald Barsnes, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics, Jim Johnson

Hello Harald and Bjoern,

It appears that the short term fix would be to convert the FASTA file to a searchgui standard fasta file within Galaxy. Is that right?

We are currently undergoing migration of resources to have them on the latest build of Galaxy. We should definitely think of updating the SearchGUI and PeptideShaker beta versions to address the version, database parsing issues, SearchGUI and PeptideShaker coupling and use of Myrimatch. We will have this done after the migration.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Thu, Sep 4, 2014 at 7:31 AM, Björn Grüning <bjoern....@gmail.com> wrote:

Am 04.09.2014 um 14:24 schrieb Harald Barsnes:

Den 2014-09-04 13:44, skrev Björn Grüning:

@Harald: For what is the fasta header used internally? I'm very
sceptical to add new regex into peptideshaker every time a new faster
header is encountered. Fasta headers are not really standardised, at
least they are not used a lot. I think we should reformat fasta files
in Galaxy to fit your input-standard, otherwise you will endup to
support a list of endless file formats.

The FASTA headers are used as a way of referring to a specific protein
in the FASTA file via our index file. Thus we have to be able to parse
the headers to extract a unique accession number. This is indeed not
trivial. For our current FASTA header parsing see:
http://code.google.com/p/compomics-utilities/source/browse/trunk/src/main/java/com/compomics/util/protein/Header.java

Note that if none of our patterns kick in, we end up trying to use the
whole header as the accession number. So even if a header is not
recognized it should (in most case) not break the parsing.

Reformatting the headers in Galaxy would make things simpler for
SearchGUI/PeptideShaker, but I think it would just move the problem, as
the issue of how to handle non-standard headers will be the same, just
at a different location in the pathway.

However, we do have a recommended format for such non-standard headers:
http://code.google.com/p/searchgui/wiki/DatabaseHelp#Non_Standard_FASTA

Great!

But not sure how easy it would be to convert all possible FASTA file
headers to this format?

I'm not talking about converting every format, but Galaxy is flexible enough to create easily a pipeline that will be able convert header X to Non_Standard_FASTA. We just need to inform our users and give advise how to do so, if you have a non_standard_header fasta.

We could of course agree on a different common format. The challenge is
that it has to be unique and not clash with the other supported header

formats.

@Pratik: Can you try to reformat the header in Galaxy to fit the
uniprot norm? That would solve your problem.

I would strongly advice against this, as the header would then be picked
up as a UniProt header and annotated as such. So while the header could
then be parsed, the code would also attempt to link it to UniProt which
of course wouldn't work. So I'd rather go for our own unique header
formatting rather than trying to mimic an existing format.

Yes makes sense. Actually, this is what I meant. Convert your fasta file to a searchgui standard fasta file (what ever this is). I was just assuming that this is UniProt based.

Thanks,
Bjoern

Harald

Björn Grüning

unread,

Sep 4, 2014, 9:11:18 AM9/4/14

to Pratik Jagtap, Harald Barsnes, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics, Jim Johnson

Salve!

Am 04.09.2014 um 15:00 schrieb Pratik Jagtap:
> Hello Harald and Bjoern,
>
> It appears that the short term fix would be to convert the FASTA file to a
> searchgui standard fasta file within Galaxy. Is that right?

You can try to just replace the jar file.
I'm not sure, when I will have time to rework the wrapper and split it
up into two ... maybe next week.

Cheers,
Bjoern

Pratik Jagtap

unread,

Sep 4, 2014, 9:45:51 PM9/4/14

to Björn Grüning, Harald Barsnes, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics, Jim Johnson, Thomas McGowan

Hello Everyone,

I tried a new search after using Regex commands (thanks to JJ) and the SearchGUI-PeptideShaker gave an error with non-UniProt databases.

Please see attached an Excel sheet that enlists steps and issues with SearchGUI-PeptideShaker that we encountered.

Please see a link to the history for this test: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014-part-two

Error : Picked up _JAVA_OPTIONS: -Xmx6291456k

So - I think updating the SearchGUI and PeptideShaker beta versions remains the only option and can be done after the migration.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

PeptideShaker Test PartTwo.xls

Ira Cooke

unread,

Sep 5, 2014, 12:55:57 AM9/5/14

to Pratik Jagtap, Björn Grüning, Harald Barsnes, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Galaxy for Proteomics, Jim Johnson, Thomas McGowan

Hi Pratik / All,

I’m coming to this quite late (sorry). First of all I think it’s great that SearchGUI-PeptideShaker actually documents it’s fasta parsing (thank you guys) as most other programs just leave this to the end user to figure out. I also agree with Bjoern that the only feasible way to deal with this is simply to have a documented way to format “non-standard” fasta headers and ask users to comply. I think the formats already documented on the searchgui wiki should be fine.

I am on leave from today until wednesday next week, but I should have time to contribute to updating the toolshed wrappers then.

Cheers

Ira

--
You received this message because you are subscribed to the Google Groups "Galaxy for Proteomics" group.
To post to this group, send email to gal...@umn.edu.
Visit this group at http://groups.google.com/a/umn.edu/group/galaxyp/.
To view this discussion on the web visit https://groups.google.com/a/umn.edu/d/msgid/galaxyp/CAFMfZ42axs7YhG9RKMC69S6bV9jOEndbgxnP9OYf%3Do4EGZATFg%40mail.gmail.com.

To unsubscribe from this group and stop receiving emails from it, send an email to galaxyp+u...@umn.edu.
<PeptideShaker Test PartTwo.xls>

Pratik Jagtap

unread,

Sep 10, 2014, 7:13:25 PM9/10/14

to Galaxy for Proteomics

For Your Information...

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Wed, Sep 10, 2014 at 6:12 PM, Pratik Jagtap <pja...@umn.edu> wrote:

Hello Everyone,

Here is a history that goes from RAW files to PeptideShaker outputs: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/control-p4-raw-to-second-step-search
Here is the workflow that makes it possible: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/w/copy-of-p4-mzml-to-second-step-search

The workflow uses a dataset collection input - has msconvert, MGF formatter, Protein database Downloader, Regex Find and Replace, FASTA Merge Files and PeptideShaker as tools.

I will keep you updated as we make more progress.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Sat, Sep 6, 2014 at 8:37 PM, Pratik Jagtap <pja...@umn.edu> wrote:

Hello Everyone,

Here is a cleaner history and its workflow:

History: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-mzml-to-second-step-search-1

Workflow: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/w/copy-of-p4-raw-to-second-step-search

I will keep everyone updated on any issues / successes.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Fri, Sep 5, 2014 at 10:20 AM, Pratik Jagtap <pja...@umn.edu> wrote:
My bad - I was looking at other history.

I will look at this closer and see if the outputs can be used for further processing.

Thanks again !

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Fri, Sep 5, 2014 at 10:17 AM, Jim Johnson <john...@umn.edu> wrote:

Pratik,

Are we still looking at: http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014-part-two
It appears to me that the job completed successfully. The 3 output datasets all seem to contain data.
Should there be any additional outputs?

The stdout/stderr does contain info about java VM settings and logging bindings.
But I didn't see anything that would indicate that the job failed.
Is there something I'm missing?

JJ

On 9/5/14, 9:57 AM, Pratik Jagtap wrote:

Hello Everyone,

The items 37, 38 and 39 failed with the following error: Fatal error: Java ExceptionPicked up _JAVA_OPTIONS: -Xmx6291456k

I will wait for JJ or Tom to answer about the "Picked up _JAVA_OPTIONS: -Xmx6291456k" issue and Bjoern's question about wrapper.

Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Fri, Sep 5, 2014 at 9:13 AM, Pratik Jagtap <pja...@umn.edu> wrote:

Hello Harald, Bjoern and JJ,

Thanks JJ - I am testing JJ's suggestion now (see items 37, 38 and 39) in http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014-part-two

@Harald - the "generic_HOMD|sp|" with "sp|" and "generic_HOMD|tr|" with "tr|" issue was only in item# 25 database. Item #31 which picked up the Java error did not have those entries.

I will ask JJ, Trevor or Tom to look at the Java issue.

@Bjoern -

@Pratik: can you check you have the following lines in your wrapper?

<stdio>
<exit_code range="1:" level="fatal" description="Job Failed" />
<regex match="Error" level="fatal" description="Error encounterd!"/>
</stdio>

This should filter junk from stderr and only fail if there are real errors, indicated with "error" or a realy unix error code.

I will request JJ, Trevor or Tom to look at this. Thanks.

Thanks and Regards,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Fri, Sep 5, 2014 at 8:32 AM, Jim Johnson <john...@umn.edu> wrote:

I'm not sure if my previous email, which suggested a different Ensembl regex, actually got sent.
Find Regex:
>(ENST\S*) \[(\d+) - (\d+)]\s*(.*)
Replacement:
>generic_EnSEMBL|\1_:\2:\3|\4

Here's the stack trace:
java.util.regex.PatternSyntaxException: Illegal character range near index 38AA_coverage_ccs_ENST00000261769_44_[3-2837]_cus_ENST00000562836_44_[183-2717]_cus_H3BNC6_cus_H3BVI7_cus_P12830                                      ^    at java.util.regex.Pattern.error(Pattern.java:1924)    at java.util.regex.Pattern.range(Pattern.java:2594)    at java.util.regex.Pattern.clazz(Pattern.java:2507)    at java.util.regex.Pattern.sequence(Pattern.java:2030)    at java.util.regex.Pattern.expr(Pattern.java:1964)    at java.util.regex.Pattern.compile(Pattern.java:1665)    at java.util.regex.Pattern.<init>(Pattern.java:1337)    at java.util.regex.Pattern.compile(Pattern.java:1022)    at java.lang.String.split(String.java:2313)    at java.lang.String.split(String.java:2355)    at eu.isas.peptideshaker.utils.IdentificationFeaturesCache.getObjectKey(IdentificationFeaturesCache.java:644)    at eu.isas.peptideshaker.utils.IdentificationFeaturesCache.addObject(IdentificationFeaturesCache.java:242)    at eu.isas.peptideshaker.utils.IdentificationFeaturesGenerator.getAACoverage(IdentificationFeaturesGenerator.java:143)    at eu.isas.peptideshaker.utils.IdentificationFeaturesGenerator.estimateSequenceCoverage(IdentificationFeaturesGenerator.java:203)    at eu.isas.peptideshaker.utils.IdentificationFeaturesGenerator.getSequenceCoverage(IdentificationFeaturesGenerator.java:438)    at eu.isas.peptideshaker.export.sections.ProteinSection.getFeature(ProteinSection.java:345)    at eu.isas.peptideshaker.export.sections.ProteinSection.writeSection(ProteinSection.java:186)    at eu.isas.peptideshaker.export.PSExportFactory.writeExport(PSExportFactory.java:296)    at eu.isas.peptideshaker.cmd.CLIMethods.exportReport(CLIMethods.java:249)    at eu.isas.peptideshaker.cmd.ReportCLI.call(ReportCLI.java:135)    at eu.isas.peptideshaker.cmd.ReportCLI.main(ReportCLI.java:256)

On 9/5/14, 8:23 AM, Jim Johnson wrote:

The fasta IDs are used to construct a key for caching Features, and the key is used as a regex:

eu.isas.peptideshaker.utils.IdentificationFeaturesCache.getObjectKey(IdentificationFeaturesCache.java:644)

    /**     * Convenience method returning the object key based on the cache key.     *     * @param cacheKey the cache key     * @return the object key     */    private String getObjectKey(String cacheKey) {        StringBuilder buf = new StringBuilder();        String[] splittedKey = cacheKey.split(cacheKey);        for (int i = 1; i < splittedKey.length; i++) {            buf.append(splittedKey[i]);        }        return buf.toString();    }
So the generated "generic" fasta header lines should not have characters that have characters that specify regex constructs: [ ] ( ) \

The range designation in the Ensembl headers needs to be something other that "[2653-3087]"
>generic_EnSEMBL|ENST00000460658_48_[2653-3087]|cdna:known chromosome:GRCh37:22:31484088:31497769:1 gene:ENSG00000183963 gene_biotype:protein_coding transcript_biotype:retained_intronFLPESIKPFPHSIPCQVMAVPSPQLLLERPLLPVSFMFLTSHPPPRLVCPMHLCICAVWVLVALLRMHGASPAQTSGTRSGNGGCRRHGAGQGRGAATQPLRPPRGTASGQLMALLSALLPRLSGSSTPMMAHGRPAPPQWSRVS
Would using ":2653:3087" work ( or does some other application rely on the "[2653-3087]" construct?
generic_EnSEMBL|ENST00000460658_48_:2653:3087|cdna:known chromosome:GRCh37:22:31484088:31497769:1 gene:ENSG00000183963 gene_biotype:protein_coding

Find Regex:
>(ENST\S*) \[(\d+) - (\d+)]\s*(.*)
Replacement:
>generic_EnSEMBL|\1_:\2:\3|\4

Demonstrated in Step#5 of history:
    http://galaxyp-dev.msi.umn.edu:8081/u/jjohnson/h/fasta-id-conversions

--
James E. Johnson Minnesota Supercomputing Institute University of Minnesota

--
James E. Johnson Minnesota Supercomputing Institute University of Minnesota

--
James E. Johnson Minnesota Supercomputing Institute University of Minnesota

Pratik Jagtap

unread,

Sep 11, 2014, 11:40:35 AM9/11/14

to Björn Grüning, Jim Johnson, Harald Barsnes, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Thomas McGowan, Galaxy for Proteomics

Thanks Bjoern,

That would be awesome. There might soon be interest in using OpenSWATH within GalaxyP. Please keep us updated regarding the OpenMS tools.

Thanks,

Pratik

Pratik Jagtap,

Managing Director,

Center for Mass Spectrometry and Proteomics,

43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Phone: 612-624-9275

On Thu, Sep 11, 2014 at 9:05 AM, Björn Grüning <bjoern....@gmail.com> wrote:

Pratik, thanks for sharing! Soon there will be a new version of OpenMS with hopefully all tools!

Cheers,
Bjoern

Am 11.09.2014 um 01:12 schrieb Pratik Jagtap:

Hello Everyone,

Here is a history that goes from RAW files to PeptideShaker outputs:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/control-p4-raw-to-second-step-search

Here is the workflow that makes it possible:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/w/copy-of-p4-mzml-to-second-step-search

The workflow uses a dataset collection input - has msconvert, MGF
formatter, Protein database Downloader, Regex Find and Replace, FASTA Merge
Files and PeptideShaker as tools.

I will keep you updated as we make more progress.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Sat, Sep 6, 2014 at 8:37 PM, Pratik Jagtap <pja...@umn.edu> wrote:

Hello Everyone,

Here is a cleaner history and its workflow:

History:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-mzml-to-second-step-search-1

Workflow:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/w/copy-of-p4-raw-to-second-step-search

I will keep everyone updated on any issues / successes.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Fri, Sep 5, 2014 at 10:20 AM, Pratik Jagtap <pja...@umn.edu> wrote:

My bad - I was looking at other history.

I will look at this closer and see if the outputs can be used for further
processing.

Thanks again !

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Fri, Sep 5, 2014 at 10:17 AM, Jim Johnson <john...@umn.edu> wrote:

Pratik,

Are we still looking at:
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014-part-two
It appears to me that the job completed successfully. The 3 output
datasets all seem to contain data.
Should there be any additional outputs?

The stdout/stderr does contain info about java VM settings and logging
bindings.
But I didn't see anything that would indicate that the job failed.
Is there something I'm missing?

JJ

On 9/5/14, 9:57 AM, Pratik Jagtap wrote:

Hello Everyone,

The items 37, 38 and 39 failed with the following error: Fatal error:
Java ExceptionPicked up _JAVA_OPTIONS: -Xmx6291456k

I will wait for JJ or Tom to answer about the "Picked up
_JAVA_OPTIONS: -Xmx6291456k" issue and Bjoern's question about wrapper.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

On Fri, Sep 5, 2014 at 9:13 AM, Pratik Jagtap <pja...@umn.edu> wrote:

Hello Harald, Bjoern and JJ,

Thanks JJ - I am testing JJ's suggestion now (see items 37, 38 and
39) in
http://galaxyp-dev.msi.umn.edu:8081/u/pratik/h/p4-control-peptideshaker-test-september-2014-part-two

@Harald - the "generic_HOMD|sp|" with "sp|" and "generic_HOMD|tr|"
with "tr|" issue was only in item# 25 database. Item #31 which picked up
the Java error did not have those entries.

I will ask JJ, Trevor or Tom to look at the Java issue.

@Bjoern -

@Pratik: can you check you have the following lines in your wrapper?

<stdio>
<exit_code range="1:" level="fatal" description="Job Failed" />
<regex match="Error" level="fatal" description="Error
encounterd!"/>
</stdio>

This should filter junk from stderr and only fail if there are real
errors, indicated with "error" or a realy unix error code.

I will request JJ, Trevor or Tom to look at this. Thanks.

Thanks and Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108
Phone: 612-624-9275

* String[] splittedKey = cacheKey.split(cacheKey);*

for (int i = 1; i < splittedKey.length; i++) {
buf.append(splittedKey[i]);
}
return buf.toString();
}

So the generated "generic" fasta header lines should not have
characters that have characters that specify regex constructs: [ ] ( ) \

The range designation in the Ensembl headers needs to be something
other that "[2653-3087]"

generic_EnSEMBL|ENST00000460658_48_[2653-3087]|cdna:known

chromosome:GRCh37:22:31484088:31497769:1 gene:ENSG00000183963
gene_biotype:protein_coding transcript_biotype:retained_intron

FLPESIKPFPHSIPCQVMAVPSPQLLLERPLLPVSFMFLTSHPPPRLVCPMHLCICAVWVLVALLRMHGASPAQTSGTRSGNGGCRRHGAGQGRGAATQPLRPPRGTASGQLMALLSALLPRLSGSSTPMMAHGRPAPPQWSRVS

Would using ":2653:3087" work ( or does some other application rely
on the "[2653-3087]" construct?
generic_EnSEMBL|ENST00000460658_48_:2653:3087|cdna:known
chromosome:GRCh37:22:31484088:31497769:1 gene:ENSG00000183963
gene_biotype:protein_coding

Find Regex:

(ENST\S*) \[(\d+) - (\d+)]\s*(.*)

Replacement:

generic_EnSEMBL|\1_:\2:\3|\4

Demonstrated in Step#5 of history:

http://galaxyp-dev.msi.umn.edu:8081/u/jjohnson/h/fasta-id-conversions

On 9/4/14, 8:45 PM, Pratik Jagtap wrote:

--
James E. Johnson Minnesota Supercomputing Institute University of
Minnesota

--
James E. Johnson Minnesota Supercomputing Institute University of
Minnesota

Björn Grüning

unread,

Sep 11, 2014, 11:53:22 AM9/11/14

to Pratik Jagtap, Jim Johnson, Harald Barsnes, Ira Cooke, harald....@gmail.com, Marc Vaudel, Timothy Griffin, Kevin Murray, Lennart Martens, Thomas McGowan, Galaxy for Proteomics

Sure, will do ... the idea is to generate them automatically ... we will
so how successful we are :)

Reply all

Reply to author

Forward