A tool that parses out proteomic databases from UniProt with species as an input?

29 views
Skip to first unread message

Pratik Jagtap

unread,
Nov 15, 2014, 4:48:56 PM11/15/14
to Galaxy for Proteomics, Timothy Griffin, Jim Johnson, Ira Cooke, Björn Grüning, Lennart Martens, Candace Guerrero
Hello Everyone,

There are a  concept / idea that we have been discussing for addition into Galaxy-P. 

Given that most of the metagenomics studies are based on 16s rRNA-based taxonomy identifications. I was wondering if there is a tool available that can take in species names and parse out proteomes (if available) from UniProt website? We can use the merge FASTA function within galaxyP then to generate a merged FASTA file for subsequent searches. In our discussion with researchers working in the field of metaproteomics - this would be a useful tool. Any ideas on effort that would be required to build this tool?

Any inputs / discussion on this will be greatly appreciated.

Regards,

Pratik


Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108

Ira Cooke

unread,
Nov 15, 2014, 5:00:10 PM11/15/14
to Pratik Jagtap, Galaxy for Proteomics, Timothy Griffin, Jim Johnson, Björn Grüning, Lennart Martens, Candace Guerrero
Hi Pratik, 


Uniprot has a great API … so if you know the species identifier (or list of them) you can get a customized database direct from Uniprot by downloading using a special url that contains all the taxonomic identifiers.  This negates the need for a merge step.

This is an example (Dog and Mouse)


I guess the trick would be to go from species names to taxon ids   … since this is inherently fuzzy (species might be listed under a different name from what you expect).  For my purposes I just do this by hand using uniprot via the ncbi taxonomy database … but if you have a bulk list of species names I wouldn’t be sure how to do it in an automated way (unless all the species names had a perfect match in the database).

Cheers
Ira

Lennart Martens

unread,
Nov 16, 2014, 9:05:25 AM11/16/14
to gal...@umn.edu, Pratik Jagtap, Timothy Griffin, Jim Johnson, Ira Cooke, Björn Grüning, Candace Guerrero

Hi Pratik,

I will repeat myself :). DBToolkit can do this from the local, complete UniProt file (in .dat format) for species as well as for entire taxons, specified as either the text string ('homo sapiens') or the TaxIDs (9606). As stated above, it does require a local version of the file, however.

Cheers,

lnnrt.



---- Pratik Jagtap wrote ----

Björn Grüning

unread,
Nov 16, 2014, 9:21:10 AM11/16/14
to Lennart Martens, gal...@umn.edu, Pratik Jagtap, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hi Lennart,

can you point me to the most recent code? google-code is dead, right?
Requiring a local db is good for reproducibility and we could write a
data_manager for an optimal Galaxy integration.

But as pointed our by Ira, being able to query UniProt directly would be
also nice and should be to hard. Ideally, UniProt would offer such a
nice "Download to Galaxy" button, as UCSC or SRA.

Lennart, can DBToolkit be used to get the mapping from "Human" to some taxa?

Pratik, can you please fill a trello card here:
https://trello.com/b/vEfiz617/proteomics

So we can organise our efforts?

Thanks,
Bjoern

Lennart Martens

unread,
Nov 16, 2014, 2:02:32 PM11/16/14
to gal...@umn.edu, Pratik Jagtap, Björn Grüning, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero

Hi Bjoern,

Google code isn't dead :). DBToolkit is just very stable (the tool is 12 years old), and relies in a clever way on compomics-tilities for most of the variable bits.

I'm not sure what you mean with the mapping of 'human' to other taxa.

Cheers,

lnnrt.



---- Björn Grüning wrote ----

Pratik Jagtap

unread,
Nov 16, 2014, 7:41:02 PM11/16/14
to Lennart Martens, Galaxy for Proteomics, Björn Grüning, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero, Thomas McGowan
Hello Ira, Bjoern and Lennart,

Thanks for the inputs.

I think it would be a good idea to use DBToolkit or its many features within GalaxyP. 

As a start do you think that we can approach someone at UniProt, have them install DBToolKit and discuss about the 'Download to Galaxy' option? This will save us the effort of downloading and updating UniProt .dat file regularly. 

Bjoern - I will have it placed on Trello once our plans are fixed. I think first we will need to have a stable version of SearchGUI and PeptideShaker wrapped and ready for GalaxyP installation so that we can test it. Candace is leading this effort and I will help her in testing as we proceed.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108


Ira Cooke

unread,
Nov 16, 2014, 9:13:00 PM11/16/14
to Pratik Jagtap, Lennart Martens, Galaxy for Proteomics, Björn Grüning, Timothy Griffin, Jim Johnson, Candace Guerrero, Thomas McGowan
Hi Pratik, 

As far as I can tell the Uniprot API provides all that is needed.  As per Bjoern’s suggestion this could perhaps be implemented as a Galaxy data manager .. or just a simple tool to create the file in your history via a download.  

I believe this is the best option as it is simple (just a galaxy tool), it doesn’t require storing data locally and it will always give the latest data.  It is also precise as there is no reliance on parsing names.

One missing piece is the “Species -> TaxonID” tool, but could be done using a local download of the NCBI Taxonomy data (or a web API .. I haven’t looked but Uniprot might even provide this too).  I’d actually say that you’re better off getting away from using species names if possible … to be precise you need the taxon id’s at some point anyway.  

Cheers
Ira

Björn Grüning

unread,
Nov 17, 2014, 12:40:40 PM11/17/14
to Lennart Martens, gal...@umn.edu, Pratik Jagtap, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero


Am 16.11.2014 um 20:02 schrieb Lennart Martens:
> Hi Bjoern,
>
>
> Google code isn't dead :). DBToolkit is just very stable (the tool is 12 years old), and relies in a clever way on compomics-tilities for most of the variable bits.

Oh ok, I though Google has closed google-code in some summer cleanups.
Sorry for the confusion!

> I'm not sure what you mean with the mapping of 'human' to other taxa.

Is it able to map the word "homo sapiens" to same tax-id as Ira
mentioned. This functionality is also important and could be an extra
wrapper.

Ciao,
Bjoern

Lennart Martens

unread,
Nov 17, 2014, 2:16:46 PM11/17/14
to Björn Grüning, Lennart Martens, gal...@umn.edu, Pratik Jagtap, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hi Bjoern,


Not sure whether still relevant. Ira has a point that UniProt API could
be a better way forward. DBToolkit does essentially the same jobs as the
UniProt API should do after all.

Translations of taxonomy are not included per se, althouh dbtoolkit has
the intelligence to match the species to the right part of the UniProt
.dat file, meaning that only sequences explicitly assigned to 'Homo
sapiens' would show up.

Several ontology managers (e.g. OLS,
http://www.ebi.ac.uk/ontology-lookup/; has a REST service as well) allow
easy access to the taxonomy as a CV.


Cheers,

lnnrt.

Pratik Jagtap

unread,
Nov 17, 2014, 2:25:15 PM11/17/14
to compomics ugent, Björn Grüning, Lennart Martens, Galaxy for Proteomics, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hello Everyone,

These are excellent points and we will discuss this soon to decide on best way forward. 

Most 16s rRNA studies offer lists of identified species (and strains). It would be good idea to a) take this list and b) either convert into taxonomy identifiers or c) submit as species names through d) UniProt API or e) some features from db toolkit to f) generate a FASTA file of available proteomes.

I will place this on Trello at some point - but its good to know that this is indeed possible.

Thanks so much,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108


harald.barsnes

unread,
Nov 17, 2014, 3:04:00 PM11/17/14
to gal...@umn.edu, lennart...@ugent.be, bjoern....@gmail.com, lennart...@gmail.com, tgri...@umn.edu, j...@umn.edu, irac...@gmail.com, cgue...@umn.edu
 
Hi all,
 
On a related note it might be worth checking out Galaxy Integrated Omics (http://gio.sbcs.qmul.ac.uk), and in particular their way of supporting MS Convert via virtual machines.
 
Would be great to be able to add this to our setup as well. Any idea if that is possible?
 
Best regards,
Harald

Pratik Jagtap

unread,
Nov 17, 2014, 4:15:16 PM11/17/14
to harald.barsnes, Galaxy for Proteomics, compomics ugent, Björn Grüning, Lennart Martens, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hello Harald,

Thanks for sharing this link. We had met Conrad Bessant at the ASMS this year. It will be nice if we can have some of his tools (not sure if they are in toolshed).

For msconvert, we do use a LWR (or Pulsar) to convert RAW files into mzml (or MGF). 

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108


--
You received this message because you are subscribed to the Google Groups "Galaxy for Proteomics" group.
To post to this group, send email to gal...@umn.edu.
Visit this group at http://groups.google.com/a/umn.edu/group/galaxyp/.
To view this discussion on the web visit https://groups.google.com/a/umn.edu/d/msgid/galaxyp/2f97fcc3-0a59-455b-8f0e-6268523a8eeb%40umn.edu.

Pratik Jagtap

unread,
Nov 17, 2014, 4:16:24 PM11/17/14
to harald.barsnes, Galaxy for Proteomics, compomics ugent, Björn Grüning, Lennart Martens, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
For msconvert, we do use a LWR (or Pulsar) to convert RAW files into mzml (or MGF). 

JJ, Ira or Bjoern might be able to confirm / elaborate on this.

Regards,

Pratik

Pratik Jagtap,
Managing Director,
Center for Mass Spectrometry and Proteomics,
43 Gortner Laboratory
1479 Gortner Avenue
St. Paul, MN 55108


Björn Grüning

unread,
Nov 17, 2014, 5:19:56 PM11/17/14
to lennart...@ugent.be, Lennart Martens, gal...@umn.edu, Pratik Jagtap, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hi Lennart,

Ira has for sure a valid point and we should definitely have such first
class UniProt Integration. You DBToolkit is from an other point
interesting ... it does not require any internet connection (many
cluster do not have access to the internet) and it enables the same
results over a long period of time (if you store the *.dat file). So I
think both options should be considered.

Ciao,
Bjoern

Björn Grüning

unread,
Nov 17, 2014, 5:41:42 PM11/17/14
to Pratik Jagtap, harald.barsnes, Galaxy for Proteomics, compomics ugent, Lennart Martens, Timothy Griffin, Jim Johnson, Ira Cooke, Candace Guerrero
Hi,


>> For msconvert, we do use a LWR (or Pulsar) to convert RAW files into mzml
> (or MGF).
>
> JJ, Ira or Bjoern might be able to confirm / elaborate on this.

Yes, it is possible (and we here in Freiburg) do conversions with Pulsar
on a windows node. This works pretty well!

Cheers,
Bjoern
>>> <https://groups.google.com/a/umn.edu/d/msgid/galaxyp/2f97fcc3-0a59-455b-8f0e-6268523a8eeb%40umn.edu?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>
Reply all
Reply to author
Forward
0 new messages