I'm working on the pepXML parser in OpenMS. I've been confronted with
a type of pepXML file I hadn't seen before, where search results from
different search engines - but for the same experiment - were
collected in one file (with one "msms_run_summary" per search engine).
I've added (maybe prematurely) support for this to the OpenMS parser,
and then wanted to construct a simple pepXML file for testing
purposes.
In doing so, I've now come across a constraint in the pepXML schema
(at least from v1.8 on) that says values of the "base_name" attribute
(supposed to contain the full path to the searched mzXML file) in the
"search_summary" element have to be unique within the document.
What is the rationale behind this constraint? Is it supposed to
prevent the above case, where different searches of the same
experiment end up in one file? Why would that be desirable/necessary?
(Also note that I can construct a valid and parseable pepXML file from
two different search runs of the same file if I change the path in
"base_name"...)
In an earlier discussion (http://groups.google.com/group/spctools- discuss/msg/7760dcda02877922?hl=en), it was mentioned that
"base_name"s in "msms_run_summary" elements had to be unique in the
document - however, as per the schema, that's not true. Also, the
"base_name" of an "msms_run_summary" is not tied to the "base_name" in
subordinate "search_summary"s. If there were such a constraint, it
would be impossible to have more than one "search_summary" under an
"msms_run_summary" - however, this is allowed in the schema.
When does it make sense to have different "base_name"s in an
"msms_run_summary" and its subordinate "search_summary"(s)? Judging
from the schema documentation and the files I've seen, it seems that
the values should be the same. On the other hand, why have the
attribute in both elements then?
All this adds to my confusion about the appropriate use of
"base_name"...
I would be happy if someone could clear things up for me.
Hi Hendrik, I think we need to get an authoritative answer from David on
this one. And he is currently traveling in the Land of the Finns. We will
let/ask him to answer when he is next able.
> From: spctools-discuss@googlegroups.com [mailto:spctools-
> discuss@googlegroups.com] On Behalf Of Hendrik Weisser
> Hi!
> I'm working on the pepXML parser in OpenMS. I've been confronted with
> a type of pepXML file I hadn't seen before, where search results from
> different search engines - but for the same experiment - were
> collected in one file (with one "msms_run_summary" per search engine).
> I've added (maybe prematurely) support for this to the OpenMS parser,
> and then wanted to construct a simple pepXML file for testing
> purposes.
> In doing so, I've now come across a constraint in the pepXML schema
> (at least from v1.8 on) that says values of the "base_name" attribute
> (supposed to contain the full path to the searched mzXML file) in the
> "search_summary" element have to be unique within the document.
> What is the rationale behind this constraint? Is it supposed to
> prevent the above case, where different searches of the same
> experiment end up in one file? Why would that be desirable/necessary?
> (Also note that I can construct a valid and parseable pepXML file from
> two different search runs of the same file if I change the path in
> "base_name"...)
> In an earlier discussion (http://groups.google.com/group/spctools- > discuss/msg/7760dcda02877922?hl=en), it was mentioned that
> "base_name"s in "msms_run_summary" elements had to be unique in the
> document - however, as per the schema, that's not true. Also, the
> "base_name" of an "msms_run_summary" is not tied to the "base_name" in
> subordinate "search_summary"s. If there were such a constraint, it
> would be impossible to have more than one "search_summary" under an
> "msms_run_summary" - however, this is allowed in the schema.
> When does it make sense to have different "base_name"s in an
> "msms_run_summary" and its subordinate "search_summary"(s)? Judging
> from the schema documentation and the files I've seen, it seems that
> the values should be the same. On the other hand, why have the
> attribute in both elements then?
> All this adds to my confusion about the appropriate use of
> "base_name"...
> I would be happy if someone could clear things up for me.
The element msms_pipeline_analysis/msms_run_summary has an attribute
base_name to specify the path to the datafile. In case the searched
file specified is different from the original data file there is
another entry in the element
msms_pipeline_analysis/msms_run_summary/search_summary for base_name.
As far as I know, there is nothing in the schema that requires these
to be unique in the pepXML file. Can you point me to where this
constraint is specified in the schema. I checked version 1.8.
> Hi Hendrik, I think we need to get an authoritative answer from David on
> this one. And he is currently traveling in the Land of the Finns. We will
> let/ask him to answer when he is next able.
> Regards,
> Eric
>> From: spctools-discuss@googlegroups.com [mailto:spctools-
>> discuss@googlegroups.com] On Behalf Of Hendrik Weisser
>> Hi!
>> I'm working on the pepXML parser in OpenMS. I've been confronted with
>> a type of pepXML file I hadn't seen before, where search results from
>> different search engines - but for the same experiment - were
>> collected in one file (with one "msms_run_summary" per search engine).
>> I've added (maybe prematurely) support for this to the OpenMS parser,
>> and then wanted to construct a simple pepXML file for testing
>> purposes.
>> In doing so, I've now come across a constraint in the pepXML schema
>> (at least from v1.8 on) that says values of the "base_name" attribute
>> (supposed to contain the full path to the searched mzXML file) in the
>> "search_summary" element have to be unique within the document.
>> What is the rationale behind this constraint? Is it supposed to
>> prevent the above case, where different searches of the same
>> experiment end up in one file? Why would that be desirable/necessary?
>> (Also note that I can construct a valid and parseable pepXML file from
>> two different search runs of the same file if I change the path in
>> "base_name"...)
>> In an earlier discussion (http://groups.google.com/group/spctools- >> discuss/msg/7760dcda02877922?hl=en), it was mentioned that
>> "base_name"s in "msms_run_summary" elements had to be unique in the
>> document - however, as per the schema, that's not true. Also, the
>> "base_name" of an "msms_run_summary" is not tied to the "base_name" in
>> subordinate "search_summary"s. If there were such a constraint, it
>> would be impossible to have more than one "search_summary" under an
>> "msms_run_summary" - however, this is allowed in the schema.
>> When does it make sense to have different "base_name"s in an
>> "msms_run_summary" and its subordinate "search_summary"(s)? Judging
>> from the schema documentation and the files I've seen, it seems that
>> the values should be the same. On the other hand, why have the
>> attribute in both elements then?
>> All this adds to my confusion about the appropriate use of
>> "base_name"...
>> I would be happy if someone could clear things up for me.
>> Best regards
>> Hendrik
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
> To post to this group, send email to spctools-discuss@googlegroups.com
> To unsubscribe from this group, send email to spctools-discuss+unsubscribe@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en > -~----------~----~----~----~------~----~------~--~---
<dshteynb...@systemsbiology.org> wrote:
> Hi Hendrik,
> The element msms_pipeline_analysis/msms_run_summary has an attribute
> base_name to specify the path to the datafile. In case the searched
> file specified is different from the original data file there is
> another entry in the element
> msms_pipeline_analysis/msms_run_summary/search_summary for base_name.
> As far as I know, there is nothing in the schema that requires these
> to be unique in the pepXML file. Can you point me to where this
> constraint is specified in the schema. I checked version 1.8.
> -David
> On Fri, Nov 13, 2009 at 12:31 AM, Eric Deutsch
> <edeut...@systemsbiology.org> wrote:
>> Hi Hendrik, I think we need to get an authoritative answer from David on
>> this one. And he is currently traveling in the Land of the Finns. We will
>> let/ask him to answer when he is next able.
>> Regards,
>> Eric
>>> From: spctools-discuss@googlegroups.com [mailto:spctools-
>>> discuss@googlegroups.com] On Behalf Of Hendrik Weisser
>>> Hi!
>>> I'm working on the pepXML parser in OpenMS. I've been confronted with
>>> a type of pepXML file I hadn't seen before, where search results from
>>> different search engines - but for the same experiment - were
>>> collected in one file (with one "msms_run_summary" per search engine).
>>> I've added (maybe prematurely) support for this to the OpenMS parser,
>>> and then wanted to construct a simple pepXML file for testing
>>> purposes.
>>> In doing so, I've now come across a constraint in the pepXML schema
>>> (at least from v1.8 on) that says values of the "base_name" attribute
>>> (supposed to contain the full path to the searched mzXML file) in the
>>> "search_summary" element have to be unique within the document.
>>> What is the rationale behind this constraint? Is it supposed to
>>> prevent the above case, where different searches of the same
>>> experiment end up in one file? Why would that be desirable/necessary?
>>> (Also note that I can construct a valid and parseable pepXML file from
>>> two different search runs of the same file if I change the path in
>>> "base_name"...)
>>> In an earlier discussion (http://groups.google.com/group/spctools- >>> discuss/msg/7760dcda02877922?hl=en), it was mentioned that
>>> "base_name"s in "msms_run_summary" elements had to be unique in the
>>> document - however, as per the schema, that's not true. Also, the
>>> "base_name" of an "msms_run_summary" is not tied to the "base_name" in
>>> subordinate "search_summary"s. If there were such a constraint, it
>>> would be impossible to have more than one "search_summary" under an
>>> "msms_run_summary" - however, this is allowed in the schema.
>>> When does it make sense to have different "base_name"s in an
>>> "msms_run_summary" and its subordinate "search_summary"(s)? Judging
>>> from the schema documentation and the files I've seen, it seems that
>>> the values should be the same. On the other hand, why have the
>>> attribute in both elements then?
>>> All this adds to my confusion about the appropriate use of
>>> "base_name"...
>>> I would be happy if someone could clear things up for me.
>>> Best regards
>>> Hendrik
>> --~--~---------~--~----~------------~-------~--~----~
>> You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
>> To post to this group, send email to spctools-discuss@googlegroups.com
>> To unsubscribe from this group, send email to spctools-discuss+unsubscribe@googlegroups.com
>> For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en >> -~----------~----~----~----~------~----~------~--~---
I will try to reason this through using trying to think from the
original authors point of view. The idea is that different searches
of the same data would happen in separate directories and the
base_name (full path to the data file) would identify one search of an
mzXML file representing one msms_run, and more than one search would
never happen in one directory on the same file. Also it is natural to
keep all results from one search engine run on an mzXML file in one
place in the mzXML file. However, you could have more than one search
that references the same data and these don't necessarily have to be
placed together in the pepXML file. Although the problem with this is
that you could have different paths to the same data and these would
all be listed. In the iProphet tool (which combines results from
multiple searches of the same data), I don't look at either base_name
but rather the spectrum names themselves, with the combination of
experiment_label, which is a user specified parameter that identifies
data from the same experiment. The idea is that the combination of
experiment_label and spectrum name will uniquely identify a spectrum
searched. I hope this is helpful. Let us know if you have other
questions.
-David
On Fri, Nov 20, 2009 at 10:30 AM, David Shteynberg
<dshteynb...@systemsbiology.org> wrote:
> OK I take that back. I see where the unique constraint is listed. I
> will have to consider your questions further.
> -David
> On Fri, Nov 20, 2009 at 10:27 AM, David Shteynberg
> <dshteynb...@systemsbiology.org> wrote:
>> Hi Hendrik,
>> The element msms_pipeline_analysis/msms_run_summary has an attribute
>> base_name to specify the path to the datafile. In case the searched
>> file specified is different from the original data file there is
>> another entry in the element
>> msms_pipeline_analysis/msms_run_summary/search_summary for base_name.
>> As far as I know, there is nothing in the schema that requires these
>> to be unique in the pepXML file. Can you point me to where this
>> constraint is specified in the schema. I checked version 1.8.
>> -David
>> On Fri, Nov 13, 2009 at 12:31 AM, Eric Deutsch
>> <edeut...@systemsbiology.org> wrote:
>>> Hi Hendrik, I think we need to get an authoritative answer from David on
>>> this one. And he is currently traveling in the Land of the Finns. We will
>>> let/ask him to answer when he is next able.
>>> Regards,
>>> Eric
>>>> From: spctools-discuss@googlegroups.com [mailto:spctools-
>>>> discuss@googlegroups.com] On Behalf Of Hendrik Weisser
>>>> Hi!
>>>> I'm working on the pepXML parser in OpenMS. I've been confronted with
>>>> a type of pepXML file I hadn't seen before, where search results from
>>>> different search engines - but for the same experiment - were
>>>> collected in one file (with one "msms_run_summary" per search engine).
>>>> I've added (maybe prematurely) support for this to the OpenMS parser,
>>>> and then wanted to construct a simple pepXML file for testing
>>>> purposes.
>>>> In doing so, I've now come across a constraint in the pepXML schema
>>>> (at least from v1.8 on) that says values of the "base_name" attribute
>>>> (supposed to contain the full path to the searched mzXML file) in the
>>>> "search_summary" element have to be unique within the document.
>>>> What is the rationale behind this constraint? Is it supposed to
>>>> prevent the above case, where different searches of the same
>>>> experiment end up in one file? Why would that be desirable/necessary?
>>>> (Also note that I can construct a valid and parseable pepXML file from
>>>> two different search runs of the same file if I change the path in
>>>> "base_name"...)
>>>> In an earlier discussion (http://groups.google.com/group/spctools- >>>> discuss/msg/7760dcda02877922?hl=en), it was mentioned that
>>>> "base_name"s in "msms_run_summary" elements had to be unique in the
>>>> document - however, as per the schema, that's not true. Also, the
>>>> "base_name" of an "msms_run_summary" is not tied to the "base_name" in
>>>> subordinate "search_summary"s. If there were such a constraint, it
>>>> would be impossible to have more than one "search_summary" under an
>>>> "msms_run_summary" - however, this is allowed in the schema.
>>>> When does it make sense to have different "base_name"s in an
>>>> "msms_run_summary" and its subordinate "search_summary"(s)? Judging
>>>> from the schema documentation and the files I've seen, it seems that
>>>> the values should be the same. On the other hand, why have the
>>>> attribute in both elements then?
>>>> All this adds to my confusion about the appropriate use of
>>>> "base_name"...
>>>> I would be happy if someone could clear things up for me.
>>>> Best regards
>>>> Hendrik
>>> --~--~---------~--~----~------------~-------~--~----~
>>> You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
>>> To post to this group, send email to spctools-discuss@googlegroups.com
>>> To unsubscribe from this group, send email to spctools-discuss+unsubscribe@googlegroups.com
>>> For more options, visit this group at http://groups.google.com/group/spctools-discuss?hl=en >>> -~----------~----~----~----~------~----~------~--~---
David, just a quick reply to part of your message. Normally, I make a
directory for an experiment and I will process the mascot, sequest,
and possibly X!Tandem data from each mzXML file in the same directory. I do
append the name of the search to the TPP files, so I can determine which
search engine the data came from. These data sometimes are combined (not
always with iProphet, since I just recently implemented it), so I would
violate your separate directory for each search engine rule. I've never seen
any documentation that either way is required, so I doubt I am the only one
processing data this way.
Greg
On Fri, Nov 20, 2009 at 12:51 PM, David Shteynberg <
dshteynb...@systemsbiology.org> wrote:
> I will try to reason this through using trying to think from the
> original authors point of view. The idea is that different searches
> of the same data would happen in separate directories and the
> base_name (full path to the data file) would identify one search of an
> mzXML file representing one msms_run, and more than one search would
> never happen in one directory on the same file. Also it is natural to
> keep all results from one search engine run on an mzXML file in one
> place in the mzXML file. However, you could have more than one search
> that references the same data and these don't necessarily have to be
> placed together in the pepXML file. Although the problem with this is
> that you could have different paths to the same data and these would
> all be listed. In the iProphet tool (which combines results from
> multiple searches of the same data), I don't look at either base_name
> but rather the spectrum names themselves, with the combination of
> experiment_label, which is a user specified parameter that identifies
> data from the same experiment. The idea is that the combination of
> experiment_label and spectrum name will uniquely identify a spectrum
> searched. I hope this is helpful. Let us know if you have other
> questions.
> -David
> On Fri, Nov 20, 2009 at 10:30 AM, David Shteynberg
> <dshteynb...@systemsbiology.org> wrote:
> > OK I take that back. I see where the unique constraint is listed. I
> > will have to consider your questions further.
> > -David
> > On Fri, Nov 20, 2009 at 10:27 AM, David Shteynberg
> > <dshteynb...@systemsbiology.org> wrote:
> >> Hi Hendrik,
> >> The element msms_pipeline_analysis/msms_run_summary has an attribute
> >> base_name to specify the path to the datafile. In case the searched
> >> file specified is different from the original data file there is
> >> another entry in the element
> >> msms_pipeline_analysis/msms_run_summary/search_summary for base_name.
> >> As far as I know, there is nothing in the schema that requires these
> >> to be unique in the pepXML file. Can you point me to where this
> >> constraint is specified in the schema. I checked version 1.8.
> >> -David
> >> On Fri, Nov 13, 2009 at 12:31 AM, Eric Deutsch
> >> <edeut...@systemsbiology.org> wrote:
> >>> Hi Hendrik, I think we need to get an authoritative answer from David
> on
> >>> this one. And he is currently traveling in the Land of the Finns. We
> will
> >>> let/ask him to answer when he is next able.
> >>> Regards,
> >>> Eric
> >>>> From: spctools-discuss@googlegroups.com [mailto:spctools-
> >>>> discuss@googlegroups.com] On Behalf Of Hendrik Weisser
> >>>> Hi!
> >>>> I'm working on the pepXML parser in OpenMS. I've been confronted with
> >>>> a type of pepXML file I hadn't seen before, where search results from
> >>>> different search engines - but for the same experiment - were
> >>>> collected in one file (with one "msms_run_summary" per search engine).
> >>>> I've added (maybe prematurely) support for this to the OpenMS parser,
> >>>> and then wanted to construct a simple pepXML file for testing
> >>>> purposes.
> >>>> In doing so, I've now come across a constraint in the pepXML schema
> >>>> (at least from v1.8 on) that says values of the "base_name" attribute
> >>>> (supposed to contain the full path to the searched mzXML file) in the
> >>>> "search_summary" element have to be unique within the document.
> >>>> What is the rationale behind this constraint? Is it supposed to
> >>>> prevent the above case, where different searches of the same
> >>>> experiment end up in one file? Why would that be desirable/necessary?
> >>>> (Also note that I can construct a valid and parseable pepXML file from
> >>>> two different search runs of the same file if I change the path in
> >>>> "base_name"...)
> >>>> In an earlier discussion (http://groups.google.com/group/spctools- > >>>> discuss/msg/7760dcda02877922?hl=en), it was mentioned that
> >>>> "base_name"s in "msms_run_summary" elements had to be unique in the
> >>>> document - however, as per the schema, that's not true. Also, the
> >>>> "base_name" of an "msms_run_summary" is not tied to the "base_name" in
> >>>> subordinate "search_summary"s. If there were such a constraint, it
> >>>> would be impossible to have more than one "search_summary" under an
> >>>> "msms_run_summary" - however, this is allowed in the schema.
> >>>> When does it make sense to have different "base_name"s in an
> >>>> "msms_run_summary" and its subordinate "search_summary"(s)? Judging
> >>>> from the schema documentation and the files I've seen, it seems that
> >>>> the values should be the same. On the other hand, why have the
> >>>> attribute in both elements then?
> >>>> All this adds to my confusion about the appropriate use of
> >>>> "base_name"...
> >>>> I would be happy if someone could clear things up for me.
> >>>> Best regards
> >>>> Hendrik
> >>> --~--~---------~--~----~------------~-------~--~----~
> >>> You received this message because you are subscribed to the Google
> Groups "spctools-discuss" group.
> >>> To post to this group, send email to spctools-discuss@googlegroups.com
> >>> To unsubscribe from this group, send email to
> spctools-discuss+unsubscribe@googlegroups.com<spctools-discuss%2Bunsubscrib e@googlegroups.com>
> >>> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=en > >>> -~----------~----~----~----~------~----~------~--~---
> --
> You received this message because you are subscribed to the Google Groups
> "spctools-discuss" group.
> To post to this group, send email to spctools-discuss@googlegroups.com.
> To unsubscribe from this group, send email to
> spctools-discuss+unsubscribe@googlegroups.com<spctools-discuss%2Bunsubscrib e@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/spctools-discuss?hl=.
I still don't see the sense in having the "base_name" constraint,
though. The basic question for me is, what would break without the
constraint? Because as far as I can tell now, it just prevents a
sensible use case - searching the same mzXML file with different
tools, and aggregating the results in one pepXML file.
> dshteynb...@systemsbiology.org> wrote:
> > The idea is that different searches
> > of the same data would happen in separate directories and the
> > base_name (full path to the data file) would identify one search of an
> > mzXML file representing one msms_run, and more than one search would
> > never happen in one directory on the same file.
I understand that you might keep the results from different search
engines in different directories, but surely you wouldn't copy (or
link, if you're smart) an mzXML file to three different directories to
search it with Mascot, Sequest and X!Tandem, would you?
> > In the iProphet tool (which combines results from
> > multiple searches of the same data), I don't look at either base_name
> > but rather the spectrum names themselves, with the combination of
> > experiment_label, which is a user specified parameter that identifies
> > data from the same experiment. The idea is that the combination of
> > experiment_label and spectrum name will uniquely identify a spectrum
> > searched.
This is exactly the problem that I have: Wouldn't it be far easier to
use the "base_name" attribute to make the connection to the mzXML
file, rather an a custom label? I think it would be - only you
couldn't do it with the current schema because of that strange
uniqueness constraint.
I'm working with iProphet results and I have to do quite some post-
processing to find out which results belong to which mzXML file, so
that I can annotate features derived from the mzXML with peptide
sequences. (To be fair, the main reason is that only for X!Tandem
"base_name" really contains the path to the original mzXML...)
From the perspective of a third-party developer, I woud much have
rather see a clean-up of the existing pepXML than the addition of new
elements.