Issue with pepXML generation

9 views
Skip to first unread message

Simon Michnowicz

unread,
Nov 11, 2009, 12:43:22 AM11/11/09
to spctools...@googlegroups.com

Dear Group,

I would like to flag a possible bug in a TPP tool.(Sorry in advance if this is the wrong forum to report bugs).

One of our users has reported issues with a tpp pepXML tool (he was using Mascot so I assume he was using Mascot2XML.exe).

Our  FASTA database has protein entries with special characters in then, i.e.

IFN-<alpha>2

&

V<beta>14 

This generated a pepXML file that was not valid xml, as the tags were not escaped properly.




<alternative_protein protein="tr|Q9UMA4|IFN-<alpha>2" num_tol_term="2" peptide_prev_aa="-" peptide_next_aa="S"/>




regards




Simon Michnowicz
Duty Programmer
Australian Proteomics Computation Facility
Ludwig Institute For Cancer Research
Royal Melbourne Hospital,
Victoria
Tel: (+61 3) 9341 3155
Fax: (+61 3) 9341 3104

Brian Pratt

unread,
Nov 11, 2009, 2:20:17 PM11/11/09
to spctools...@googlegroups.com
Granted, this is a defect - but that's still an unfortunate choice of characters.  Even with the correction I can imagine this tripping up other software downstream since the properly escaped XML would no longer match the FASTA on a literal basis.  I don't suppose your users could be induced to use { and } or [ and ] or ( and ) instead of < and > ?
 
Brian

Matthew Chambers

unread,
Nov 11, 2009, 2:25:08 PM11/11/09
to spctools...@googlegroups.com
What about the other reserved characters in XML that are valid in FASTA?
"
'
&

Not escaping could also break downstream software - especially with &
which should always begin an escape sequence. :(

-Matt


Brian Pratt wrote:
> Granted, this is a defect - but that's still an unfortunate choice of
> characters. Even with the correction I can imagine this tripping
> up other software downstream since the properly escaped XML would no
> longer match the FASTA on a literal basis. I don't suppose your
> users could be induced to use { and } or [ and ] or ( and ) instead of
> < and > ?
>
> Brian
>
> On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz
> <simon.mi...@gmail.com <mailto:simon.mi...@gmail.com>> wrote:
>
> Dear Group,
>
> I would like to flag a possible bug in a TPP tool.(Sorry in
> advance if this is the wrong forum to report bugs).
>
> One of our users has reported issues with a tpp pepXML tool (he
> was using Mascot so I assume he was using Mascot2XML.exe).
>
> Our FASTA database has protein entries with special characters in
> then, i.e.
>

> *IFN-<alpha>2*
>
> *&*
>
> *V<beta>14 *


>
> This generated a pepXML file that was not valid xml, as the tags
> were not escaped properly.
>
>
>
>

> *<alternative_protein protein="tr|Q9UMA4|IFN-<alpha>2"
> num_tol_term="2" peptide_prev_aa="-" peptide_next_aa="S"/>*

Brian Pratt

unread,
Nov 11, 2009, 2:52:41 PM11/11/09
to spctools...@googlegroups.com
Yes, one would want to escape everything properly - happily there's a library call for that.  And certainly it's only right to emit valid XML.
 
But I do think that it might be wisest to sidestep the whole mess - it's valid FASTA but also unconventional (based on many years of TPP not bumping into this), and even converted to valid XML I suspect it may cause other problems downstream since it no longer exactly matches the FASTA.  I suspect you're damned if you do and damned if you don't. 
 
Brian

dctrud

unread,
Nov 11, 2009, 3:17:12 PM11/11/09
to spctools-discuss
Unfortunately the offending entries are present in commonly used
public DBs. We recently bumped into exactly this problem, as there are
4 entries containing <xxxx> in the IPI human v3.66 fasta file:

IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein
IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment)
IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment)
IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment)

After hundreds of searches, a particular experiment happened to ID one
of these proteins, causing problems with the tools. In the event I
manually removed the problematic IDs as they were irrelevant for the
experiment. We already re-write IPI headers after download of the
FASTA, so will implement a substitution there if it crops up again.
Should a substitution be added to the IPI retrieval utility scripts in
the TPP distribution so that the problem doesn't show it's face if
they are being used?

Interestingly, if you search on the EBI IPI site for these proteins
the < > are substituted with [ ] , but the problematic characters are
in the FASTA.

http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-id+657mP1a3t7q+-e+[IPI:%27IPI00465120.3%27]+-qnum+1+-enum+1

Cheers,

DT

On Nov 11, 7:52 pm, Brian Pratt <brian.pr...@insilicos.com> wrote:
> Yes, one would want to escape everything properly - happily there's a
> library call for that.  And certainly it's only right to emit valid XML.
>
> But I do think that it might be wisest to sidestep the whole mess - it's
> valid FASTA but also unconventional (based on many years of TPP not bumping
> into this), and even converted to valid XML I suspect it may cause other
> problems downstream since it no longer exactly matches the FASTA.  I suspect
> you're damned if you do and damned if you don't.
>
> Brian
> On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers <
>

Brian Pratt

unread,
Nov 11, 2009, 3:27:49 PM11/11/09
to spctools...@googlegroups.com
Well, I'll go ahead and modify the mascot converter to emit proper XML for proteins with reserved XML characters, but it does sound like folks would do well to make that <> / [] substitution upstream from the search engines.  The fact that the EBI IPI site does the substitution confirms my suspicion that a number of tools might get munged up by this.
 
Brian

Jimmy Eng

unread,
Nov 11, 2009, 3:31:04 PM11/11/09
to spctools...@googlegroups.com
I'll add the substitutions to the getdb.* scripts in the TPP src/util directory.

Simon Michnowicz

unread,
Nov 11, 2009, 6:21:31 PM11/11/09
to spctools-discuss

Unfortunately we have no control over what goes in the FASTA
databases! Matrix Science's pepXML generation code escapes the XML
if ($thisScript->param($urlParams{'prot_desc'})) {
$prot_desc = &noXmlTag(&mustGetProteinDescription
($protein_list[0], \%fastaTitles));
}
Where noXMLTag() detags strings of xml tags..

Thanks
Simon Michnowicz.

Eileen Yue

unread,
Nov 11, 2009, 6:00:19 PM11/11/09
to spctools...@googlegroups.com

 

Dear all:

May I check one question? After I run analysis peptide with TPP, I export to excel mode. Then I will see all the different column lists all the information that showed in web but I also noticed that there are some extra information, such as “adjusted ratio mean”. What does this “adjusted ratio mean” represent?

By the way, for the identified protein, the percentage of “shared of spectrum ids” means the spectrum that also shows in other proteins, is this right?

Thanks

 

Brian Pratt

unread,
Nov 11, 2009, 7:19:40 PM11/11/09
to spctools...@googlegroups.com
No worries, a corrected Mascot2XML will be in the next TPP release.
 
Brian

Reply all
Reply to author
Forward
0 new messages