Simon Michnowicz Duty Programmer Australian Proteomics Computation Facility Ludwig Institute For Cancer Research Royal Melbourne Hospital, Victoria Tel: (+61 3) 9341 3155 Fax: (+61 3) 9341 3104
Granted, this is a defect - but that's still an unfortunate choice of
characters. Even with the correction I can imagine this tripping up other
software downstream since the properly escaped XML would no longer match the
FASTA on a literal basis. I don't suppose your users could be induced to
use { and } or [ and ] or ( and ) instead of < and > ?
Brian
On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz <
> Simon Michnowicz
> Duty Programmer
> Australian Proteomics Computation Facility
> Ludwig Institute For Cancer Research
> Royal Melbourne Hospital,
> Victoria
> Tel: (+61 3) 9341 3155
> Fax: (+61 3) 9341 3104
Brian Pratt wrote: > Granted, this is a defect - but that's still an unfortunate choice of > characters. Even with the correction I can imagine this tripping > up other software downstream since the properly escaped XML would no > longer match the FASTA on a literal basis. I don't suppose your > users could be induced to use { and } or [ and ] or ( and ) instead of > < and > ?
> Brian
> On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz > <simon.michnow...@gmail.com <mailto:simon.michnow...@gmail.com>> wrote:
> Dear Group,
> I would like to flag a possible bug in a TPP tool.(Sorry in > advance if this is the wrong forum to report bugs).
> One of our users has reported issues with a tpp pepXML tool (he > was using Mascot so I assume he was using Mascot2XML.exe).
> Our FASTA database has protein entries with special characters in > then, i.e.
> *IFN-<alpha>2*
> *&*
> *V<beta>14 *
> This generated a pepXML file that was not valid xml, as the tags > were not escaped properly.
> Simon Michnowicz > Duty Programmer > Australian Proteomics Computation Facility > Ludwig Institute For Cancer Research > Royal Melbourne Hospital, > Victoria > Tel: (+61 3) 9341 3155 > Fax: (+61 3) 9341 3104
Yes, one would want to escape everything properly - happily there's a
library call for that. And certainly it's only right to emit valid XML.
But I do think that it might be wisest to sidestep the whole mess - it's
valid FASTA but also unconventional (based on many years of TPP not bumping
into this), and even converted to valid XML I suspect it may cause other
problems downstream since it no longer exactly matches the FASTA. I suspect
you're damned if you do and damned if you don't.
Brian
On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers <
> What about the other reserved characters in XML that are valid in FASTA?
> "
> '
> &
> Not escaping could also break downstream software - especially with &
> which should always begin an escape sequence. :(
> -Matt
> Brian Pratt wrote:
> > Granted, this is a defect - but that's still an unfortunate choice of
> > characters. Even with the correction I can imagine this tripping
> > up other software downstream since the properly escaped XML would no
> > longer match the FASTA on a literal basis. I don't suppose your
> > users could be induced to use { and } or [ and ] or ( and ) instead of
> > < and > ?
> > Brian
> > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz
> > <simon.michnow...@gmail.com <mailto:simon.michnow...@gmail.com>> wrote:
> > Dear Group,
> > I would like to flag a possible bug in a TPP tool.(Sorry in
> > advance if this is the wrong forum to report bugs).
> > One of our users has reported issues with a tpp pepXML tool (he
> > was using Mascot so I assume he was using Mascot2XML.exe).
> > Our FASTA database has protein entries with special characters in
> > then, i.e.
> > *IFN-<alpha>2*
> > *&*
> > *V<beta>14 *
> > This generated a pepXML file that was not valid xml, as the tags
> > were not escaped properly.
> > Simon Michnowicz
> > Duty Programmer
> > Australian Proteomics Computation Facility
> > Ludwig Institute For Cancer Research
> > Royal Melbourne Hospital,
> > Victoria
> > Tel: (+61 3) 9341 3155
> > Fax: (+61 3) 9341 3104
Unfortunately the offending entries are present in commonly used
public DBs. We recently bumped into exactly this problem, as there are
4 entries containing <xxxx> in the IPI human v3.66 fasta file:
IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein
IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment)
IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment)
IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment)
After hundreds of searches, a particular experiment happened to ID one
of these proteins, causing problems with the tools. In the event I
manually removed the problematic IDs as they were irrelevant for the
experiment. We already re-write IPI headers after download of the
FASTA, so will implement a substitution there if it crops up again.
Should a substitution be added to the IPI retrieval utility scripts in
the TPP distribution so that the problem doesn't show it's face if
they are being used?
Interestingly, if you search on the EBI IPI site for these proteins
the < > are substituted with [ ] , but the problematic characters are
in the FASTA.
> Yes, one would want to escape everything properly - happily there's a
> library call for that. And certainly it's only right to emit valid XML.
> But I do think that it might be wisest to sidestep the whole mess - it's
> valid FASTA but also unconventional (based on many years of TPP not bumping
> into this), and even converted to valid XML I suspect it may cause other
> problems downstream since it no longer exactly matches the FASTA. I suspect
> you're damned if you do and damned if you don't.
> Brian
> On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers <
> > What about the other reserved characters in XML that are valid in FASTA?
> > "
> > '
> > &
> > Not escaping could also break downstream software - especially with &
> > which should always begin an escape sequence. :(
> > -Matt
> > Brian Pratt wrote:
> > > Granted, this is a defect - but that's still an unfortunate choice of
> > > characters. Even with the correction I can imagine this tripping
> > > up other software downstream since the properly escaped XML would no
> > > longer match the FASTA on a literal basis. I don't suppose your
> > > users could be induced to use { and } or [ and ] or ( and ) instead of
> > > < and > ?
> > > Brian
> > > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz
> > > <simon.michnow...@gmail.com <mailto:simon.michnow...@gmail.com>> wrote:
> > > Dear Group,
> > > I would like to flag a possible bug in a TPP tool.(Sorry in
> > > advance if this is the wrong forum to report bugs).
> > > One of our users has reported issues with a tpp pepXML tool (he
> > > was using Mascot so I assume he was using Mascot2XML.exe).
> > > Our FASTA database has protein entries with special characters in
> > > then, i.e.
> > > *IFN-<alpha>2*
> > > *&*
> > > *V<beta>14 *
> > > This generated a pepXML file that was not valid xml, as the tags
> > > were not escaped properly.
Well, I'll go ahead and modify the mascot converter to emit proper XML for
proteins with reserved XML characters, but it does sound like folks would do
well to make that <> / [] substitution upstream from the search engines.
The fact that the EBI IPI site does the substitution confirms my suspicion
that a number of tools might get munged up by this.
On Wed, Nov 11, 2009 at 12:17 PM, dctrud <dct...@ccmp.ox.ac.uk> wrote:
> Unfortunately the offending entries are present in commonly used
> public DBs. We recently bumped into exactly this problem, as there are
> 4 entries containing <xxxx> in the IPI human v3.66 fasta file:
> IPI00465120 Gene_Symbol=- 3<beta>-HSD <psi>1 protein
> IPI00816409 Gene_Symbol=- V<gamma>1 protein (Fragment)
> IPI00816761 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment)
> IPI00930475 Gene_Symbol=GUSB F<lambda>8 protein (Fragment)
> After hundreds of searches, a particular experiment happened to ID one
> of these proteins, causing problems with the tools. In the event I
> manually removed the problematic IDs as they were irrelevant for the
> experiment. We already re-write IPI headers after download of the
> FASTA, so will implement a substitution there if it crops up again.
> Should a substitution be added to the IPI retrieval utility scripts in
> the TPP distribution so that the problem doesn't show it's face if
> they are being used?
> Interestingly, if you search on the EBI IPI site for these proteins
> the < > are substituted with [ ] , but the problematic characters are
> in the FASTA.
> On Nov 11, 7:52 pm, Brian Pratt <brian.pr...@insilicos.com> wrote:
> > Yes, one would want to escape everything properly - happily there's a
> > library call for that. And certainly it's only right to emit valid XML.
> > But I do think that it might be wisest to sidestep the whole mess - it's
> > valid FASTA but also unconventional (based on many years of TPP not
> bumping
> > into this), and even converted to valid XML I suspect it may cause other
> > problems downstream since it no longer exactly matches the FASTA. I
> suspect
> > you're damned if you do and damned if you don't.
> > Brian
> > On Wed, Nov 11, 2009 at 11:25 AM, Matthew Chambers <
> > > What about the other reserved characters in XML that are valid in
> FASTA?
> > > "
> > > '
> > > &
> > > Not escaping could also break downstream software - especially with &
> > > which should always begin an escape sequence. :(
> > > -Matt
> > > Brian Pratt wrote:
> > > > Granted, this is a defect - but that's still an unfortunate choice of
> > > > characters. Even with the correction I can imagine this tripping
> > > > up other software downstream since the properly escaped XML would no
> > > > longer match the FASTA on a literal basis. I don't suppose your
> > > > users could be induced to use { and } or [ and ] or ( and ) instead
> of
> > > > < and > ?
> > > > Brian
> > > > On Tue, Nov 10, 2009 at 9:43 PM, Simon Michnowicz
> > > > <simon.michnow...@gmail.com <mailto:simon.michnow...@gmail.com>>
> wrote:
> > > > Dear Group,
> > > > I would like to flag a possible bug in a TPP tool.(Sorry in
> > > > advance if this is the wrong forum to report bugs).
> > > > One of our users has reported issues with a tpp pepXML tool (he
> > > > was using Mascot so I assume he was using Mascot2XML.exe).
> > > > Our FASTA database has protein entries with special characters
> in
> > > > then, i.e.
> > > > *IFN-<alpha>2*
> > > > *&*
> > > > *V<beta>14 *
> > > > This generated a pepXML file that was not valid xml, as the tags
> > > > were not escaped properly.
> Should a substitution be added to the IPI retrieval utility scripts in > the TPP distribution so that the problem doesn't show it's face if > they are being used?
Unfortunately we have no control over what goes in the FASTA
databases! Matrix Science's pepXML generation code escapes the XML
if ($thisScript->param($urlParams{'prot_desc'})) {
$prot_desc = &noXmlTag(&mustGetProteinDescription
($protein_list[0], \%fastaTitles));
}
Where noXMLTag() detags strings of xml tags..
Dear all: May I check one question? After I run analysis peptide with TPP, I export to excel mode. Then I will see all the different column lists all the information that showed in web but I also noticed that there are some extra information, such as “adjusted ratio mean”. What does this “adjusted ratio mean” represent? By the way, for the identified protein, the percentage of “shared of spectrum ids” means the spectrum that also shows in other proteins, is this right? Thanks
> Unfortunately we have no control over what goes in the FASTA
> databases! Matrix Science's pepXML generation code escapes the XML
> if ($thisScript->param($urlParams{'prot_desc'})) {
> $prot_desc = &noXmlTag(&mustGetProteinDescription
> ($protein_list[0], \%fastaTitles));
> }
> Where noXMLTag() detags strings of xml tags..