tpp2mzid: error 4: Syntax error parsing XML

42 views
Skip to first unread message

Malcolm Cook

unread,
Jun 10, 2022, 10:35:48 AM6/10/22
to spctools-discuss

I am getting an error with tpp2mzid parsing ProteinProphet output (here is the full file).

Reviewing the ISB/SPC Trans Proteomic Pipeline :: Jobs logs I read our ProteinProphet completed with success:

Finished. Results written to: /data/tpp/tests/QuickYeastUPS1/interact.prot.xmlhowever

however trying to process it with tpp2mzid gives error

Reading: /data/tpp/tests/QuickYeastUPS1/interact.prot.xml
/data/tpp/tests/QuickYeastUPS1/interact.prot.xml(148) : error 4: Syntax error parsing XML.
Failed to read protXML: /data/tpp/tests/QuickYeastUPS1/interact.prot.xmlmanually

inspecting this file it looks like a well formed big ball of XML to my eyes:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="http://hd1991356yb/tests/QuickYeastUPS1/interact.prot.xsl"?>
<protein_summary xmlns="http://regis-web.systemsbiology.net/protXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/protXML/protXML_v9.xsd" summary_xml="/data/tpp/tests/QuickYeastUPS1/interact.prot.xml">
<protein_summary_header reference_database="/data/tpp/tests/QuickYeastUPS1/Yeast_UPS_cRAP.fasta" residue_substitution_list="I -> L" source_files="/data/tpp/tests/QuickYeastUPS1/interact.ipro.pep.xml" source_files_alt="/data/tpp/tests/QuickYeastUPS1/interact.ipro.pep.xml" min_peptide_probability="0.20" min_peptide_weight="0.50" num_predicted_correct_prots="98.6" num_input_1_spectra="0" num_input_2_spectra="348" num_input_3_spectra="70" num_input_4_spectra="0" num_input_5_spectra="0" initial_min_peptide_prob="0.05" total_no_spectrum_ids="357.6" sample_enzyme="trypsin">
<program_details analysis="proteinprophet" time="2022-06-09T09:15:57" version=" Insilicos_LabKey_C++ (TPP v6.1.0 Parhelion, Build 202206081159-exported (Linux-x86_64))">
<proteinprophet_details occam_flag="Y" groups_flag="Y" degen_flag="Y" nsp_flag="Y" fpkm_flag="N" initial_peptide_wt_iters="2" nsp_distribution_iters="3" final_peptide_wt_iters="0" run_options="IPROPHET">
<nsp_information neighboring_bin_smoothing="Y">
<yadda_yadda_yadda>
...
</yadda_yadda_yadda>
</protein_group>
</protein_summary>

trying to run  tpp2mzid  from the command line sheds little light:

/usr/local/tpp/bin/tpp2mzid /data/tpp/tests/QuickYeastUPS1/interact.prot.xmltpp2mzid v0.9.11 (May 19 2022), copyright Mike Hoopmann, Institute for Systems Biology.
Built using mzIMLTools: 1.2.6    Mar 21 2022
Built using NeoPepXMLParser: 1.0.2    2022 MAY 17
Built using NeoProtXMLParser: 1.0.0    2022 MAY 19
Reading: /data/tpp/tests/QuickYeastUPS1/interact.prot.xml
/data/tpp/tests/QuickYeastUPS1/interact.prot.xml(148) :
error 4: Syntax error parsing XML.Failed to read protXML: /data/tpp/tests/QuickYeastUPS1/interact.prot.xml

how to diagnose/debug/fix?

Thanks!

David Shteynberg

unread,
Jun 10, 2022, 10:50:05 AM6/10/22
to spctools-discuss
Hi Malcolm,

The message is complaining about line 148 of the pepXML file which look like this:

<annotation protein_description="40S ribosomal protein S14-B OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=RPS14B PE"/>

Apparently this protein entry in your database has a non-ascii character in the descriptions of this protein.  Can you please share this database so I can examine this protein entry or can you check it in the fasta file?  I think the problem might be somewhere upstream of the error but would need to do a bit more digging to confirm.  Also if you are able, please compress and share the data from the whole analysis you've done so I can trace where the failure occurs.

Thanks!
-David

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spctools-discuss/e09ec315-2321-4e11-b97d-dbba51fb1d3an%40googlegroups.com.

David Shteynberg

unread,
Jun 10, 2022, 1:03:08 PM6/10/22
to spctools-discuss
Hello again Malcolm,

There are several of these non-ascii characters in the protXML file that you sent.  After I removed them I was able to open the file in ProtXMLViewer.  Also the syntax errors from tpp2mzid went away (although I wasn't able to convert your file without the associated pepXML data.)   Please let us know if you are able to resolve this issue on your end.

Cheers,
-David

On Fri, Jun 10, 2022 at 7:35 AM Malcolm Cook <malcol...@gmail.com> wrote:
--

Malcolm Cook

unread,
Jun 10, 2022, 2:14:08 PM6/10/22
to spctools-discuss
Thank you.  I concur deleting those characters resolves the issue.  

My colleague Gabriel actually was running the tutorial test.  I will ask him to chime in if he did anything other than follow the steps given.

Malcolm Cook

unread,
Jun 10, 2022, 4:37:27 PM6/10/22
to spctools-discuss
I discovered the issue.  I believe you have a bug.

In my hands, when the fasta deflines in Yeast_UPS_cRAP.fasta are longer than 120 characters they get truncated and a non-breaking space appended.

Running the demo pipeline runs to completion if I first fix this problem with: 

sudo -u apache bash -c 'perl -pi -e "s|^(>.{119,119})(.*)\$|\$1|"  /data/tpp/tests/QuickYeastUPS1/Yeast_UPS_cRAP.fasta'

Malcolm Cook

unread,
Jun 10, 2022, 4:39:13 PM6/10/22
to spctools-discuss
(note: my location for data is not the suggested in the install guide.  I did not use `/data` but `/data/tpp`.)

David Shteynberg

unread,
Jun 10, 2022, 4:47:16 PM6/10/22
to spctools-discuss
Which specific tutorial are you attempting so I can try to reproduce the issue?

Thanks,
-David



Malcolm Cook

unread,
Jun 10, 2022, 4:47:47 PM6/10/22
to spctools-discuss

David Shteynberg

unread,
Jun 10, 2022, 9:27:16 PM6/10/22
to spctools-discuss
I am unable to reproduce this error when I download and run the tutorial on my windows or linux (ubuntu) machines with this tutorial data.  Perhaps you can confirm that the fasta file you have is not getting corrupted somehow during your download or processing.  Thanks.

-David  

Malcolm Cook

unread,
Jun 11, 2022, 3:32:11 AM6/11/22
to spctools-discuss
Thanks for trying this in your environment.

I have determined that ProteinProphet generates these malformed files when run with LANG equal to e.g. en_US.UTF-8.  

Running with LANG=C and the resultant .prot.xml files are well-formed.

I have updated my notes on Installing Trans Proteomic Pipeline (TPP) on CentOS 7 accordingly.

Thanks for TPP!

~Malcolm


Malcolm Cook

unread,
Jun 11, 2022, 3:37:09 AM6/11/22
to spctools-discuss
Apologies, my notes are published at  Installing Trans Proteomic Pipeline (TPP) on CentOS 7

tl;dr:   do this:

    sudo perl -p -i -e 's/^AddDefaultCharset/#AddDefaultCharset/' /etc/httpd/conf/httpd.conf

Reply all
Reply to author
Forward
0 new messages