Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

python and parsing an xml file

93 views
Skip to first unread message

Matt Funk

unread,
Feb 21, 2011, 12:30:29 PM2/21/11
to pytho...@python.org
Hi,
I was wondering if someone had some advice:
I want to create a set of xml input files to my code that look as follows:
<?xml version="1.0" encoding="UTF-8"?>

<!-- Settings for the algorithm to be performed
-->
<Algorithm>

<!-- The algorithm type.
-->
<!-- The supported options are:
-->
<!-- - Alg0
-->
<!-- - Alg1
-->
<Type>Alg1</Type>

<!-- the location/path of the input file for this algorithm
-->
<path>./Alg1.in</path>

</Algorithm>


<!-- Relevant information during the processing will be written to a
logfile -->
<Logfile>

<!-- the location/path of the logfile (i.e. where to put the
logfile) -->
<path>c:\tmp</path>

<!-- verbosity level (i.e. how much to print)
-->
<!-- The supported options are:
-->
<!-- - 0 (nothing printed)
-->
<!-- - 1 (print on error)
-->
<verbosity>1</verbosity>

</Logfile>


So there are comments, whitespace etc ... in it.
I would like to be able to put everything into some sort of structure
such that i can access it as:
structure['Algorithm']['Type'] == Alg1
I was wondering if there is something out there that does this.
I found and tried a few things:
1) http://code.activestate.com/recipes/534109-xml-to-python-data-structure/
It simply doesn't work. I get the following error:
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:2: not well-formed
(invalid token)
But i removed everything from the file except: <?xml version="1.0"
encoding="UTF-8"?>
and i still got the error.

Anyway, i looked at ElementTree, but that error out with:
xml.parsers.expat.ExpatError: junk after document element: line 19, column 0


Anyway, if anyone can give me advice of point me somewhere i'd greatly
appreciate it.

thanks
matt

Stefan Behnel

unread,
Feb 21, 2011, 12:43:16 PM2/21/11
to pytho...@python.org
Matt Funk, 21.02.2011 18:30:

That's not XML. XML documents have exactly one root element, i.e. you need
an enclosing element around these two tags.


> So there are comments, whitespace etc ... in it.
> I would like to be able to put everything into some sort of structure

Including the comments or without them? Note that ElementTree will ignore
comments.


> such that i can access it as:
> structure['Algorithm']['Type'] == Alg1

Have a look at lxml.objectify. It allows you to write

alg_type = root.Algorithm.Type.text

and a couple of other niceties.

http://lxml.de/objectify.html


> I was wondering if there is something out there that does this.
> I found and tried a few things:
> 1) http://code.activestate.com/recipes/534109-xml-to-python-data-structure/
> It simply doesn't work. I get the following error:
> raise exception
> xml.sax._exceptions.SAXParseException:<unknown>:1:2: not well-formed
> (invalid token)

"not well formed" == "not XML".


> But i removed everything from the file except:<?xml version="1.0"
> encoding="UTF-8"?>
> and i still got the error.

That's not XML, either.


> Anyway, i looked at ElementTree, but that error out with:
> xml.parsers.expat.ExpatError: junk after document element: line 19, column 0

In any case, ElementTree is preferable over a SAX based solution, both for
performance and maintainability reasons.

Stefan

Terry Reedy

unread,
Feb 21, 2011, 1:22:11 PM2/21/11
to pytho...@python.org
On 2/21/2011 12:30 PM, Matt Funk wrote:
> Hi,
> I was wondering if someone had some advice:
> I want to create a set of xml input files to my code that look as follows:

Why?

...


> So there are comments, whitespace etc ... in it.
> I would like to be able to put everything into some sort of structure

> such that i can access it as:
> structure['Algorithm']['Type'] == Alg1

Unless I have a very good reason otherwise, I would just put everything
in nested dicts in a .py file to begin with. Or if I needed cross
language portability, a JSON file, which is close to the same thing.

--
Terry Jan Reedy

Stefan Behnel

unread,
Feb 21, 2011, 2:03:49 PM2/21/11
to pytho...@python.org
Terry Reedy, 21.02.2011 19:22:

Hmm, right, good call. There's also the configparser module in the stdlib
which may provide a suitable format here.

Stefan

Matt Funk

unread,
Feb 21, 2011, 5:08:07 PM2/21/11
to pytho...@python.org
Hi Terry,

On 2/21/2011 11:22 AM, Terry Reedy wrote:
> On 2/21/2011 12:30 PM, Matt Funk wrote:
>> Hi,
>> I was wondering if someone had some advice:
>> I want to create a set of xml input files to my code that look as
>> follows:
>
> Why?

mmmh. not sure how to answer this question exactly. I guess it's a
design decision. I am not saying that it is best one, but it seemed
suitable to me. I am certainly open to suggestions. But here are some
requirements:
1) My boss needs to be able to read the input and make sense out of it.
XML seems fairly self explanatory, at least when you choose suitable
names for the properties/tags etc ...
2) I want reproducability of a given run without changes to the code.
I.e. all the inputs need to be stored external to the code such that the
state of the run is captured from the input files entirely.

>
>
> ...
>> So there are comments, whitespace etc ... in it.
>> I would like to be able to put everything into some sort of structure
>> such that i can access it as:
>> structure['Algorithm']['Type'] == Alg1
>
> Unless I have a very good reason otherwise, I would just put
> everything in nested dicts in a .py file to begin with.

That is certainly a possibility and it should work. However, eventually
the input files might be written by a webinterface (likely php). Even
though what you are suggesting is still possible it just feels a little
awkward (i.e. to use php to write a python file).


> Or if I needed cross language portability, a JSON file, which is close
> to the same thing.
>

again, i am certainly not infallible and open to suggestions of there is
a better solution.


thanks
matt

Matt Funk

unread,
Feb 21, 2011, 5:07:52 PM2/21/11
to pytho...@python.org
HI Stefan,
thank you for your advice.
I am running into an issue though (which is likely a newbie problem):

My xml file looks like (which i got from the internet):
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>

Then i try to access it as:

from lxml import etree
from lxml import objectify

inputfile="../input/books.xml"
parser = etree.XMLParser(ns_clean=True)
parser = etree.XMLParser(remove_comments=True)
root = objectify.parse(inputfile,parser)

I try to access elements by (for example):
print root.catalog.book.author.text

But it errors out with:
AttributeError: 'lxml.etree._ElementTree' object has no attribute 'catalog'

So i guess i don't understand why this is.
Also, how can i print all the attributes of the root for examples or
obtain keys?

I really appreciate your help
thanks
matt


On 2/21/2011 10:43 AM, Stefan Behnel wrote:
> Matt Funk, 21.02.2011 18:30:

>> I want to create a set of xml input files to my code that look as
>> follows:

>> So there are comments, whitespace etc ... in it.
>> I would like to be able to put everything into some sort of structure
>

> Including the comments or without them? Note that ElementTree will
> ignore comments.
>
>

>> such that i can access it as:
>> structure['Algorithm']['Type'] == Alg1
>

Stefan Behnel

unread,
Feb 21, 2011, 5:28:23 PM2/21/11
to pytho...@python.org
Matt Funk, 21.02.2011 23:07:

> thank you for your advice.
> I am running into an issue though (which is likely a newbie problem):
>
> My xml file looks like (which i got from the internet):
> <?xml version="1.0"?>
> <catalog>
> <book id="bk101">
> <author>Gambardella, Matthew</author>
> <title>XML Developer's Guide</title>
> <genre>Computer</genre>
> <price>44.95</price>
> <publish_date>2000-10-01</publish_date>
> <description>An in-depth look at creating applications
> with XML.</description>
> </book>
> </catalog>
>
> Then i try to access it as:
>
> from lxml import etree
> from lxml import objectify
>
> inputfile="../input/books.xml"
> parser = etree.XMLParser(ns_clean=True)
> parser = etree.XMLParser(remove_comments=True)

Change that to

parser = objectify.makeparser(ns_clean=True, remove_comments=True)

> root = objectify.parse(inputfile,parser)

Change that to

root = objectify.parse(inputfile,parser).getroot()

Stefan

Matt Funk

unread,
Feb 21, 2011, 5:40:59 PM2/21/11
to pytho...@python.org
Hi Stefan,
i don't mean to be annoying so sorry if i am.
According to your instructions i do:

parser = objectify.makeparser(ns_clean=True, remove_comments=True)
root = objectify.parse(inputfile,parser).getroot()
print root.catalog.book.author.text

which still gives the following error:
AttributeError: no such child: catalog

matt

Stefan Behnel

unread,
Feb 22, 2011, 2:04:53 AM2/22/11
to pytho...@python.org
Matt Funk, 21.02.2011 23:40:
> According to your instructions i do:
>
> parser = objectify.makeparser(ns_clean=True, remove_comments=True)
> root = objectify.parse(inputfile,parser).getroot()
> print root.catalog.book.author.text

root == <catalog>

Try "print root.tag", and read the tutorial.

Stefan

Paul Anton Letnes

unread,
Feb 22, 2011, 4:03:58 AM2/22/11
to
Den 21.02.11 18.30, skrev Matt Funk:

How about skipping the whole xml thing? You can dynamically import any
python module, even if it does not have a python filename. I show an
example from my own code, slightly modified. I just hand the function a
filename, and it tries to import the file. If the input file now
contains variables like
algorithm = 'fast'
I can access the variables with
input = getinput('f.txt')
print input.algorithm.

Good luck,
Paul

+++++++++++++

import imp
def getinput(inputfilename):
"""Parse inputs to program from the given file."""
try:
# http://docs.python.org/library/imp.html#imp.load_source
parameters = imp.load_source("parameters", inputfilename)
except IOError as err:
print >>sys.stderr, '%s: %s' % (str(err), inputfilename)
print >>sys.stderr, 'The specified input file was not found -
exiting'
sys.exit(IO_ERROR)
# Verify presence of all required input parameters, see input.py
example file.
required_parameter_names = ['setting1', 'setting2', 'setting3']
msg = 'Required parameter name not found in input file: {0}'
for parameter_name in required_parameter_names:
assert hasattr(parameters, parameter_name),
msg.format(parameter_name)
return parameters


Stefan Behnel

unread,
Feb 22, 2011, 4:13:00 AM2/22/11
to pytho...@python.org
Matt Funk, 21.02.2011 23:08:

> On 2/21/2011 11:22 AM, Terry Reedy wrote:
>> On 2/21/2011 12:30 PM, Matt Funk wrote:
>>> Hi,
>>> I was wondering if someone had some advice:
>>> I want to create a set of xml input files to my code that look as
>>> follows:
>>
>> Why?
>>
> mmmh. not sure how to answer this question exactly. I guess it's a
> design decision. I am not saying that it is best one, but it seemed
> suitable to me. I am certainly open to suggestions. But here are some
> requirements:
> 1) My boss needs to be able to read the input and make sense out of it.
> XML seems fairly self explanatory, at least when you choose suitable
> names for the properties/tags etc ...
> 2) I want reproducability of a given run without changes to the code.
> I.e. all the inputs need to be stored external to the code such that the
> state of the run is captured from the input files entirely.

That sounds like ConfigParser actually suits your needs.

Stefan

Ian

unread,
Feb 22, 2011, 6:01:42 AM2/22/11
to pytho...@python.org
On 21/02/2011 22:08, Matt Funk wrote:
>> Why?
> mmmh. not sure how to answer this question exactly. I guess it's a
> design decision. I am not saying that it is best one, but it seemed
> suitable to me. I am certainly open to suggestions. But here are some
> requirements:
> 1) My boss needs to be able to read the input and make sense out of it.
> XML seems fairly self explanatory, at least when you choose suitable
> names for the properties/tags etc ...
> 2) I want reproducability of a given run without changes to the code.
> I.e. all the inputs need to be stored external to the code such that the
> state of the run is captured from the input files entirely.
>
>
Hi Mark,

Having tried XML for something similar, I would strongly advise against
it. It has been nothing but a nightmare.

XML is acceptable for machine to machine communication where the two
sides cannot agree a common
language in advance or they can't coordinate format changes. Even then
it is slow and verbose.

Use the config module if the configuration is simple to moderately complex.

Consider JSON or Python (source) if your requirements are really
complicated.

Regards

Ian

pyt...@bdurham.com

unread,
Feb 22, 2011, 7:29:04 AM2/22/11
to Paul Anton Letnes, pytho...@python.org
Paul,

> How about skipping the whole xml thing? You can dynamically import any python module, even if it does not have a python filename.

Great example!

Can you do the same with a cStringIO based file that exists in memory
vs. on disk? Your example requires a physical file on disk. Is it
possible to accomplish the same using a string vs. a file or temporary
file?

Thank you,
Malcolm (not the OP)

Paul Anton Letnes

unread,
Feb 22, 2011, 7:33:33 AM2/22/11
to
Den 22.02.11 13.29, skrev pyt...@bdurham.com:

Malcolm,

I never had the interest, so I did not try. Why don't you just try it?
Just insert your StringIO file into the function and see if it works, or
if it crashes and burns. Let me know what you find out!

Anyway, you could always just write a temporary file. It won't consume a
lot of resources, I imagine.

Cheers,
Paul

Matt Funk

unread,
Feb 22, 2011, 2:09:54 PM2/22/11
to pytho...@python.org
Hi,
first of all thanks everyone for the (at least to me) valuable
discussion about xml and its usage domain.
Also thanks for all the hints and suggestions.
In terms of my problems, from what i can tell right now the ConfigObj4
(see:
http://www.voidspace.org.uk/python/configobj.html#reading-a-config-file)
will suit my needs.

Again, thanks for the time and effort you put in to answer my questions
(and, in Stefan's case for writing tools and making them available to
everyone) and pointing me in the better direction.
matt

Matt Funk

unread,
Feb 22, 2011, 2:11:53 PM2/22/11
to pytho...@python.org
One thing i forgot,
in case anyone is at this point:
the reason i chose ConfigObj over ConfigParser is that it allows
subsections.

John Nagle

unread,
Feb 22, 2011, 5:56:54 PM2/22/11
to
On 2/21/2011 2:08 PM, Matt Funk wrote:
> Hi Terry,
>
> On 2/21/2011 11:22 AM, Terry Reedy wrote:
>> On 2/21/2011 12:30 PM, Matt Funk wrote:
>>> Hi,
>>> I was wondering if someone had some advice:
>>> I want to create a set of xml input files to my code that look as
>>> follows:
>>
>> Why?
> mmmh. not sure how to answer this question exactly. I guess it's a
> design decision. I am not saying that it is best one, but it seemed
> suitable to me. I am certainly open to suggestions. But here are some
> requirements:
> 1) My boss needs to be able to read the input and make sense out of it.
> XML seems fairly self explanatory, at least when you choose suitable
> names for the properties/tags etc ...

For that, you can provide an XSL stylesheet for the XML file which
will cause it to be displayed in a browser in a nice format.

John Nagle

Stefan Behnel

unread,
Feb 23, 2011, 1:48:20 AM2/23/11
to pytho...@python.org
John Nagle, 22.02.2011 23:56:

Or rather CSS, which you will likely need anyway, even if you decide to use
an XSLT before hand. CSS is for layout (headings, tables, positioning
etc.), XSLT is only needed when you require major structural changes or
format conversions, which likely won't be necessary here.

Stefan

0 new messages