stripping fields from xml file into a csv

Hal Styli

unread,

Feb 27, 2010, 3:50:38 PM2/27/10

to

Hello,

Can someone please help.
I have a sed solution to the problems below but would like to rewrite
in python...

I need to strip out some data from a quirky xml file into a csv:

from something like this

< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >

to this

dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6

I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.

Thanks for any constructive help given.

Hal

John Bokma

unread,

Feb 27, 2010, 3:56:22 PM2/27/10

to

Hal Styli <sill...@yahoo.com> writes:

It's not clear how your XML actually looks, but (especially) if those
are all attributes of one element I probably would just use xml.sax

I strongly suggest to not use regex to parse XML.

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development

Zaphod

unread,

Feb 27, 2010, 8:52:15 PM2/27/10

to

Start here:

http://pyxml.sourceforge.net/topics/

Hai Vu

unread,

Feb 27, 2010, 11:15:43 PM2/27/10

to

Here is a sample XML file (I named it data.xml):
--------------------------
<orders>
<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />
</orders>
--------------------------

Code:
--------------------------
import csv
import xml.sax

# Handle the XML file with the following structure:
# <orders>
# <order attributes... /> ...
# </orders>
class OrdersHandler(xml.sax.handler.ContentHandler):
def __init__(self, csvfile):
# Open a csv file for output
self.csvWriter = csv.writer(open(csvfile, 'w'))

def startElement(self, name, attributes):
# Only process the <order ... > element
if name == 'order':
# Construct a sorted list of attribute names in order to
# guarantee rows are written in the same order. We assume
# the XML elements contain the same attributes
attributeNames = attributes.getNames()
attributeNames.sort()

# Construct a row and write it to the csv file
row = []
for name in attributeNames:
row.append(attributes.getValue(name))
self.csvWriter.writerow(row)

def endDocument(self):
# Destroy the csv writer object to close the file
self.csvWriter = None

# Main
datafile = 'data.xml'
csvfile = 'data.csv'
ordersHandler = OrdersHandler(csvfile)
xml.sax.parse(datafile, ordersHandler)
--------------------------

To solve your problem, it is easier to use SAX than DOM. Basically,
use SAX to scan the XML file, if you encounter the element you like
(in this case <order ...>) then you process its attributes. In this
case, you sort the attributes, then write to a csv file.

--------------------------

References:

SAX Parser:
http://docs.python.org/library/xml.sax.html

SAX Content Handler:
http://docs.python.org/library/xml.sax.handler.html

Attributes Object:
http://docs.python.org/library/xml.sax.reader.html#attributes-objects

Stefan Behnel

unread,

Feb 28, 2010, 3:05:11 AM2/28/10

to pytho...@python.org

Hal Styli, 27.02.2010 21:50:

> I have a sed solution to the problems below but would like to rewrite
> in python...

Note that sed (or any other line based or text based tool) is not a
sensible way to handle XML. If you want to read XML, use an XML parser.
They are designed to do exactly what you want in a standard compliant way,
and they can deal with all sorts of XML formatting and encoding, for example.

> I need to strip out some data from a quirky xml file into a csv:
>
> from something like this
>
> < ..... cust="dick" .... product="eggs" ... quantity="12" .... >
> < .... cust="tom" .... product="milk" ... quantity="2" ...>
> < .... cust="harry" .... product="bread" ... quantity="1" ...>
> < .... cust="tom" .... product="eggs" ... quantity="6" ...>
> < ..... cust="dick" .... product="eggs" ... quantity="6" .... >

As others have noted, this doesn't tell much about your XML. A more
complete example would be helpful.

> to this
>
> dick,eggs,12
> tom,milk,2
> harry,bread,1
> tom,eggs,6
> dick,eggs,6
>
> I am new to python and xml and it would be great to see some slick
> ways of achieving the above by using python's XML capabilities to
> parse the original file or python's regex to achive what I did using
> sed.

It's funny how often people still think that SAX is a good way to solve XML
problems. Here's an untested solution that uses xml.etree.ElementTree:

from xml.etree import ElementTree as ET

csv_field_order = ['cust', 'product', 'quantity']

clean_up_used_elements = None
for event, element in ET.iterparse("thefile.xml", events=['start']):
# you may want to select a specific element.tag here

# format and print the CSV line to the standard output
print(','.join(element.attrib.get(title, '')
for title in csv_field_order))

# safe some memory (in case the XML file is very large)
if clean_up_used_elements is None:
# this assigns the clear() method of the root (first) element
clean_up_used_elements = element.clear
clean_up_used_elements()

You can strip everything dealing with 'clean_up_used_elements' (basically
the last section) if your XML file is small enough to fit into memory (a
couple of MB is usually fine).

Stefan

Stefan Behnel

unread,

Feb 28, 2010, 7:15:09 AM2/28/10

to pytho...@python.org

Roland Mueller, 28.02.2010 13:01:
> The stylesheet is test.xsl and the insput data test.xml. The following
> Python code the applies the stylesheet on the input data and puts the output
> into foo.
>
> Python code:
> #!/usr/bin/python
> import sys
> import libxml2
> import libxslt
>
> styledoc = libxml2.parseFile("test.xsl")
> style = libxslt.parseStylesheetDoc(styledoc)
> doc = libxml2.parseFile("test.xml")
> result = style.applyStylesheet(doc, None)
> style.saveResultToFilename("foo", result, 0)
>
> BR,
> Roland
>
> *Example run in Linux:*
> roland@komputer:~/Desktop/XML/XSLT$ ./xslt_test.py

Note that the shorthand for the above is

$ xsltproc test.xsl test.xml > foo

Stefan

Hai Vu

unread,

Feb 28, 2010, 11:41:21 AM2/28/10

to

This solution is so beautiful and elegant. Thank you. Now I am off to
learn ElementTree.

By the way, Stefan, I am using Python 2.6. Do you know the differences
between ElementTree and cElementTree?

Stefan Behnel

unread,

Feb 28, 2010, 2:20:21 PM2/28/10

to pytho...@python.org

Hai Vu, 28.02.2010 17:41:

> By the way, Stefan, I am using Python 2.6. Do you know the differences
> between ElementTree and cElementTree?

Use cElementTree, it's implemented in C and a lot faster and more memory
friendly.

http://effbot.org/zone/celementtree.htm#benchmarks
http://codespeak.net/lxml/performance.html

Stefan

Hal Styli

unread,

Feb 28, 2010, 6:15:51 PM2/28/10

to

Thanks for the responses so far, most enlightening.

Stefan, I was happy to see such concise code.
Your python worked with only very minor modifications.

Hai's test xml data *without* the first and last line is close enough
to the data I am using:

... quirky.

I get a large file given to me in this format. I believe it is
created by something like:
grep 'customer=' *.xml, where there are a large number of xml files.

I had to edit the data to include the first and last lines, <orders>
and </orders>,
to get the python code to work. It's not an arduous task(!), but can
you recommend a way to get it to work without
manually editing the data?

One other thing, what's the Roland Mueller post above about (I'm
viewing htis in google groups)? What would the test.xsl file look
like?

Thanks again

Hal.

Stefan Behnel

unread,

Mar 1, 2010, 2:46:04 AM3/1/10

to pytho...@python.org

Hal Styli, 01.03.2010 00:15:

> Stefan, I was happy to see such concise code.
> Your python worked with only very minor modifications.
>
> Hai's test xml data *without* the first and last line is close enough
> to the data I am using:
>
> <order customer="john" product="eggs" quantity="12" />
> <order customer="cindy" product="bread" quantity="1" />
> <order customer="larry" product="tea bags" quantity="100" />
> <order customer="john" product="butter" quantity="1" />
> <order product="chicken" quantity="2" customer="derek" />
>
> ... quirky.
>
> I get a large file given to me in this format. I believe it is
> created by something like:
> grep 'customer=' *.xml, where there are a large number of xml files.

Try to get this fixed at the source. Exporting non-XML that looks like XML
is not a good idea in general, and it means that everyone who wants to read
the data has to adapt, instead of fixing the source once and for all.

> I had to edit the data to include the first and last lines, <orders>
> and </orders>,
> to get the python code to work. It's not an arduous task(!), but can
> you recommend a way to get it to work without
> manually editing the data?

Iff this cannot be fixed at the source, you can write a file-like wrapper
around a file that simply returns the boundary tags before and after
reading from the file itself. All you need is a .read(n) method, see the
documentation of the file type.

> One other thing, what's the Roland Mueller post above about (I'm
> viewing htis in google groups)? What would the test.xsl file look
> like?

This is the XSLT script he posted:

============================
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
version="1.0">

<xsl:output method="text"/>

<xsl:strip-space elements="*"/>

<xsl:template match="order">
<xsl:value-of select="@customer"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@product"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@quantity"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>
============================

Stefan