Can someone please help.
I have a sed solution to the problems below but would like to rewrite
in python...
I need to strip out some data from a quirky xml file into a csv:
from something like this
< ..... cust="dick" .... product="eggs" ... quantity="12" .... >
< .... cust="tom" .... product="milk" ... quantity="2" ...>
< .... cust="harry" .... product="bread" ... quantity="1" ...>
< .... cust="tom" .... product="eggs" ... quantity="6" ...>
< ..... cust="dick" .... product="eggs" ... quantity="6" .... >
to this
dick,eggs,12
tom,milk,2
harry,bread,1
tom,eggs,6
dick,eggs,6
I am new to python and xml and it would be great to see some slick
ways of achieving the above by using python's XML capabilities to
parse the original file or python's regex to achive what I did using
sed.
Thanks for any constructive help given.
Hal
It's not clear how your XML actually looks, but (especially) if those
are all attributes of one element I probably would just use xml.sax
I strongly suggest to not use regex to parse XML.
--
John Bokma j3b
Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
Here is a sample XML file (I named it data.xml):
--------------------------
<orders>
<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />
</orders>
--------------------------
Code:
--------------------------
import csv
import xml.sax
# Handle the XML file with the following structure:
# <orders>
# <order attributes... /> ...
# </orders>
class OrdersHandler(xml.sax.handler.ContentHandler):
def __init__(self, csvfile):
# Open a csv file for output
self.csvWriter = csv.writer(open(csvfile, 'w'))
def startElement(self, name, attributes):
# Only process the <order ... > element
if name == 'order':
# Construct a sorted list of attribute names in order to
# guarantee rows are written in the same order. We assume
# the XML elements contain the same attributes
attributeNames = attributes.getNames()
attributeNames.sort()
# Construct a row and write it to the csv file
row = []
for name in attributeNames:
row.append(attributes.getValue(name))
self.csvWriter.writerow(row)
def endDocument(self):
# Destroy the csv writer object to close the file
self.csvWriter = None
# Main
datafile = 'data.xml'
csvfile = 'data.csv'
ordersHandler = OrdersHandler(csvfile)
xml.sax.parse(datafile, ordersHandler)
--------------------------
To solve your problem, it is easier to use SAX than DOM. Basically,
use SAX to scan the XML file, if you encounter the element you like
(in this case <order ...>) then you process its attributes. In this
case, you sort the attributes, then write to a csv file.
--------------------------
References:
SAX Parser:
http://docs.python.org/library/xml.sax.html
SAX Content Handler:
http://docs.python.org/library/xml.sax.handler.html
Attributes Object:
http://docs.python.org/library/xml.sax.reader.html#attributes-objects
Note that sed (or any other line based or text based tool) is not a
sensible way to handle XML. If you want to read XML, use an XML parser.
They are designed to do exactly what you want in a standard compliant way,
and they can deal with all sorts of XML formatting and encoding, for example.
> I need to strip out some data from a quirky xml file into a csv:
>
> from something like this
>
> < ..... cust="dick" .... product="eggs" ... quantity="12" .... >
> < .... cust="tom" .... product="milk" ... quantity="2" ...>
> < .... cust="harry" .... product="bread" ... quantity="1" ...>
> < .... cust="tom" .... product="eggs" ... quantity="6" ...>
> < ..... cust="dick" .... product="eggs" ... quantity="6" .... >
As others have noted, this doesn't tell much about your XML. A more
complete example would be helpful.
> to this
>
> dick,eggs,12
> tom,milk,2
> harry,bread,1
> tom,eggs,6
> dick,eggs,6
>
> I am new to python and xml and it would be great to see some slick
> ways of achieving the above by using python's XML capabilities to
> parse the original file or python's regex to achive what I did using
> sed.
It's funny how often people still think that SAX is a good way to solve XML
problems. Here's an untested solution that uses xml.etree.ElementTree:
from xml.etree import ElementTree as ET
csv_field_order = ['cust', 'product', 'quantity']
clean_up_used_elements = None
for event, element in ET.iterparse("thefile.xml", events=['start']):
# you may want to select a specific element.tag here
# format and print the CSV line to the standard output
print(','.join(element.attrib.get(title, '')
for title in csv_field_order))
# safe some memory (in case the XML file is very large)
if clean_up_used_elements is None:
# this assigns the clear() method of the root (first) element
clean_up_used_elements = element.clear
clean_up_used_elements()
You can strip everything dealing with 'clean_up_used_elements' (basically
the last section) if your XML file is small enough to fit into memory (a
couple of MB is usually fine).
Stefan
Note that the shorthand for the above is
$ xsltproc test.xsl test.xml > foo
Stefan
This solution is so beautiful and elegant. Thank you. Now I am off to
learn ElementTree.
By the way, Stefan, I am using Python 2.6. Do you know the differences
between ElementTree and cElementTree?
Use cElementTree, it's implemented in C and a lot faster and more memory
friendly.
http://effbot.org/zone/celementtree.htm#benchmarks
http://codespeak.net/lxml/performance.html
Stefan
Thanks for the responses so far, most enlightening.
Stefan, I was happy to see such concise code.
Your python worked with only very minor modifications.
Hai's test xml data *without* the first and last line is close enough
to the data I am using:
<order customer="john" product="eggs" quantity="12" />
<order customer="cindy" product="bread" quantity="1" />
<order customer="larry" product="tea bags" quantity="100" />
<order customer="john" product="butter" quantity="1" />
<order product="chicken" quantity="2" customer="derek" />
... quirky.
I get a large file given to me in this format. I believe it is
created by something like:
grep 'customer=' *.xml, where there are a large number of xml files.
I had to edit the data to include the first and last lines, <orders>
and </orders>,
to get the python code to work. It's not an arduous task(!), but can
you recommend a way to get it to work without
manually editing the data?
One other thing, what's the Roland Mueller post above about (I'm
viewing htis in google groups)? What would the test.xsl file look
like?
Thanks again
Hal.
Try to get this fixed at the source. Exporting non-XML that looks like XML
is not a good idea in general, and it means that everyone who wants to read
the data has to adapt, instead of fixing the source once and for all.
> I had to edit the data to include the first and last lines, <orders>
> and </orders>,
> to get the python code to work. It's not an arduous task(!), but can
> you recommend a way to get it to work without
> manually editing the data?
Iff this cannot be fixed at the source, you can write a file-like wrapper
around a file that simply returns the boundary tags before and after
reading from the file itself. All you need is a .read(n) method, see the
documentation of the file type.
> One other thing, what's the Roland Mueller post above about (I'm
> viewing htis in google groups)? What would the test.xsl file look
> like?
This is the XSLT script he posted:
============================
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
version="1.0">
<!-- text output because we want to have an CSV file -->
<xsl:output method="text"/>
<!-- remove all whitespace coming with input XML -->
<xsl:strip-space elements="*"/>
<!-- matches any <order> element and extracts the customer,product&quantity
attributes -->
<xsl:template match="order">
<xsl:value-of select="@customer"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@product"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="@quantity"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
============================
Stefan