On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> Hello,
> I want to process XML-like data like this:
<snipped>
> Edits were substituting '/' for '\' on the end tags, and adding the
> following structure:
If thats all you want, you can try the following:
# obviously this should come from a file
input= """<testname=ltpacpi.sh>
<description>
ACPI (Advanced Control Power & Integration) testscript
for 2.5 kernels.
> Is there a name for the format above (perhaps xhtml)?
The only word I can think of is "broken." xml and html and xhtml all
use forward slashes.
> I'd like to find a python module that can translate it to proper xml.
> Does one exist? etree?
I think you've already figured it out. Just take your description and
turn it into Python. in other words, replace all "<\" with "</" and
perhaps " \>" with " /", although your example doesn't happen to have
any of these. Tack a xml header on, and try to parse it with etree. If you can't, then let someone manually fix it.
Or better, fix the program upstream that's creating this mess. There
isn't a reliable way to "fix" all the possible broken xml it might be
creating, without reverse engineering it.
> On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> # submit correctedinput to etree
I was very grateful to get the "leg up" on getting started down that right path with my coding. Many thanks to you, rusi. I took your excellent advices and have this working.
### main
# input to script is directory:
# sanitize trailing slash
testPkgDir = sys.argv[1].rstrip('/')
# Within each test package directory is doc/testcase
tcDocDir = "doc/testcases"
# set input dir, containing broken files
tcTxtDir = os.path.join(testPkgDir, tcDocDir)
# set output dir, to write proper XML files
tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
if not os.path.exists(tcXmlDir):
os.makedirs(tcXmlDir)
# iterate through files in input dir
for filename in os.listdir(tcTxtDir):
# set filepaths
filepathTXT = os.path.join(tcTxtDir, filename)
base = os.path.splitext(filename)[0]
fileXML = base + ".xml"
filepathXML = os.path.join(tcXmlDir, fileXML)
# read broken file, convert to proper XML
with open(filepathTXT) as f:
c = Converter(f.read())
xmlFO = open(filepathXML, 'w') # xmlFileObject
xmlFO.write(c.dataXML)
xmlFO.close()
###
Writing XML files so to see whats happening. My plan is to
keep xml data in memory and parse with xml.etree.ElementTree.
Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:
<testname name="cron_test.sh">
<description>
This testcase tests if crontab <filename> installs the cronjob
and cron schedules the job correctly.
<\description>
##
What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?
I am open to comments to improve my code more to be more readable,
pythonic, or better.
> Unfortunately, xml parsing fails due to angle brackets inside
> description tags. In particular, xml.etree.ElementTree.parse()
> aborts on '<' inside xml data such as the following:
> <testname name="cron_test.sh">
> <description>
> This testcase tests if crontab <filename> installs the cronjob
> and cron schedules the job correctly.
> <\description>
> ##
> What is right way to handle the extra angle brackets?
> Substitute on line-by-line basis, if that works?
> Or learn to write a simple stack-style parser, or
> recursive descent, it may be called?
> I am open to comments to improve my code more to be more readable,
> pythonic, or better.
> On 11/9/12 5:50 AM, rusi wrote:
> > On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> > # submit correctedinput to etree
> I was very grateful to get the "leg up" on getting started down that
> right path with my coding. Many thanks to you, rusi. I took your
> excellent advices and have this working.
> ### main
> # input to script is directory:
> # sanitize trailing slash
> testPkgDir = sys.argv[1].rstrip('/')
> # Within each test package directory is doc/testcase
> tcDocDir = "doc/testcases"
> # set input dir, containing broken files
> tcTxtDir = os.path.join(testPkgDir, tcDocDir)
> # set output dir, to write proper XML files
> tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
> if not os.path.exists(tcXmlDir):
> os.makedirs(tcXmlDir)
> # iterate through files in input dir
> for filename in os.listdir(tcTxtDir):
> # set filepaths
> filepathTXT = os.path.join(tcTxtDir, filename)
> base = os.path.splitext(filename)[0]
> fileXML = base + ".xml"
> filepathXML = os.path.join(tcXmlDir, fileXML)
> # read broken file, convert to proper XML
> with open(filepathTXT) as f:
> c = Converter(f.read())
> xmlFO = open(filepathXML, 'w') # xmlFileObject
> xmlFO.write(c.dataXML)
> xmlFO.close()
> ###
> Writing XML files so to see whats happening. My plan is to
> keep xml data in memory and parse with xml.etree.ElementTree.
> Unfortunately, xml parsing fails due to angle brackets inside
> description tags. In particular, xml.etree.ElementTree.parse()
> aborts on '<' inside xml data such as the following:
> <testname name="cron_test.sh">
> <description>
> This testcase tests if crontab <filename> installs the cronjob
> and cron schedules the job correctly.
> <\description>
> ##
> What is right way to handle the extra angle brackets?
> Substitute on line-by-line basis, if that works?
> Or learn to write a simple stack-style parser, or
> recursive descent, it may be called?
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
> Artie Ziff wrote:
>> Writing XML files so to see whats happening. My plan is to
>> keep xml data in memory and parse with xml.etree.ElementTree.
>> Unfortunately, xml parsing fails due to angle brackets inside
>> description tags. In particular, xml.etree.ElementTree.parse()
>> aborts on '<' inside xml data such as the following:
>> <testname name="cron_test.sh">
>> <description>
>> This testcase tests if crontab <filename> installs the cronjob
>> and cron schedules the job correctly.
>> <\description>
>> ##
>> What is right way to handle the extra angle brackets?
>> Substitute on line-by-line basis, if that works?
>> Or learn to write a simple stack-style parser, or
>> recursive descent, it may be called?
Ah, don't bother with CDATA. Just make sure the data gets properly escaped,
any XML serialiser will do that for you. Just generate the XML using
ElementTree and you'll be fine. Generating XML as literal text is not a
good idea.