hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
Seems simple enough, but I'm having some trouble with it. regexps trip up because I also have to take into account 'img', 'meta', 'link' tags, not just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not enough of a regexp pro to figure out that lookahead stuff.
I'm not sure where to start now; I looked at BeautifulSoup and BeautifulStoneSoup, but I can't see how to modify the actual tag.
Tim Arnold wrote: > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to > create CHM files. That application really hates xhtml, so I need to convert > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> Seems simple enough, but I'm having some trouble with it. regexps trip up > because I also have to take into account 'img', 'meta', 'link' tags, not > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not > enough of a regexp pro to figure out that lookahead stuff.
> I'm not sure where to start now; I looked at BeautifulSoup and > BeautifulStoneSoup, but I can't see how to modify the actual tag.
Whether or not you can find an application that does what you want, I don't know, but at the very least I can say this much.
You should not be reading and parsing the text yourself! XHTML is valid XML, and there a lots of ways to read and parse XML with Python. (ElementTree is what I use, but other choices exist.) Once you use an existing package to read your files into an internal tree structure representation, it should be a relatively easy job to traverse the tree to emit the tags and text you want.
"Tim Arnold" <tim.arn...@sas.com> writes: > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to > create CHM files. That application really hates xhtml, so I need to convert > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> Seems simple enough, but I'm having some trouble with it. regexps trip up > because I also have to take into account 'img', 'meta', 'link' tags, not > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not > enough of a regexp pro to figure out that lookahead stuff.
Hi, I'm not sure if this is very helpful but the following works on the very simple example below.
> Tim Arnold wrote: >> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop >> to create CHM files. That application really hates xhtml, so I need to >> convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up >> because I also have to take into account 'img', 'meta', 'link' tags, not >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to >> do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. >> I'm not enough of a regexp pro to figure out that lookahead stuff.
>> I'm not sure where to start now; I looked at BeautifulSoup and >> BeautifulStoneSoup, but I can't see how to modify the actual tag.
> Whether or not you can find an application that does what you want, I > don't know, but at the very least I can say this much.
> You should not be reading and parsing the text yourself! XHTML is valid > XML, and there a lots of ways to read and parse XML with Python. > (ElementTree is what I use, but other choices exist.) Once you use an > existing package to read your files into an internal tree structure > representation, it should be a relatively easy job to traverse the tree to > emit the tags and text you want.
> Gary Herron
I agree and I'd really rather not parse it myself. However, ET will clean up the file which in my case includes some comments required as metadata, so that won't work. Oh, I could get ET to read it and write a new parser--I see what you mean. I think I need to subclass so I could get ET to honor those comments too. That's one way to go, I was just hoping for something easier. thanks, --Tim
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop >> to >> create CHM files. That application really hates xhtml, so I need to >> convert >> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up >> because I also have to take into account 'img', 'meta', 'link' tags, not >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to >> do >> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm >> not >> enough of a regexp pro to figure out that lookahead stuff.
> Hi, I'm not sure if this is very helpful but the following works on > the very simple example below.
Thanks for that. It is helpful--I guess I had a brain malfunction. Your example will work for me I'm pretty sure, except in some cases where the IMG alt text contains a gt sign. I'm not sure that's even possible, so maybe this will do the job. thanks, --Tim
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to >> create CHM files. That application really hates xhtml, so I need to convert >> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up >> because I also have to take into account 'img', 'meta', 'link' tags, not >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do >> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not >> enough of a regexp pro to figure out that lookahead stuff.
> Hi, I'm not sure if this is very helpful but the following works on > the very simple example below.
> -----Original Message----- > From: python-list-bounces+jkrukoff=ltgc....@python.org [mailto:python- > list-bounces+jkrukoff=ltgc....@python.org] On Behalf Of Tim Arnold > Sent: Thursday, April 24, 2008 9:34 AM > To: python-l...@python.org > Subject: convert xhtml back to html
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop > to > create CHM files. That application really hates xhtml, so I need to > convert > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> Seems simple enough, but I'm having some trouble with it. regexps trip up > because I also have to take into account 'img', 'meta', 'link' tags, not > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm > not > enough of a regexp pro to figure out that lookahead stuff.
> I'm not sure where to start now; I looked at BeautifulSoup and > BeautifulStoneSoup, but I can't see how to modify the actual tag.
One method which wouldn't require much python code, would be to run the XHTML through a simple identity XSL tranform with the output method set to HTML. It would have the benefit that you wouldn't have to worry about any of the specifics of the transformation, though you would need an external dependency.
As far as I know, both 4suite and lxml (my personal favorite: http://codespeak.net/lxml/) support XSLT in python.
It might work out fine for you, but mixing regexps and XML always seems to work out badly in the end for me. --------- John Krukoff jkruk...@ltgc.com
>> -----Original Message----- >> From: python-list-bounces+jkrukoff=ltgc....@python.org [mailto:python- >> list-bounces+jkrukoff=ltgc....@python.org] On Behalf Of Tim Arnold >> Sent: Thursday, April 24, 2008 9:34 AM >> To: python-l...@python.org >> Subject: convert xhtml back to html
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop >> to >> create CHM files. That application really hates xhtml, so I need to >> convert >> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>> Seems simple enough, but I'm having some trouble with it. regexps trip up >> because I also have to take into account 'img', 'meta', 'link' tags, not >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do >> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm >> not >> enough of a regexp pro to figure out that lookahead stuff.
>> I'm not sure where to start now; I looked at BeautifulSoup and >> BeautifulStoneSoup, but I can't see how to modify the actual tag.
You could filter the XHTML through mxTidy and set the hide_endtags to 1:
I'll second the recommendation to use xsl-t, set the output to html.
The code for an XSL-T to do it would be basically: <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="html" /> <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template> </xsl:stylesheet>
you would probably want to do other stuff than just copy it out but that's another case.
Also, from my recollection the solution in CHM to make XHTML br elements behave correctly was <br /> as opposed to <br/>, at any rate I've done projects generating CHM and my output markup was well formed XML at all occasions.
On Thu, Apr 24, 2008 at 5:34 PM, Tim Arnold <tim.arn...@sas.com> wrote: > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to > create CHM files. That application really hates xhtml, so I need to convert > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> Seems simple enough, but I'm having some trouble with it. regexps trip up > because I also have to take into account 'img', 'meta', 'link' tags, not > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not > enough of a regexp pro to figure out that lookahead stuff.
> I'm not sure where to start now; I looked at BeautifulSoup and > BeautifulStoneSoup, but I can't see how to modify the actual tag.
Tim Arnold wrote: > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to > create CHM files. That application really hates xhtml, so I need to convert > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
This should do the job in lxml 2.x:
from lxml import etree
tree = etree.parse("thefile.xhtml") tree.write("thefile.html", method="html")
On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <stefan...@behnel.de> wrote: > Tim Arnold wrote: > > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to > > create CHM files. That application really hates xhtml, so I need to convert > > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
> This should do the job in lxml 2.x:
> from lxml import etree
> tree = etree.parse("thefile.xhtml") > tree.write("thefile.html", method="html")
But the exact numbers depend on your data. lxml holds the XML tree in memory, which is a lot bigger than the serialised data. So, for example, if you have 2GB of RAM and want to parse a serialised 1GB XML file full of little one-element integers into an in-memory tree, get prepared for lunch. With a lot of long text string content instead, it might still fit.
However, lxml also has a couple of step-by-step and stream parsing APIs:
> But the exact numbers depend on your data. lxml holds the XML tree in memory, > which is a lot bigger than the serialised data. So, for example, if you have > 2GB of RAM and want to parse a serialised 1GB XML file full of little > one-element integers into an in-memory tree, get prepared for lunch. With a > lot of long text string content instead, it might still fit.
> However, lxml also has a couple of step-by-step and stream parsing APIs:
If you are operating with huge XML files (say, larger than available RAM) repeatedly, an XML database may also be a good option.
My current favorite in this realm is Sedna (free, Apache 2.0 license). Among other features, it has facilities for indexing within documents and collections (faster queries) and transactional sub-document updates (safely modify parts of a document without rewriting the entire document). I have been working on a python interface to it recently (zif.sedna, in pypi).
Regarding RAM consumption, a Sedna database uses approximately 100 MB of RAM by default, and that does not change much, no matter how much (or how little) data is actually stored.
For a quick idea of Sedna's capabilities, the Sedna folks have put up an on-line demo serving and xquerying an extract from Wikipedia (in the range of 20 GB of data) using a Sedna server, at http://wikidb.dyndns.org/ . Along with the on-line demo, they provide instructions for deploying the technology locally.
> I'll second the recommendation to use xsl-t, set the output to html.
> The code for an XSL-T to do it would be basically: > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > version="1.0"> > <xsl:output method="html" /> > <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template> > </xsl:stylesheet>
> you would probably want to do other stuff than just copy it out but > that's another case.
> Also, from my recollection the solution in CHM to make XHTML br > elements behave correctly was <br /> as opposed to <br/>, at any rate > I've done projects generating CHM and my output markup was well formed > XML at all occasions.
> Cheers, > Bryan Rasmussen
Thanks Bryan, Walter, John, Marc, and Stefan. I finally went with the xslt transform which works very well and is simple. regexps would work, but they just scare me somehow. Brian, my tags were formatted as <br /> but the help compiler would issue warnings on each one resulting in log files with thousands of warnings. It did finish the compile though, but it made understanding the logs too painful.
Stefan, I *really* look forward to being able to use lxml when I move to RH linux next month. I've been using hp10.20 and never could get the requisite libraries to compile. Once I make that move, maybe I won't have as many markup related questions here!
thanks again to all for the great suggestions. --Tim Arnold