Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
convert xhtml back to html
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  14 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tim Arnold  
View profile  
 More options Apr 24 2008, 11:34 am
Newsgroups: comp.lang.python
From: "Tim Arnold" <tim.arn...@sas.com>
Date: Thu, 24 Apr 2008 11:34:11 -0400
Local: Thurs, Apr 24 2008 11:34 am
Subject: convert xhtml back to html
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create  CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gary Herron  
View profile  
 More options Apr 24 2008, 12:11 pm
Newsgroups: comp.lang.python
From: Gary Herron <gher...@islandtraining.com>
Date: Thu, 24 Apr 2008 09:11:50 -0700
Local: Thurs, Apr 24 2008 12:11 pm
Subject: Re: convert xhtml back to html

Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.

You should not be reading and parsing the text yourself!  XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.  
(ElementTree is what I use, but other choices exist.)   Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree
to emit the tags and text you want.

Gary Herron


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Arnaud Delobelle  
View profile  
 More options Apr 24 2008, 12:23 pm
Newsgroups: comp.lang.python
From: Arnaud Delobelle <arno...@googlemail.com>
Date: Thu, 24 Apr 2008 17:23:22 +0100
Local: Thurs, Apr 24 2008 12:23 pm
Subject: Re: convert xhtml back to html

"Tim Arnold" <tim.arn...@sas.com> writes:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create  CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).

> Seems simple enough, but I'm having some trouble with it. regexps trip up
> because I also have to take into account 'img', 'meta', 'link' tags, not
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
> enough of a regexp pro to figure out that lookahead stuff.

Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.

>>> import re
>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
>>> xtag = re.compile(r'<([^>]*?)/>')
>>> xtag.sub(r'<\1>', xhtml)

'<p>hello <img src="/img.png"> spam <br> bye </p>'

--
Arnaud


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Arnold  
View profile  
 More options Apr 24 2008, 12:46 pm
Newsgroups: comp.lang.python
From: "Tim Arnold" <tim.arn...@sas.com>
Date: Thu, 24 Apr 2008 12:46:17 -0400
Local: Thurs, Apr 24 2008 12:46 pm
Subject: Re: convert xhtml back to html
"Gary Herron" <gher...@islandtraining.com> wrote in message

news:mailman.130.1209053543.12834.python-list@python.org...

I agree and I'd really rather not parse it myself. However, ET will clean up
the file which in my case includes some comments required as metadata, so
that won't work. Oh, I could get ET to read it and write a new parser--I see
what you mean. I think I need to subclass so I could get ET to honor those
comments too.
That's one way to go, I was just hoping for something easier.
thanks,
--Tim

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Arnold  
View profile  
 More options Apr 24 2008, 12:48 pm
Newsgroups: comp.lang.python
From: "Tim Arnold" <tim.arn...@sas.com>
Date: Thu, 24 Apr 2008 12:48:18 -0400
Local: Thurs, Apr 24 2008 12:48 pm
Subject: Re: convert xhtml back to html
"Arnaud Delobelle" <arno...@googlemail.com> wrote in message

news:m28wz3cjd1.fsf@googlemail.com...

Thanks for that. It is helpful--I guess I had a brain malfunction. Your
example will work for me I'm pretty sure, except in some cases where the IMG
alt text contains a gt sign. I'm not sure that's even possible, so maybe
this will do the job.
thanks,
--Tim

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Walter Dörwald  
View profile  
 More options Apr 24 2008, 12:46 pm
Newsgroups: comp.lang.python
From: Walter Dörwald <wal...@livinglogic.de>
Date: Thu, 24 Apr 2008 18:46:15 +0200
Local: Thurs, Apr 24 2008 12:46 pm
Subject: Re: convert xhtml back to html

You might try XIST (http://www.livinglogic.de/Python/xist):

Code looks like this:

from ll.xist import parsers
from ll.xist.ns import html

xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'

doc = parsers.parsestring(xhtml)
print doc.bytes(xhtml=0)

This outputs:

<p>hello <img src="/img.png"> spam <br> bye </p>

(and a warning that the alt attribute is missing in the img ;))

Servus,
    Walter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Krukoff  
View profile  
 More options Apr 24 2008, 1:16 pm
Newsgroups: comp.lang.python
From: "John Krukoff" <jkruk...@ltgc.com>
Date: Thu, 24 Apr 2008 11:16:35 -0600
Local: Thurs, Apr 24 2008 1:16 pm
Subject: RE: convert xhtml back to html

One method which wouldn't require much python code, would be to run the
XHTML through a simple identity XSL tranform with the output method set to
HTML. It would have the benefit that you wouldn't have to worry about any of
the specifics of the transformation, though you would need an external
dependency.

As far as I know, both 4suite and lxml (my personal favorite:
http://codespeak.net/lxml/) support XSLT in python.

It might work out fine for you, but mixing regexps and XML always seems to
work out badly in the end for me.
---------
John Krukoff
jkruk...@ltgc.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
M.-A. Lemburg  
View profile  
 More options Apr 24 2008, 1:41 pm
Newsgroups: comp.lang.python
From: "M.-A. Lemburg" <m...@egenix.com>
Date: Thu, 24 Apr 2008 19:41:43 +0200
Local: Thurs, Apr 24 2008 1:41 pm
Subject: Re: convert xhtml back to html
On 2008-04-24 19:16, John Krukoff wrote:

You could filter the XHTML through mxTidy and set the hide_endtags to 1:

http://www.egenix.com/products/python/mxExperimental/mxTidy/

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 24 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
bryan rasmussen  
View profile  
 More options Apr 24 2008, 2:08 pm
Newsgroups: comp.lang.python
From: "bryan rasmussen" <rasmussen.br...@gmail.com>
Date: Thu, 24 Apr 2008 20:08:50 +0200
Local: Thurs, Apr 24 2008 2:08 pm
Subject: Re: convert xhtml back to html
I'll second the recommendation to use xsl-t, set the output to html.

The code for an XSL-T to do it would be basically:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" />
    <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>

you would probably want to do other stuff than just  copy it out but
that's another case.

Also, from my recollection the solution in CHM to make XHTML br
elements behave correctly was <br /> as opposed to <br/>, at any rate
I've done projects generating CHM and my output markup was well formed
XML at all occasions.

Cheers,
Bryan Rasmussen


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stefan Behnel  
View profile  
 More options Apr 24 2008, 3:55 pm
Newsgroups: comp.lang.python
From: Stefan Behnel <stefan...@behnel.de>
Date: Thu, 24 Apr 2008 21:55:56 +0200
Local: Thurs, Apr 24 2008 3:55 pm
Subject: Re: convert xhtml back to html

Tim Arnold wrote:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create  CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).

This should do the job in lxml 2.x:

    from lxml import etree

    tree = etree.parse("thefile.xhtml")
    tree.write("thefile.html", method="html")

http://codespeak.net/lxml

Stefan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
bryan rasmussen  
View profile  
 More options Apr 24 2008, 5:11 pm
Newsgroups: comp.lang.python
From: "bryan rasmussen" <rasmussen.br...@gmail.com>
Date: Thu, 24 Apr 2008 23:11:57 +0200
Local: Thurs, Apr 24 2008 5:11 pm
Subject: Re: convert xhtml back to html
wow, that's pretty nice there.

 Just to know: what's the performance like on XML instances of 1 GB?

Cheers,
Bryan Rasmussen


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stefan Behnel  
View profile  
 More options Apr 25 2008, 2:16 am
Newsgroups: comp.lang.python
From: Stefan Behnel <stefan...@behnel.de>
Date: Fri, 25 Apr 2008 08:16:57 +0200
Local: Fri, Apr 25 2008 2:16 am
Subject: Re: convert xhtml back to html
bryan rasmussen top-posted:

> On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <stefan...@behnel.de> wrote:
>>     from lxml import etree

>>     tree = etree.parse("thefile.xhtml")
>>     tree.write("thefile.html", method="html")

>>  http://codespeak.net/lxml

> wow, that's pretty nice there.

>  Just to know: what's the performance like on XML instances of 1 GB?

That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.

lxml is pretty conservative in terms of memory:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.

However, lxml also has a couple of step-by-step and stream parsing APIs:

http://codespeak.net/lxml/parsing.html#the-target-parser-interface
http://codespeak.net/lxml/parsing.html#the-feed-parser-interface
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

They might do what you want.

Stefan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jim Washington  
View profile  
 More options Apr 25 2008, 8:46 am
Newsgroups: comp.lang.python
From: Jim Washington <jwas...@vt.edu>
Date: Fri, 25 Apr 2008 08:46:53 -0400
Local: Fri, Apr 25 2008 8:46 am
Subject: Re: convert xhtml back to html

If you are operating with huge XML files (say, larger than available
RAM) repeatedly, an XML database may also be a good option.

My current favorite in this realm is Sedna (free, Apache 2.0 license).  
Among other features, it has facilities for indexing within documents
and collections (faster queries) and transactional sub-document updates
(safely modify parts of a document without rewriting the entire
document).  I have been working on a python interface to it recently
(zif.sedna, in pypi).

Regarding RAM consumption, a Sedna database uses approximately 100 MB of
RAM by default, and that does not change much, no matter how much (or
how little) data is actually stored.

For a quick idea of Sedna's capabilities, the Sedna folks have put up an
on-line demo serving and xquerying an extract from Wikipedia (in the
range of 20 GB of data) using a Sedna server, at
http://wikidb.dyndns.org/ .  Along with the on-line demo, they provide
instructions for deploying the technology locally.

- Jim Washington


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Arnold  
View profile  
 More options Apr 25 2008, 1:29 pm
Newsgroups: comp.lang.python
From: "Tim Arnold" <tim.arn...@sas.com>
Date: Fri, 25 Apr 2008 13:29:54 -0400
Local: Fri, Apr 25 2008 1:29 pm
Subject: Re: convert xhtml back to html
"bryan rasmussen" <rasmussen.br...@gmail.com> wrote in message

news:mailman.138.1209061180.12834.python-list@python.org...

Thanks Bryan, Walter, John, Marc, and Stefan. I finally went with the xslt
transform which works very well and is simple.  regexps would work, but they
just scare me somehow. Brian, my tags were formatted as <br /> but the help
compiler would issue warnings on each one resulting in log files with
thousands of warnings. It did finish the compile though, but it made
understanding the logs too painful.

Stefan, I *really* look forward to being able to use lxml when I move to RH
linux next month. I've been using hp10.20 and never could get the requisite
libraries to compile. Once I make that move, maybe I won't have as many
markup related questions here!

thanks again to all for the great suggestions.
--Tim Arnold


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »