Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Parsing html with Beautifulsoup

18 views
Skip to first unread message

Johann Spies

unread,
Dec 10, 2009, 4:15:19 AM12/10/09
to Python poslys
I am trying to get csv-output from a html-file.

With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
# print(','.join(t))

for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===============================

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users@Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
...

It left out all the non-plaintext parts of <td></td>

I then tried using

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,<img src=icons/group.png>&nbsp;<a href=#OBJ_sunetint>
sunetint</A><BR>,
<img src=icons/gateway_cluster.png>&nbsp;<a>href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>,
<img>src=icons/udp.png>&nbsp;<a href=#SVC_IKE >IKE</a><br>,
<img src=icons/drop.png>&nbsp;drop,
<img src=icons/log.png>&nbsp;Log&nbsp;,
<img src=icons/any.png>&nbsp;Any<br>&nbsp;,
<img src=icons/gateway_cluster.png>&nbsp;<a href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>&nbsp;,&nbsp;

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png>&nbsp;<a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3

Gabriel Genellina

unread,
Dec 10, 2009, 10:23:19 PM12/10/09
to pytho...@python.org
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jsp...@sun.ac.za>
escribi�:

> How do I get Beautifulsoup to render (taking the above line as
> example)
>
> sunentint for <img src=icons/group.png>&nbsp;<a
> href=#OBJ_sunetint>sunetint</A><BR>
>
> and still provide the text-parts in the <td>'s with plain text?

Hard to tell if we don't see what's inside those <td>'s - please provide
at least a few rows of the original HTML table.

--
Gabriel Genellina

Johann Spies

unread,
Dec 11, 2009, 2:04:38 AM12/11/09
to pytho...@python.org
Gabriel Genellina het geskryf:

> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jsp...@sun.ac.za>
> escribió:

>
>> How do I get Beautifulsoup to render (taking the above line as
>> example)
>>
>> sunentint for <img src=icons/group.png>&nbsp;<a
>> href=#OBJ_sunetint>sunetint</A><BR>
>>
>> and still provide the text-parts in the <td>'s with plain text?
>
> Hard to tell if we don't see what's inside those <td>'s - please
> provide at least a few rows of the original HTML table.
>
Thanks for your reply.

Here are a few lines:

<!------- Rule 1 ------->
<tr style="background-color: #ffffff"><td class=normal>2</td><td><img
src=icons/usrgroup.png>&nbsp;All Users@Any<br><td><im$
</td><td><img src=icons/any.png>&nbsp;Any<br></td><td><img
src=icons/clientencrypt.png>&nbsp;clientencrypt</td><td><img src$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 2 ------->
<tr style="background-color: #eeeeee"><td class=normal>3</td><td><img
src=icons/any.png>&nbsp;Any<br><td><img src=icons/any$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 3 ------->
<tr style="background-color: #ffffff"><td class=normal>4</td><td><img
src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group$
<td><img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
</td><td><img src=icons/udp.png>&nbsp;<a href=#SVC_RainWall_Stop
>RainWall_Stop</a><br></td><td><img src=icons/drop.png>&nb$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 4 ------->
<tr style="background-color: #eeeeee"><td class=normal>5</td><td><img
src=icons/host.png>&nbsp;<a href=#OBJ_Rainwall_Broadc$
<img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
<td><img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
<img src=icons/host.png>&nbsp;<a href=#OBJ_Rainwall_Broadcast
>Rainwall_Broadcast</A> <BR>
</td><td><img src=icons/udp.png>&nbsp;<a href=#SVC_RainWall_Daemon
>RainWall_Daemon</a><br></td><td><img src=icons/accept.p$
&nbsp;</td><td>&nbsp;</td></tr>

Gabriel Genellina

unread,
Dec 13, 2009, 5:58:55 AM12/13/09
to pytho...@python.org
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies <jsp...@sun.ac.za>
escribiᅵ:

> Gabriel Genellina het geskryf:
>> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jsp...@sun.ac.za>

>> escribiᅵ:


>>
>>> How do I get Beautifulsoup to render (taking the above line as
>>> example)
>>>
>>> sunentint for <img src=icons/group.png>&nbsp;<a
>>> href=#OBJ_sunetint>sunetint</A><BR>
>>>
>>> and still provide the text-parts in the <td>'s with plain text?
>>
>> Hard to tell if we don't see what's inside those <td>'s - please
>> provide at least a few rows of the original HTML table.
>>
> Thanks for your reply. Here are a few lines:
>
> <!------- Rule 1 ------->
> <tr style="background-color: #ffffff"><td class=normal>2</td><td><img
> src=icons/usrgroup.png>&nbsp;All Users@Any<br><td><im$
> </td><td><img src=icons/any.png>&nbsp;Any<br></td><td><img
> src=icons/clientencrypt.png>&nbsp;clientencrypt</td><td><img src$
> &nbsp;</td><td>&nbsp;</td></tr>

I *think* I finally understand what you want (your previous example above
confused me).
If you want for Rule 1 to generate a line like this:

2,All Users@Any,<im$,Any,clientencrypt,,

this code should serve as a starting point:

lines = []
soup = BeautifulSoup(html)
for table in soup.findAll("table"):
for row in table.findAll("tr"):
line = []
for cell in row.findAll("td"):
text = ' '.join(
s.replace('\n',' ').replace('&nbsp;',' ')
for s in cell.findAll(text=True)).strip()
line.append(text)
lines.append(line)

import csv
with open("output.csv","wb") as f:
writer = csv.writer(f)
writer.writerows(lines)

cell.findAll(text=True) returns a list of all text nodes inside a <td>
cell; I preprocess all \n and &nbsp; in each text node, and join them all.
lines is a list of lists (each entry one cell), as expected by the csv
module used to write the output file.

--
Gabriel Genellina

Johann Spies

unread,
Dec 14, 2009, 1:58:34 AM12/14/09
to pytho...@python.org
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:

> this code should serve as a starting point:

Thank you very much!

> cell.findAll(text=True) returns a list of all text nodes inside a
> <td> cell; I preprocess all \n and &nbsp; in each text node, and
> join them all. lines is a list of lists (each entry one cell), as
> expected by the csv module used to write the output file.

I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do. Thanks for your
explanation.

As far as I can see there was no documentation installed with the
debian package.

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"But I will hope continually, and will yet praise thee
more and more." Psalms 71:14

Gabriel Genellina

unread,
Dec 14, 2009, 5:39:35 PM12/14/09
to pytho...@python.org
En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies <jsp...@sun.ac.za>
escribi�:

> On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:

>> cell.findAll(text=True) returns a list of all text nodes inside a
>> <td> cell; I preprocess all \n and &nbsp; in each text node, and
>> join them all. lines is a list of lists (each entry one cell), as
>> expected by the csv module used to write the output file.
>
> I have struggled a bit to find the documentation for (text=True).
> Most of documentation for Beautifulsoup I saw mostly contained some
> examples without explaining what the options do. Thanks for your
> explanation.

See
http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-text

> As far as I can see there was no documentation installed with the
> debian package.

BeautifulSoup is very small - a single .py file, no dependencies. The
whole documentation is contained in the above linked page.

--
Gabriel Genellina

0 new messages