regex that remove html tags in a node

752 views
Skip to first unread message

scr...@asia.com

unread,
Jul 26, 2010, 6:45:15 PM7/26/10
to scrapy...@googlegroups.com

Hi,

I've got a problem with extracting content:

item['content'] = hxs.select('//span[@id="desc"]').extrac()

I would like to get rid of all the html tags in the node... how?

In my output XML file i've this kind of stuff:
<field name="content">&lt;b&gt;&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;b&gt;&lt;span&gt;&lt;font

Rolando Espinoza La Fuente

unread,
Jul 26, 2010, 7:04:07 PM7/26/10
to scrapy...@googlegroups.com

You can use the xpath's text() function:
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()

Or check to module scrapy.utils.markup which have some functions
to cleanup html.

~Rolando

scr...@asia.com

unread,
Jul 27, 2010, 5:31:02 AM7/27/10
to scrapy...@googlegroups.com
It's what i was using, but if in the span id they are some more <div> it doesn't work! It returns empy content.

So that's why i'm trying to remove all HTML tags with a regex


item['content'] = hxs.select('//span[@id="desc"]//text()').extract()




-----Original Message-----
From: Rolando Espinoza La Fuente <dar...@gmail.com>
To: scrapy...@googlegroups.com
Sent: Tue, Jul 27, 2010 1:04 am
Subject: Re: regex that remove html tags in a node

On Mon, Jul 26, 2010 at 6:45 PM,  <scr...@asia.com> wrote:
>
> Hi,
>
> I've got a problem with extracting content:
>
> item['content'] = hxs.select('//span[@id="desc"]').extrac()
>
> I would like to get rid of all the html tags in the node... how?
>
> In my output XML file i've this kind of stuff:
> <field


> name="content"><b></b><b></b><b><span><font


You can use the xpath's text() function:
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()

Or check to module scrapy.utils.markup which have some functions
to cleanup html.

~Rolando



-- 
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Alex

unread,
Jul 27, 2010, 3:43:26 AM7/27/10
to scrapy-users
Try

def remove_tags(text, which_ones=(), keep=(), encoding=None):
""" Remove HTML Tags only.

which_ones and keep are both tuples, there are four cases:

which_ones, keep (1 - not empty, 0 - empty)
1, 0 - remove all tags in which_ones
0, 1 - remove all tags except the ones in keep
0, 0 - remove all tags
1, 1 - not allowd
"""

from scrapy markup module

scr...@asia.com

unread,
Jul 27, 2010, 10:06:48 AM7/27/10
to scrapy...@googlegroups.com
I think that the end of the function is missing? no?




-----Original Message-----
From: Alex <pers...@gmail.com>
To: scrapy-users <scrapy...@googlegroups.com>
Sent: Tue, Jul 27, 2010 9:43 am
Subject: Re: regex that remove html tags in a node



> <field name="content"><b></b><b></b><b><span>& lt;font
Reply all
Reply to author
Forward
0 new messages