regex that remove html tags in a node

scr...@asia.com

unread,

Jul 26, 2010, 6:45:15 PM7/26/10

to scrapy...@googlegroups.com

Hi,

I've got a problem with extracting content:

item['content'] = hxs.select('//span[@id="desc"]').extrac()

I would like to get rid of all the html tags in the node... how?

In my output XML file i've this kind of stuff:
<field name="content"><font

Rolando Espinoza La Fuente

unread,

Jul 26, 2010, 7:04:07 PM7/26/10

to scrapy...@googlegroups.com

You can use the xpath's text() function:
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()

Or check to module scrapy.utils.markup which have some functions
to cleanup html.

~Rolando

scr...@asia.com

unread,

Jul 27, 2010, 5:31:02 AM7/27/10

to scrapy...@googlegroups.com

It's what i was using, but if in the span id they are some more <div> it doesn't work! It returns empy content.

So that's why i'm trying to remove all HTML tags with a regex

item['content'] = hxs.select('//span[@id="desc"]//text()').extract()

-----Original Message-----
From: Rolando Espinoza La Fuente <dar...@gmail.com>
To: scrapy...@googlegroups.com
Sent: Tue, Jul 27, 2010 1:04 am
Subject: Re: regex that remove html tags in a node

On Mon, Jul 26, 2010 at 6:45 PM,  <scr...@asia.com> wrote:


>


> Hi,


>


> I've got a problem with extracting content:


>


> item['content'] = hxs.select('//span[@id="desc"]').extrac()


>


> I would like to get rid of all the html tags in the node... how?


>


> In my output XML file i've this kind of stuff:


> <field



> name="content"><b></b><b></b><b><span><font







You can use  the xpath's text() function:


item['content'] = hxs.select('//span[@id="desc"]//text()').extract()





Or check to module scrapy.utils.markup which have some functions


to cleanup html.





~Rolando



-- 


You received this message because you are subscribed to the Google Groups 


"scrapy-users" group.


To post to this group, send email to scrapy...@googlegroups.com.


To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.


For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Alex

unread,

Jul 27, 2010, 3:43:26 AM7/27/10

to scrapy-users

Try

def remove_tags(text, which_ones=(), keep=(), encoding=None):
""" Remove HTML Tags only.

which_ones and keep are both tuples, there are four cases:

which_ones, keep (1 - not empty, 0 - empty)
1, 0 - remove all tags in which_ones
0, 1 - remove all tags except the ones in keep
0, 0 - remove all tags
1, 1 - not allowd
"""

from scrapy markup module

scr...@asia.com

unread,

Jul 27, 2010, 10:06:48 AM7/27/10

to scrapy...@googlegroups.com

I think that the end of the function is missing? no?

-----Original Message-----
From: Alex <pers...@gmail.com>
To: scrapy-users <scrapy...@googlegroups.com>
Sent: Tue, Jul 27, 2010 9:43 am
Subject: Re: regex that remove html tags in a node



> <field name="content"><b></b><b></b><b><span>& lt;font

Reply all

Reply to author

Forward