You can use the xpath's text() function:
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()
Or check to module scrapy.utils.markup which have some functions
to cleanup html.
~Rolando
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()
On Mon, Jul 26, 2010 at 6:45 PM, <scr...@asia.com> wrote:
>
> Hi,
>
> I've got a problem with extracting content:
>
> item['content'] = hxs.select('//span[@id="desc"]').extrac()
>
> I would like to get rid of all the html tags in the node... how?
>
> In my output XML file i've this kind of stuff:
> <field
> name="content"><b></b><b></b><b><span><font
You can use the xpath's text() function:
item['content'] = hxs.select('//span[@id="desc"]//text()').extract()
Or check to module scrapy.utils.markup which have some functions
to cleanup html.
~Rolando
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
I think that the end of the function is missing? no?
> <field name="content"><b></b><b></b><b><span>& lt;font