How i can get only text from body

1,316 views
Skip to first unread message

krzys...@gmail.com

unread,
Mar 3, 2014, 10:47:31 AM3/3/14
to scrapy...@googlegroups.com
This is my configuration scrapy.


from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from play.items import PlayItem

class PlaySpider(CrawlSpider):
    name = 'play'
    allowed_domains = ['lo.lesko.pl']
    start_urls = ['http://www.lo.lesko.pl/']
    rules = [Rule(SgmlLinkExtractor(allow=[]), follow=True, callback='parse_play')]

    def parse_play(self, response):
        sel = Selector(response)
        play = PlayItem()
        play['url'] = response.url[0].strip()
       # play['title'] = sel.xpath("//title/text()").extract()
        play['body'] = sel.select("//body").extract()[0].strip()
        return play


I use the strip function because I would like to have a text without tags html but am I doing something wrong there are html tags in my xml file

Svyatoslav Sydorenko

unread,
Mar 3, 2014, 6:11:29 PM3/3/14
to scrapy...@googlegroups.com
strip() only cuts leading and trailing spaces in the string.
I advise you using BeautifulSoup4 (maybe this will help). It will satisfy your needs and will simplify interaction with HTML DOM.

Понеділок, 3 березня 2014 р. 17:47:31 UTC+2 користувач krzys...@gmail.com написав:

Paul Tremberth

unread,
Mar 3, 2014, 6:46:13 PM3/3/14
to scrapy...@googlegroups.com
Hi,

you have a couple options here (at least):

- select descendant text nodes of the body element, and joining this list of strings with u"" (or a newline character u"\n")

play['body'] = u''.join(sel.xpath('//body//text()').extract()).strip()

If you want to remove text nodes in <script>  elements (Javascript instructions that you probably don't want), you can use:

play['body'] = u''.join(sel.xpath('//body/descendant-or-self::*[not(self::script)]/text()').extract()).strip()

- alternatively, if you don't want to deal with XPath expressions, using w3lib (http://w3lib.readthedocs.org/en/latest/w3lib.html#w3lib.html.remove_tags)

import w3lib.html
...
play['body'] = w3lib.html.remove_tags(sel.xpath('//body').extract()[0])

and to remove text from <script> before stripping tags, you can remove <script> tags alltogether, and then only remove tags, keeping text content:

play['body'] = w3lib.html.remove_tags(
    w3lib
.html.remove_tags_with_content(
        sel
.xpath('//body').extract()[0],
        which_ones
=('script',)
   
)
)


Hope this helps

/Paul.
Reply all
Reply to author
Forward
0 new messages