Removing HTML tags from the crawl

1,929 views
Skip to first unread message

Yves S. Garret

unread,
Jun 25, 2013, 9:44:59 AM6/25/13
to scrapy...@googlegroups.com
Hello, I have an interesting problem.  As I'm crawling through the different pages,
I'd like to strip out the HTML tags and just have the text that's associated with
the pages.

This is my mspider_spiders.py in spiders directory:
from mspider.html_strip import MLStripper

.............

  def parse_pages(self, response):
    hxs = HtmlXPathSelector(response)
    html_strip = MLStripper()
    item = MspiderItem()

    html_strip.feed(hxs.select("//a/@href").extract())
    item['links']     = html_strip.get_data()
    html_strip.feed(hxs.select("//p").extract())
    item['paragraph'] = html_strip.get_data()
    html_strip.feed(hxs.select("//div").extract())
    item['div']       = html_strip.get_data()
    html_strip.feed(hxs.select("//span").extract())
    item['span']      = html_strip.get_data()

    #item['links']     = item.feed(hxs.select("//a/@href").extract()).get_data()
    #item['paragraph'] = item.feed(hxs.select("//p").extract()).get_data()
    #item['div']       = item.feed(hxs.select("//div").extract()).get_data()
    #item['span']      = item.feed(hxs.select("//span").extract()).get_data()

    return item

And this is my MLStripper class in the parent directory stored as html_strip.py:
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
  def __init__(self):
    self.reset()
    self.fed = []

  def handle_data(self, d):
    self.fed.append(d)

  def get_data(self):
    return ''.join(self.fed)

#def strip_tags(html):
#    s = MLStripper()
#    s.feed(html)
#    return s.get_data()

Now, I'm getting an error while running this code (shown at the bottom of the e-mail), but also, I'm thinking that my approach to this problem might have been done and accomplished many times before and I'd like to know how others have done this.

....................................

2013-06-25 13:37:04+0000 [mspider] ERROR: Spider error processing <GET http://www.xbox.com:80/en-US/>
    Traceback (most recent call last):
      File "/usr/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 1201, in mainLoop
        self.runUntilCurrent()
      File "/usr/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback
        self._startRunCallbacks(result)
      File "/usr/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/azureuser/scrapy_projects/mspider/mspider/spiders/mspider_spiders.py", line 60, in parse_pages
        html_strip.feed(hxs.select("//a/@href").extract())
      File "/usr/lib64/python2.6/HTMLParser.py", line 107, in feed
        self.rawdata = self.rawdata + data
    exceptions.TypeError: cannot concatenate 'str' and 'list' objects
   
2013-06-25 13:37:04+0000 [mspider] INFO: Closing spider (finished)
2013-06-25 13:37:04+0000 [mspider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 13750,
     'downloader/request_count': 52,
     'downloader/request_method_count/GET': 52,
     'downloader/response_bytes': 1655760,
     'downloader/response_count': 52,
     'downloader/response_status_count/200': 49,
     'downloader/response_status_count/302': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 6, 25, 13, 37, 4, 821748),
     'log_count/DEBUG': 60,
     'log_count/ERROR': 49,
     'log_count/INFO': 4,
     'response_received_count': 49,
     'scheduler/dequeued': 52,
     'scheduler/dequeued/memory': 52,
     'scheduler/enqueued': 52,
     'scheduler/enqueued/memory': 52,
     'spider_exceptions/TypeError': 49,
     'start_time': datetime.datetime(2013, 6, 25, 13, 36, 58, 933191)}
2013-06-25 13:37:04+0000 [mspider] INFO: Spider closed (finished)

Grigoriy Petukhov

unread,
Jun 25, 2013, 10:14:32 AM6/25/13
to scrapy...@googlegroups.com
Try this code: https://github.com/lorien/grab/blob/master/grab/tools/lxml_tools.py#L11
It produces different results depending on the value of `smart` option.
Anyway you can get some ideas from implementation of this method and write your own version.
The `node`  argument of `get_node_text` function is ElementTree node that you can get with::

    from lxml.html import fromstring
    node = fromstring(HTML_CONTENT)


2013/6/25 Yves S. Garret <yoursurr...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
WBR, Grigoriy Petukhov (http://datalab.io)

Yves S. Garret

unread,
Jun 25, 2013, 10:34:26 AM6/25/13
to scrapy...@googlegroups.com

Ok.  But when I looked for that code, I did not want to turn the HTML into an
ElementTree.  Just remove the tags and leave the text, or did I misunderstand
something?

Grigoriy Petukhov

unread,
Jun 25, 2013, 11:16:19 AM6/25/13
to scrapy...@googlegroups.com
In your first message of this thread you provided example of some source code that uses HtmlXpathSelector that (under the hood) builds the lxml ElementTree object or something similar (if libxml used instead lxml). So... why are you wondering about bilding ElementTree object?

How scrapy selectors work? The scrapy selectors engine gets some HTML, then it converts HTML into DOM tree. The DOM tree is a special object which represents all HTML elements and connections between them. If HTML is non-valid, then some repairing operations take place before building the DOM tree. When DOM tree is ready you can extract its elements with XPATH queries. 

I am just trying to tell you that scrapy selector is just an additional layer on top of other technology and in your case it is better to take a look on the lower level: lxml library.

Also you can always get quick and rough result with regular expressions::

    import re
    re_tag = re.compile(r'<[^>]+>')
    text = re_tag.sub(' ', html)

Paul Tremberth

unread,
Jun 25, 2013, 11:18:58 AM6/25/13
to scrapy...@googlegroups.com
Did you try appending 'text()' at the end of your XPath selectors for elements? (not for @attributes that is)
See http://doc.scrapy.org/en/latest/topics/selectors.html#htmlxpathselector-objects for reference

With what Scrapy selectors provide, you should not need MLStripper (I dont know what that does by the way)
Things like
    hxs.select("//p/text()").extract()
ought to give you what you want directly
Reply all
Reply to author
Forward
0 new messages