not getting same results as shown in tutorial

24 views
Skip to first unread message

Malik Rumi

unread,
Aug 8, 2015, 2:36:28 PM8/8/15
to scrapy-users
Here is my code:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

Here is the code from the tutorial:

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

I can't see any difference here, but the result shown in the tutorial is:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
      'link': [u'http://gnosis.cx/TPiP/'],
      'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
      'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
      'title': [u'XML Processing with Python']}

But my result looks like this:

2015-08-08 13:14:55 [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\r\n\t\r\n                                ',
          u' \r\n\t\t\t\r\n                                - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.]\r\n                                \r\n                                ',
          u'\r\n                                '],
 'link': [u'http://gnosis.cx/TPiP/'],
 'title': [u'Text Processing in Python']}
2015-08-08 13:14:55 [scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\r\n\t\r\n                                ',
          u' \r\n\t\t\t\r\n                                - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\r\n                                \r\n                                ',
          u'\r\n                                '],
 'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
 'title': [u'XML Processing with Python']}



Actually, my result is worse than this. I just gave you a snippet to match what is in the tutorial. But actually, I have the whole dmoz page with LOTS and LOTS of newlines, whitespace, and so on.

The tutorial does not say anything about running strip() or something like it, so how did they get this result and I got what I got? Further, the tutorial says:

After inspecting the page source, you’ll find that the web site’s information is inside a <ul> element, in fact the second <ul> element.


When I look at the source, the information is in the fourth <ul> element. Maybe I can't count, maybe the writers of the tutorial can't count, or maybe the page has changed, but I can't see how the change from 2nd to 4th alone would account for all this whitespace.

I tried indexing to see if that would narrow the result:

        for sel in response.xpath('//ul[4]/li'):

but that and [3] got me no data. [2] got me the same data as no index reference at all.

So if someone can help me understand why I got all this whitespace, \t, \n, and \r, and how to eliminate them, I would be very happy.


Travis Leleu

unread,
Aug 8, 2015, 3:00:14 PM8/8/15
to scrapy-users
It's possible DMOZ updated their layout in their HTML since the tutorial was written.  It's also possible that underlying libraries change how they process or remove text.  Most likely, the tutorial was written for scrapy v0.2x, and you're probably using a more recent one.

Really, though, you're going to need to adopt a more resilient attitude in order to succeed at data scraping.  Trust your judgement -- you have strings with whitespace.  So call str.strip().  You have \r\n to remove?  str.replace( '\r\n', '' ).

The scrapy docs are of varying quality.  The development group has been pushing to get 1.0 out the door, so some things have changed and they haven't gotten the opportunity to update the documentation.

I completely agree that its very frustrating when you're trying to learn something and the documentation doesn't match what you see.  A big part of learning is the feedback between trying something, and comparing to an established / documented result.

This is a great opportunity for you to contribute back to a project that you derive value back from.  I see many posts on this list asking how to get involved -- perhaps this can be a call to action for anyone interested.   Updating the tutorial is probably the single most important thing for the entire project because that's where most users dip their toe in the water to test it out.

A bad first experience likely discourages many first time users.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Malik Rumi

unread,
Aug 8, 2015, 5:33:38 PM8/8/15
to scrapy...@googlegroups.com
Because it is so rare, I always try to take the time to point out when someone actually answers the question I ask, and thank them for it. So thank you for not only answering but providing reasonable explanations for the difficulties I encountered. PLUS it was really fast!  ☺

I do have a minor quibble with your description of my resilience, because I am self taught in coding and have come quite a long way through a lot challenges, but there's no way for you to know that and anyway it's not a huge deal. Then why bring it up? To honor the fact that I have been sticking this out. In other words, for me, not you or anyone else reading this.

You pose an interesting challenge to get involved, and that's both good and valid. and more people should both issue such challenges as well as take them up.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/AIU_bi5oUzQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

Travis Leleu

unread,
Aug 8, 2015, 6:05:29 PM8/8/15
to scrapy-users
Hi Malik,

Apologies if it came off a bit rough.  I didn't really mean to direct that at you in particular -- I have no idea who you are, or your history, and it would be presumptuous otherwise.

I meant more in a general sense.  Scraping is very rewarding, can be quite lucrative, but really requires a lot of patience.  Essentially you're reverse engineering a system that might not want you reverse engineering it.  And with all the js frontend technologies so often misapplied, breaking the web's semantics, sometimes you have to reverse engineer something quite ugly.

So my opinion is that perserverance is probably the most important mindset a scraper / data engineer can have.  You need to have a lot of tricks in your toolbelt -- countering blocking, how to rate limit so as to not impact sites, how to proxify requests, js scraping, ajax reverse engineering, plus a billion other tricks in the toolbox.

It's why I love the topic -- there's so much to know, and every situation requires a slightly different application of techniques.  But once you have the right combination, it sure is fun when the data starts to fly in!
Reply all
Reply to author
Forward
0 new messages