Tutorial: error in data extraction

Anne Schumann

unread,

Oct 19, 2016, 5:56:32 AM10/19/16

to scrapy-users

Hi there,

I am trying to reproduce the tutorial. Things seem fine, but I run into a problem in the data extraction step. I do:

scrapy shell 'http://www.zeit.de/index'

This gives me an error here:

2016-10-19 11:36:30 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 11:36:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 11:36:30 [scrapy] INFO: Spider opened
2016-10-19 11:36:30 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (failed 1 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (failed 2 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] DEBUG: Gave up retrying <GET http://'http:/robots.txt> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] ERROR: Error downloading <GET http://'http:/robots.txt>: DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] DEBUG: Retrying <GET http://'http://www.zeit.de/index'> (failed 1 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] DEBUG: Retrying <GET http://'http://www.zeit.de/index'> (failed 2 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.
2016-10-19 11:36:30 [scrapy] DEBUG: Gave up retrying <GET http://'http://www.zeit.de/index'> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 11001] getaddrinfo failed.

And then what follows is a Python exception. I also tried with the tutorial URLs, the result was the same.

I actually think the problem is here:

2016-10-19 11:36:30 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt>

This URL is obviously wrong. Any hints on how this can be fixed?

Cheers,
Anne

Paul Tremberth

unread,

Oct 19, 2016, 6:01:48 AM10/19/16

to scrapy-users

Hello Anne,

Are you using Windows perhaps?

There was a bug report the other day: https://github.com/scrapy/scrapy/issues/2325

It appears that single- and double- quotes on Windows do not work the same as on Linux and MacOS.

So you need to do

scrapy shell "http://www.zeit.de/index"

The documentation will be fixed soon enough (see https://github.com/scrapy/scrapy/pull/2339)

Best,

Paul.

Anne-Kathrin Schumann

unread,

Oct 19, 2016, 6:18:46 AM10/19/16

to scrapy...@googlegroups.com

Hi Paul,

yes, I use a Windows machine. And the problem was with the quotes! Solved! Thank you!

Best,

a

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward