How to use TOR ?

6,657 views
Skip to first unread message

Peter Chatzilampros

unread,
May 18, 2012, 3:51:57 AM5/18/12
to scrapy...@googlegroups.com
Hello,
does anybody know how to use TOR with scrapy in ubuntu?
I don't know much about TOR. 
Do I have to make a tor account and install some special packages?
Do I have to write a middleware?
Is there any sample code?


Максим Горковский

unread,
May 18, 2012, 4:06:54 AM5/18/12
to scrapy...@googlegroups.com
You would have to install tor and polipo from packages and write simple middleware which forces tor to change route when scrapy retrying to recieve page:

class RetryChangeProxyMiddleware(RetryMiddleware):
    def _retry(self, request, reason, spider):
        log.msg('Changing proxy')
        tn = telnetlib.Telnet('127.0.0.1', 9051)
        tn.read_until("Escape character is '^]'.", 2)
        tn.write('AUTHENTICATE "267765"\r\n')
        tn.read_until("250 OK", 2)
        tn.write("signal NEWNYM\r\n")
        tn.read_until("250 OK", 2)
        tn.write("quit\r\n")
        tn.close()
        time.sleep(3)
        log.msg('Proxy changed')
        return RetryMiddleware._retry(self, request, reason, spider)

then use it in settings.py:

DOWNLOADER_MIDDLEWARE = {
                         'spider.middlewares.RetryChangeProxyMiddleware': 600,
                         }

and then you just want to send requests through local tor proxy (polipo) which could be done with:
tsocks scrapy crawl spirder 

I've done this long ago but this solution seems ok and, most important, it works


2012/5/18 Peter Chatzilampros <phatzi...@yahoo.gr>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.



--
С уважением,
Максим Горковский

Максим Горковский

unread,
May 18, 2012, 4:08:53 AM5/18/12
to scrapy...@googlegroups.com
Also you should configure tsocks to send requests on polipo proxy. usually it is localhost:9050 depends on whether or not you changed default tor settings

2012/5/18 Максим Горковский <ragzo...@gmail.com>

Pablo Hoffman

unread,
May 21, 2012, 1:23:43 PM5/21/12
to scrapy...@googlegroups.com, Максим Горковский
Thanks Kazimi, I've added a link to your message in the wiki:
https://github.com/scrapy/scrapy/wiki

Feel free to add any other links with useful recipes there.

Максим Горковский

unread,
Jul 23, 2012, 1:07:47 AM7/23/12
to scrapy...@googlegroups.com
it doesn't. ip changes when requests are unsuccessful

2012/7/23 embedded <asi...@gmail.com>

I'm looking into this snippet.

can anyone confirm that this snippet works?
does it change the ip on each request?

waiting for replays

thanks
2012/5/18 Peter Chatzilampros <phatzi...@yahoo.gr>
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.



--
С уважением,
Максим Горковский

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/d6bcG0oK1S8J.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

embedded

unread,
Jul 23, 2012, 2:34:58 AM7/23/12
to scrapy...@googlegroups.com
Is it possible to change IP on a regular basis? for example: every X requests?
how can see what IP each request holds?

Thanks


On Monday, July 23, 2012 8:07:47 AM UTC+3, Kazimir wrote:
it doesn't. ip changes when requests are unsuccessful

Максим Горковский

unread,
Jul 23, 2012, 2:49:46 AM7/23/12
to scrapy...@googlegroups.com
make a middleware and move ip-changing code into it and log ip of each request.


2012/7/23 embedded <asi...@gmail.com>
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/2RPQ5bz9jwkJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

embedded

unread,
Jul 23, 2012, 3:15:53 AM7/23/12
to scrapy...@googlegroups.com
Could you please hand up the ip-changing-code?
I'm not familiar with TOR and could not find details on it on the web.

Thanks!


On Monday, July 23, 2012 9:49:46 AM UTC+3, Kazimir wrote:
make a middleware and move ip-changing code into it and log ip of each request.

embedded

unread,
Jul 23, 2012, 3:42:26 AM7/23/12
to scrapy...@googlegroups.com, Peter Chatzilampros
Hi there,

I need to use TOR for my scrapy project under Ubuntu.
Did you manage to work with it?

I would be more than happy to receive a code snippet and instructions how to make it happen.

Thanks

embedded

unread,
Jul 23, 2012, 3:26:30 PM7/23/12
to scrapy...@googlegroups.com
Hi Kazimir,

I tired your code snippet and got the following error:
(I changed the telnet port to 9050, [9051 did not work] )

Jul 23 22:23:41.364 [warn] Socks version 65 not recognized. (Tor is not an http proxy.)
Jul 23 22:23:41.365 [warn] Fetching socks handshake failed. Closing.

Do you know how to fix this?

10x


On Monday, July 23, 2012 9:49:46 AM UTC+3, Kazimir wrote:
make a middleware and move ip-changing code into it and log ip of each request.

Максим Горковский

unread,
Jul 23, 2012, 8:31:16 PM7/23/12
to scrapy...@googlegroups.com
seems like you have forgot about tsocks. this section is covered above

2012/7/24 embedded <asi...@gmail.com>
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/UxAC3OLozlgJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Tsouras

unread,
Sep 25, 2012, 7:45:33 AM9/25/12
to scrapy...@googlegroups.com
In ubuntu you have to install packages tor, polipo, tsocks
$ sudo apt-get install tor polipo tsocks

How to configure tsocks in ubuntu 12.04:
edit file /etc/tsocks.conf (you have to be root)  and change the following settings:
server=127.0.0.1
server_type=5
server_port=9050

Don't forget to change telnet port into 9050 as Kazimir wrote ("tn = telnetlib.Telnet('127.0.0.1', 9050)")

Tsouras

unread,
Sep 25, 2012, 3:49:48 PM9/25/12
to scrapy...@googlegroups.com
In addition you have to edit the file /etc/tor/torrc, you have to uncomment "controlport 9051", uncomment "CookieAuthentication 0" (by default it is 0)
Then you have to restart tor using "sudo /etc/init.d/tor restart"


"""Don't forget to change telnet port into 9050 as Kazimir wrote ("tn = telnetlib.Telnet('127.0.0.1', 9050)")"""
This is wrong, I am sorry for the confusion, you have to leave it as is was "tn = telnetlib.Telnet('127.0.0.1', 9050)"

At last you to change the password in Kazimir's code becouse it only aplies to his configuration.
So you have to change "tn.write('AUTHENTICATE "267765"\r\n')" into "tn.write('AUTHENTICATE ""\r\n')"

Максим Горковский

unread,
Oct 2, 2012, 11:53:02 PM10/2/12
to scrapy...@googlegroups.com
This library looks really powerful and useful. Maybe i'll check that later, thanks

2012/10/3 Łukasz Kurowski <crac...@gmail.com>
To use TOR as a proxy I'm using Privoxy, for changing identity try https://gitweb.torproject.org/pytorctl.git

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/QjzHhlEidy8J.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
Message has been deleted

Pablo Hoffman

unread,
Aug 5, 2013, 6:19:32 PM8/5/13
to scrapy-users
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware


On Tue, Jul 30, 2013 at 3:04 AM, Ajai kumar k <ajai....@gmail.com> wrote:
"NameError: name 'RetryMiddleware' is not defined"

From where can I import RetryMiddleware ?
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.

Mayank Chutani

unread,
May 25, 2015, 5:23:11 AM5/25/15
to scrapy...@googlegroups.com, ragzo...@gmail.com
2015-05-25 13:32:46+0530 [scrapy] INFO: Scrapy 0.24.4 started (bot: proxy)
2015-05-25 13:32:46+0530 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-05-25 13:32:46+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'proxy.spiders', 'SPIDER_MODULES': ['proxy.spiders'], 'BOT_NAME': 'proxy'}
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled item pipelines: 
2015-05-25 13:32:46+0530 [proxy] INFO: Spider opened
2015-05-25 13:32:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:32:46+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-25 13:32:46+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
        2015-05-25 13:33:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:34:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:34:54+0530 [proxy] DEBUG: Retrying <GET https://check.torproject.org/> (failed 1 times): TCP connection timed out: 110: Connection timed out.
2015-05-25 13:34:54+0530 [proxy] DEBUG: Retrying <GET http://my-ip.heroku.com> (failed 1 times): TCP connection timed out: 110: Connection timed out.
Any ideas??
Please also explain the authentication process for tor.

Mayank Chutani

unread,
May 25, 2015, 5:23:12 AM5/25/15
to scrapy...@googlegroups.com, ragzo...@gmail.com
I tried your snippet, seems like scrapy is hanging:


2015-05-25 13:32:46+0530 [scrapy] INFO: Scrapy 0.24.4 started (bot: proxy)
2015-05-25 13:32:46+0530 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-05-25 13:32:46+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'proxy.spiders', 'SPIDER_MODULES': ['proxy.spiders'], 'BOT_NAME': 'proxy'}
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-25 13:32:46+0530 [scrapy] INFO: Enabled item pipelines: 
2015-05-25 13:32:46+0530 [proxy] INFO: Spider opened
2015-05-25 13:32:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:32:46+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-25 13:32:46+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
        2015-05-25 13:33:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:34:46+0530 [proxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-25 13:34:54+0530 [proxy] DEBUG: Retrying <GET https://check.torproject.org/> (failed 1 times): TCP connection timed out: 110: Connection timed out.
2015-05-25 13:34:54+0530 [proxy] DEBUG: Retrying <GET http://my-ip.heroku.com> (failed 1 times): TCP connection timed out: 110: Connection timed out.

Any ideas??
 Plus please describe the authentication,  how you're using it.

On Friday, May 18, 2012 at 1:36:54 PM UTC+5:30, Максим Горковский wrote:
Reply all
Reply to author
Forward
0 new messages