how to crawl search results pages?

51 views
Skip to first unread message

Maik Derstappen

unread,
Jun 29, 2012, 4:55:02 PM6/29/12
to crawle...@googlegroups.com
Hi, i'm wondering how to crawl the results pages after sending an post request to search for data.

I tryed post_urls in Crawler like this but without success:

class someCrawler(BaseCrawler):
  start_urls = ["http://somewebsite.com"]
  post_urls = [
    ("http://somewebsite.com/search", {'searchText':'django'})
  ]

There is e redirect after submitting the search form for session things, could be that a problem or do i something wrong.

any idears?

thanks, Maik

David Litvak

unread,
Jun 30, 2012, 4:53:41 PM6/30/12
to crawle...@googlegroups.com


2012/6/29 Maik Derstappen <maik.de...@googlemail.com>


Hey Maik,

What is that you need, the contents of the redirect? or is something in the POST response?

If you need the contents of the redirect, the response object that you get has the redirects which you can just access and scrape the data in there.

If you need something else just ask,

Cheers

--
नारायण हरि ओम
सत्य नारायण हरि ओम
गोविन्द कृष्ण हरि ओम
गोपाल कृष्ण हरि ओम

David Litvak

Bachiller Técnico Orientado en Producción Musical - ORT
Estudiante de Ingeniería en Sistemas - UTN

http://vizualize.me/david.litvak
http://about.me/david.litvak
(011)15-6686-6714

Maik Derstappen

unread,
Jul 2, 2012, 9:24:33 AM7/2/12
to crawle...@googlegroups.com
On 30.06.2012 22:53, David Litvak wrote:


2012/6/29 Maik Derstappen <maik.de...@googlemail.com>
Hi, i'm wondering how to crawl the results pages after sending an post request to search for data.

I tryed post_urls in Crawler like this but without success:

class someCrawler(BaseCrawler):
  start_urls = ["http://somewebsite.com"]
  post_urls = [
    ("http://somewebsite.com/search", {'searchText':'django'})
  ]

There is e redirect after submitting the search form for session things, could be that a problem or do i something wrong.

any idears?

thanks, Maik


Hey Maik,

What is that you need, the contents of the redirect? or is something in the POST response?

If you need the contents of the redirect, the response object that you get has the redirects which you can just access and scrape the data in there.

If you need something else just ask,

Hi David,

thank you for your quick answer.
I need the detail pages with listed on the search (Post) response page.
So i should do some POST search request with different zip-codes and follow the links to detail pages on the batched result page.
Then i'll scrape the detail pages.
But, the website i'll try to crawl do some redirect for sessions when i come the first time to it.
There is a session param for BaseCrawler class, should i do some thing there, to handle the session of the website?
And where should create the different POST request, before crawling the result pages?

thank you, for you advice
Maik


David Litvak

unread,
Jul 2, 2012, 9:36:11 AM7/2/12
to crawle...@googlegroups.com
2012/7/2 Maik Derstappen <maik.de...@inqbus.de>

Maik,

With the details you are giving me, my understanding is that you just need to follow the redirect and get the information from it.
If that is not enough, you can make the crawler not to follow redirects, and use the response from that request, and then continue to the redirect. Maybe Juan can give you more details about this (as I don't remember how to do this right now).

I think that it is highly likely to be that you just need the data from the page after the redirect (are you trying to replicate user experience?) as the data before the redirect shouldn't be user accesible

Maik Derstappen

unread,
Jul 2, 2012, 10:47:03 AM7/2/12
to crawle...@googlegroups.com
Hi David,

hm, but i don't get the whole think about how to do this with crawley.
I there a way to debug crawley without parallelizing, so that i can use pdb to unserstand what is going on here?

thx

  

Juan Manuel Garcia

unread,
Jul 2, 2012, 11:09:18 AM7/2/12
to crawle...@googlegroups.com
Hi Malk,

Yeah, you can.
Just set the max_concurrency_level to 1 on your crawler.
It would look like:

class someCrawler(BaseCrawler):

  max_concurrency_level = 1

  start_urls = ["http://somewebsite.com"]
  post_urls = [
    ("http://somewebsite.com/search", {'searchText':'django'})
  ]


Then you can introduce a pdb on the code to debug crawley without parallelizing.
Let us know if you could understand what is happening here.
--
Juan Manuel García
Software Developer

Maik Derstappen

unread,
Jul 2, 2012, 2:59:19 PM7/2/12
to crawle...@googlegroups.com
On 02.07.2012 17:09, Juan Manuel Garcia wrote:
Hi Malk,

Yeah, you can.
Just set the max_concurrency_level to 1 on your crawler.
It would look like:

class someCrawler(BaseCrawler):

  max_concurrency_level = 1

  start_urls = ["http://somewebsite.com"]
  post_urls = [
    ("http://somewebsite.com/search", {'searchText':'django'})
  ]


Then you can introduce a pdb on the code to debug crawley without parallelizing.
Let us know if you could understand what is happening here.

hi Juan,

I've done this, but if i try to put a pdb in on_start method it does'nt stop there:


class someCrawler(BaseCrawler):
    max_concurrency_level = 1
   
    def on_start(self):
        """
        """
        import pdb;pdb.set_trace()


so:
crawley run

will result in:

[...]
    response = self.request_manager.make_request(url, data, self.extractor)
  File "/usr/lib/python2.7/bdb.py", line 48, in trace_dispatch
    return self.dispatch_line(frame)
  File "/usr/lib/python2.7/bdb.py", line 67, in dispatch_line
    if self.quitting: raise BdbQuit
BdbQuit


I also tryed to set POOL= 'threads' in setting.py to see if this will work, but it does not.
I have figured out how to crawl the website, with urllib2 and will now try to integrate this with crawley.

I have to submit the search form many times for all my zip codes.
Then i'll crawl the batched results.

At the end i want to have a long list with items (scraped data).

thank you for your advices, Maik
-- 
Maik Derstappen
Geschäftsführer

Inqbus GmbH & Co. KG
Softwareentwicklung, Consulting & Hosting
Karl-Heine-Straße 99 | 04229 Leipzig | Deutschland

Telefon: +49 341 989758-52
Fax: +49 341 989758-72
E-Mail: maik.de...@inqbus.de
Web: http://inqbus.de/

Persönlich haftende Gesellschafterin: Inqbus Management GmbH (Amtsgericht Leipzig, HRB 27350)
Vertretungsberechtigte Geschäftsführer: Maik Derstappen, Dr. Volker Jaenisch, Thomas Massmann, Markus Zapke-Gründemann

Registergericht: Amtsgericht Leipzig
Registernummer: HRA 16424

Umsatzsteuer-Identifikationsnummer: DE278744671 

Juan Manuel Garcia

unread,
Jul 5, 2012, 11:47:04 PM7/5/12
to crawle...@googlegroups.com
Hi Maik,

thanks for the issue reporting

2012/7/2 Maik Derstappen <maik.de...@inqbus.de>
This is a weird problem that happened to me on python2.7 but not on python2.6.
Could be an issue regarding the eventlet library on python2.7.
I've to research more about this cause is very disturbing and crazy. The pdb module just fails
with the eventlet greenlets.

 
I also tryed to set POOL= 'threads' in setting.py to see if this will work, but it does not.

Do you have the same problem (pdb fails) using 'threads' with the code in the master branch on github?
 
I have figured out how to crawl the website, with urllib2 and will now try to integrate this with crawley.

I have to submit the search form many times for all my zip codes.
Then i'll crawl the batched results.

At the end i want to have a long list with items (scraped data).

thank you for your advices, Maik

Thanks to you for contributing to debug crawley! 
-- 
Maik Derstappen
Geschäftsführer

Inqbus GmbH & Co. KG
Softwareentwicklung, Consulting & Hosting
Karl-Heine-Straße 99 | 04229 Leipzig | Deutschland

Telefon: +49 341 989758-52
Fax: +49 341 989758-72
E-Mail: maik.de...@inqbus.de
Web: http://inqbus.de/

Persönlich haftende Gesellschafterin: Inqbus Management GmbH (Amtsgericht Leipzig, HRB 27350)
Vertretungsberechtigte Geschäftsführer: Maik Derstappen, Dr. Volker Jaenisch, Thomas Massmann, Markus Zapke-Gründemann

Registergericht: Amtsgericht Leipzig
Registernummer: HRA 16424

Umsatzsteuer-Identifikationsnummer: DE278744671 

Maik Derstappen

unread,
Jul 6, 2012, 7:15:24 AM7/6/12
to crawle...@googlegroups.com
Hi Juan,
i had the same problem in python2.6 and python2.7 with greenlets and threads.
BTW the master branch now requires PyQt4 even if i run:
crawley run

I think this is not god, because its not easy to install this requirement and may prevent people from using crawley.



 
I also tryed to set POOL= 'threads' in setting.py to see if this will work, but it does not.

Do you have the same problem (pdb fails) using 'threads' with the code in the master branch on github?
 
I have figured out how to crawl the website, with urllib2 and will now try to integrate this with crawley.

I have to submit the search form many times for all my zip codes.
Then i'll crawl the batched results.

At the end i want to have a long list with items (scraped data).


thx, maik
Reply all
Reply to author
Forward
0 new messages