2012/6/29 Maik Derstappen <maik.de...@googlemail.com>
Hi, i'm wondering how to crawl the results pages after sending an post request to search for data.
I tryed post_urls in Crawler like this but without success:
class someCrawler(BaseCrawler):
start_urls = ["http://somewebsite.com"]post_urls = [
("http://somewebsite.com/search", {'searchText':'django'})]
There is e redirect after submitting the search form for session things, could be that a problem or do i something wrong.
any idears?
thanks, Maik
Hey Maik,
What is that you need, the contents of the redirect? or is something in the POST response?
If you need the contents of the redirect, the response object that you get has the redirects which you can just access and scrape the data in there.
If you need something else just ask,
Hi Malk,
Yeah, you can.Just set the max_concurrency_level to 1 on your crawler.It would look like:
class someCrawler(BaseCrawler):
max_concurrency_level = 1
start_urls = ["http://somewebsite.com"]post_urls = [
("http://somewebsite.com/search", {'searchText':'django'})]
Then you can introduce a pdb on the code to debug crawley without parallelizing.Let us know if you could understand what is happening here.
-- Maik Derstappen Geschäftsführer Inqbus GmbH & Co. KG Softwareentwicklung, Consulting & Hosting Karl-Heine-Straße 99 | 04229 Leipzig | Deutschland Telefon: +49 341 989758-52 Fax: +49 341 989758-72 E-Mail: maik.de...@inqbus.de Web: http://inqbus.de/ Persönlich haftende Gesellschafterin: Inqbus Management GmbH (Amtsgericht Leipzig, HRB 27350) Vertretungsberechtigte Geschäftsführer: Maik Derstappen, Dr. Volker Jaenisch, Thomas Massmann, Markus Zapke-Gründemann Registergericht: Amtsgericht Leipzig Registernummer: HRA 16424 Umsatzsteuer-Identifikationsnummer: DE278744671
I also tryed to set POOL= 'threads' in setting.py to see if this will work, but it does not.
I have figured out how to crawl the website, with urllib2 and will now try to integrate this with crawley.
I have to submit the search form many times for all my zip codes.
Then i'll crawl the batched results.
At the end i want to have a long list with items (scraped data).
thank you for your advices, Maik
-- Maik Derstappen Geschäftsführer Inqbus GmbH & Co. KG Softwareentwicklung, Consulting & Hosting Karl-Heine-Straße 99 | 04229 Leipzig | Deutschland Telefon: +49 341 989758-52 Fax: +49 341 989758-72 E-Mail: maik.de...@inqbus.de Web: http://inqbus.de/ Persönlich haftende Gesellschafterin: Inqbus Management GmbH (Amtsgericht Leipzig, HRB 27350) Vertretungsberechtigte Geschäftsführer: Maik Derstappen, Dr. Volker Jaenisch, Thomas Massmann, Markus Zapke-Gründemann Registergericht: Amtsgericht Leipzig Registernummer: HRA 16424 Umsatzsteuer-Identifikationsnummer: DE278744671
I also tryed to set POOL= 'threads' in setting.py to see if this will work, but it does not.
Do you have the same problem (pdb fails) using 'threads' with the code in the master branch on github?I have figured out how to crawl the website, with urllib2 and will now try to integrate this with crawley.
I have to submit the search form many times for all my zip codes.
Then i'll crawl the batched results.
At the end i want to have a long list with items (scraped data).