Re: Using Scrapy to scrapy Google News

495 views

Skip to first unread message

Pablo Hoffman

unread,

Nov 4, 2012, 5:21:31 PM11/4/12

to scrapy...@googlegroups.com

You should find out the ajax requests that your browser is doing (using Chrome inspector or Firebug) and replicate those in Scrapy. One of those requests will probably receive the date range in the arguments, and those are the ones you'll want to modify in your spider.

On Sat, Nov 3, 2012 at 2:35 PM, naaboo <malte.sp...@gmail.com> wrote:

heya!

I would like to scrape the news archive of Google News, given a certain keyword.

I have setup Scrapy to run on an Amazon EC2 with Ubuntu (just the CLI) and it's all working fine and Scrapy collects the data and saves it into a MySQL Table.

Now my problem is, that I would like to use the Google News filter to e.g. scrape all news Articles published between 2010 and 2011.
Google News has this option (if you type in a keyword, you have a little gear button in the input field or use the options on the left hand site).

My Problem is now, that scrapy does not use JavaScript and the Google filter is only applicable by using JavaScript.
If I configure everythin in my browser and use this URL for Scrapy, I get a non JavaScript version of Google News with no filters applied.

So far I was trying to understand how webkit or silenium work, but all I understood is that they use a real browser which is opened in the OS and the use an API to access this browser.
Seeing that I am using CLI only, I don't have the option to run a browser.

Can you guys help me out and point me in the right direction of how I could use Scrapy and JavaScript without having to use a "real" browser?

Thank you very much

Best
naaboo

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/H87rBcosfmUJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

naaboo

unread,

Nov 11, 2012, 8:35:18 AM11/11/12

to scrapy...@googlegroups.com

Thanks for your answer!

I have looked at the request that Google makes when changing the timerange:

If I then just paste this URL to a different browser (no Sessions etc) it works without problems.

But if I try to enter this URL into a browser with JavaScript deactivated, then I get redirected to a different page

(the same happens if I use this URL in Scrapy and save the response to a file)

All I can make of it is, that the response of the request is again parsed by JavaScript and Googles Live Search kicks in and get the results.

Or am I missing something basic?

Thanks for your help!

naaboo

Reply all

Reply to author

Forward

0 new messages