Scraping dynamic content of a website

1,016 views
Skip to first unread message

Binit Singh

unread,
May 6, 2013, 8:01:48 AM5/6/13
to scrap...@googlegroups.com
Hi,

I am writing a scraper on wiki scarper to scrape all the mobile infromation of a mobile store (http://www.snapdeal.com/products/mobiles-mobile-phones?q=Price%3A999%2C56599). For this i am using urllib2.urlopen().
But it is only able to scrape the data of 20 first products out of 1000 available because further 20 more products are dynamic loaded when the page reaches to its end.

I think it is using ajax or Jquery to pull in data dynamically.someone suggested me to use windmill. Is it helpful???

What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax request ?

Svavar Kjarrval

unread,
May 6, 2013, 10:56:45 AM5/6/13
to scrap...@googlegroups.com
Check the AJAX requests and make a scraper that executes the same or
similar requests.

- Svavar Kjarrval
> --
> You received this message because you are subscribed to the Google
> Groups "ScraperWiki" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to scraperwiki...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


signature.asc

Binit Singh

unread,
May 20, 2013, 8:03:26 AM5/20/13
to scrap...@googlegroups.com
How it will be possible i need an example

Zarino Zappia

unread,
May 20, 2013, 10:34:18 AM5/20/13
to scrap...@googlegroups.com
This is not going to be easy. I don't suggest you try this unless you have programming experience and some understanding of how web requests work.

Rough steps:

  1. Use Google Chrome.
  2. Find the web page you're interested in.
  3. Open the Developer Tools (F12 or Ctrl-Shift-J on a PC, ⌥-⌘-J on a Mac)
  4. Switch to the Network tab in the Developer Tools window.
  5. Reload your page. As each resource on the page is loaded, it will show up on the network timeline.
  6. Scroll down, click buttons, whatever you normally do to make the target site load more content.
  7. The AJAX requests should be pretty obvious in the network timeline. If they're not, click the almost invisible "XHR" filter at the bottom of the window, to show only AJAX requests in the timeline.
  8. When you've found your AJAX request, click its Name to see more information, like the Headers your browser sent (including, importantly, the Requests URL and the Request Parameters).
  9. Try to recreate the requests in your Python scraper.

Z




Reply all
Reply to author
Forward
0 new messages