Web scraping in a chrome extension

355 views
Skip to first unread message

methos o

unread,
Nov 24, 2012, 10:01:09 PM11/24/12
to chromium-...@chromium.org
I am writing a chrome extension that downloads a page for user, parses contents and shows useful bits to user.

I first tried it using $.get(). But some of the downloaded pages (for e.g. gmail.com), did not send any content back. It showed a message that "your browser does not have javascript enabled". 

Then I tried to download the page using iframe. But even that did not work.
1. If I use iframes in content-scripts, some sites do not allow iframes at all in their pages (e.g. stackoverflow).
2. If I use iframes in background page, the page is not displayed for sites using X-frame-options (e.g. google.com).

So how can I download web pages in the background and parse their contents?

Matt Kruse

unread,
Nov 25, 2012, 9:43:49 PM11/25/12
to chromium-...@chromium.org
Well, there is no simple answer. The results you are seeing is what makes it difficult. Since so many sites are ajax and javascript-driven, simple loading a page via ajax will not give you the content you want, and trying to run it in an iframe will cause other issues like you've seen.

The answer really depends on exactly what kind of content you want to scrape. In some cases, you can hit the mobile version of a site and it will give you the same content but in a much more scrape-friendly format. In other cases you can find a different url that gives you a cleaner version of the data you want, in a stand-alone way. And in still other cases, there are actually web services you can call directly to get the raw data.

It all depends.

Matt Kruse
Reply all
Reply to author
Forward
0 new messages