Scraping website with GWT

373 views
Skip to first unread message

Fermin

unread,
Aug 10, 2010, 8:48:58 AM8/10/10
to Google Web Toolkit
Hi,

I don't found any reference to do scraping with GWT, is posible ? Like
CURL in php ?

Thx 4 all

lineman78

unread,
Aug 10, 2010, 12:09:34 PM8/10/10
to Google Web Toolkit
First of all GWT is executed client side and therefore XSRF security
should prevent you from scraping another site directly. However, you
can do scraping quite easily with server-side java. PHP is also a
server executed language, so anything you would usually do in php, you
will do it via server side java with GWT. There are a few different
ways you can scrape a page in java.

1) External Libraries (JScrape, XQuery)
2) Parse the HTML as XML (DOM or SAX)
3) Regex

These all require you to get the HTML page as a string which is rather
easy (see URL.openConnection)

Henrique Viecili

unread,
Aug 12, 2010, 8:35:41 AM8/12/10
to Google Web Toolkit
hmmm... you could use IFRAME to load the page, some JSNI to get the
HTML from the IFRAME (you might get a security warning or even be
blocked), after you have the HTML you just use DOM support on GWT to
do the thing.

but should be much easier if you use any server side language to do
that for you

cokol

unread,
Aug 12, 2010, 9:12:32 AM8/12/10
to Google Web Toolkit
nope, thats not possible - u cannot access JS namespace of an iframe,
so serverside is the only way but you can bring up results into the
client though

Henrique Viecili

unread,
Aug 13, 2010, 5:25:43 AM8/13/10
to Google Web Toolkit
it is possible but you will be blocked or get a security warning if
you access a URL outside your site. You might wonder 'why would I
scrap my own site?', well ask my boss that want to index all pages
from our intranet.

If you use this code to get the content
public native String getIFRAMEBodyContent(String iframeId); /*-{
return
document.getElementById(iframeid).contentWindow.document.body.innerHTML);
}-*/

once you have the HTML with the content you can wrap it with an HTML
object:
HTML html = new HTML(getIFRAMEBodyContent("myIframe"));
Element rootElement = html.getElement();
// be happy

In case you must scrap pages outside your domain and this must be done
in the browser, you can use a Signed Java Applet (would be a great
exercise of your java knowledge).

Anyway, the easiest way would be with server side code as *lineman78*
and *cokol* said.

Cheers,
Henrique Viecili
--
Think outside the box, limitations are self imposed!
Reply all
Reply to author
Forward
0 new messages