Are there any screen-scraping packages for OCaml?
I'm looking for something that would let me analyze the contents of a
web page and extract, for example, all the image tags.
I'm using Ruby for this at work and something like hpricot [1] is
very neat but also somewhat slow.
Thanks, Joel
[1] http://code.whytheluckystiff.net/hpricot/
_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs
I don't think of this as screen scraping. Spidering might be a better word.
I've done a good bit of this in OCaml. I use the curl package for
downloading web pages and the netstring package for parsing them.
I'm going to attach a couple of files that I use for this sort of stuff.
The file htmltreeutils.ml has a bunch of functions for working with
the results of a nethtml parse tree.
So your program would look something like this.. and this hasn't been
tested:
open Htmltreeutils
let result = Buffer.create 2000 in
let connection = Curl.init () in
Curl.set_httpget connection true;
Curl.set_url connection "http://www.yahoo.com/randompage.html";
Curl.set_writefunction connection (fun s -> Buffer.add_string
result s);
Curl.set_headerfunction connection (fun s -> ());
Curl.perform connection;
Curl.cleanup connection;
let dom = get_parsed_html_from_string result in
let img_tags = list_tags "img" dom in
.... do something with img tags here like pull out their src
attributes
Here are the two helper files:
We did some web scraping using WWW::Mechanize + perl4caml. As a
result, perl4caml contains pretty complete bindings for the
WWW::Mechanize library.
http://merjis.com/developers/perl4caml
http://resources.merjis.com/developers/perl4caml/Pl_WWW_Mechanize.www_mechanize.html
Rich.
--
Richard Jones, CTO Merjis Ltd.
Merjis - web marketing and technology - http://merjis.com
Team Notepad - intranets and extranets for business - http://team-notepad.com