Are there any screen-scraping packages for OCaml?
I'm looking for something that would let me analyze the contents of a
web page and extract, for example, all the image tags.
I'm using Ruby for this at work and something like hpricot  is
very neat but also somewhat slow.
Caml-list mailing list. Subscription management:
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs
I don't think of this as screen scraping. Spidering might be a better word.
I've done a good bit of this in OCaml. I use the curl package for
downloading web pages and the netstring package for parsing them.
I'm going to attach a couple of files that I use for this sort of stuff.
The file htmltreeutils.ml has a bunch of functions for working with
the results of a nethtml parse tree.
So your program would look something like this.. and this hasn't been
let result = Buffer.create 2000 in
let connection = Curl.init () in
Curl.set_httpget connection true;
Curl.set_url connection "http://www.yahoo.com/randompage.html";
Curl.set_writefunction connection (fun s -> Buffer.add_string
Curl.set_headerfunction connection (fun s -> ());
let dom = get_parsed_html_from_string result in
let img_tags = list_tags "img" dom in
.... do something with img tags here like pull out their src
Here are the two helper files:
We did some web scraping using WWW::Mechanize + perl4caml. As a
result, perl4caml contains pretty complete bindings for the
Richard Jones, CTO Merjis Ltd.
Merjis - web marketing and technology - http://merjis.com
Team Notepad - intranets and extranets for business - http://team-notepad.com