Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[Caml-list] Web page scraping packages

154 views

Skip to first unread message

Joel Reymont

unread,

Jul 31, 2006, 8:09:01 PM7/31/06

to caml-list

Folks,

Are there any screen-scraping packages for OCaml?

I'm looking for something that would let me analyze the contents of a
web page and extract, for example, all the image tags.

I'm using Ruby for this at work and something like hpricot [1] is
very neat but also somewhat slow.

Thanks, Joel

[1] http://code.whytheluckystiff.net/hpricot/

--
http://wagerlabs.com/

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Karl Zilles

unread,

Jul 31, 2006, 8:43:51 PM7/31/06

to Joel Reymont, caml-list

Joel Reymont wrote:
> Are there any screen-scraping packages for OCaml?
>
> I'm looking for something that would let me analyze the contents of a
> web page and extract, for example, all the image tags.

I don't think of this as screen scraping. Spidering might be a better word.

I've done a good bit of this in OCaml. I use the curl package for
downloading web pages and the netstring package for parsing them.

I'm going to attach a couple of files that I use for this sort of stuff.
The file htmltreeutils.ml has a bunch of functions for working with
the results of a nethtml parse tree.

So your program would look something like this.. and this hasn't been
tested:

open Htmltreeutils

let result = Buffer.create 2000 in
let connection = Curl.init () in
Curl.set_httpget connection true;
Curl.set_url connection "http://www.yahoo.com/randompage.html";
Curl.set_writefunction connection (fun s -> Buffer.add_string
result s);
Curl.set_headerfunction connection (fun s -> ());
Curl.perform connection;
Curl.cleanup connection;

let dom = get_parsed_html_from_string result in
let img_tags = list_tags "img" dom in
.... do something with img tags here like pull out their src
attributes

Here are the two helper files:

htmltreeutils.ml

utility.ml

Richard Jones

unread,

Aug 1, 2006, 5:46:46 AM8/1/06

to Joel Reymont, caml-list

On Tue, Aug 01, 2006 at 01:06:52AM +0100, Joel Reymont wrote:
> Are there any screen-scraping packages for OCaml?
>
> I'm looking for something that would let me analyze the contents of a
> web page and extract, for example, all the image tags.

We did some web scraping using WWW::Mechanize + perl4caml. As a
result, perl4caml contains pretty complete bindings for the
WWW::Mechanize library.

http://merjis.com/developers/perl4caml
http://resources.merjis.com/developers/perl4caml/Pl_WWW_Mechanize.www_mechanize.html

Rich.

--
Richard Jones, CTO Merjis Ltd.
Merjis - web marketing and technology - http://merjis.com
Team Notepad - intranets and extranets for business - http://team-notepad.com

0 new messages