[I'm not sure whether announcements of Enlive-based libraries are on-topic here; please moderate this post out if not.]
I'm happy to announce the availability of release 0.1.1 of Skyscraper, an Enlive-based library for "structural
scraping" -- extracting information from whole sites in a structural
way.
What
is structural scraping? Think of Enlive. It allows you to parse
arbitrary HTML and extract various bits of information out of it:
subtrees or parts of subtrees determined by selectors. You can then
convert this information to some other format, easier for machine
consumption, or process it in whatever other way you wish. This is
called scraping.
Now
imagine that you have to parse a lot of HTML documents. They all come
from the same site, so most of them are structured in the same way and
can be scraped using the same sets of selectors. But not all of them.
There’s an index page, which has a different layout and needs to be
treated in its own peculiar way, with pagination and all. There are
pages that group together individual pages in categories. And so on.
Treating single pages is easy, but with whole collections of pages, you
quickly find yourself writing a lot of boilerplate code.
In
particular, you realize that you can’t just wget -r the whole thing and
then parse each page in turn. Rather, you want to simulate the workflow
of a user who tries to “click through” the website to obtain the
information she’s interested in. Sites have tree-like structure, and you
want to keep track of this structure as you traverse the site, and
reflect it in your output. I call it “structural scraping”.
This is where Skyscraper comes in.