[ANN] Skyscraper 0.1.1, an Enlive-based library for scraping entire websites

70 views

Skip to first unread message

Daniel Janus

unread,

Aug 24, 2015, 8:12:41 PM8/24/15

to Enlive

Hi,

[I'm not sure whether announcements of Enlive-based libraries are on-topic here; please moderate this post out if not.]

I'm happy to announce the availability of release 0.1.1 of Skyscraper, an Enlive-based library for "structural scraping" -- extracting information from whole sites in a structural way.

Homepage / GitHub: https://github.com/nathell/skyscraper
Leiningen: [skyscraper "0.1.1"]

Clojars: https://clojars.org/skyscraper

From the README:

What is structural scraping? Think of Enlive. It allows you to parse arbitrary HTML and extract various bits of information out of it: subtrees or parts of subtrees determined by selectors. You can then convert this information to some other format, easier for machine consumption, or process it in whatever other way you wish. This is called scraping.

Now imagine that you have to parse a lot of HTML documents. They all come from the same site, so most of them are structured in the same way and can be scraped using the same sets of selectors. But not all of them. There’s an index page, which has a different layout and needs to be treated in its own peculiar way, with pagination and all. There are pages that group together individual pages in categories. And so on. Treating single pages is easy, but with whole collections of pages, you quickly find yourself writing a lot of boilerplate code.

In particular, you realize that you can’t just wget -r the whole thing and then parse each page in turn. Rather, you want to simulate the workflow of a user who tries to “click through” the website to obtain the information she’s interested in. Sites have tree-like structure, and you want to keep track of this structure as you traverse the site, and reflect it in your output. I call it “structural scraping”.

This is where Skyscraper comes in.

New in this release:

Processors (process-fn functions) can now access current context.
Skyscraper now uses clj-http to issue HTTP GET requests.
- Skyscraper can now auto-detect page encoding thanks to clj-http’s decode-body-headers feature.
- scrape now supports a http-options argument to override HTTP options (e.g., timeouts).
Skyscraper’s output is now fully lazy (i.e., guaranteed to be non-chunking).
Fixed a bug where relative URLs were incorrectly resolved in certain circumstances.