[ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

257 views
Skip to first unread message

Lee Hinman

unread,
May 31, 2012, 11:49:18 PM5/31/12
to clo...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,
I'm pleased to announce the initial 0.1.0 release of Itsy. Itsy is a
threaded web spider written in Clojure. A list of some of the Itsy
features:

- - Multithreaded, with the ability to add and remove workers as needed
- - No global state, run multiple crawlers with multiple threads at once
- - Pre-written handlers for writing to text files and ElasticSearch
- - Skip URLs that have been seen before
- - Domain limiting to crawl pages only belonging to a certain domain

You should be able to use it from Clojars[1] with the following:

[itsy "0.1.0"]

Please give it a try and open any issues on the github repo[2] that
you find. Check out the readme for the full information and usage.

thanks,
Lee Hinman

[1]: https://clojars.org/itsy
[2]: https://github.com/dakrone/itsy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPyDu+AAoJEJ1kZdQ6zsrgseoP/0j4HZ4k0Ok0f2u4HB7xm1kc
V5oE67kqXKCJq7Nb4LQexbxEwbpQF1u6Zg9o7CvtUeMtLkQXAozIjJkk0H05HtEy
lXINNupW2AylXnTzMd75E0ydeY+pvyNrG1EY5W1i5CcKhiruNcAQUNxh4UeCmMw2
G/TBhENW+24KtFEJBd1sum2+o86atMHxvlNruwheLYtzq1iSUbpJe6oZu0EzZaa7
DlrG1r8Gv77Tgbf+pYtFr+Bpf+ILaojh1lBJwb/8jPbaLwrI/TE4qdnOA+BERn0F
7qtNErxq5UBVhrYh9Nit53ZEyDkHLYGWc0P39F8nFfWeWN9C8hAd9GWFddyZw3xL
eop7IF0XerGdPaM93qfnKDMJLUFGfakBeP4hZIH1k5Ouoou+ffqbIZbzKK4yQwlt
9VFKq7z0CsoNQ+sMwPWHjXTqNj62k1DYo1iyGFc0RHLyujuGtOna6ksh10PopIpz
JxZX+txYXI5MsxLo6zGqHbuartXxhNUtoloYBi3BkD1Knmf5qYR/Irlzcy4TUIov
QK/UNtvESSapKO/95HUgnw9wi0UDpOLHFTBTFU2XZkvNAalLwMLX9YZwAH79+htY
C4cKLZjkME7wkvgq/HaMbRsPNuuJN8oBqDpmNzKW2DlJ6TZIdcjlgAVDBFL9oI1+
mMlBkEVBNGMK+9dWMHur
=BcLy
-----END PGP SIGNATURE-----

László Török

unread,
Jun 1, 2012, 7:24:17 AM6/1/12
to clo...@googlegroups.com
Hi,

interesting project. I was wondering though how do you make sure two crawlers do not crawl the same URL twice if there is no global state? :)

If I read it correctly you're going to have to spawn a lot of threads to have at least a few busy with extraction at an point in time, as most of them will be blocked most of the time while waiting for the page to be retrieved.

You may also consider using the sitemap as a source of urls per domain, although this depends on the crawling policy.

Regards,

Laszlo

2012/6/1 Lee Hinman <matthew...@gmail.com>

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en



--
László Török

Michael Klishin

unread,
Jun 1, 2012, 3:53:37 PM6/1/12
to clo...@googlegroups.com
László Török:

> I was wondering though how do you make sure two
> crawlers do not crawl the same URL twice if there is no global state? :)

By adding sharing state, for a single app instance, typically an atom. As for separating different instances,
it is not uncommon to hash seed URLs (or domains) in such a way that two instances simply won't
crawl the same site in parallel.


> You may also consider using the sitemap as a source of urls per domain,
> although this depends on the crawling policy.

That does not work in practice. One reason is, sitemaps are often incomplete, out of date or missing
completely. Another one, for most news websites and blogs, you will discover site structure a lot
faster by frequently (within reason, of course) recrawling either first level pages or a seed of known
"section" pages.

There is a really good workshop on Web mining video from Strata Santa Clara 2012, it highlights two dozens
more common problems you face when designing Web crawlers:

http://my.safaribooksonline.com/video/-/9781449336172

Highly recommended for people who are interested or work in this area (I think it can be purchased separately, O'Reilly Safari subscribers have access to the entire video set)

I am by no means an expert (or even very experienced) in this area but Itsy has features that solve several very common
problems out of the box in 0.1.0. Good job.

MK

László Török

unread,
Jun 1, 2012, 4:22:17 PM6/1/12
to clo...@googlegroups.com
Hi,

don't want to turn this to a lengthy discussion about crawling, but happy to continue off list. ;)

Sitemaps work surprisingly well in certain domains (web shops powered by standard web shop software, large e-commerce sites) and can make life easier based on our experience.

Another point: i nice addition would be to observe polite crawling (e.g. do not retrieve more than one page per sec from a domain), we got banned once due to excessive traffic from a single IP.

Anyway thanks for sharing, I hope to get some hacking time and implement one of our extractors as a handler in clojure and take out Itsy for a spin. :)

Las

2012/6/1 Michael Klishin <michael....@gmail.com>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en



--
László Török

Reply all
Reply to author
Forward
0 new messages