Hello everyone,
I've very happy to announce the release of the BlogForever crawler! Our work
is entirely based on Scrapy, and we wanted to thank you for the amazing work
you did on this framework, without which we could not have accomplished a
fraction of what we did during the last 6 months.
The crawler targets web blogs, and is able to automatically extract blog post
articles, title, authors and comments. It's open source and comes with tests
written and submitted a paper for the WIMS14 conference where we present our
algorithm for content extraction and a high level overview of the crawler
architecture, available at
I believe that the following parts of our project might be useful for other
application:
- The content extractor interface is similar to the one of Scrapely
to include it in our evaluation). It's very fast and robust: we got to 93%
success rate on blog articles extraction over 2300 blog posts.
- To crawl blogs mixed up with other resources (such as wiki or a forum), we
use a simple machine-learning based priority predictor to favor crawling
URLs with links to blog posts. This allows to get the best out of a limited
number of page download, which might otherwise get stuck into unrelevant
portions of the blog.
- We use PhantomJS do JavaScript rendering, take screenshots and fake some
user interaction to deal with Disqus comments. We have a pool of reusable
browser which allows to take full advantage of processors (with JavaScript
rendering on, the crawling bottleneck is CPU).
If you take the time to read the paper (8 pages) or the code, don't hesitate
to send comments or feedback!
Regards,
Olivier Blanvillain