Announcing the BlogForever crawler

182 views
Skip to first unread message

olivierbl...@gmail.com

unread,
Feb 1, 2014, 11:52:06 AM2/1/14
to scrapy...@googlegroups.com
Hello everyone,

I've very happy to announce the release of the BlogForever crawler! Our work
is entirely based on Scrapy, and we wanted to thank you for the amazing work
you did on this framework, without which we could not have accomplished a
fraction of what we did during the last 6 months.

The crawler targets web blogs, and is able to automatically extract blog post
articles, title, authors and comments. It's open source and comes with tests
and documentation: <https://github.com/BlogForever/crawler>. We've also
written and submitted a paper for the WIMS14 conference where we present our
algorithm for content extraction and a high level overview of the crawler
architecture, available at

I believe that the following parts of our project might be useful for other
application:

- The content extractor interface is similar to the one of Scrapely
  <https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too late
  to include it in our evaluation). It's very fast and robust: we got to 93%
  success rate on blog articles extraction over 2300 blog posts.

- To crawl blogs mixed up with other resources (such as wiki or a forum), we
  use a simple machine-learning based priority predictor to favor crawling
  URLs with links to blog posts. This allows to get the best out of a limited
  number of page download, which might otherwise get stuck into unrelevant
  portions of the blog.

- We use PhantomJS do JavaScript rendering, take screenshots and fake some
  user interaction to deal with Disqus comments. We have a pool of reusable
  browser which allows to take full advantage of processors (with JavaScript
  rendering on, the crawling bottleneck is CPU).

If you take the time to read the paper (8 pages) or the code, don't hesitate
to send comments or feedback!

Regards,
Olivier Blanvillain

Shane Evans

unread,
Feb 3, 2014, 5:51:56 PM2/3/14
to scrapy...@googlegroups.com
Interesting project!

It's nice to see the bits on Scrapy in your paper - thanks! We're delighted it was so useful for the BlogForever crawler. It's great to see your crawler released as open source too.

I thought Scrapely could have been a nice comparison.. your approach takes better advantage of the fact that you have many examples (from the feeds) where Scrapely is designed for working with very little example data so I expect your approach would compare favorably. I see you favor using id and class attributes - something we are considering for Scrapely too as it currently relies exclusively on HTML structure.  Do you plan to release the testing / evaluation part?

Should we put a link to BlogForever on the companies page?

Good luck with the conference submission!

Shane



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

olivierbl...@gmail.com

unread,
Feb 4, 2014, 10:58:43 AM2/4/14
to scrapy...@googlegroups.com
> Do you plan to release the testing / evaluation part?

The GitHub repository of the paper [1] contains scripts and instructions to
reproduce the results we present in our evaluation. I could not include the
dataset we used because it's not publicly available, but it should be
reasonably easy to request it or manually build a small one. (I've not yet
included a license because I'm not really sure how it works with the text of
the paper, but everything else will be MIT)


> Should we put a link to BlogForever on the companies page?

At the moment the BlogForever web site not really up-to-date, and we still a
bit of work (mostly the connection to our back-end) before putting the crawler
in production. The first instance will likely be hosted on CERN servers to
monitor high-energy physics related blogs. I suggest to wait for this one to
be up and running before adding a link, we will get back to you when it is!

Cheers,
Olivier

Atrijit Dasgupta

unread,
Jul 3, 2015, 4:41:28 AM7/3/15
to scrapy...@googlegroups.com
We are evaluating the BlogForEver crawler for a project, and from the following page we cannot get the source code:


- the http://invenio-software.org/repo/blogforever/ resource link gives an error.

However, a copy of the source code is available at https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier BlanVillain.

Is the Github source code the final version? We downloaded and ran it, and while it perfectly works with the example sites provided, when we try to crawl other blogs etc, it seems to get into infinite loops and does not produce any output.

And if we cannot get the source code from the blogforever.eu page, is there another location where from we can get it?

Thanks

Atrijit Dasgupta

Olivier Blanvillain

unread,
Jul 3, 2015, 4:46:01 AM7/3/15
to scrapy...@googlegroups.com
Hi, as far as I know nobody continued the development after me, so
what's on GitHub should be the latest version.
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

Atrijit Dasgupta

unread,
Jul 3, 2015, 10:31:03 AM7/3/15
to scrapy...@googlegroups.com
Thanks Olivier ...
Reply all
Reply to author
Forward
0 new messages