Announcing the BlogForever crawler

olivierbl...@gmail.com

unread,

Feb 1, 2014, 11:52:06 AM2/1/14

to scrapy...@googlegroups.com

Hello everyone,

I've very happy to announce the release of the BlogForever crawler! Our work

is entirely based on Scrapy, and we wanted to thank you for the amazing work

you did on this framework, without which we could not have accomplished a

fraction of what we did during the last 6 months.

The crawler targets web blogs, and is able to automatically extract blog post

articles, title, authors and comments. It's open source and comes with tests

and documentation: <https://github.com/BlogForever/crawler>. We've also

written and submitted a paper for the WIMS14 conference where we present our

algorithm for content extraction and a high level overview of the crawler

architecture, available at

<https://github.com/OlivierBlanvillain/blogforever-crawler-publication/blob/master/tex/main.pdf>.

I believe that the following parts of our project might be useful for other

application:

- The content extractor interface is similar to the one of Scrapely

<https://github.com/scrapy/scrapely> (Sadly we discovered Scrapely too late

to include it in our evaluation). It's very fast and robust: we got to 93%

success rate on blog articles extraction over 2300 blog posts.

- To crawl blogs mixed up with other resources (such as wiki or a forum), we

use a simple machine-learning based priority predictor to favor crawling

URLs with links to blog posts. This allows to get the best out of a limited

number of page download, which might otherwise get stuck into unrelevant

portions of the blog.

- We use PhantomJS do JavaScript rendering, take screenshots and fake some

user interaction to deal with Disqus comments. We have a pool of reusable

browser which allows to take full advantage of processors (with JavaScript

rendering on, the crawling bottleneck is CPU).

If you take the time to read the paper (8 pages) or the code, don't hesitate

to send comments or feedback!

Regards,

Olivier Blanvillain

Shane Evans

unread,

Feb 3, 2014, 5:51:56 PM2/3/14

to scrapy...@googlegroups.com

Interesting project!

It's nice to see the bits on Scrapy in your paper - thanks! We're delighted it was so useful for the BlogForever crawler. It's great to see your crawler released as open source too.

I thought Scrapely could have been a nice comparison.. your approach takes better advantage of the fact that you have many examples (from the feeds) where Scrapely is designed for working with very little example data so I expect your approach would compare favorably. I see you favor using id and class attributes - something we are considering for Scrapely too as it currently relies exclusively on HTML structure. Do you plan to release the testing / evaluation part?

Should we put a link to BlogForever on the companies page?

Good luck with the conference submission!

Shane

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

olivierbl...@gmail.com

unread,

Feb 4, 2014, 10:58:43 AM2/4/14

to scrapy...@googlegroups.com

> Do you plan to release the testing / evaluation part?

The GitHub repository of the paper [1] contains scripts and instructions to

reproduce the results we present in our evaluation. I could not include the

dataset we used because it's not publicly available, but it should be

reasonably easy to request it or manually build a small one. (I've not yet

included a license because I'm not really sure how it works with the text of

the paper, but everything else will be MIT)

> Should we put a link to BlogForever on the companies page?

At the moment the BlogForever web site not really up-to-date, and we still a

bit of work (mostly the connection to our back-end) before putting the crawler

in production. The first instance will likely be hosted on CERN servers to

monitor high-energy physics related blogs. I suggest to wait for this one to

be up and running before adding a link, we will get back to you when it is!

Cheers,

Olivier

[1]: https://github.com/OlivierBlanvillain/blogforever-crawler-publication

Atrijit Dasgupta

unread,

Jul 3, 2015, 4:41:28 AM7/3/15

to scrapy...@googlegroups.com

We are evaluating the BlogForEver crawler for a project, and from the following page we cannot get the source code:

http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/

- the http://invenio-software.org/repo/blogforever/ resource link gives an error.

However, a copy of the source code is available at https://github.com/BlogForever/crawler, uploaded there by Mr. Olivier BlanVillain.

Is the Github source code the final version? We downloaded and ran it, and while it perfectly works with the example sites provided, when we try to crawl other blogs etc, it seems to get into infinite loops and does not produce any output.

And if we cannot get the source code from the blogforever.eu page, is there another location where from we can get it?

Thanks

Atrijit Dasgupta

Olivier Blanvillain

unread,

Jul 3, 2015, 4:46:01 AM7/3/15

to scrapy...@googlegroups.com

Hi, as far as I know nobody continued the development after me, so
what's on GitHub should be the latest version.

> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/hKsr4BKRlzo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

> scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.

> For more options, visit https://groups.google.com/d/optout.

Atrijit Dasgupta

unread,

Jul 3, 2015, 10:31:03 AM7/3/15

to scrapy...@googlegroups.com

Thanks Olivier ...

Reply all

Reply to author

Forward