ChatNoir: First Public Search Engine for the Common Crawl

Martin Potthast

unread,

Mar 27, 2018, 11:01:20 AM3/27/18

to common...@googlegroups.com

Dear everyone,

I would like to announce the first public search search engine to index the Common Crawl: ChatNoir.

ChatNoir is a research search engine developed by Webis, a network of researchers from the Universities of Weimar, Halle, and Leipzg.

The public search interface can be reached at www.chatnoir.eu, where you will also find the documentation of ChatNoir's API.

ChatNoir is based on Elasticsearch and runs on the 130-node Betaweb cluster of the Digital Bauhaus Lab at Bauhaus-Universität Weimar, ensuring search results in milliseconds. More details about the search engine can found in the demo paper published at ECIR: http://www.uni-weimar.de/medien/webis/publications/papers/stein_2018a.pdf

Just like the Common Crawl Foundation, we are committed to openness:

- ChatNoir wiill be maintained as a free and open search engine

- ChatNoir's API is available free of charge for research purposes
(a lot cheaper compared to commercial APIs ;-)

- ChatNoir's source code is open source: www.github.com/chatnoir-eu

- We strive to update the Common Crawl index periodically

Feel free to give ChatNoir a spin!

All the best,

Martin Potthast

on behalf of ChatNoir's developers

--

Jun-Prof. Dr. Martin Potthast
Leipzig University

Germany

leipzig.webis.de

Sebastian Nagel

unread,

Mar 27, 2018, 11:08:18 AM3/27/18

to common...@googlegroups.com

Hi Martin,

thanks for the announcement. Great project and it's amazing to see the past time
frozen in search results - I've tried:
https://www.chatnoir.eu/?q=common+crawl&index=cc1511

> - We strive to update the Common Crawl index periodically

Let us know if we can support you in any way!

Thanks,
Sebastian

On 03/27/2018 05:00 PM, Martin Potthast wrote:
> Dear everyone,
>
> I would like to announce the first public search search engine to index the Common Crawl: ChatNoir.
>
> ChatNoir is a research search engine developed by Webis, a network of researchers from the
> Universities of Weimar, Halle, and Leipzg.
>

> The public search interface can be reached at www.chatnoir.eu <http://www.chatnoir.eu>, where you

> will also find the documentation of ChatNoir's API.
>
> ChatNoir is based on Elasticsearch and runs on the 130-node Betaweb cluster of the Digital Bauhaus
> Lab at Bauhaus-Universität Weimar, ensuring search results in milliseconds. More details about the
> search engine can found in the demo paper published at
> ECIR: http://www.uni-weimar.de/medien/webis/publications/papers/stein_2018a.pdf
>
> Just like the Common Crawl Foundation, we are committed to openness:
> - ChatNoir wiill be maintained as a free and open search engine
> - ChatNoir's API is available free of charge for research purposes
> (a lot cheaper compared to commercial APIs ;-)

> - ChatNoir's source code is open source: www.github.com/chatnoir-eu <http://www.github.com/chatnoir-eu>

> - We strive to update the Common Crawl index periodically
>
> Feel free to give ChatNoir a spin!
>
> All the best,
> Martin Potthast
> on behalf of ChatNoir's developers
>
>
> --
> Jun-Prof. Dr. Martin Potthast
> Leipzig University
> Germany
>

> leipzig.webis.de <http://leipzig.webis.de>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,

Aug 26, 2018, 8:36:29 PM8/26/18

to common...@googlegroups.com

That's cool, but until I read the paper today, I didn't realize that Sebastien's "frozen in time" comment was due to the fact that the February 2015 crawl was the one being used.

On Tue, Mar 27, 2018 at 11:01 AM Martin Potthast <martin....@uni-leipzig.de> wrote:

- We strive to update the Common Crawl index periodically

Any information when these updates will happen? Since the purpose of including Common Crawl was to improve recency over the ClueWeb09 and ClueWeb12 corpora, it seems like a 2018 crawl might be more appropriate. Better yet, taking the most recent deduped pages from the union of a few 2018 crawls would substantially increase coverage.

Tom

Shameel Abdulla

unread,

Sep 3, 2018, 2:14:25 AM9/3/18

to Common Crawl

@Martin tremendous job. Will help researchers and analysts really well. Thinking of putting our team as well to contribute to the project.

Looking forward to understand the following:

1) The crawl used now is from 2015 - how can we update to the latest crawl

2) What are your thoughts on how we can achieve filters based on sources to - News, Blogs, Forums, Social Media

3) What are your thoughts on how we can achieve filters based on country of origin

4) What are your thoughts on how we can achieve filters based on date of search result

Keep up the great job. Looking forward to see how I can contribute.

Reply all

Reply to author

Forward