We are making publicly available our largest dataset to date, Freebase Annotations of the ClueWeb Corpora (FACC1), which contains 11 billion entity annotations in 800 million documents.
We annotated English-language Web pages from two corpora, ClueWeb09 (
http://lemurproject.org/clueweb09/) and ClueWeb12 (
http://lemurproject.org/clueweb12/). The annotation process was automatic, and for each entity we recognized with high confidence, we provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels.
There are 340,451,982 documents in ClueWeb09 and 456,498,584 documents in ClueWeb12 with at least one entity annotated. On average, ClueWeb09 documents have 15 entity mentions annotated, and ClueWeb12 documents have 13 mentions annotated.
The annotation data will be distributed by CMU. More details about the data are available at
http://lemurproject.org/clueweb09/FACC1/ and
http://lemurproject.org/clueweb12/FACC1/ (these pages will soon be updated with instructions on how to apply to get a copy of the dataset). We are thankful to Juan Caicedo Carvajal for his help in preparing the data. We are also thankful to Jamie Callan for his help and advice throughout the annotation project and for hosting the annotated data at CMU.
Evgeniy.
P.S. You might want to subscribe to our mailing list (
http://goo.gl/MJb3A) to get timely notifications of future data releases. The list archives are open so you're welcome to browse them to learn more about our data releases to date.