Re: ANN: Corpus of 147 million quasi-relational Web tables released for public download

Tom Morris

unread,

Mar 6, 2014, 1:51:47 PM3/6/14

to web-data...@googlegroups.com

That looks very cool.

On Thu, Mar 6, 2014 at 10:57 AM, Robert Meusel <robert...@gmail.com> wrote:

More information about the corpus, its application domains as well as information about how to download the corpus is found at http://webdatacommons.org/webtables/

A few early questions based on reading that page:

- what Amazon instance was used? "x1.large" isn't a current instance type

- how was topic identification done? The last paragraph of 3.3 talks about looking up property names in DBpedia and 3.4 has labels broken down by "topic" but I think I'm missing something in the middle. In particular, for example, how are the values "Mississipi" "Lena" "Don" and "McKenzie" identified as the names of rivers as opposed to a state, two given names, and a surname?

- is the schema for the JSON file documented anywhere? Most of it seems self-explanatory, but one bit which isn't obvious is "hasRelevantTables". What makes a table relevant?

Thanks for a great resource!

Tom

Petar Ristoski

unread,

Mar 6, 2014, 7:31:33 PM3/6/14

to web-data...@googlegroups.com

Hi Tom,

Thank you for your interest and questions.

- what Amazon instance was used? "x1.large" isn't a current instance type

We were using c1.xlarge machines. Thanks for the notice, we already updated the information on our Web page.

-how was topic identification done? The last paragraph of 3.3 talks about looking up property names in DBpedia and 3.4 has labels broken down by "topic" but I think I'm missing something in the middle. In particular, for example, how are the values "Mississipi" "Lena" "Don" and "McKenzie" identified as the names of rivers as opposed to a state, two given names, and a surname?

We are not doing topic identification over the corpus, yet. In section 3.3, we are trying to give the users better understanding about the topics of the tables, based on the column headers. We believe that the complete distribution of headers, might help some of the users to check whether some topic is covered in the corpus. Even though some of the headers are completely abstract, and we cannot conclude the topic from them, i.e. "name", there are some headers that are quite specific for some topic, e.g. "area" and "population" may be considered as column headers in tables that contain data for geographical areas; Or, "isbn" is clearly a column header in tables that contain data about books and/or authors. Additionally, we thought that linking the properties of DBpedia with column headers might be useful for the LOD community, for the same purpose.

In section 3.4, we want to show what kind of values are covered in the corpus, and how well they are covered. For each value in the key columns (as key column in a relational table), we count in how many tables appears the given value. We should have clarified that we are not using any entity disambiguation approaches, but we use simple string matching to count the values occurrences over the whole corpus. E.g. "mississippi" can be found in 87367 tables, but we don't disambiguate if the value refers to the state of "Mississippi", or the river "Mississippi". Again, the main idea of compiling the values distribution, is to give high-level overview of which topics can we expect to find data in the corpus, without putting too much thought into values disambiguation.

- is the schema for the JSON file documented anywhere? Most of it seems self-explanatory, but one bit which isn't obvious is "hasRelevantTables". What makes a table relevant?

The documented schema for the JSON file can be found here. The field "hasRelevantTables" was used during the extraction, and should have been removed from the file. The field doesn't contain any valuable information anymore, thus it should not be used.

Cheers,

Petar

Tom Morris

unread,

Mar 7, 2014, 1:58:11 AM3/7/14

to web-data...@googlegroups.com

Thanks for the quick response and the clarifications, Petar.

On Thu, Mar 6, 2014 at 7:31 PM, Petar Ristoski <petar.ri...@gmail.com> wrote:

In section 3.4, we want to show what kind of values are covered in the corpus, and how well they are covered. For each value in the key columns (as key column in a relational table), we count in how many tables appears the given value. We should have clarified that we are not using any entity disambiguation approaches, but we use simple string matching to count the values occurrences over the whole corpus. E.g. "mississippi" can be found in 87367 tables, but we don't disambiguate if the value refers to the state of "Mississippi", or the river "Mississippi". Again, the main idea of compiling the values distribution, is to give high-level overview of which topics can we expect to find data in the corpus, without putting too much thought into values disambiguation.

This might be worth clarifying in the documentation. It's not clear, or at least it wasn't to me, that the words in the columns are just words which also can, in certain contexts, be the names of rivers, rather than actually being names of rivers in the context that they were found in the corpus. I suspect that the footballers values, for example, are much more representative than those in the rivers column.

Tom

Reply all

Reply to author

Forward