We were using c1.xlarge machines. Thanks for the notice, we already updated the information on our Web page.
-how was topic identification done? The last paragraph of 3.3 talks about looking up property names in DBpedia and 3.4 has labels broken down by "topic" but I think I'm missing something in the middle. In particular, for example, how are the values "Mississipi" "Lena" "Don" and "McKenzie" identified as the names of rivers as opposed to a state, two given names, and a surname?
We are not doing topic identification over the corpus, yet. In section 3.3, we are trying to give the users better understanding about the topics of the tables, based on the column headers. We believe that the complete distribution of headers, might help some of the users to check whether some topic is covered in the corpus. Even though some of the headers are completely abstract, and we cannot conclude the topic from them, i.e. "name", there are some headers that are quite specific for some topic, e.g. "area" and "population" may be considered as column headers in tables that contain data for geographical areas; Or, "isbn" is clearly a column header in tables that contain data about books and/or authors. Additionally, we thought that linking the properties of DBpedia with column headers might be useful for the LOD community, for the same purpose.
In section 3.4, we want to show what kind of values are covered in the corpus, and how well they are covered. For each value in the key columns (as key column in a relational table), we count in how many tables appears the given value. We should have clarified that we are not using any entity disambiguation approaches, but we use simple string matching to count the values occurrences over the whole corpus. E.g. "mississippi" can be found in 87367 tables, but we don't disambiguate if the value refers to the state of "Mississippi", or the river "Mississippi". Again, the main idea of compiling the values distribution, is to give high-level overview of which topics can we expect to find data in the corpus, without putting too much thought into values disambiguation.
The documented schema for the JSON file can be found here. The field "hasRelevantTables" was used during the extraction, and should have been removed from the file. The field doesn't contain any valuable information anymore, thus it should not be used.
Cheers,
Petar