Hi kw:
This is a good question. So, I checked in with our
nGrams team about this dataset. I think this dataset is a bit confusing, due to the field names and the existence of plenty of nulls in the columns.
Take, for example, this query:
SELECT ngram, cell.value, cell.volume_count, cell.volume_fraction, cell.page_count, cell.match_count FROM [publicdata:samples.trigrams] WHERE ngram CONTAINS "dinosaur";
You'll get data that looks something like this:
Row | ngram | cell_value | cell_volume_count | cell_volume_fraction | cell_page_count | cell_match_count | |
1 | of these dinosaurs | 1888 | 1 | 1.6196954972465177E-4 | 1 | 1 | |
2 | of these dinosaurs | 1890 | 1 | 1.6196954972465177E-4 | 1 | 1 | |
The "cell" prefix doesn't mean anything. Each row refers to an nGram appearance in a particular "volume" (book).
value = year of this particular volume containing this nGram
volume fraction = the fraction of books in this corpus that contain this nGram
page count = on how many pages in the volume did this nGram appear
match count = how many times did this nGram appear in the volume?
As I mentioned above, this dataset currently has some null columns... fourth and fifth gram is not there, nor is the book title and authors. For example, running such as:
SELECT COUNT(*) as row_count FROM [publicdata:samples.trigrams] WHERE cell.sample.title != "" OR cell.sample.title IS NOT NULL;
... returns 0.
Note that these publicdata:sample datasets are really useful for testing BigQuery's syntax and for spinning up demo apps, but they are best used as test datasets, and not necessarily a good choice for scientific research. I would construct my own controlled corpora for that purpose.
- Michael