publicdata:samples.trigrams explained

1281 views
Skip to first unread message

kw

unread,
Mar 28, 2012, 9:30:49 PM3/28/12
to bigquery...@googlegroups.com
Hi guys,

can you explain me a little bit about structure of the publicdata:samples.trigrams dataset? 

What are cell.* and other columns for?

Thanks in advance.

kw

unread,
Apr 5, 2012, 12:35:52 AM4/5/12
to bigquery...@googlegroups.com
Hi,
any updates here or am i asking smth stupid or inappropriate or universally unknown? 

Kind regards.

Michael Manoochehri

unread,
Apr 5, 2012, 1:33:07 AM4/5/12
to bigquery...@googlegroups.com
Hi kw:

This is a good question. So, I checked in with our nGrams team about this dataset. I think this dataset is a bit confusing, due to the field names and the existence of plenty of nulls in the columns.

Take, for example, this query:
SELECT ngram, cell.value, cell.volume_count, cell.volume_fraction, cell.page_count, cell.match_count FROM [publicdata:samples.trigrams] WHERE ngram CONTAINS "dinosaur";

You'll get data that looks something like this:
Rowngramcell_valuecell_volume_countcell_volume_fractioncell_page_countcell_match_count 
1of these dinosaurs188811.6196954972465177E-411 
2of these dinosaurs189011.6196954972465177E-411 

The "cell" prefix doesn't mean anything. Each row refers to an nGram appearance in a particular "volume" (book).

value = year of this particular volume containing this nGram
volume fraction = the fraction of books in this corpus that contain this nGram
page count = on how many pages in the volume did this nGram appear
match count = how many times did this nGram appear in the volume?

As I mentioned above, this dataset currently has some null columns... fourth and fifth gram is not there, nor is the book title and authors. For example, running such as:
SELECT COUNT(*) as row_count FROM [publicdata:samples.trigrams] WHERE cell.sample.title != "" OR cell.sample.title IS NOT NULL;
... returns 0.

Note that these publicdata:sample datasets are really useful for testing BigQuery's syntax and for spinning up demo apps, but they are best used as test datasets, and not necessarily a good choice for scientific research. I would construct my own controlled corpora for that purpose.

- Michael

Keiw Kw

unread,
Apr 5, 2012, 4:28:57 AM4/5/12
to bigquery...@googlegroups.com
Thank you for a quick response with clean description also for pointing to the source of this sample data and notice about usage!

Appreciate your help.

-- 
Kind regards,
Alexander.

Reply all
Reply to author
Forward
0 new messages