publicdata:samples.trigrams explained

kw

unread,

Mar 28, 2012, 9:30:49 PM3/28/12

to bigquery...@googlegroups.com

Hi guys,

can you explain me a little bit about structure of the publicdata:samples.trigrams dataset?

It is not listed in the https://developers.google.com/bigquery/docs/sample-tables

What are cell.* and other columns for?

Thanks in advance.

kw

unread,

Apr 5, 2012, 12:35:52 AM4/5/12

to bigquery...@googlegroups.com

Hi,

any updates here or am i asking smth stupid or inappropriate or universally unknown?

Kind regards.

Michael Manoochehri

unread,

Apr 5, 2012, 1:33:07 AM4/5/12

to bigquery...@googlegroups.com

Hi kw:

This is a good question. So, I checked in with our nGrams team about this dataset. I think this dataset is a bit confusing, due to the field names and the existence of plenty of nulls in the columns.

Take, for example, this query:

SELECT ngram, cell.value, cell.volume_count, cell.volume_fraction, cell.page_count, cell.match_count FROM [publicdata:samples.trigrams] WHERE ngram CONTAINS "dinosaur";

You'll get data that looks something like this:

Row	ngram	cell_value	cell_volume_count	cell_volume_fraction	cell_page_count	cell_match_count
1	of these dinosaurs	1888	1	1.6196954972465177E-4	1	1
2	of these dinosaurs	1890	1	1.6196954972465177E-4	1	1

The "cell" prefix doesn't mean anything. Each row refers to an nGram appearance in a particular "volume" (book).

value = year of this particular volume containing this nGram

volume fraction = the fraction of books in this corpus that contain this nGram

page count = on how many pages in the volume did this nGram appear

match count = how many times did this nGram appear in the volume?

As I mentioned above, this dataset currently has some null columns... fourth and fifth gram is not there, nor is the book title and authors. For example, running such as:

SELECT COUNT(*) as row_count FROM [publicdata:samples.trigrams] WHERE cell.sample.title != "" OR cell.sample.title IS NOT NULL;

... returns 0.

Note that these publicdata:sample datasets are really useful for testing BigQuery's syntax and for spinning up demo apps, but they are best used as test datasets, and not necessarily a good choice for scientific research. I would construct my own controlled corpora for that purpose.

- Michael

Keiw Kw

unread,

Apr 5, 2012, 4:28:57 AM4/5/12

to bigquery...@googlegroups.com

Thank you for a quick response with clean description also for pointing to the source of this sample data and notice about usage!

Appreciate your help.

--

Kind regards,

Alexander.

Reply all

Reply to author

Forward