BigQuery bug

157 views
Skip to first unread message

Василий Барсуков

unread,
Apr 18, 2012, 2:39:01 PM4/18/12
to bigquery...@googlegroups.com
Hi, guys! We have found an annoying bug in calculatiing  unique field values count. We created a test case for you. There is a single field in a table
[
 { "name": "id",        "type":"integer",       "mode":"REQUIRED" }
]
It was created and populated by the data in attachement. All the data were successfully uploaded into table named ads_test.unique_test.
lines in file

vaska@vaska-laptop:~$ gzip -cd test.dt.gz | wc -l
385011

records in the table
select count(*) from [ads_test.unique_test];
Row f0_  
1 385011  

but calculating the count of unique id values shows less unique values then it is:

vaska@vaska-laptop:~$ gzip -cd test.dt.gz | sort -n -u | wc -l
304323

SELECT count(distinct id) FROM [ads_test.unique_test];

Row f0_  
1 290157  

We use such queries in calculating unique visitors of out system, and discrepancy in 5% is sessms to be huge

BR - Vasily Barsukov

test.dt.gz

Sanjath Shringeri

unread,
Apr 18, 2012, 2:42:33 PM4/18/12
to bigquery...@googlegroups.com
count distinct is an statistical approximation. As per the documentation, if you need accurate count, you have to feed the approximate count as the second parameter. 

"
Returns the total number of non-NULL values in the scope of the function. If you use the DISTINCT keyword, the function returns the number of unique values in the specified field. Note that the returned value for DISTINCT is a statistical estimate, not necessarily the exact value.

When you use the DISTINCT keyword, you can also specify a value for n (n > 0). If the count is greater than n, the function returns a statistical approximation. If the count is less than n, the function returns an exact result. The default value for n is 1000. Large values for n might substentially increase query execution time.

In general, you should use COUNT([DISTINCT] field [, n]) instead of COUNT(*) because specifying the field removes confusion over what you are counting when repeated fields are present.

We also stumbled across the same issue, and I got a clarification with the above info. 

HTH.

- Sanjath

--
___________________________________
Sanjath Shringeri | VP, Engineering | claritics | Social. Intelligence. Now. 
Come visit us at GDC 2012 in San Francisco, booth & session
408.796.1287 |

Blinded

unread,
Apr 23, 2012, 10:51:52 AM4/23/12
to bigquery...@googlegroups.com
Hi! That'me again! I followed by docs, but I have got 
Query Failed
Error: Response too large to return.
I think this means I'll never get the exact count. Then the question is how can i evaluate your estimation precision. I suppose I have about 700 * 10^6    unique values in my query 

среда, 18 апреля 2012 г., 22:42:33 UTC+4 пользователь Sanjath Shringeri написал:
Reply all
Reply to author
Forward
0 new messages