uniq vs uniqExact

1,219 views
Skip to first unread message

kriticar

unread,
Oct 16, 2019, 6:32:26 AM10/16/19
to ClickHouse
Hi,



uniq

Calculates the approximate number of different values of the argument.
uniq(x[, ...])


and


uniqExact

Calculates the exact number of different argument values.
uniqExact(x[, ...])


Use the uniqExact function if you absolutely need an exact result. Otherwise use the uniq function.

The uniqExact function uses more memory than uniq, because the size of the state has unbounded growth as the number of different values increases.


What does it mean approximate and exact in this case?
Who needs approximate result?

Is uniq reliable?

Regards.

Denis Zhuravlev

unread,
Oct 16, 2019, 9:26:37 AM10/16/19
to ClickHouse
>What does it mean approximate and exact in this case?
It means that result will be accurate with +/- 0.00% precision.

>Who needs approximate result?
everybody who needs speed.

>Is uniq reliable?
yes


create table u(I Int64) Engine=MergeTree() order by tuple();
insert into u select cityHash64(number) from numbers(100000000);

select uniq(I) from u;

┌───uniq(I)─┐
│ 100107766 │
└───────────┘

1 rows in set. Elapsed: 0.310 sec. Processed 100.00 million rows


select uniqExact(I) from u;

┌─uniqExact(I)─┐
│    100000000 │
└──────────────┘

1 rows in set. Elapsed: 8.840 sec. Processed 100.00 million rows


As you can see uniq is 20 times faster. But brought 0.1% error.

There is more functions:

SELECT uniqCombined(20, I) FROM u

┌─uniqCombined(20, I)─┐
│           100001110 │
└─────────────────────┘


error is 0.001%

kriticar

unread,
Oct 17, 2019, 2:33:18 AM10/17/19
to ClickHouse
Thanks for clarification Denis.

Regards.

kriticar

unread,
Oct 25, 2019, 3:58:55 AM10/25/19
to ClickHouse
I have just executed two successive queries:

select startOf15Min, uniqExactMerge(uniqExactCusts) uniqExactCusts
from interval_sums_agg
where startOf15MinDay = today()
group by startOf15Min
order by startOf15Min

lasted 27 seconds

select startOf15Min, uniqExact(customerId)
from indicators_15_min
where startOf15MinDay = today()
group by startOf15Min
order by startOf15Min

lasted 3 seconds

Average difference between results is 1%.
It is way higher than +/- 0.00% precision.

For example, uniq shows
2.673.280,00
vs uniqExact
2.682.861,00

The difference is
9.581,00

it is about
1,003584%
difference.

Looks like uniq is not so precise.

Regards.

On Wednesday, 16 October 2019 15:26:37 UTC+2, Denis Zhuravlev wrote:

Denis Zhuravlev

unread,
Oct 25, 2019, 10:46:52 AM10/25/19
to ClickHouse

>it is about
>1,003584%
>difference.
>Looks like uniq is not so precise.

You don't understand.
If one needs response time less 100ms, one use different tricks: calculation by 1% of data (sampling), uniq and so on. In these cases always will be some inaccuracy but it does not matter. 

kriticar

unread,
Oct 28, 2019, 6:03:37 AM10/28/19
to ClickHouse
OK, in that case, it is not for me, because I have to provide exact numbers.
Reply all
Reply to author
Forward
0 new messages