Cosine Similarity

79 views
Skip to first unread message

Amanali Bekbolat

unread,
Dec 21, 2021, 7:39:31 AM12/21/21
to ClickHouse
hi, we are currently using clickhouse+kafka to store machine learning results, i.e. rows with columns containing vectors. We have descriptors containing floats and event_id of each descriptor, we have to compute cosine similarity of these event_id. I need to choose random pairs of event_id and find cosine similarity of each pair, it must be over 0.8
Here is the example, but its working only when I specify the index of each event_id and descriptor: 
WITH two_events AS (
SELECT groupArray(event_id)[1] as aid, groupArray(event_id)[2] as bid, groupArray(descriptor)[1] as a, groupArray(descriptor)[2] as b FROM ( SELECT descriptor, max(event_id) as event_id FROM teye.fr_descr_array GROUP BY descriptor LIMIT 2 ) ) SELECT aid, bid, 1 - arraySum((a_i, b_i) -> (a_i * b_i), a, b) / sqrt(arraySum(x -> (x * x), a) * arraySum(x -> (x * x), b)) AS cos_similarity FROM two_events

Alexey Milovidov

unread,
Dec 21, 2021, 7:58:32 AM12/21/21
to Amanali Bekbolat, ClickHouse
Hi! Since ClickHouse version 21.11 we have dedicated functions for
cosine similarity:
https://presentations.clickhouse.com/release_21.11/#20

These functions will simplify your queries.
> --
> You received this message because you are subscribed to the Google Groups "ClickHouse" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clickhouse+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clickhouse/e82f9168-e106-4075-931e-addb9a190ca7n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages