hi, we are currently using clickhouse+kafka to store machine learning results, i.e. rows with columns containing vectors. We have descriptors containing floats and event_id of each descriptor, we have to compute cosine similarity of these event_id. I need to choose random pairs of event_id and find cosine similarity of each pair, it must be over 0.8
Here is the example, but its working only when I specify the index of each event_id and descriptor:
WITH two_events AS (
SELECT groupArray(event_id)[1] as aid, groupArray(event_id)[2] as bid, groupArray(descriptor)[1] as a, groupArray(descriptor)[2] as b
FROM (
SELECT descriptor, max(event_id) as event_id
FROM teye.fr_descr_array
GROUP BY descriptor
LIMIT 2
)
)
SELECT
aid,
bid,
1 - arraySum((a_i, b_i) -> (a_i * b_i), a, b) / sqrt(arraySum(x -> (x * x), a) * arraySum(x -> (x * x), b)) AS cos_similarity
FROM two_events