Does Clickhouse support something like fulltext index?

瀏覽次數:2,618 次
跳到第一則未讀訊息

Filip Podstavec

未讀,
2021年6月23日 下午1:13:572021/6/23
收件者:ClickHouse
Hello,
 I'm trying to make a simple fulltext search in Clickhouse with dozens and sometimes hundreds of milions keywords but I'm not sure if I'm doing it in a good way. 

Currently I have a table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 8000
  
)
ENGINE = MergeTree()
ORDER BY (keyword);

Filled with 100 mil. random keywords. To search in it I'm using LIKE operator on query like:
select count() from default.fulltext where keyword like '%fil%'

But even with index on keyword column it still gets around 1-2 seconds to get results. What is stranger - same select on column without index get also around 1-3 seconds (so have similar performance).

Probably I'm doing something wrong but I haven't found any articles dedicated to fulltext search in Clickhouse (in our case it's not fulltext search but we need to find any rows that contains keyword anywhere).

Does Clickhouse support something like MySQL fulltext index or a way how to perform LIKE '%%' or multiSearchAny fast?

Below the line:
- We want to find string anywhere inside keyword (so solution like LIKE 'fil%' is faster but doesn't make sense for us)

Denis Zhuravlev

未讀,
2021年6月23日 下午1:35:072021/6/23
收件者:ClickHouse
>GRANULARITY 8000

It should be GRANULARITY 3 or GRANULARITY 2.
It's number of PrimaryINDEX granules. Basically GRANULARITY 8000 means = 8000*8129 rows -- that is nonsense.

Filip Podstavec

未讀,
2021年6月23日 下午1:51:292021/6/23
收件者:ClickHouse
Omg my bad...sorry I though GRANULARITY 8000 limits it to 8000 rows not 8000*8129 rows. Thank you!

Dne středa 23. června 2021 v 19:35:07 UTC+2 uživatel denis.z...@gmail.com napsal:

Filip Podstavec

未讀,
2021年6月23日 下午3:40:492021/6/23
收件者:ClickHouse
I tried it with GRANULARITY set to 2 but it still seems almost the same in term of performance. Currently around 70 mil. rows per second and it still has to go through all database. Is there something I miss?

Dne středa 23. června 2021 v 19:51:29 UTC+2 uživatel Filip Podstavec napsal:

Christophe Kalenzaga

未讀,
2021年6月23日 下午4:15:522021/6/23
收件者:Filip Podstavec、ClickHouse
Assuming your primary index granularity is the default 8k your skip index is per 16k rows.
If at least one line per for a 16k rows batch contains the ngram, clickhouse will fetch the 16k rows then it will apply the LIKE for the 16k rows.
Maybe your query is not selective enough (your ngram is present in most of the 16k batchs). Another possibility is the configuration of ngrambf. For example, the 2nd argument is the size of the bloomfilter that stores the ngrams present in the 16k rows. If you have too many different ngrams, the 256 bytes you put might be too small and the bloomfilter will produce many false positives (saying that a ngram is present in the 16k rows whereas it's not the case).

--
You received this message because you are subscribed to the Google Groups "ClickHouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clickhouse+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clickhouse/44546af4-e060-4c71-b449-3d227c6e9d7cn%40googlegroups.com.

Denis Zhuravlev

未讀,
2021年6月23日 下午4:40:092021/6/23
收件者:ClickHouse
What do you store in keyword ?

It could be because of the nature of bloom_filters

CREATE TABLE fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree()
ORDER BY (keyword);

insert into fulltext select arrayStringConcat(arrayMap( i -> toString(rand(i)%44545445), range(10)), ' ') from numbers(10000000)

SET send_logs_level = 'debug'

select count() from fulltext where keyword like '%aaa%';
0 rows in set. Elapsed: 0.002 sec.
Index `fulltext_index` has dropped 1221/1221 granules.


select count() from fulltext where keyword like '%33333333333%';
0 rows in set. Elapsed: 0.188 sec. Processed 10.00 million rows
Index `fulltext_index` has dropped 0/1221 granules

Filip Podstavec

未讀,
2021年6月24日 凌晨2:33:272021/6/24
收件者:ClickHouse
Denis: I have a table filled with 100 mil. random strings from 2 to 10 characters.

Christopher were right. I had to change index_granularity to 500 and use GRANULARITY 1 and with such a small granularity it works much faster. 

With table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 2
  
)
ENGINE = MergeTree()
ORDER BY (keyword);

SELECT:
select count() from default.fulltext where keyword like '%market%'

- Processed 100.00 mil. rows
- Around 1,5 seconds to process (approximately in 10 different executes)

With updated table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(2, 1024, 2, 0) GRANULARITY 1
  
)
ENGINE = MergeTree()
ORDER BY (keyword)
SETTINGS index_granularity=500;
 
The same SELECT:
- Processed 2,19 mil. rows
- Around 0.15 seconds to process

I'll try to play with index_granularity settings and the size of bloomfilter and maybe make it event faster but thanks for you help! :-)

Dne středa 23. června 2021 v 22:40:09 UTC+2 uživatel denis.z...@gmail.com napsal:

Filip Podstavec

未讀,
2021年10月21日 凌晨4:15:312021/10/21
收件者:ClickHouse
Skip indexes work great (according to above answers) but now I have another problem - I need to search words in columns and it brings two problems:
1) The word has to be alone in the keyword (for keyword "know" I don't want to see in the results "knowledge" etc.)
2) The search should not respect accent (for keyword "pracka" I want to see in results also "pračka")

I tried to find some function to deal with that on:

But the only thing I found is to make multiple LIKE operations to deal with first problem, but it doesn't deal with the second problem. Example:
WHERE (keyword LIKE 'word %' OR keyword LIKE '% word %' OR keyword like '% word')

This at least find the keyword "alone" in the column.

I also tried to use MATCH function but nothing deals with the second problem with accent. My question is:
Does Clickhouse support something like fulltext search (find words alone and ignore accent)? And if not is it possible to create materialized view with keyword without accent (some function right from Clickhouse to remove accent from string)? I tried to find something in documentation but without success.

Thank you!

Dne čtvrtek 24. června 2021 v 8:33:27 UTC+2 uživatel Filip Podstavec napsal:
回覆所有人
回覆作者
轉寄
0 則新訊息