Does Clickhouse support something like fulltext index?

112 views
Skip to first unread message

Filip Podstavec

unread,
Jun 23, 2021, 1:13:57 PMJun 23
to ClickHouse
Hello,
 I'm trying to make a simple fulltext search in Clickhouse with dozens and sometimes hundreds of milions keywords but I'm not sure if I'm doing it in a good way. 

Currently I have a table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 8000
  
)
ENGINE = MergeTree()
ORDER BY (keyword);

Filled with 100 mil. random keywords. To search in it I'm using LIKE operator on query like:
select count() from default.fulltext where keyword like '%fil%'

But even with index on keyword column it still gets around 1-2 seconds to get results. What is stranger - same select on column without index get also around 1-3 seconds (so have similar performance).

Probably I'm doing something wrong but I haven't found any articles dedicated to fulltext search in Clickhouse (in our case it's not fulltext search but we need to find any rows that contains keyword anywhere).

Does Clickhouse support something like MySQL fulltext index or a way how to perform LIKE '%%' or multiSearchAny fast?

Below the line:
- We want to find string anywhere inside keyword (so solution like LIKE 'fil%' is faster but doesn't make sense for us)

Denis Zhuravlev

unread,
Jun 23, 2021, 1:35:07 PMJun 23
to ClickHouse
>GRANULARITY 8000

It should be GRANULARITY 3 or GRANULARITY 2.
It's number of PrimaryINDEX granules. Basically GRANULARITY 8000 means = 8000*8129 rows -- that is nonsense.

Filip Podstavec

unread,
Jun 23, 2021, 1:51:29 PMJun 23
to ClickHouse
Omg my bad...sorry I though GRANULARITY 8000 limits it to 8000 rows not 8000*8129 rows. Thank you!

Dne středa 23. června 2021 v 19:35:07 UTC+2 uživatel denis.z...@gmail.com napsal:

Filip Podstavec

unread,
Jun 23, 2021, 3:40:49 PMJun 23
to ClickHouse
I tried it with GRANULARITY set to 2 but it still seems almost the same in term of performance. Currently around 70 mil. rows per second and it still has to go through all database. Is there something I miss?

Dne středa 23. června 2021 v 19:51:29 UTC+2 uživatel Filip Podstavec napsal:

Christophe Kalenzaga

unread,
Jun 23, 2021, 4:15:52 PMJun 23
to Filip Podstavec, ClickHouse
Assuming your primary index granularity is the default 8k your skip index is per 16k rows.
If at least one line per for a 16k rows batch contains the ngram, clickhouse will fetch the 16k rows then it will apply the LIKE for the 16k rows.
Maybe your query is not selective enough (your ngram is present in most of the 16k batchs). Another possibility is the configuration of ngrambf. For example, the 2nd argument is the size of the bloomfilter that stores the ngrams present in the 16k rows. If you have too many different ngrams, the 256 bytes you put might be too small and the bloomfilter will produce many false positives (saying that a ngram is present in the 16k rows whereas it's not the case).

--
You received this message because you are subscribed to the Google Groups "ClickHouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clickhouse+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clickhouse/44546af4-e060-4c71-b449-3d227c6e9d7cn%40googlegroups.com.

Denis Zhuravlev

unread,
Jun 23, 2021, 4:40:09 PMJun 23
to ClickHouse
What do you store in keyword ?

It could be because of the nature of bloom_filters

CREATE TABLE fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree()
ORDER BY (keyword);

insert into fulltext select arrayStringConcat(arrayMap( i -> toString(rand(i)%44545445), range(10)), ' ') from numbers(10000000)

SET send_logs_level = 'debug'

select count() from fulltext where keyword like '%aaa%';
0 rows in set. Elapsed: 0.002 sec.
Index `fulltext_index` has dropped 1221/1221 granules.


select count() from fulltext where keyword like '%33333333333%';
0 rows in set. Elapsed: 0.188 sec. Processed 10.00 million rows
Index `fulltext_index` has dropped 0/1221 granules

Filip Podstavec

unread,
Jun 24, 2021, 2:33:27 AMJun 24
to ClickHouse
Denis: I have a table filled with 100 mil. random strings from 2 to 10 characters.

Christopher were right. I had to change index_granularity to 500 and use GRANULARITY 1 and with such a small granularity it works much faster. 

With table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 2
  
)
ENGINE = MergeTree()
ORDER BY (keyword);

SELECT:
select count() from default.fulltext where keyword like '%market%'

- Processed 100.00 mil. rows
- Around 1,5 seconds to process (approximately in 10 different executes)

With updated table:
CREATE TABLE default.fulltext
(
    `keyword` String,
    INDEX fulltext_index (keyword) TYPE ngrambf_v1(2, 1024, 2, 0) GRANULARITY 1
  
)
ENGINE = MergeTree()
ORDER BY (keyword)
SETTINGS index_granularity=500;
 
The same SELECT:
- Processed 2,19 mil. rows
- Around 0.15 seconds to process

I'll try to play with index_granularity settings and the size of bloomfilter and maybe make it event faster but thanks for you help! :-)

Dne středa 23. června 2021 v 22:40:09 UTC+2 uživatel denis.z...@gmail.com napsal:
Reply all
Reply to author
Forward
0 new messages