Benchmarks - есть ли более детальные данные

607 views
Skip to first unread message

Peter Zaitsev

unread,
Jul 16, 2016, 1:14:30 PM7/16/16
to ClickHouse
Добрый день

Смотрю на результаты с тестами и они просто великолепны


Вопрос однако есть ли где-то более полное описание тестов - на каком железе и с какими настройками проводились тесты ?
Есть ли где-то или сами данные на которых проводилось тестирование или генератор их ?

Второй интересный вопрос по тестам - такое впечатление что все тесты делаются только на таблице hits -  проводились ли какие-то тесты с использованием JOIN

man...@gmail.com

unread,
Jul 16, 2016, 6:29:59 PM7/16/16
to ClickHouse
Добрый день.


Железо такое:

Двухсокетный Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
128 GiB RAM
8 x 6 TB HDD SATA 7200 RPM в RAID-5 (md)

Для простоты, бенчмарки проводились в односерверной конфигурации.

(За исключением результатов, помеченных как Vertica x3 и x6 - на трёх и шести серверах, соответственно. Эти тесты проводились на другом железе (более-менее сравнимом - тоже двухсокетные серверы) другими людьми, которые самостоятельно всё настраивали.)

Для некоторых систем подробно написано, как они устанавливались и как загружались данные, а для некоторых эта информация не сохранена.
Та, что есть, находится здесь: https://github.com/yandex/ClickHouse/tree/master/dbms/benchmark


Насчёт данных.

Так получилось, что данные для этих тестов - это кусочек реальных данных Метрики и их нельзя раскрывать.
Конечно, это нехорошо. Поэтому, есть желание сделать хотя бы одно из двух:

1. Написать генератор псевдослучайных данных такой же структуры. Это не очень легко, потому что нужно сохранить все вероятностные распределения. Например, для строковых полей - распределение одинаковых строк, распределение длин строк, коэффициент сжатия. Но если постараться, можно сделать.

2. Перевести тесты на другой dataset. У нас есть заготовки для этого, смотрите здесь: https://github.com/yandex/ClickHouse/tree/master/doc/example_datasets
Например, тест ontime взят из вашего блога :)
Вы можете получить результаты по инструкции примерно за 30 минут.


По поводу таблиц.

Да, все тесты сделаны на одной таблице hits.
Запросы для бенчмарка были выбраны осенью 2013, и тогда ClickHouse не поддерживал JOIN.

Тест состоит из 43 запросов, из них 36 - запросы на full scan, а остальные 7 - запросы по диапазону первичного ключа.
В основном, бенчмарк проверяет производительность чтения, фильтрации и агрегации данных.
При этом проверяются различные комбинации типов полей, селективности условий, кардинальности ключей агрегации.

Timur Shenkao

unread,
Jul 18, 2016, 4:04:26 AM7/18/16
to ClickHouse
Hi guys!

We've performed several tests on ClickHouse multi-server configuration. Unfortunately, I can't disclose several details but:
-- the volume was at least 1 Tb
-- the data was real and dirty and skewed in some tables
-- several configurations were tested: 1 shard x 1 replica, 2 shards x 1 replica, 2 shards x 2 replicas, etc.

In our tests, we don't consider milliseconds as users don't feel such difference + network latency eats such negligible time difference.

1) The compression is much better than in Vertica 7.1 / 7.2. Raw CSV is squeezed 5-10 times.
2) Queries like "SELECT count(*) from ..." aren't fast at all. It scans the whole table. But, if you define WHERE clause, it becomes better. ClickHouse and Vertica gave the same results.
3) ClickHouse is extremely fast at simple SELECTs without joins, much faster than Vertica. And, we've got the feeling, that ClickHouse has something like cache: consecutive same queries were fulfilled up to 2 times faster.
4) Problem: JOINs don't work at all. At least in usual sense. One has to redefine queries with joins according to documentation and examples given in this Google group. Then, they become much faster than in Vertica

Alexander Zaitsev

unread,
Jul 18, 2016, 8:39:04 AM7/18/16
to ClickHouse
Well, we also performed ClickHouse vs. Vertica tests using 3 node clusters and real-life queries from data analytics applications and user reports. Something is weird with your setup, Vertica can not be so much slower.

In our tests:
  • Compression is similar using zero configuration, but Vertica wins when proper encoding is used.. Since on Vertica there is a full control of compression (encoding) on column basis, it should be theoretically better than CH in this regard.
  • Vertica is 2 times slower when no joins are present and simple predicate on high selectivity column is used  and the column is not in projection sort order on Vertica
  • However Vertica is significantly faster (2+ times) when column is the projection sort column (the exact ratio depends on column selectivity selectivity)
  • Every join (converted to subquery if joined table is filtered) dramatically affects ClickHouse performance: query with 1 join is 3-4 times slower then without the join, and query with two joins is even more slower.
Yes, ClickHouse seems to cache the data, so when query is repeated multiple times it gets executed much faster. It is probably useless in real-life scenario when data is being constantly loaded to the database and cache expires often.

man...@gmail.com

unread,
Jul 18, 2016, 4:26:09 PM7/18/16
to ClickHouse
ClickHouse don't cache data itself, it rely on page cache in OS.
(There are 'uncomressed_cache' which is mostly useless and disabled by default.)

ClickHouse has very limited and rigid support for JOINs. JOINs need to be done "subquery with subquery". Even left table need to be represented in subquery because currently there are no optimization of order of JOIN operation relative to filtering, and otherwise JOIN is done before WHERE, which is obviously less efficient if WHERE is not done on JOINed fields.

Better JOIN support is in our nearest plans.

man...@gmail.com

unread,
Jul 18, 2016, 4:27:45 PM7/18/16
to ClickHouse
By the way, "projection sort order" in Vertica and "primary key" in ClickHouse is exactly the same concepts.

Alexander Zaitsev

unread,
Jul 19, 2016, 5:50:28 AM7/19/16
to ClickHouse
Ok, Vertica also relies on OS cache, but the difference between first and next runs is not so significant usually.

For JOINs -- certainly we tried them using subqueries, otherwise the query size jumped from several seconds to several minutes :)

Regarding projection sort order and general concept of the primary key -- there are few differences. 
- Projection sort order does not need to be a unique, normally it is not, while primary key is unique (not sure if it holds true for CH, though)
- If multiple columns are used in projection sort order, even the filtering using non-first sort column works better than on unsorted table. It is usually not the case for the primary key that works for leftmost columns only
- And last but not least: Vertica supports multiple projections for the table that allows unlimited flexibility. 

Just to be clear. I am not advocating for Vertica here, just highlighting the differences.

man...@gmail.com

unread,
Jul 19, 2016, 6:36:14 AM7/19/16
to ClickHouse
Some clarifications, just in case:
 
Regarding projection sort order and general concept of the primary key -- there are few differences. 
- Projection sort order does not need to be a unique, normally it is not, while primary key is unique (not sure if it holds true for CH, though)

In ClickHouse, "primary key" is also not unique.
 
- If multiple columns are used in projection sort order, even the filtering using non-first sort column works better than on unsorted table. It is usually not the case for the primary key that works for leftmost columns only

In ClickHouse, primary key works someway even when filtering by non-first column, see: https://groups.google.com/forum/#!searchin/clickhouse/primary$20key/clickhouse/eUDrOLxV-lE/r2jCgYJWBgAJ
 

Alexander Zaitsev

unread,
Jul 19, 2016, 7:09:36 AM7/19/16
to ClickHouse
Good to know, thanks. It definitely resembles Vertica projections sort order then. Though, 'primary key' naming is kind of misleading here, since it usually means different thing in databases.

вторник, 19 июля 2016 г., 13:36:14 UTC+3 пользователь man...@gmail.com написал:
In ClickHouse, "primary key" is also not unique.
 

Burak Emre Kabakcı

unread,
Jul 21, 2016, 6:07:45 PM7/21/16
to ClickHouse
+1. It's confusing if you're coming from RDBMS world. Maybe partition would be better choice?

man...@gmail.com

unread,
Jul 21, 2016, 8:55:02 PM7/21/16
to ClickHouse
Sort order is the right term.
But primary key term has its sense too.

It's a "key", because it works as index, though non unique.
And it is "primary", because it is main key in table.

We need more explicit wording in documentation.
Send pull request if you have a time.

пятница, 22 июля 2016 г., 1:07:45 UTC+3 пользователь Burak Emre Kabakcı написал:
Reply all
Reply to author
Forward
0 new messages