Double-distributed IN/JOIN subqueries is denied (distributed_product_mode = 'deny')

3,426 views
Skip to first unread message

Олег Облеухов

unread,
Aug 6, 2016, 9:59:53 AM8/6/16
to ClickHouse
Hello!
I want to implement highestMax function in DB with 2 nested queries:
On local tables it works perfectly fine

:) select count(*) from default.graphite where Path in (select Path FROM default.graphite WHERE Path like 'network.%.sFlow.all_pktsIn' GROUP BY Path order by max(Value) desc limit 5) AND Date >= toDate(toDateTime(1470346140)) AND Date <= toDate(toDateTime(1470432540)) limit 10;

SELECT count(*)
FROM default.graphite
WHERE (Path IN
(
    SELECT Path
    FROM default.graphite
    WHERE Path LIKE 'network.%.sFlow.all_pktsIn'
    GROUP BY Path
    ORDER BY max(Value) DESC
    LIMIT 5
)) AND (Date >= toDate(toDateTime(1470346140))) AND (Date <= toDate(toDateTime(1470432540)))
LIMIT 10

┌─count()─┐
│      59 │
└─────────┘

Then I go for distributed tables

:) select count(*) from default.graphite_d where Path in (select Path FROM default.graphite_d WHERE Path like 'network.%.sFlow.all_pktsIn' GROUP BY Path order by max(Value) desc limit 5) AND Date >= toDate(toDateTime(1470346140)) AND Date <= toDate(toDateTime(1470432540)) limit 10;

SELECT count(*)
FROM default.graphite_d
WHERE (Path IN
(
    SELECT Path
    FROM default.graphite_d
    WHERE Path LIKE 'network.%.sFlow.all_pktsIn'
    GROUP BY Path
    ORDER BY max(Value) DESC
    LIMIT 5
)) AND (Date >= toDate(toDateTime(1470346140))) AND (Date <= toDate(toDateTime(1470432540)))
LIMIT 10

Received exception from server:
Code: 288. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception: Double-distributed IN/JOIN subqueries is denied (distributed_product_mode = 'deny'). You may rewrite query to use local tables in subqueries, or use GLOBAL keyword, or set distributed_product_mode to suitable value..

Googling by exception show only source code. Also nothing about it in documentation.
Just trying different stuff I discovered:
SET distributed_product_mode = 'allow'


Is this a correct way to go or I can set it in config. And how bad it is to use?

Thanks,
Oleg.

Олег Облеухов

unread,
Aug 6, 2016, 10:03:05 AM8/6/16
to ClickHouse
Please ignore Limit 10 at the end of query. Result of experiments

суббота, 6 августа 2016 г., 15:59:53 UTC+2 пользователь Олег Облеухов написал:

man...@gmail.com

unread,
Aug 6, 2016, 8:29:56 PM8/6/16
to ClickHouse
Hello.

Depending on your sharding configuration, you should use one of following:

1.
select count(*) from default.graphite_d where Path in (select Path FROM default.graphite WHERE Path like 'network.%.sFlow.all_pktsIn' GROUP BY Path order by max(Value) desc limit 5) AND Date >= toDate(toDateTime(1470346140)) AND Date <= toDate(toDateTime(1470432540)) limit 10;

Or SET
distributed_product_mode = 'local' and leave query as is, it will be rewritten in same way automatically.

2.
select count(*) from default.graphite_d where Path global in (select Path FROM default.graphite_d WHERE Path like 'network.%.sFlow.all_pktsIn' GROUP BY Path order by max(Value) desc limit 5) AND Date >= toDate(toDateTime(1470346140)) AND Date <= toDate(toDateTime(1470432540)) limit 10;

Or SET distributed_product_mode = 'global' and leave query as is, it will be rewritten in same way automatically.


First variant is suitable, if corresponding data is co-located on same nodes. In your case - if all data for same Path are in one node: when data is sharded by Path. It works as follows: on each server, data will be filtered by result of subquery, that run over local table.

Second variant is suitable in all cases, but more expensive. It works as follows: first, execute distributed subquery, get result on initiating server. Then, broadcast this result to every server, into temporary table in memory. Then execute outer query distributed, using temporary table on each server. So, every server will see full result of subquery.

Also third variant is possible: on each server, process subquery distributed way. For example, if you have 10 shards, then query is send to 10 servers, and on each of them, to execute subquery, this subquery is sent to each of 10 servers again, resulting in 100 elementary queries. It is too bad, and we denied it by default, and introduced 'distributed_product_mode' to control this behaviour and to provide possibility to select one of two better options ('local' and 'global') without need of query rewrite.

All methods of processing distributed subqueries described in documentation:
https://clickhouse.yandex/reference_en.html#Distributed%20subqueries
(But it does not cover 'distributed_product_mode' setting.)

Reply all
Reply to author
Forward
0 new messages