Hi Folks,
I've been playing with data annotation using clickhouse. Meaning, i have a *very* large dataset (>10Tbyte) that i would like to annotate using the command line client of clickhouse. Now, for small sets of this data this works, but there seems to be a problem that buffers are either too big or clickhouse waits for the input to finish. Either way, it does not seem to work.
Here is an example command:
while true; do echo $(((RANDOM)%2)); done | pv -lc -N "INPUT" | clickhouse-client -q "select input, Name from (Select input from STDIN) any left join (select 0 as input, 'zero' as Name union all select 1 as input, 'one' as Name) using input" --external --file=- --name=STDIN --structure='input UInt8' | pv -lc -N "OUTPUT" > /dev/null
Basically, i input 0 or 1 from the commandline at random, and annotate it with a static table that turns the number into a string. Output looks like this:
1 one
1 one
0 zero
1 one
0 zero
1 one
1 one
0 zero
1 one
0 zero
which is exactly what i am looking for. (well not exactly this, my query is more complicated, but as an example this should do)
When running this indefinitely, clickhouse seems to use a huge internal buffer on this for the results. The memory of the client does not go up, but the servers just eats more and more memory when the result is quite clear after each line has been finished. (after inputing more than 10Gbyte of 0's there is still not a single line out output)
Is there a way to stop the client/server from caching the result and flushing it right away back to STDOUT ?
NOTE: pv is only used here to show the input/output rates. It can be removed at and the command will still run.