Query latest N records and paging

Kfir Dadosh

unread,

Sep 6, 2017, 12:58:42 PM9/6/17

to ClickHouse

Hi,

Can I query the latest N records (according to their insert time)

Basically running a simple "select ... from ... limit N", only that usually this query return the first inserts, and not the last.

Alternatively, is there an efficient way to do the sorting, without running on all the data (time is a primary key).

In addition, is there a way to do "paging", by getting a cursor to a query result, and be able to get more results from the same point:

select x from y where z limit 100

and then continue the same query, similar to running with

limit 100,100

data is inserted all the time so running another query will give different results.

Thanks,

Dmitry Berezhnov

unread,

Sep 7, 2017, 5:58:36 AM9/7/17

to ClickHouse

Hi,
It depends on how do you want to paging you data.
You can use this way (paging with round by date):
1. Get list of date (page)
SELECT
toDate(time) AS date
FROM
y
GROUP BY date
ORDER BY
date

2017-01-02
2017-01-03
2017-01-05
2017-01-08
...

2. Find which date is on you page. For example - 3 page, so you get 3th date from step 1.
3. Filter data by date '2017-01-05'
SELECT

...
FROM y
WHERE toDate(time) = '2017-01-05'

Николай Кочетов

unread,

Sep 7, 2017, 9:19:51 AM9/7/17

to ClickHouse

For the latest N records you can use:
ORDER BY ... DESC LIMIT N
And you can't do the sorting without reading all the data because primary keys are sorted only inside part.

Now you can't achieve paging functionality without an effect from newly inserted parts. However, you can run clickhouse-client -q "..." and periodically read from output stream as many rows as you need. Not sure that it's what you need.

среда, 6 сентября 2017 г., 19:58:42 UTC+3 пользователь Kfir Dadosh написал:

Kfir Dadosh

unread,

Sep 7, 2017, 10:55:43 AM9/7/17

to ClickHouse

Thanks for you help,

Since data is always appended to the end of the table, I thought it would be possible to query the latest elements:

Suggested syntax for that might be LIMIT -N, which would return the last N records that were inserted.

1. Do you know if the ClickHouse architecture allows implementation something like this?

2. Assuming I do a sort, is there a guaranty for the internal sub order

assuming each tuple is a row ((a,a),(a,b),(a,c),(b,a)), sorted by the first column. would running sort again might return differently ((a,b)(a,c)(a,a),(b,a))?

3. Does the internal ClickHouse implementation has an internal unique ID for each row, so it would be able to tell Clickhouse to start the scan from that row?

Kfir Dadosh

Николай Кочетов

unread,

Sep 7, 2017, 11:46:30 AM9/7/17

to ClickHouse

1. It's possible to do it faster than sorting, but rather difficult and only under some conditions. At first, primary key is sorted as tuple, so target column should be first in primary key. At second, N shouldn't be very big. Finally, it won't work like LIMIT, because LIMIT is applied to result set of rows (after GROUP BY, for example). Also, such setting will be specific for MergeTree family tables. Another possible problem that MergeTree reads intervals from part in ascending order, which, probably, not very difficult to change, but I'm not sure, that it isn't used somewhere.

2. Yes, it would. Parts can be read in any order. Also, during merge of parts the order of rows would be changed.

3. No, it hasn't. Also it may be not trivial task to add it.

2.

четверг, 7 сентября 2017 г., 17:55:43 UTC+3 пользователь Kfir Dadosh написал:

Reply all

Reply to author

Forward