Query latest N records and paging

1,942 views
Skip to first unread message

Kfir Dadosh

unread,
Sep 6, 2017, 12:58:42 PM9/6/17
to ClickHouse
Hi,

Can I query the latest N records (according to their insert time)
Basically running a simple "select ... from ... limit N", only that usually this query return the first inserts, and not the last.
Alternatively, is there an efficient way to do the sorting, without running on all the data (time is a primary key).

In addition, is there a way to do "paging", by getting a cursor to a query result, and be able to get more results from the same point:
select x from y where z limit 100
and then continue the same query, similar to running with
limit 100,100
data is inserted all the time so running another query will give different results.

Thanks,

Dmitry Berezhnov

unread,
Sep 7, 2017, 5:58:36 AM9/7/17
to ClickHouse
Hi,
It depends on how do you want to paging you data.
You can use this way (paging with round by date):
1. Get list of date (page)
SELECT
 toDate(time) AS date
FROM
 y
GROUP BY date
ORDER BY
 date

2017-01-02
2017-01-03
2017-01-05
2017-01-08
...

2. Find which date is on you page. For example - 3 page, so you get 3th date from step 1.
3. Filter data by date '2017-01-05'
SELECT
 ...
FROM y
WHERE toDate(time) = '2017-01-05'

Николай Кочетов

unread,
Sep 7, 2017, 9:19:51 AM9/7/17
to ClickHouse
For the latest N records you can use:
ORDER BY ... DESC LIMIT N
And you can't do the sorting without reading all the data because primary keys are sorted only inside part.

Now you can't achieve paging functionality without an effect from newly inserted parts. However, you can run clickhouse-client -q "..." and periodically read from output stream as many rows as you need. Not sure that it's what you need.
 
среда, 6 сентября 2017 г., 19:58:42 UTC+3 пользователь Kfir Dadosh написал:

Kfir Dadosh

unread,
Sep 7, 2017, 10:55:43 AM9/7/17
to ClickHouse
Thanks for you help,

Since data is always appended to the end of the table, I thought it would be possible to query the latest elements:
Suggested syntax for that might be LIMIT -N, which would return the last N records that were inserted.
1. Do you know if the ClickHouse architecture allows implementation something like this?
2. Assuming I do a sort, is there a guaranty for the internal sub order
assuming each tuple is a row ((a,a),(a,b),(a,c),(b,a)), sorted by the first column. would running sort again might return differently ((a,b)(a,c)(a,a),(b,a))?
3. Does the internal ClickHouse implementation has an internal unique ID for each row, so it would be able to tell Clickhouse to start the scan from that row? 

Kfir Dadosh

Николай Кочетов

unread,
Sep 7, 2017, 11:46:30 AM9/7/17
to ClickHouse
1. It's possible to do it faster than sorting, but rather difficult and only under some conditions. At first, primary key is sorted as tuple, so target column should be first in primary key. At second, N shouldn't be very big. Finally, it won't work like LIMIT, because LIMIT is applied to result set of rows (after GROUP BY, for example). Also, such setting will be specific for MergeTree family tables. Another possible problem that MergeTree reads intervals from part in ascending order, which, probably, not very difficult to change, but I'm not sure, that it isn't used somewhere.

2. Yes, it would. Parts can be read in any order. Also, during merge of parts the order of rows would be changed.

3. No, it hasn't. Also it may be not trivial task to add it.

2.

четверг, 7 сентября 2017 г., 17:55:43 UTC+3 пользователь Kfir Dadosh написал:
Reply all
Reply to author
Forward
0 new messages