Duplicate records when order by date?

172 views
Skip to first unread message

Peter Borissow

unread,
Apr 12, 2024, 1:16:31 PMApr 12
to H2 Database
I think I might have found a bug sorting by date but I could use some help confirming.

I have attached a csv file with an “id” and “start_date” field which you can load into a into a table like this:
CREATE TABLE TEST (
    ID BIGSERIAL NOT NULL,
    START_DATE TIMESTAMP with time zone,
    CONSTRAINT PK_TEST PRIMARY KEY (ID)
);

If I run the following 2 queries, I often see the last row from the first query appear as the first row in the second query:
SELECT * FROM test ORDER BY start_date DESC LIMIT 50 OFFSET 100
SELECT * FROM test ORDER BY start_date DESC LIMIT 50 OFFSET 150

The behavior is inconsistent. Sometimes I see 217 at the end of the first query and 218 at the start of the second (good), sometimes I see 218 at the end of the first query and 217 at the start of the second (also valid), sometimes I see 217 at the end of the first query and 217 at the start of the second (bad), sometimes I see 218 at the end of the first query and 218 at the start of the second (bad).

I don’t see the same issue when I order by id.

Again, this behavior is spurious and you may need to run the same queries multiple times to see different results.

Please let me know if anyone else can replicate.

Thanks in advance,
Peter


My environment:
Java 21.0.1
Windows 11
H2 2.2.224 (PostgreSQL mode)

test.csv

Evgenij Ryazanov

unread,
Apr 12, 2024, 11:50:41 PMApr 12
to H2 Database
Hello!

Try the following query: SELECT RANK() OVER (ORDER BY start_date DESC NULLS FIRST) R, * FROM test ORDER BY 1; Some rows have the same rank:
149 130 2019-01-11 00:00:00-05
150
217 2019-01-02 00:00:00-05

150
218 2019-01-02 00:00:00-05

152
164 2019-01-01 00:00:00-05

152
166 2019-01-01 00:00:00-05

154
165 2018-11-16 00:00:00-05

That means rows with ID 217 and 218 may be returned every time in any order. It isn't a bug, it is a valid standard-compliant behavior.

Offset pagination a bad thing, but if you really need to use it, there are two common solutions:
1. You can add a unique non-null column to the ORDER BY clause, such as primary key column(s):
SELECT * FROM test ORDER BY start_date DESC NULLS FIRST, id OFFSET 100 ROWS FETCH NEXT 50 ROWS ONLY;
2. You can use standard WITH TIES clause:
SELECT * FROM test ORDER BY start_date DESC NULLS FIRST OFFSET 100 ROWS FETCH NEXT 50 ROWS WITH TIES;
This query returns 51 rows, because after 50th row there is an additional row with the same rank. These possible additional returned rows inform your application about this situation and it can handle it somehow:
217 2019-01-02 00:00:00-05
218 2019-01-02 00:00:00-05

For example, it can process all of them and run next query with OFFSET 151 ROWS FETCH NEXT 50 ROWS WITH TIES.

Peter Borissow

unread,
Apr 13, 2024, 11:34:54 AMApr 13
to H2 Database
Thanks Evgeni, as always, for your prompt response!

So you don't think this is a bug? Pagination using offset and limit is not uncommon. 

Also, I should have mentioned this earlier but I don't see this behavior in PostgreSQL. Not sure how other RDBMS behave but it would be interesting to compare.

Peter

Andrei Tokar

unread,
Apr 13, 2024, 10:58:04 PMApr 13
to H2 Database
Hi Peter,
IMHO, Evgenij is 100% correct. This is NOT a bug.

If something (like offset pagination) is not uncommon, it does not make it the best practice.
If PostgressSQL always yields some deterministic behaviour, where non-determinism is allowed by SQL standard (and common sence),
 it does not mean that any other RDBMS should behave similarly.

I would follow Evgenij's advice and organize pagination on some unique attribute(s), and without OFFSET, but with adjustable WHERE condition, instead.
It would also cover cases of concurrent inserts / deletes (messing up all offsets), during your pages traversal.

Peter Borissow

unread,
Apr 16, 2024, 4:19:38 AMApr 16
to H2 Database
I don't know guys. It looks like a bug to me.

The problem is twofold:

(1) Whenever I see a "duplicate" record, the database is actually skipping a record. In other words, when I see 218 at the end of the first query and 218 at the start of the second, record 217 is missing. That is bad.

(2) The query response is inconsistent. I see 4 different responses coming back from the same query on a static table. I don't think that's right.

I'll update my queries to try to circumvent the issue(s) for now but I think this issue should be documented and fixed at some point.

Respectfully,
Peter


Noel Grandin

unread,
Apr 16, 2024, 5:02:03 AMApr 16
to h2-da...@googlegroups.com


On 4/16/2024 10:19 AM, 'Peter Borissow' via H2 Database wrote:
> (2) The query response is inconsistent. I see 4 different responses coming back from the same query on a static table. I
> don't think that's right.
>

SQL databases implement the relational model. The relational model works with sets, not lists.

Sets do not have ordering, so, if you want repeatable results and ordering, then you have to impose that ordering using
an ORDER BY clause, and that ORDER BY clause needs to impose a total order, not a partial order.

Andreas Reichel

unread,
Apr 16, 2024, 5:04:17 AMApr 16
to h2-da...@googlegroups.com
Peter,

from what I can see, your records are NOT DISTINCT regarding the ORDER criteria? So what exactly do you expect to happen when there is no formal contract on the order and sorting?
Thought experiment: insert all your records but with exactly the same Date/Timestamp -- in this case, your queries don't order anything but imply grab any 50 records first.

Cheers

Peter Borissow

unread,
Apr 16, 2024, 5:14:18 AMApr 16
to h2-da...@googlegroups.com
I expect to see both rows 217 and 218 at the break point like this:

Inline image


I don't care if 218 appears before 217 or if 217 appears before 218 as long as they both appear. 

Unfortunately, sometimes I get "duplicate" 217 at the break point, with no 218. Other times, I get "duplicate" 218 at the break point, with no 217 like this:

Inline image


That's point #1.

Point #2 regarding consistency is that if I run the same query over and over I get 4 different results. Bear in mind that the table and data is 100% static.


Evgenij Ryazanov

unread,
Apr 16, 2024, 5:18:41 AMApr 16
to H2 Database
These rows are called pears in the SQL Standard. The SQL Standard doesn't require any exact ordering of peers and doesn't require any stable ordering rules for them. These things are implementation-dependent.

If some particular DBMS doesn't have own documented ordering rules for pears (I never saw such rules anywhere), your queries rely on undefined behavior.

PostgreSQL may and can return peers in any order:

# create table test(a int, b int);
# create index test_b on test(b);
# insert into test values (1, 2), (1, 3), (2, 3), (2, 4), (3, 4), (0, 4), (-1, 4);
# select * from test order by a offset 0 rows;
a  | b
---+---
-1 | 4
 0 | 4
 1 | 3
 1 | 2
 2 | 4
 2 | 3
 3 | 4
(7 rows)

# select * from test order by a offset 0 rows fetch first 3 rows only;
a  | b
---+---
-1 | 4
 0 | 4
 1 | 2
(3 rows)

# select * from test order by a offset 3 rows fetch first 3 rows only;
a | b
--+---
1 | 2
2 | 4
2 | 3
(3 rows)

Row (1, 3) is missing.

Results in H2 are slightly different:

# select * from test order by a offset 0 rows fetch first 3 rows only;
A  | B
-- | -
-1 | 4
 0 | 4
 1 | 3
(3 rows)

# select * from test order by a offset 3 rows fetch first 3 rows only;
A | B
- | -
1 | 3
2 | 4
2 | 3

In H2 row (1, 2) is missing.

But there is no bug in PostgreSQL or H2.

Andreas Reichel

unread,
Apr 16, 2024, 5:43:04 AMApr 16
to h2-da...@googlegroups.com
On Tue, 2024-04-16 at 09:14 +0000, 'Peter Borissow' via H2 Database wrote:
I don't care if 218 appears before 217 or if 217 appears before 218 as long as they both appear. 


Peter,

please consider that 

1) the second query call knows nothing about the first query call ( and this fact alone makes pagination unreliable unless you can guarantee that there is a distinct order and no change to the data in between the calls).

2) both query calls grab ANY records out of the duplicates

The observed behaviour is perfectly expected and explainable, unless you order by a distinct criteria -- which was "ID" in your example only.

Cheers
Andreas

Evgenij Ryazanov

unread,
Apr 16, 2024, 5:43:18 AMApr 16
to H2 Database
On Tuesday 16 April 2024 at 17:18:41 UTC+8 Evgenij Ryazanov wrote:
These rows are called pears in the SQL Standard.
I meant peers, of course.

Peter Borissow

unread,
Apr 16, 2024, 5:46:09 AMApr 16
to H2 Database
I see what you are saying and I can replicate what you're seeing. However the two test cases are slightly different.

One difference between my test and yours is that I have a unique primary key. Are they really peers if there's a unique primary key?

Try this:

create table test(a int, b int, constraint a_key primary key (a));
insert into test values (1, 2), (2, 3), (3, 3), (4, 4), (5, 4), (6, 4), (7, 4);

select * from test order by b offset 0 limit 3;
select * from test order by b offset 3 limit 3;
select * from test order by b offset 6 limit 3;


Note the sorting is on b.


Inline image


A finer break point looking at just 2 records at a time looks like this:

Inline image


Does having a unique primary key make any difference? Or should it?

Andreas Reichel

unread,
Apr 16, 2024, 5:46:37 AMApr 16
to h2-da...@googlegroups.com
In my understanding, you could create a temporary table from your query (without pagination) and then use the special column `_row_id_` for the pagination.
--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/1fda6aa6c870c759abf48010bec19b56a643a2b6.camel%40manticore-projects.com.

Peter Borissow

unread,
Apr 16, 2024, 5:54:48 AMApr 16
to h2-da...@googlegroups.com
Hmm... I kinda have a row id already via the unique primary key. What advantage would a temp table give in my case?

Andreas Reichel

unread,
Apr 16, 2024, 5:58:26 AMApr 16
to h2-da...@googlegroups.com
On Tue, 2024-04-16 at 09:54 +0000, 'Peter Borissow' via H2 Database wrote:
Hmm... I kinda have a row id already via the unique primary key. What advantage would a temp table give in my case?

The advantage is, that special column _rowid_ is the physical row of the data frame and it would not change unless you shuffle the data.
It's comparable to Oracle's rownum.

Complete example:
drop table test if exists cascade;

create table test(a int, b int, constraint a_key primary key (a));
insert into test values (1, 2), (2, 3), (3, 3), (4, 4), (5, 4), (6, 4), (7, 4);

create table tmp_test
as select * from test order by b;

select * from tmp_test order by _rowid_ offset 0 rows fetch next 3 row only;


select * from tmp_test order by _rowid_ offset 3 rows fetch next 3 row only;


select * from tmp_test order by _rowid_ offset 6 rows fetch next 3 row only;

Peter Borissow

unread,
Apr 16, 2024, 6:24:37 AMApr 16
to h2-da...@googlegroups.com
I understand your suggested workaround. The workaround involves copying all the records into a temp table via query with an order by and then paginating through the temp table using the _rowid_ as the order by. Gotcha. 

Evgenij also suggest something a little more simpler that works for me - simply adding the id to the order by like this:

SELECT * FROM award_test ORDER BY start_date desc, id LIMIT 50 OFFSET 100;
SELECT * FROM award_test ORDER BY start_date desc, id LIMIT 50 OFFSET 150;


...which is what I'll probably end up doing. 

But I still think that these are just workarounds. I don't think we should be seeing "duplicates" like I reported - esp when there's a unique primary key. But I'll let you guys have the last word on that.

Best,
Peter

Evgenij Ryazanov

unread,
Apr 16, 2024, 6:28:41 AMApr 16
to H2 Database
> Are they really peers if there's a unique primary key?

Yes, they are if this primary key isn't included into sort specification. If rows aren't distinct with respect to the sort specification, they are peers of each other. Columns aren't included into sort specification don't matter here.

Database systems don't shuffle rows intentionally, but ordering of peers depends on index construction, query execution plan (it can be changed even for the same query with the same parameters in some cases), sometimes on row insertion order and other data modification operations. There are too many unknown variables. You can't rely on them.

Peter Borissow

unread,
Apr 16, 2024, 6:31:52 AMApr 16
to H2 Database
Understood. Thanks!
Reply all
Reply to author
Forward
0 new messages