Huge performance problem when using OR. Much faster(500* +) when using two SELECT and UNION

213 views
Skip to first unread message

Thomas Egense

unread,
Apr 25, 2012, 3:33:41 AM4/25/12
to H2 Database
I am using the latest version of H2: 1.3.166
Maybe this issue is known, but for most other DB-products it is
counterintuitive.

The table has about 6M rows. Table has 4
Columns(left,right,relation,count) and each have their own index.
SQL1 and SQL2 below are result-set identical except the order is
slightly different for same value of COUNT.
And normally for performance you would use SQL1.

SQL1: Takes 1 second+. Sometimes several seconds.
SQL2: Takes around 2 milis.

SQL1:
SELECT * FROM TRIPPLETCOUNT
WHERE RELATION = 'REQUESTED'
AND ( LEFT = 'sb_1909322' OR RIGHT= 'sb_1909322')
ORDER BY COUNT DESC

SQL2:
SELECT * FROM TRIPPLETCOUNT
WHERE RELATION = 'REQUESTED'
AND LEFT = 'sb_1909322'
UNION
SELECT * FROM TRIPPLETCOUNT
WHERE RELATION = 'REQUESTED'
AND RIGHT = 'sb_1909322'
ORDER BY COUNT DESC

Noel Grandin

unread,
Apr 25, 2012, 8:59:58 AM4/25/12
to h2-da...@googlegroups.com, Thomas Egense
yeah, we don't have a very clever query planner.
In particular, it doesn't do anything to optimise UNION
Patches welcome :-)

Thomas Egense

unread,
Apr 25, 2012, 9:03:34 AM4/25/12
to Noel Grandin, h2-da...@googlegroups.com
Union is the fastest way. It is the OR that needs to be optimised.
But I guess you meant that ?

From,
Thomas Egense

Noel Grandin

unread,
Apr 25, 2012, 10:10:39 AM4/25/12
to Thomas Egense, h2-da...@googlegroups.com
Ah yes, you are correct.
Either way, patches welcome :-)

Steve McLeod

unread,
Apr 26, 2012, 5:45:53 AM4/26/12
to h2-da...@googlegroups.com
My understanding is that H2 chooses at most one index to use per SELECT statement. With your OR example, this will therefore require a complete table scan to satisfy one side of the OR statement.

I wonder how other database engines deal with this. Do they use multiple indexes per SELECT? Or do they rewrite the SQL statement to become a UNION statement?

Noel Grandin

unread,
Apr 26, 2012, 5:49:59 AM4/26/12
to h2-da...@googlegroups.com, Steve McLeod
What we should be doing is using an index scan, which will be just as fast.
Not sure why we're going with a table-scan.
Probably a bug somewhere in the planner logic.

Thomas Egense

unread,
Apr 26, 2012, 6:29:44 AM4/26/12
to H2 Database
I tested that the query time with the OR-query matches a full table
scan and this is the simplest explanation also.

Steve McLeod

unread,
Apr 26, 2012, 8:30:52 AM4/26/12
to h2-da...@googlegroups.com
Hi Thomas,

It is useful to put EXPLAIN ANALYZE before your query to see what's going on. It will show you which index is picked, and whether a table scan is necessary:

explain analyze SELECT * FROM TRIPPLETCOUNT 

  WHERE RELATION = 'REQUESTED' 
  AND ( LEFT = 'sb_1909322' OR RIGHT= 'sb_1909322') 
  ORDER BY COUNT DESC;
PLAN  
SELECT
    TRIPPLETCOUNT.LEFT,
    TRIPPLETCOUNT.RIGHT,
    TRIPPLETCOUNT.RELATION,
    TRIPPLETCOUNT.COUNT
FROM PUBLIC.TRIPPLETCOUNT
    /* PUBLIC.TRIPPLETCOUNT.tableScan */
    /* scanCount: 1 */
WHERE (RELATION = 'REQUESTED')
    AND ((LEFT = 'sb_1909322')
    OR (RIGHT = 'sb_1909322'))
ORDER BY 4 DESC
(1 row, 0 ms)


explain analyze SELECT * FROM TRIPPLETCOUNT 

  WHERE RELATION = 'REQUESTED' 
  AND LEFT = 'sb_1909322' 
UNION 
SELECT * FROM TRIPPLETCOUNT 
  WHERE RELATION = 'REQUESTED' 
  AND RIGHT = 'sb_1909322' 
ORDER BY COUNT DESC;
PLAN  
(SELECT DISTINCT
    TRIPPLETCOUNT.LEFT,
    TRIPPLETCOUNT.RIGHT,
    TRIPPLETCOUNT.RELATION,
    TRIPPLETCOUNT.COUNT
FROM PUBLIC.TRIPPLETCOUNT
    /* PUBLIC.INDEX_LEFT: LEFT = 'sb_1909322' */
    /* scanCount: 1 */
WHERE (RELATION = 'REQUESTED')
    AND (LEFT = 'sb_1909322'))
UNION
(SELECT DISTINCT
    TRIPPLETCOUNT.LEFT,
    TRIPPLETCOUNT.RIGHT,
    TRIPPLETCOUNT.RELATION,
    TRIPPLETCOUNT.COUNT
FROM PUBLIC.TRIPPLETCOUNT
    /* PUBLIC.INDEX_RIGHT: RIGHT = 'sb_1909322' */
    /* scanCount: 1 */
WHERE (RELATION = 'REQUESTED')
    AND (RIGHT = 'sb_1909322'))
ORDER BY 4 DESC

Frederico

unread,
Apr 26, 2012, 5:41:26 PM4/26/12
to h2-da...@googlegroups.com
Hi. This is because H2 uses only one index per logical table. Take a look on using multiples indexes here: http://www.h2database.com/html/performance.html#explain_plan

Att,

Fred

Enviado via iPad
--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To view this discussion on the web visit https://groups.google.com/d/msg/h2-database/-/yl-Th51GZAsJ.
To post to this group, send email to h2-da...@googlegroups.com.
To unsubscribe from this group, send email to h2-database...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/h2-database?hl=en.

Fred&Dani&Pandora

unread,
Apr 27, 2012, 10:05:23 AM4/27/12
to h2-da...@googlegroups.com
Hi. I'm not able to reproduce your problem, can you post the creation table script?

Att,

Fred

2012/4/26 Frederico <zep...@gmail.com>

Thomas Egense

unread,
May 1, 2012, 1:50:21 AM5/1/12
to H2 Database
Okie, here are all details:

-------------------
DDL:
CREATE TABLE TRIPPLETCOUNT (
LEFT VARCHAR(512),
RIGHT VARCHAR(512),
RELATION VARCHAR(128),
COUNT INT
);

CREATE INDEX LEFT_IN ON TRIPPLETCOUNT(LEFT);
CREATE INDEX RIGHT_IN ON TRIPPLETCOUNT(RIGHT);
CREATE INDEX RELATION_IN ON TRIPPLETCOUNT(RELATION);
CREATE INDEX COUNT_IN ON TRIPPLETCOUNT(COUNT);
--------------------------
The table has about 8 M entries.

Below are the two SQL's and to make it easier they do not find any
rows, but this does not matter to the problem at all. It is the same
scenario if I use an id that will find
many rows.

----------------------
SQL 1: takes 1+ second
explain analyze SELECT * FROM TRIPPLETCOUNT WHERE RELATION =
'REQUESTED' AND ( LEFT = 'sb_2601322' OR RIGHT= 'sb_2601322') ORDER BY
COUNT DESC

SELECT
TRIPPLETCOUNT.LEFT,
TRIPPLETCOUNT.RIGHT,
TRIPPLETCOUNT.RELATION,
TRIPPLETCOUNT.COUNT
FROM PUBLIC.TRIPPLETCOUNT
/* PUBLIC.RELATION_IN: RELATION = 'REQUESTED' */
/* scanCount: 1190147 */
WHERE (RELATION = 'REQUESTED')
AND ((LEFT = 'sb_2909322')
OR (RIGHT = 'sb_2909322'))
ORDER BY 4 DESC
/*
total: 67483
TRIPPLETCOUNT.RELATION_IN read: 17919 (26%)
TRIPPLETCOUNT.TRIPPLETCOUNT_DATA read: 49564 (73%)
*/

----------
SQL 2: takes 0.02 second
explain analyze
SELECT * FROM TRIPPLETCOUNT WHERE RELATION = 'REQUESTED'
AND LEFT = 'sb_2909322'
UNION
SELECT * FROM TRIPPLETCOUNT
WHERE RELATION = 'REQUESTED'
AND RIGHT = 'sb_2909322'
ORDER BY COUNT DESC

SELECT
TRIPPLETCOUNT.LEFT,
TRIPPLETCOUNT.RIGHT,
TRIPPLETCOUNT.RELATION,
TRIPPLETCOUNT.COUNT
FROM PUBLIC.TRIPPLETCOUNT
/* PUBLIC.RELATION_IN: RELATION = 'REQUESTED' */
/* scanCount: 1190147 */
WHERE (RELATION = 'REQUESTED')
AND ((LEFT = 'sb_2601322')
OR (RIGHT = 'sb_2601322'))
ORDER BY 4 DESC
/*
total: 67483
TRIPPLETCOUNT.RELATION_IN read: 17919 (26%)
TRIPPLETCOUNT.TRIPPLETCOUNT_DATA read: 49564 (73%)
*/


Thomas Mueller

unread,
May 1, 2012, 4:31:36 AM5/1/12
to h2-da...@googlegroups.com
Hi,

It is true, H2 doesn't use two indexes for the single query (the one
without union), as described in the documentation at
http://h2database.com/html/performance.html#storage_and_indexes

I just tested that Apache Derby doesn't use two indexes either (the same as H2).

I didn't test, but I guess other databases with a more advanced
optimizer (such as PostgreSQL) do use two indexes for this case. But
I'm not sure. I guess some databases can use two indexes, and some can
not.

Regards,
Thomas

Knut Wannheden

unread,
May 1, 2012, 4:56:12 AM5/1/12
to h2-da...@googlegroups.com
FWIW I can confirm that Oracle will use two indexes in a situation
like this. Plan for OR query:

Filter Predicates Access Predicates Plan
SELECT STATEMENT ALL_ROWS
5 5 5 CONCATENATION
2 2 2 TABLE ACCESS BY INDEX ROWID TABLE X.FOO
1 1 "LEFT"<1234 1 INDEX RANGE SCAN INDEX X.FOO#LEFT
4 LNNVL("LEFT"<1234) 4 4 TABLE ACCESS BY INDEX ROWID TABLE X.FOO
3 3 "RIGHT"<1234 3 INDEX RANGE SCAN INDEX X.FOO#RIGHT

Plan for corresponding UNION query:

Filter Predicates Access Predicates Plan
SELECT STATEMENT ALL_ROWS
6 6 6 SORT UNIQUE
5 5 5 UNION-ALL
2 2 2 TABLE ACCESS BY INDEX ROWID TABLE X.FOO
1 1 "LEFT"<1234 1 INDEX RANGE SCAN INDEX X.FOO#LEFT
4 4 4 TABLE ACCESS BY INDEX ROWID TABLE X.FOO
3 3 "RIGHT"<1234 3 INDEX RANGE SCAN INDEX X.FOO#RIGHT

Regards,

--knut
> --
> You received this message because you are subscribed to the Google Groups "H2 Database" group.

Thomas Egense

unread,
May 1, 2012, 6:37:28 AM5/1/12
to H2 Database
From the documentation: "If the query condition does not contain the
row id (and if no other index can be used), then all rows of the table
are scanned."
I guess that is a major important information when you trying to
optimize your SQL, so it should be stated more visible and clearly.
I am used to DB2/Oracle that performs this optimization, so I just
took it for granted :)

H2 performance is really quite excellent, you just have to know to use
the UNION optimization sometimes.

Thomas.


Thomas Egense

unread,
May 1, 2012, 6:50:08 AM5/1/12
to H2 Database
I see no measurable improvements creating another index over both
columns (LEFT,RIGHT) btw. I already need index for LEFT and RIGHT
individually. Also you have to
consider the column order when creating and index that spawns over
several columns, depending on the data. Ie. have to most restrictive
first etc.
More indexes increases insert-time.

If my table had 1000M instead of 10M rows, maybe the index over
several columns will be faster.

Noel Grandin

unread,
May 2, 2012, 4:58:32 AM5/2/12
to h2-da...@googlegroups.com, Thomas Egense
Thomas, your test case cannot be correct - notice that they are both
reading the same amount of data. The only reason the second is faster is
because the data is cache-hot after you have run the first query.

Thomas Egense

unread,
May 3, 2012, 11:57:17 AM5/3/12
to H2 Database
Any news from dissecting the databasefile I gave you?

From,
Thomas

Steve McLeod

unread,
May 4, 2012, 6:13:24 AM5/4/12
to h2-da...@googlegroups.com
Hi Thomas,

Even with your  database file, the performance using OR can't be improved. You've hit a limit of H2 - a maximum of 1 index per SELECT. An index that contains multiple fields still won't help, because of the way H2 indexes work. Let me explain:

If your index is on (field1, field2, field3) then this index can be applied if the where clause refers to either:
* just field1
* field1 + field2
* field1 + field2 + field3

The index can't be used if you only refer to field2, or field3, or a combination of field2 and field3. This is the case only when the WHERE clause is using AND, rather than OR

The OR in your where clause complicates this further. You should consider that your where clause is effectively two where clauses, one for field1 and one for field2.

I looked back at the query in your original post. Your where clause is:

WHERE RELATION = 'REQUESTED' 

  AND ( LEFT = 'sb_1909322' OR RIGHT= 'sb_1909322') 

Performance tip 1: If only a small percentage of rows have RELATION = 'REQUESTED' then you could have an index on
(RELATION, LEFT). This would undoubtedly lead to better performance than you have now.

Performance tip 2: If possible, can you use an INTEGER instead of a VARCHAR for RELATION? This will make your row data smaller and therefore marginally faster to read I'd imagine.

Thomas Egense

unread,
May 7, 2012, 4:09:57 AM5/7/12
to H2 Database
Thanks for your time analyzing the DB file.

I understand your performance tips and index over multiple fields.

But they are minor compared to the *100+ performance factor I
experience using the OR instead of using two SELECTS and UNION between
them.
Can you confirm you experienced this too? Even with the smaller DB
file I gave you it still still a factor *100+ in difference.

From,
Thomas



Thomas Mueller

unread,
May 7, 2012, 3:46:55 PM5/7/12
to h2-da...@googlegroups.com
Hi,

Nobody needs to run any additional tests, that would be just a waste
of time. It is clear that the H2 database, as well as at least Apache
Derby, don't use two indexes for the given query. Except if you use
UNION. This is documented already, under
http://h2database.com/html/performance.html#storage_and_indexes

Therefore, I suggest to use UNION in this case. This will not only
help H2, but also other databases.

Regards,
Thomas
> --
> You received this message because you are subscribed to the Google Groups "H2 Database" group.

Noel Grandin

unread,
May 12, 2012, 6:47:19 AM5/12/12
to h2-da...@googlegroups.com
The problem here is that H2 was not using any index at all for the
query, it was doing a full table scan.

I'm sorry but I've become really busy so I haven't been able to look into this.

Steve McLeod

unread,
May 13, 2012, 8:41:35 AM5/13/12
to h2-da...@googlegroups.com
Noel, with H2's current approach (one index per query), it is unavoidable to do a full table scan with an OR statement on two different fields.
Reply all
Reply to author
Forward
0 new messages