Simple CTE (not recursive) with a custom function

285 views
Skip to first unread message

Harshad RJ

unread,
Apr 3, 2015, 3:24:59 PM4/3/15
to h2-da...@googlegroups.com
Hi,

I have created a custom function called SIMILARITY to measure similarity between two strings. It works fine when I use it like this:

SELECT name, SIMILARITY(name, 'alex') AS siml
  FROM
Person
  ORDER BY siml DESC
  LIMIT
10;



However, when querying a large table, this is slow. To speed it up, I want to filter away names whose similarity is very low (so that result set would be smaller and sorting would be faster). I am not allowed to use the siml alias in the WHERE clause, so I tried using a CTE:

WITH cte(name, siml) AS (
  SELECT name
, SIMILARITY(name, 'alex') AS siml
      FROM
Person
)
SELECT
* FROM cte
  WHERE cte
.siml > 0.1
  ORDER BY cte
.siml DESC
  LIMIT
10;


This syntax works fine in Postgresql. But in H2, it gives this error:
Syntax error in SQL statement "recursive queries without UNION ALL"; expected {1};

From the past messages in this group, it seems like WITH statement is not yet fully supported. But this seems to be a simple variant. Just wondering if there is a quick solution possible.

Thanks,
Harshad

Harshad RJ

unread,
Apr 4, 2015, 1:29:34 AM4/4/15
to h2-da...@googlegroups.com
Hi,

I did some more analysis and have a better understanding, so let me rephrase and simplify my question.

My query was crossing the default result set limit of 40,000 rows, due to which h2 is storing the result set to disk, which was the cause for the slow down.

Increasing the limit (h2.maxMemoryRows) helped the performance a lot. But this can't scale when the database grows.

On the other hand, adding a WHERE clause helps limit the number of rows in the result, which prevents h2 from writing the result set to disk in a scalable way. But, currently my custom function has to be invoked twice (once in the SELECT column list and once in the WHERE clause). Since this is an expensive function, it would be better to avoid repeating the call.

Using the below CTE seems to be one way to avoid the duplicate call. It works fine in Postgresql and hsqldb.


On Saturday, 4 April 2015 00:54:59 UTC+5:30, Harshad RJ wrote:

WITH cte(name, siml) AS (
  SELECT name
, SIMILARITY(name, 'alex') AS siml
      FROM
Person
)
SELECT
* FROM cte
  WHERE cte
.siml > 0.1
  ORDER BY cte
.siml DESC
  LIMIT
10;


Is support for this simple CTE possible in the near future? It would help tremendously for the kind of queries I am encountering.

Thanks,
Harshad

Noel Grandin

unread,
Apr 4, 2015, 5:55:00 AM4/4/15
to h2-da...@googlegroups.com
Why can you not use it in a where clause?
--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.
To post to this group, send email to h2-da...@googlegroups.com.
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

Harshad RJ

unread,
Apr 4, 2015, 6:47:37 AM4/4/15
to h2-database

On Sat, Apr 4, 2015 at 3:24 PM, Noel Grandin <noelg...@gmail.com> wrote:
Why can you not use it in a where clause?

It's not supported by the SQL standard. None of the databases I tried support that.

Just to clarify, I can't use the alias in the WHERE clause​, like this:

select distance(name, 'xyz') as dist
   from person
   where dist > 0.1;

I can repeat the expression in the WHERE clause like this:

select distance(name, 'xyz') as dist
   from person
   where distance(name, 'xyz') > 0.1;

And that works, but takes more time.

--

Noel Grandin

unread,
Apr 4, 2015, 12:47:58 PM4/4/15
to h2-da...@googlegroups.com
Putting it into a CTE isn't going to make it any faster. 
--

Harshad RJ

unread,
Apr 4, 2015, 2:52:06 PM4/4/15
to h2-database

On Sat, Apr 4, 2015 at 10:17 PM, Noel Grandin <noelg...@gmail.com> wrote:
Putting it into a CTE isn't going to make it any faster. 

​Yup; I just realized this. The CTE will be inlined into a nested SELECT right?

Ok, I have another question. I put up a counter inside my custom function to see how many times it gets called. And the count is very strange.

The table has 90,000 rows. When I execute the SELECT statement without the WHERE clause, the function is called 90,000 times as expected.

However, when I add the WHERE clause, which repeats the call to the custom function, I get a count of 150,000 and sometimes it is 160,000!

If it was 90,000 or 180,000 it would have made sense; former number means there is Common Subexpression Elimination; latter means it is not (the sub-expression is evaluated twice).

But, why 150,000? Does some optimisation kick in after a certain number of rows have been processed?

​thanks,
--

Harshad RJ

unread,
Apr 4, 2015, 3:01:51 PM4/4/15
to h2-database

On Sun, Apr 5, 2015 at 12:21 AM, Harshad RJ <harsh...@gmail.com> wrote:
However, when I add the WHERE clause, which repeats the call to the custom function, I get a count of 150,000 and sometimes it is 160,000!

​I forgot to mention: it is an atomic counter. There is no race condition there.​



--

Noel Grandin

unread,
Apr 4, 2015, 3:24:32 PM4/4/15
to h2-da...@googlegroups.com
If you declare the function as deterministic we have a small cache of input-to-output values for each user function. 


On Saturday, 4 April 2015, Harshad RJ <harsh...@gmail.com> wrote:
--

Harshad RJ

unread,
Apr 4, 2015, 4:14:44 PM4/4/15
to h2-database

On Sun, Apr 5, 2015 at 12:54 AM, Noel Grandin <noelg...@gmail.com> wrote:
If you declare the function as deterministic we have a small cache of input-to-output values for each user function. 

​Ah yes; that would explain it. Moreover, if the WHERE clause ​evaluates to false, some rows would get eliminated, avoiding the duplicate evaluation in the SELECT column list.

I think the summary of this thread is a feature request: if possible, please implement Common Sub-expression Elimination in H2.

​I sent too many emails before understanding the root cause; I am sorry for the noise.​

--

Noel Grandin

unread,
Apr 5, 2015, 2:07:59 PM4/5/15
to h2-da...@googlegroups.com
Use a nested select. Do the function in the inner select and the where and group by in the outer select. 


On Saturday, 4 April 2015, Harshad RJ <harsh...@gmail.com> wrote:
--

Harshad RJ

unread,
Apr 7, 2015, 1:38:32 PM4/7/15
to h2-database

On Sun, Apr 5, 2015 at 11:37 PM, Noel Grandin <noelg...@gmail.com> wrote:
Use a nested select. Do the function in the inner select and the where and group by in the outer select. 

​That doesn't help. It calls the function twice. Here's my query:

  SELECT name, dist FROM (
    SELECT name, DISTANCE(name, 'alex') as dist
       FROM Person
    )
    WHERE dist > 0.05
    ORDER BY dist DESC
    LIMIT 5;

​This evaluates the `DISTANCE` function twice!

I can confirm that DISTANCE() is being called in the WHERE clause, not in the ORDER clause (by simply deleting that clause and observing the count).

​thanks for looking,​
--

Thomas Mueller

unread,
Apr 10, 2015, 1:41:59 AM4/10/15
to h2-da...@googlegroups.com
Hi,

Yes, it calls the method twice, once to evaluate the condition "dist > 0.05", and once to get the row for result ("select name, dist"). To avoid that, you would need to use a small cache. There currently is no cache in H2, as a cache would slow down performance unnecessarily for simple methods, and is also not necessary for slow methods if the method is only used once (for example, just in the condition, or just in the result).

A workaround is to change the query slightly, to do the caching yourself, for example:

SELECT name, @dist as dist FROM (
  SELECT name, DISTANCE(name, 'alex') as dist
     FROM Person
  )
  WHERE (@dist := dist) > 0.05
  ORDER BY dist DESC
  LIMIT 5;

Regards,
Thomas

Harshad RJ

unread,
Jun 25, 2015, 1:39:51 PM6/25/15
to h2-database
​Sorry for the late reply.​

On Fri, Apr 10, 2015 at 11:11 AM, Thomas Mueller <thomas.to...@gmail.com> wrote:
A workaround is to change the query slightly, to do the caching yourself, for example:

​​
SELECT name, @dist as dist FROM (
  SELECT name, DISTANCE(name, 'alex') as dist
     FROM Person
  )
  WHERE (@dist := dist) > 0.05
  ORDER BY dist DESC
  LIMIT 5;

​Your suggestion seemed very convoluted at first, but then I realized I could simplify it thus:


​​SELECT name, @dist as dist
  FROM Person
  WHERE (@dist := DISTANCE(name, 'alex')) > 0.05
  ORDER BY dist DESC
  LIMIT 5;

​Seems to work fine. The nested query doesn't seem to be required.

​If there is some subtle difference between the two, please let me know.

​thanks,​
--
Reply all
Reply to author
Forward
0 new messages