Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Performance of a UDF with datetime parameters

5 views
Skip to first unread message

Yarik

unread,
Jan 3, 2008, 11:29:27 PM1/3/08
to
Hi,

I'm using SQL Server 2000 and observing some very counter-intuitive
behaviour...

There are a few inline table-valued UDFs having the following
signatures:

alter function dbo.Function1(@From datetime, @To datetime, .../*
other params */) returns table
alter function dbo.Function2(@From datetime, @To datetime, .../*
other params */) returns table
alter function dbo.Function3(@From datetime, @To datetime, .../*
other params */) returns table

These UDFs actually call each other passing the two temporal
parameters along, like this:

alter function dbo.Function1(@From datetime, @To datetime, .../*
other params */) returns table
return
(
select ... from ... Function2(@From, @To, ...) ...
)

alter function dbo.Function2(@From datetime, @To datetime, .../*
other params */) returns table
return
(
select ... from ... Function3(@From, @To, ...) ...
)

Now, the puzzle is: the performance of this call stack turns out to
greatly depend on how the values of the two temporal parameters are
being passed to the "topmost" UDF. Specifically, when Function1
receives its parameter values as variables of type datetime, like this

declare @From as datetime; set @From = '2007-01-01'
declare @To as datetime; set @To = '2007-12-31'

select ... from ... Function1(@From, @To, ...)

the query takes, for example, 14-16 seconds. However, when Function1
receives its parameter values as string literals, like this

select ... from ... Function1('2006-01-01', '2007-12-31', ...)

the query takes only 8-10 seconds. It's almost 50% performance
improvement.

This looks very counterintuitive, because I would expect that
performance in the second case would not be better (if only because
SQL Server would not have to do extra work to convert string literals
into datetime values).

The abovementioned difference in performance seems to decrease if I
decrease the number of calls in stack (for example, when the same
experiment is set up for Function2, the response times are 900 ms vs
700 ms; when the same experiment is set up for the "leaf" function -
the function that does not pass temporal parameters anywhere else -
the response times are practically the same). So it looks like the
answer is in the mechanics of how SQL Server passes parameter values
to UDFs, not in what my UDFs are actually doing with those parameters.

Can anybody explain the phenomenon I'm observing?

I have one theory: maybe SQL Server actually converts all the UDF
parameters to strings (internally), so whenever I use other types to
pass parameter values I actually impose additional load on SQL Server
forcing it to do some type conversions while passing parameters on
each call. On one hand, this theory looks crazy for a person whose
main programming experience lies with C/C++/Java and other strongly
typed non-declarative languages. On the other hand, my modest
experience with UDFs suggests that no theory in this area should be
considered crazy until proven otherwise. :-)

I'd really appreciate any feedback on this problem.

Thank you,
Yarik.


Itzik Ben-Gan

unread,
Jan 5, 2008, 10:16:37 AM1/5/08
to
Hi Yarik,

This may have to do with the fact that the initial compilation unit is the
whole batch (as opposed to the statement).
There's an important difference between the two scenarios you are describing
in terms of the knowledge that the optimizer has/doesn't have regarding the
values used in the query predicates.

In scenario 1, you have the following logic in a single batch:

1. Declare local variables
2. Set values to the variables
3. Issue a query referring to the variables in the query filter

The key point here is that since the initial compilation unit is the whole
batch the expanded query in item 3 is optimized before the assignments in
item 2 were processed. Therefore the optimizer has no knowledge of the
values in the variables, and it ends up using a hard coded guess (which
happens to be 30% selectivity for a range filter).

In scenario 2, you have the following logic in the batch:

1. Issue a query referring to the specified constants in the query filter

SQL Server expands the definitions of of the UDFs, substituting the
parameter values with the specified constants, and the optimizer will
optimize the expanded query with knowledge of the values in the filter.

The end result is that in scenario 2 the optimizer can make better/more
educated decisions when it comes to choices that depend on selectivity;
e.g., which indexes to use and how.

To make this explanation more tangible, I'll use a simplified example, with
only one parameter, and only one UDF:

use northwind;
go

create function dbo.f1(@d as datetime) returns table
as

return
select orderid, orderdate, customerid, employeeid
from dbo.orders
where orderdate >= @d;
go

To demonstrate scenario 1, here's code that declares a variable, sets it
with a value, and queries the UDF:

declare @myd as datetime;
set @myd = '19980506';

select orderid, orderdate, customerid, employeeid from dbo.f1(@myd) as t;

If you examine the actual execution plan of the query, you will see that the
predicate identified by the optimizer is:

orderdate >= @myd

As far as the optimizer is concerned, it needs to optimize a query where the
value in the predicate is unknown at optimization time:

select orderid, orderdate, customerid, employeeid
from dbo.orders
where orderdate >= ?;

Since this is a range filter, it uses a hard coded guess of 30% selectivity.
With such selectivity, the optimizer chooses to use a full table scan
(clustered index scan). If you examine the number of rows flowing out of the
clustered index scan operator, you will find an estimate of 249 rows (while
actual is 4), amounting to 249/830 = 30% selectivity.

To demonstrate scenario 2, here's code that uses a constant:

select orderid, orderdate, customerid, employeeid from dbo.f1('19980506') as
t;

If you examine the actual execution plan of the query, you will see that the
predicate identified by the optimizer is:

orderdate >= '19980506'

This time, the optimizer knows what's the value in the predicate, and can
optimize accordingly based on a selectivity estimate that is close to
reality:

select orderid, orderdate, customerid, employeeid
from dbo.orders
where orderdate >= '19980506';

That is, it can examine the histogram on the orderdate column, calculate the
selectivity of the predicate, realize that it's very high, and opt for using
the index on the orderdate column.

As for how to resolve the problem you're experiencing, there are different
options, and for different cases a different solution would be most
appropriate.
This has to do with the selectivity ranges requested by the queries against
the function, the index design of the underlying objects, and desired
behavior of reuse/nonreuse of cached execution plans.
The subject is a bit too involved to be covered in this post, but just to
give you a sense...

For example, say that a reuse of a previously cached plan is undesired
(e.g., different requests have varying selectivity ranges). In SQL Server
2005 the initial compilation unit is still the whole batch, but the
recompilation unit is the statement. This enabled adding a special option
that you can specify at the query level to request a recompile of the
statement. What changes here compared to scenario 1 is that the statement in
item 3 is optimized after items 1 and 2 were already processed, therefore
the optimizer has knowledge of the values in the variables. Here's how the
code looks like:

declare @myd as datetime;
set @myd = '19980506';

select orderid, orderdate, customerid, employeeid from dbo.f1(@myd) as t
option(recompile);

Here, the query won't reuse a previously cached plan; or more accurately,
SQL Server won't even bother to cache the plan. But if you do not want to
reuse it, that's the behavior you will get. In terms of the efficiency of
the plan, it is based on knowledge of the values in the query filter.

If, on the other hand, all queries against the function use similar
selectivity ranges, and the same plan would be adequate for all invocations,
you would probably want to both get an efficient plan and be able to reuse
it. Probably the best way to achieve this is to use a stored procedure:

create proc dbo.proc1 @d as datetime
as

select orderid, orderdate, customerid, employeeid from dbo.f1(@d) as t;
go

The stored procedure's batch gets optimized upon first invocation AFTER the
value was passed through the parameter. In other words, the optimizer knows
what the value of the variable is when it optimizes the batch. This solution
would generate an optimal plan for the query based on the value provided in
the first invocation, and would reuse the plan in subsequent invocations of
the stored procedure.

declare @myd as datetime;
set @myd = '19980506';

exec dbo.proc1 @myd;

If you examine the execution plan you will see that the optimizer's
selectivity estimate was pretty accurate, and the index on orderdate was
used.

If for your own reasons you prefer not to use a stored procedure, rather
prefer to use ad-hoc code, sp_executesql will provide similar behavior in
terms of knowledge of the value in the parameter:

declare @myd as datetime;
set @myd = '19980506';

exec sp_executesql
@stmt = N'select orderid, orderdate, customerid, employeeid from
dbo.f1(@d) as t;',
@params = N'@d as datetime',
@d = @myd;

There's much more to consider here before choosing the right solution to
your problem, but this was just to give you some sense...

--
BG, SQL Server MVP
www.SolidQ.com
www.InsideTSQL.com


"Yarik" <ya...@garlic.com> wrote in message
news:5b535612-dac3-4e4a...@s8g2000prg.googlegroups.com...

Yarik

unread,
Jan 19, 2008, 9:16:16 PM1/19/08
to
Some sense? Well, it makes a lot of sense to me. :-)

In fact, that was one of the best answers I've ever received in
newsgroups!

Thank you very, very much for your time and wisdom!!

0 new messages