Why does doing update in chunks works faster than doing the whole?

Sydney Lotterby

unread,

Feb 16, 2006, 1:00:26 AM2/16/06

to

SQL2K w/all sps; Win2K3 Enterprise clustered server; 3 GHz dual CPUs; 4 gb
RAM; 200+gb HD free; low traffic;

Tables below: GoFwdImages ~ 490,000 rows; rectype ~ 20 rows; instloc =
1.6M rows
All these tables have the necessary indexes on them to make the JOINs
efficient.

The script below ran for 1.5 hrs and I suspected it might keep going for
another 10+ hrs so I killed it.

UPDATE gfi
set gfi.InstrlocID = i.instrlocid
FROM __GoFwdImages gfi
JOIN instloc i on i.volume = gfi.volume and i.page = gfi.page
JOIN rectype rt on gfi.rt = rt.shortdesc
JOIN rectype rt2 on rt2.rectype = i.rectype

The script below is on target to complete in ~ 30 mins!
I check progress with: select top 10 * from _tcaProgress order by id desc

It is a pain to have to break this into chunks rather than letting SQL
server handle the efficiencies for me. Can anyone tell me why SQL server is
laboring over this update?

declare @i integer
declare @c integer
declare @chunk integer

set @chunk = 10000

set @i = 1
set @c = 1

while (@c > 0)
begin
UPDATE gfi
set gfi.InstrlocID = i.instrlocid
FROM __GoFwdImages gfi
JOIN instloc i on i.volume = gfi.volume and i.page = gfi.page
JOIN rectype rt on gfi.rt = rt.shortdesc
JOIN rectype rt2 on rt2.rectype = i.rectype
where gfi.InstrlocID is null
and gfi.id >= @i and gfi.id < @i + @chunk

set @c = @@rowcount

--print 'Updated ' + cast((@i-1) as varchar(10)) + ' rows.'
insert into _tcaProgress(operation, progress)
values ('update gfi', @i)

set @i = @i + @chunk
end
GO

Daniel Crichton

unread,

Feb 16, 2006, 7:35:34 AM2/16/06

to

Your second update is not the same as the first. In the second you are
limiting the update to only those rows where __GoFwdImages.InstrlocID is
null, in the first you are updating every row in __GoFwdImages even if
InstrlocID is already set. First thing I'd do is try adjusting your first
query to match the second and compare again.

If it's still taking a very long time, it's possible that it's due to the
number of rows being updated in one transaction. In the first query you are
updating every row in one go. In the second you are breaking it up into
smaller transactions of 10000 rows each. How big is your log file? Is it big
enough to handle the entire table update in one go, if not and autogrow is
enabled then there will be a performance issue due to the log file having to
be resized during the update.

There are lots of possible reasons for the difference in performance.
However, I'd suspect that the lack of a WHERE clause in your first query has
a lot to do with it.

Dan

JT

unread,

Feb 16, 2006, 8:56:23 AM2/16/06

to

By breaking the update into smaller batches, you reduce the size of the
active transaction log and also the number of locks. The programming
required to perform the loop and conditional processing is fairly derivative
and can be easily carried over to another update / delete statement. If this
is an issue, then perhaps you could try re-designing your workflow or data
model in such a way that it doen's require frequent mass updates on large
tables.

"Sydney Lotterby" <syd...@infosearch.com> wrote in message
news:OzIix5rM...@TK2MSFTNGP10.phx.gbl...

Sydney Lotterby

unread,

Feb 16, 2006, 9:29:58 AM2/16/06

to

Dan,
This is a one time conversion script and all rows in gfi.instrlocid are
null. Since I am updating from 'bottom to top' the 'where instrlocid is
null' is redundant. So I would say the two scripts are essentially the
same -- except of course for the 'between ID' constraint.

The Log file starts at 1 Mb and grows. When I killed the big script it was
~ 1 gb and I know it is set to auto-grow at the default increment (10%??).
So perhaps this automatic growth is part of the problem. Perhaps locks are
a problem too (?)

1) Since this is a 'once-off' and not a production script - can I turn off
LOCKs in an UPDATE? I know I can in a SELECT.
2) What do you recommend I do with the log file
3) Can you elaborate on your statement "There are lots of possible reasons
for the difference in performance"?

tia

"Daniel Crichton" <msn...@worldofspack.co.uk> wrote in message
news:uVND9Vv...@TK2MSFTNGP11.phx.gbl...

Sydney Lotterby

unread,

Feb 16, 2006, 9:32:44 AM2/16/06

to

JT,

This is a one time conversion script and all rows in gfi.instrlocid are
null. Since I am updating from 'bottom to top' the 'where instrlocid is
null' is redundant. So I would say the two scripts are essentially the
same -- except of course for the 'between ID' constraint.

1) It would help if I understood what gets posted to the Log file in a big
loop like this.
2) Can you elaborate on your statement "The programming required to perform

the loop and conditional processing is fairly derivative and can be easily

carried over to another update / delete statement"?

tia

"JT" <som...@microsoft.com> wrote in message
news:uiosGFwM...@TK2MSFTNGP12.phx.gbl...

JT

unread,

Feb 16, 2006, 10:26:40 AM2/16/06

to

> 1) It would help if I understood what gets posted to the Log file in a big
> loop like this.

Sydney,

Any data pages or index pages that are modified by the update statement are
copied to the transaction log, and the transaction is uncommitted and active
for the duration of the update. If you update every row in a 2 GB table,
then at least 2 GB will be written to the log, and in cases several times
more data will be logged. The prupose of all this logging is to enable SQL
Server to recover from a failed transaction by rolling back the pages to
their original state.

If gfi.instrlocid is a column of a variable lenth data type (such as
VarChar), then updating it from NULL to a value > NULL will increase the
total lenth of the row, movement of the row to another physical location,
perhaps page splitting, and even more transaction logging. If you know that
every value in this column will eventually be updated later, then it may be
best to initially insert it with an appropriate number of spaces or perhaps
even define as Char instead.
Understanding SQL Server's DBCC SHOWCONTIG:
http://www.sql-server-performance.com/dt_dbcc_showcontig.asp

If there are other processes attempting to select from the table while the
update is taking place, then either the select process or the update process
may be held in a blocked state, thus increasing the duration of the
transaction.
Incomplete transaction may hold large number of locks and case blocking:
http://support.microsoft.com/kb/295108/EN-US/

There are also some techniques, such as placing the location of the log file
on another drive, that can improve the general performance of transaction
logging:
Optimizing Transaction Log Performance:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_5aqt.asp

How to stop the transaction log of a SQL Server database from growing
unexpectedly:
http://support.microsoft.com/default.aspx?scid=kb;en-us;873235
A transaction log grows unexpectedly or becomes full on a computer that is
running SQL Server:
http://support.microsoft.com/kb/317375/
Checkpoints and the Active Portion of the Log:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_da2_8y3y.asp

> 2) Can you elaborate on your statement "The programming required to
> perform the loop and conditional processing is fairly derivative and can
> be easily carried over to another update / delete statement"?

I meant that if you have several different procedures that perform mass
update or deletes, the existing T-SQL can be retrofitted fairly easily by
wrapping it in generic looping logic and adding an additional where clause
condition.

"Sydney Lotterby" <syd...@infosearch.com> wrote in message

news:u4PbFYwM...@TK2MSFTNGP09.phx.gbl...

Daniel Crichton

unread,

Feb 16, 2006, 10:37:08 AM2/16/06

to

Sydney wrote on Thu, 16 Feb 2006 08:29:58 -0600:

> Dan,
> This is a one time conversion script and all rows in gfi.instrlocid are
> null. Since I am updating from 'bottom to top' the 'where instrlocid is
> null' is redundant. So I would say the two scripts are essentially the
> same -- except of course for the 'between ID' constraint.
>
> The Log file starts at 1 Mb and grows. When I killed the big script it
> was ~ 1 gb and I know it is set to auto-grow at the default increment
> (10%??). So perhaps this automatic growth is part of the problem. Perhaps
> locks are a problem too (?)
>
> 1) Since this is a 'once-off' and not a production script - can I turn off
> LOCKs in an UPDATE? I know I can in a SELECT.

As you're updating all rows in the first query, you'll end up with a much
wider lock. You can't turn off locks in an update - if you try WITH (NOLOCK)
you'll get an error saying you can't perform an update if you have locks
disabled. You could try various locking hints (check BOL for options) to see
if any help, but SQL Server is trying to use the locking scheme it thinks is
most appropriate for the rows you are updating.

> 2) What do you recommend I do with the log file

Make it a little larger than you'll need to handle the entire single update.
Resizing that log file is most likely the biggest performance killer.

> 3) Can you elaborate on your statement "There are lots of possible reasons
> for the difference in performance"?

Different execution plans, different locking schemes on the table(s),
parallelism kicking in (if you have multiple processors/cores, sometimes
parallelism can slow things down rather than speed them up), disk
fragmentation, index fragmentation, etc. If you have an index on the column
you're updating, consider removing it before the update, and then adding it
again. There are plenty of posts in this group about various checks for
performance issues, have a dig around and try some of them.

Dan

Sydney Lotterby

unread,

Feb 17, 2006, 1:44:52 PM2/17/06

to

JT,
Thanks for the pointers. I was aware of some of these but not others.
fyi - gfi.instrlocid is int and contains NULLs and points to another table
but is not defined as an FK so perhaps I could preset these to 0 so there
would be not dynamic allocation of the 4 bytes for int. Am I thinking
correctly on this?

Also, in BOL there is this article below. Since I absolutely guarantee that
no other user will be accessing my db during the update, will any of these
settings help performance with respect to LOCKs?

tia

How can I set the database to single user mode and restrict the access to
dbo use only?

Answer:

In SQL Server 2000, a database cannot be in single-user mode with dbo use
only. Instead, the following alternative options are available by using the
ALTER DATABASE command:
ALTER DATABASE database SET SINGLE_USER.
This command restricts access to the database to only one user at a time.

ALTER DATABASE database SET RESTRICTED_USER.
This command restricts access to the database to only members of the
db_owner, dbcreator, or sysadmin roles.

ALTER DATABASE database SET MULTI_USER.
This command returns access to the database to its normal operating state.

"JT" <som...@microsoft.com> wrote in message

news:OMuTj3wM...@TK2MSFTNGP10.phx.gbl...

Sydney Lotterby

unread,

Feb 17, 2006, 1:46:51 PM2/17/06

to

Thanks for your input. I have posted a response to the second thread
response to my original posting. If you have any input on that I would like
to hear it.

"Daniel Crichton" <msn...@worldofspack.co.uk> wrote in message

news:OjzDa7wM...@TK2MSFTNGP12.phx.gbl...

JT

unread,

Feb 17, 2006, 2:53:47 PM2/17/06

to

In order of importance:

#1 Continue "chunking" your updates; that's the main thing and also add a
'checkpoint' between each iteration to flush the buffered writes and
designate a recovery point.
#2 You are updating from a query, so execute a single chunk in Query
Analyzer using the Show Execution Plan option and confirm that your joins to
[instloc], [rectype], etc. are indexed.
#3 Configure the location of the log file on a seperate drive.
#4 Set the database recovery model (at least temporarily) to Simple
Recovery. This will tell SQL Server to truncate the transaction log at each
checkpoint and minimize it's size.
#5 If you are sure about not wanting user access to the table during this
update, then specify the TABLOCKX hint to hold an exclusive lock on the
table for the duration of the update statement. Setting the database to DBO
only mode will lock out the entire database, but perhaps this would also
result in some marginal performance gain in some way.

"Sydney Lotterby" <syd...@infosearch.com> wrote in message

news:%23A8fqJ$MGHA...@TK2MSFTNGP09.phx.gbl...