counting rows

buca...@gmail.com

unread,

Mar 30, 2006, 4:11:40 PM3/30/06

to

Consider the tables USERS and PICTURES. When retrieving the number of
pictures from a user there are two ways:

1) I count them from pictures table.
2) I pick it directly from a field "numPictures" in table users.

How are you used to do this? Which is better?
Thanks

paul c

unread,

Mar 30, 2006, 7:56:45 PM3/30/06

to

In a way, it's like the riddles about changing light bulbs (the one I
like best is 'how many psychiatrists?' - answer 'only one but the light
bulb has to want to change'). When even mainframes took a second or two
to count to a million it seemed the fastest way to do that was store
(pardon me, I should have said 'memorize') the number in the database.
I think it depends on what the number means to you and what if anything
its production means to you. I don't believe that any DB dogma has
anything useful in the sense of right or wrong to say about this.

pc

-CELKO-

unread,

Mar 30, 2006, 7:58:30 PM3/30/06

to

Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

paul c

unread,

Mar 30, 2006, 8:10:39 PM3/30/06

to

The OP didn't ask about language or code. He asked about 'better'.

pc

Frank Hamersley

unread,

Mar 30, 2006, 11:38:05 PM3/30/06

to

Then he had 'better' post the requested details if he wants the best
outcome.

Cheers, Frank.

Tony Rogerson

unread,

Mar 31, 2006, 2:46:20 AM3/31/06

to

Number 2 will be best because you are effectively storing the aggregate
instead of the server having to count the number of rows in the table which
would require IO.

Tony.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

<buca...@gmail.com> wrote in message
news:1143753100....@u72g2000cwu.googlegroups.com...

Volker Hetzer

unread,

Mar 31, 2006, 6:13:52 AM3/31/06

to

buca...@gmail.com schrieb:

From a software maintenance view, 1) is better. It's way more difficult
to maintain the correct count than to simply put an index on the picture
table and count.

If users have more than let's say ten thousand pictures, you got to use
your particular database's special abilities like materialized views and
so on. At that time, doing a very careful trigger implementation might
become a serious contender too, depending on how poor your database is
at real applications.

Lots of Greetings!
Volker

David Cressey

unread,

Mar 31, 2006, 6:18:45 AM3/31/06

to

"paul c" <toledob...@oohay.ac> wrote in message
news:h7%Wf.201992$sa3.129417@pd7tw1no...

It depends.

Number 1 is better from the point of view of managing redundancy. Number 2
stores a single fact in more than one place. In this case the "fact" is a
count of rows and also the recorded count.

If you mismanage redundancy, unlikely in this case, you can end up with a
database that contradicts itself.
(Where the recorded count doesn't match the actual count).

Number 1 is also better from the point of view of data capture. There is
less work to do when inserting a new picture.

Number 2 is faster for retrieval. It will generally take more work to count
the pictures than to read a recorded count.

If number 2 is too slow at picture insert time, you can sometimes defer
adding one to the recorded count, but you have to make sure that the
recorded count doesn't get used when it's out of date. (Again, redundancy
mismanagement).

For small numbers, this will be a trivial difference. For millions of
pictures, it will make a difference.

David Cressey

unread,

Mar 31, 2006, 6:19:55 AM3/31/06

to

"paul c" <toledob...@oohay.ac> wrote in message
news:h7%Wf.201992$sa3.129417@pd7tw1no...

My favorite is "how many Zen Buddhists does it take to change a light bulb?

Two: one to change the light bulb, and the other to NOT change the light
bulb.

Tony Andrews

unread,

Apr 1, 2006, 6:37:33 AM4/1/06

to

Tony Rogerson wrote:
> Number 2 will be best because you are effectively storing the aggregate
> instead of the server having to count the number of rows in the table which
> would require IO.

That is wildly simplistic. Based on that advice, it would be "best" to
always store the count, sum, min, max, avg, etc. of everything in its
parent table(s). Please tell me you don't!?

Tony Rogerson

unread,

Apr 1, 2006, 11:46:19 AM4/1/06

to

When somebody asks this...

>>Consider the tables USERS and PICTURES. When retrieving the number of
>>pictures from a user there are two ways:

>>1) I count them from pictures table.
>>2) I pick it directly from a field "numPictures" in table users.

>>How are you used to do this? Which is better?

I give an answer to THAT question and not add in other stuff, if the OP
wanted min, max, sum etc... then I'd answer differently.

Given 10 concurrent users all wanting the number of pictures and the number
of rows is 10,000, which in your opinion is better?

-- The engine may need to read 20MBytes of memory per user to realise the
result.
SELECT COUNT(*) FROM yourtable WHERE user = 'x'

or

-- The engine only needs to read a single row per user, a couple of
KBytes at most.
SELECT num_pict FROM user_details WHERE user = 'x'

num_pict is kept up-to-date using a before (instead of) trigger

The trigger simply does a +1 or -1 each time a row is added/removed.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"Tony Andrews" <andr...@onetel.com> wrote in message
news:1143891453.7...@i39g2000cwa.googlegroups.com...

E. Lefty Kreouzis

unread,

Apr 1, 2006, 12:12:51 PM4/1/06

to

On 2006-04-01, Tony Rogerson <tonyro...@sqlserverfaq.com> wrote:
> When somebody asks this...
> [snip]

>
> Given 10 concurrent users all wanting the number of pictures and the number
> of rows is 10,000, which in your opinion is better?
>
> -- The engine may need to read 20MBytes of memory per user to realise the
> result.
> SELECT COUNT(*) FROM yourtable WHERE user = 'x'
>
> or
>
> -- The engine only needs to read a single row per user, a couple of
> KBytes at most.
> SELECT num_pict FROM user_details WHERE user = 'x'
>
> num_pict is kept up-to-date using a before (instead of) trigger
>
> The trigger simply does a +1 or -1 each time a row is added/removed.
>

Lets extend this scenario a bit you have 100 concurrent users, some
adding pictures (and thus inserting into the picture tables), some
deleting pictures and some doing other unrelated stuff. All in
transactions running at an appropriate isolation level ( at least
repeatable read if not serializable). How is the trigger solution
going to handle that without horrid locking issues?

This is by the way the reason that select count(*) in PostgreSQL
scans the table rather than reading some magic stored count. Just because
different tuples might be visible to different transactions.

Simple example an initially empty table with 3 open transactions t1,
t2 & t3. t1 inserts 2 rows,
t2 inserts 5 rows and
t2 doesn't insert any.

what is count(*) for t1? t2? t3?

E. Lefty Kreouzis

Tony Rogerson

unread,

Apr 1, 2006, 2:03:05 PM4/1/06

to

I'll answer that question with another question - do you want it accurate?
In which case you'll need to serialise anyway.

Locking will be there anyway because of the inserting of the picture row
data.

The update of count is simply modifying 4 bytes on a single row which would
be accessed via a unique index (very little IO), do you really think the
contention will be with the update?

By doing COUNT(*) and serialising the query will cause considerable
contention because the COUNT(*) query will be a lot longer.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"E. Lefty Kreouzis" <le...@internet.gr> wrote in message
news:slrne2td4n...@goedel.home.rtfm.gr...

Volker Hetzer

unread,

Apr 1, 2006, 3:51:43 PM4/1/06

to

Tony Rogerson schrieb:

> I'll answer that question with another question - do you want it accurate?
> In which case you'll need to serialise anyway.

Read committed would do it.

>
> Locking will be there anyway because of the inserting of the picture row
> data.
>
> The update of count is simply modifying 4 bytes on a single row which would
> be accessed via a unique index (very little IO), do you really think the
> contention will be with the update?
>
> By doing COUNT(*) and serialising the query will cause considerable
> contention because the COUNT(*) query will be a lot longer.

The count has to be protected from modification by accident, code bugs
or someone who doesn't understand the schema. Stuff like this makes a
system more fragile.

I'd consider that trigger stuff only after some serious profiling. And even
then I'd first look what else the database has to offer in order to solve
that problem.

Lots of Greetings!
Volker

Message has been deleted

Volker Hetzer

unread,

Apr 1, 2006, 3:58:05 PM4/1/06

to

Tony Rogerson schrieb:

> I'll answer that question with another question - do you want it accurate?
> In which case you'll need to serialise anyway.
>
> Locking will be there anyway because of the inserting of the picture row
> data.
>
> The update of count is simply modifying 4 bytes on a single row which would
> be accessed via a unique index (very little IO), do you really think the
> contention will be with the update?
>
> By doing COUNT(*) and serialising the query will cause considerable
> contention because the COUNT(*) query will be a lot longer.

The count has to be protected from modification by accident, code bugs
or someone who doesn't understand the schema. Stuff like this makes a
system more fragile.

I'd consider that trigger stuff only after some serious profiling. And even
then I'd first look what else the database has to offer in order to solve
that problem.

Here's a discussion on a related subject (summing up rows in a child table):
http://tinyurl.com/ktc2l

Lots of Greetings!
Volker

E. Lefty Kreouzis

unread,

Apr 1, 2006, 4:41:13 PM4/1/06

to

On 2006-04-01, Tony Rogerson <tonyro...@sqlserverfaq.com> wrote:
> I'll answer that question with another question - do you want it accurate?
> In which case you'll need to serialise anyway.
>

You clearly want it correct, otherwise you don't want to evaluate count(*)
you want to evaluate something like 'estimate rows' which is a different
cettle of fish alltogether.

> Locking will be there anyway because of the inserting of the picture row
> data.
>

No locking is only an issue on /some/ databases.
e.g. in read commited isolation level (the default) with two transactions
both inserting into the same table neither Oracle nor PostgreSQL lock
the table. Both give the correct reply for count(*).

> The update of count is simply modifying 4 bytes on a single row which would
> be accessed via a unique index (very little IO), do you really think the
> contention will be with the update?
>

The contention will be because locking will be mandatory in the trigger since
in order for the count to be correct the transactions will *have* to be serialised.
IO has *nothing* to do with it. If you have to serialize you take a performance
hit on concurrent transactions, no two ways about it.

> By doing COUNT(*) and serialising the query will cause considerable
> contention because the COUNT(*) query will be a lot longer.
>

As I said before, in PostgreSQL in transaction isolation level serializable you
you don't have contention. I just tried the following experiment:

open two sessions s1 & s2

s1: create table s (id int4 primary key);
s1: set session characteristics as transaction isolation level serializable;
s2: set session characteristics as transaction isolation level serializable;
s1: begin work;
s2: begin work;
s1: insert into s values(1);
s1: select * from s;
id
----
1
(1 row)
s1: select count(*) from s;
count
-------
1
(1 row)
s2: select * from s;
id
----
(0 rows)
s2: select count(*) from s;
count
-------
0
(1 row)
s2: insert into s values(2);
s2: select * from s;
id
----
2
(1 row)
s2: select count(*) from s;
count
-------
1
(1 row)
s1: select * from s;
id
----
1
(1 row)
s1: insert into s values(3);
s1: select * from s;
id
----
1
3
(2 rows)
s1: select count(*) from s;
count
-------
2
(1 row)
s2: commit;
s1: select * from s;
id
----
1
3
(2 rows)
s1: select count(*) from s;
count
-------
2
(1 row)
s1: commit;
s1: select * from s;
id
----
1
2
3
(3 rows)
s1: select count(*) from s;
count
-------
3
(1 row)

No locks, no contention (mind you I was careful not to insert using the same
primary key - if I had done so then I would have had the following:

s1: begin work;
s2: begin work;
s1: insert into s values(4);
INSERT 0 1
s2: insert into s values(4);
------- Waiting for lock
s1: commit;
s2: prints ERROR: duplicate key violates unique constraint "s_pkey"

Lefty Kreouzis

Murdoc

unread,

Apr 1, 2006, 5:38:07 PM4/1/06

to

Tony Rogerson wrote:

> When somebody asks this...
>
> > > Consider the tables USERS and PICTURES. When retrieving the number of
> > > pictures from a user there are two ways:
>
> > > 1) I count them from pictures table.
> > > 2) I pick it directly from a field "numPictures" in table users.
>
> > > How are you used to do this? Which is better?
>
> I give an answer to THAT question and not add in other stuff, if the OP wanted min, max, sum
> etc... then I'd answer differently.
>
> Given 10 concurrent users all wanting the number of pictures and the number of rows is 10,000,
> which in your opinion is better?
>
> -- The engine may need to read 20MBytes of memory per user to realise the result.
> SELECT COUNT(*) FROM yourtable WHERE user = 'x'

Wouldn't it be more efficient to do:

SELECT COUNT(user_picture_id) FROM yourTable WHERE user = 'x'

ensuring of course that
- user_picture_id is the unique key of the picture table; and
- there is an index on the table for the user code

>
> or
>
> -- The engine only needs to read a single row per user, a couple of KBytes at most.
> SELECT num_pict FROM user_details WHERE user = 'x'
>
> num_pict is kept up-to-date using a before (instead of) trigger
>
> The trigger simply does a +1 or -1 each time a row is added/removed.

We actually do that with several types of information: # of transactions in a batch; # of members
in a fund; running account balance, etc. However, we do it with code, and not with database-level
triggers.

--

Tony Rogerson

unread,

Apr 2, 2006, 3:08:05 AM4/2/06

to

> No locking is only an issue on /some/ databases.
> e.g. in read commited isolation level (the default) with two transactions
> both inserting into the same table neither Oracle nor PostgreSQL lock
> the table. Both give the correct reply for count(*).

At the start of the transaction they do (sql 2005 has versioning now also
and read committed without blockers), but they don't take into account
inserts and updates do they?

But as I said, you'd need to serialise in order to make sure you didn't have
any phantom inserts/deletes.... (and that is regardless of using a trigger
or not)

> As I said before, in PostgreSQL in transaction isolation level
> serializable you
> you don't have contention. I just tried the following experiment:

Thats too basic an experiment, what happens when inserts are occuring at the
same time the COUNT(*) is taking place - you need to serialise so that
doesn't happen.

And when you serialise it becomes an IO bottleneck, everything waits until
that serialised transaction completes.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"E. Lefty Kreouzis" <le...@goedel.home.rtfm.gr> wrote in message
news:slrne2tsrs...@goedel.home.rtfm.gr...

Tony Rogerson

unread,

Apr 2, 2006, 3:12:16 AM4/2/06

to

> SELECT COUNT(user_picture_id) FROM yourTable WHERE user = 'x'
>
> ensuring of course that
> - user_picture_id is the unique key of the picture table; and
> - there is an index on the table for the user code

What if there where 1000's, 10000's of pictures, thats a lot of rows to
count, it will be quick, but its the other stuff around it - how do you
protect yourself from phantoms? How do you make sure nobody remove a row
while you are counting?

Unless you serialise then as you are counting the 10000 rows, 5 might get
deleted, you'll incorrectly give 10000 as the total when its actually 9995 -
hence you need to use the isolation of serialisable which then starts to
have an effect because of the number of pages that need to be read to
satisfy the query.

> We actually do that with several types of information: # of transactions
> in a batch; # of members
> in a fund; running account balance, etc. However, we do it with code, and
> not with database-level
> triggers.

That's dangerous, with a trigger its consistent, you can't have people
mangle the numbers so the aggregates get out of date to the facts in the
table - you are relying on too much there I'm afraid.

A trigger will give a consistent value, the trigger fire whenever an
insert/update/delete occurrs.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"Murdoc" <murd...@hotmail.com> wrote in message
news:xn0ekhvd...@news-south.connect.com.au...

E. Lefty Kreouzis

unread,

Apr 2, 2006, 4:01:44 AM4/2/06

to

On 2006-04-02, Tony Rogerson <tonyro...@sqlserverfaq.com> wrote:
>> No locking is only an issue on /some/ databases.
>> e.g. in read commited isolation level (the default) with two transactions
>> both inserting into the same table neither Oracle nor PostgreSQL lock
>> the table. Both give the correct reply for count(*).
>
> At the start of the transaction they do (sql 2005 has versioning now also
> and read committed without blockers), but they don't take into account
> inserts and updates do they?
>

Well yes they do, as long as two transactions don't try to change the same
row of course. The correct count(*) was given /after/ inserts by the two
long running transactions.

However I see where you come from since doing the same test on MS SQL
Server 2000 (MSDE v8.00.760), this happened

s1: insert -> OK
s1: select count(*) -> OK
s2: insert -> OK
s2: select count(*) -> froze (waiting for lock)
s1: select * -> ERROR: Transaction (Process ID 68) was deadlocked on lock \
resources with another process and has been chosen as the deadlock victim. \
Rerun the transaction.
s2: continues returning the count

> But as I said, you'd need to serialise in order to make sure you didn't have
> any phantom inserts/deletes.... (and that is regardless of using a trigger
> or not)

No you need at least read commited isolation level, however arguably isolation
level serialized is more correct.

The problem is that if you have a trigger to maintain the count then the row
that the trigger updates is also the /same/ row that the other transactions
update. In such a case locking is unavoidable.

> Thats too basic an experiment, what happens when inserts are occuring at the
> same time the COUNT(*) is taking place - you need to serialise so that
> doesn't happen.
>

But the inserts are happening while the other transactions are active. I was
testing long running transactions (autocommit off). When the lock will be
released then has more to do with when the transaction will complete (which
is not necessarily IO bound - think slow network link & extended transactions)
than with how much IO the database server will do.

> And when you serialise it becomes an IO bottleneck, everything waits until
> that serialised transaction completes.
>

Lefty Kreouzis

E. Lefty Kreouzis

unread,

Apr 2, 2006, 4:07:48 AM4/2/06

to

Just to make things clear I did the test on 3 databases using 2 long
running transactions. The results were as follows:

Database Isolation Level Result
Oracle 10g Read committed No locking
PostgreSQL Read committed No Locking
PostgreSQL Serializable No Locking
SQL Server 2000 default (read commited?) Deadlock

The above was done using the default implementation of count(*) for
each database. My claim is that using a trigger would result in
locking for all cases just because the trigger would update the
/same/ row.

Lefty Kreouzis

Tony Rogerson

unread,

Apr 2, 2006, 5:54:00 AM4/2/06

to

Hang on a minute, the whole point about the serialisable transaction
isolation is that no other connections can insert rows within the key range
your query is operating on.

So how on earth can anybody insert a row for that user if you are in the
middle of a COUNT(*)?

The test you are doing is not realistic, you are isolating the connections
by partition the data which in reality isn't something that happens.

My point is that the lock held by the trigger will be significantly less in
duration than all those COUNT(*) queries.

Post the SQL you used for SQL Server 2000 and I'll take a look and code it
to demonstrate what i'm talking about - current version is SQL Server 2005
by the way, its got the snapshot isolation and read committed without
writers blocking readers now.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"E. Lefty Kreouzis" <le...@goedel.home.rtfm.gr> wrote in message

news:slrne2v1io...@goedel.home.rtfm.gr...

Tony Andrews

unread,

Apr 2, 2006, 6:57:03 AM4/2/06

to

> "Tony Andrews" <andr...@onetel.com> wrote in message
> news:1143891453.7...@i39g2000cwa.googlegroups.com...
> > Tony Rogerson wrote:
> >> Number 2 will be best because you are effectively storing the aggregate
> >> instead of the server having to count the number of rows in the table
> >> which
> >> would require IO.
> >
> > That is wildly simplistic. Based on that advice, it would be "best" to
> > always store the count, sum, min, max, avg, etc. of everything in its
> > parent table(s). Please tell me you don't!?

> Tony Rogerson wrote:
> When somebody asks this...
>
> >>Consider the tables USERS and PICTURES. When retrieving the number of
> >>pictures from a user there are two ways:
>
> >>1) I count them from pictures table.
> >>2) I pick it directly from a field "numPictures" in table users.
>
> >>How are you used to do this? Which is better?
>
> I give an answer to THAT question and not add in other stuff, if the OP
> wanted min, max, sum etc... then I'd answer differently.

So I guess what you are saying is: given that there IS a column called
numPictures in table Users anyway, it is best to use it. You are not
suggesting that such a column should generally be created and
maintained, which is what I inferred from your answer?

E. Lefty Kreouzis

unread,

Apr 2, 2006, 7:02:15 AM4/2/06

to

On 2006-04-02, Tony Rogerson <tonyro...@sqlserverfaq.com> wrote:

> Hang on a minute, the whole point about the serialisable transaction
> isolation is that no other connections can insert rows within the key range
> your query is operating on.
>

No I'm not, the two transactions are inserting rows with different primary
keys. Easy to do with autoincrement columns.

> So how on earth can anybody insert a row for that user if you are in the
> middle of a COUNT(*)?
>

the row that the other transaction is insering is not visible to the transaction
that does thw count(*) so that it shouldn't modify the result.

> The test you are doing is not realistic, you are isolating the connections
> by partition the data which in reality isn't something that happens.
>

As I said, I am not partitioning anything.

> My point is that the lock held by the trigger will be significantly less in
> duration than all those COUNT(*) queries.
>

As I have said again the lock that the trigger will hold will live for the
duration of the holding transaction which can be quite large (think long
running transactions).

> Post the SQL you used for SQL Server 2000 and I'll take a look and code it
> to demonstrate what i'm talking about - current version is SQL Server 2005
> by the way, its got the snapshot isolation and read committed without
> writers blocking readers now.
>

I only have access to an MSDE version of SQL server 200 so that is what I ran my
tests against. I used SquirrelSQL and the freeTDS jdbc driver, opened two
connections, and disabled autocommit.

Then I did the insert, select in the two windows.

It is good that SQL Server 2005 has this feature now.

Lefty Kreouzis

Tony Rogerson

unread,

Apr 2, 2006, 7:15:52 AM4/2/06

to

I think you've fallen into the usual developer trap in thinking what works
small will scale.

If there are only 5 or 6 rows per user then fair enough, but what about a
few thousand, 10'000's? Do you really think the IO will scale....

Say it costs 10MBytes to realise the COUNT(*) each time a user queries it,
now times that by 10, its 100MBytes, then by a thousand its 1GByte, thats
1GByte of memory (if cached) that needs to be queried in order to realise
the result.

Locking aside (which is still a problem because you need to serialise and
get an accurate result), IO comes in to play.

The trigger is a simple look up via a unique index, probably a few KBytes
and then an update on those same pages - changing a 4 byte value.

Try scaling it up - give each user 1000 pictures and simulate 10 users
running COUNT(*) and measure your CPU, now use the trigger method (+1 or -1)
and your CPU will be negligable.

If you can post the SQL you used I can show you what I me.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"E. Lefty Kreouzis" <le...@goedel.home.rtfm.gr> wrote in message

news:slrne2vbpq...@goedel.home.rtfm.gr...

Tony Rogerson

unread,

Apr 2, 2006, 8:07:59 AM4/2/06

to

> So I guess what you are saying is: given that there IS a column called
> numPictures in table Users anyway, it is best to use it. You are not
> suggesting that such a column should generally be created and
> maintained, which is what I inferred from your answer?

I'm simply answering the question from an implementation and scalability
point of view rather than relational theory.

I am saying, if performance and scalability requirements dictate
(implementation of the logical design) then yes, a number_of_pictures column
should be created, whether that feeds back into the logical model I'm not
sure as its an implementation thing, in the say way you can argue surrogate
keys are an implementation thing and shouldn't be in the logical model.

--
Tony Rogerson
SQL Server MVP
http://sqlserverfaq.com - free video tutorials

"Tony Andrews" <andr...@onetel.com> wrote in message
news:1143975423.1...@e56g2000cwe.googlegroups.com...

Tony Andrews

unread,

Apr 2, 2006, 8:20:08 AM4/2/06

to

Tony Rogerson wrote:
> > So I guess what you are saying is: given that there IS a column called
> > numPictures in table Users anyway, it is best to use it. You are not
> > suggesting that such a column should generally be created and
> > maintained, which is what I inferred from your answer?
>
> I'm simply answering the question from an implementation and scalability
> point of view rather than relational theory.
>
> I am saying, if performance and scalability requirements dictate
> (implementation of the logical design) then yes, a number_of_pictures column
> should be created, whether that feeds back into the logical model I'm not
> sure as its an implementation thing, in the say way you can argue surrogate
> keys are an implementation thing and shouldn't be in the logical model.

Nothing in the original question suggested that performance and
scalability requirements did so dictate! It just asked which was
"better". I am concerned that the OP (who is obviously a database
newbie or wouldn't be asking the question) could get the impression
from your answer that it is always "best" to create and maintain
summary columns for data that you may want to select the count of. I
would have expected some sort of "it depends" in your answer, but there
was none. Of course I don't say NEVER create and maintain such
columns, but I would say that they are the exception rather than the
rule.

E. Lefty Kreouzis

unread,

Apr 2, 2006, 11:29:59 AM4/2/06

to

On 2006-04-02, Tony Rogerson <tonyro...@sqlserverfaq.com> wrote:

> [snip]

> Say it costs 10MBytes to realise the COUNT(*) each time a user queries it,
> now times that by 10, its 100MBytes, then by a thousand its 1GByte, thats
> 1GByte of memory (if cached) that needs to be queried in order to realise
> the result.
>

No that isn't correct, the same 10MB are cached per user no matter how many
times the user queries select count(*) sto it is 100MB not 1GByte.

> Locking aside (which is still a problem because you need to serialise and
> get an accurate result), IO comes in to play.
>
> The trigger is a simple look up via a unique index, probably a few KBytes
> and then an update on those same pages - changing a 4 byte value.
>
> Try scaling it up - give each user 1000 pictures and simulate 10 users
> running COUNT(*) and measure your CPU, now use the trigger method (+1 or -1)
> and your CPU will be negligable.
>
> If you can post the SQL you used I can show you what I me.
>

OK I did my test on Postgresql using jmeter on a remote server over Wifi
11 Mbit.

First the schema:

CREATE TABLE pictures (
id serial NOT NULL,
user_id integer NOT NULL,
picture_name text
);
ALTER TABLE public.pictures OWNER TO test;
CREATE TABLE users (
user_id integer NOT NULL,
name text NOT NULL,
num_pictures integer
);
ALTER TABLE public.users OWNER TO test;
ALTER TABLE ONLY pictures
ADD CONSTRAINT pictures_pkey PRIMARY KEY (id);
ALTER TABLE ONLY users
ADD CONSTRAINT users_name_key UNIQUE (name);
ALTER TABLE ONLY users
ADD CONSTRAINT users_pkey PRIMARY KEY (user_id);
CREATE INDEX picture_user_idx ON pictures USING btree (user_id);
ALTER TABLE ONLY pictures
ADD CONSTRAINT pictures_user_id_fkey FOREIGN KEY (user_id) REFERENCES users(user_id);

we have 10 users (user_1 to user 10) and 1000 pictures per user.

Jmeter has 100 threads, repeated 10

what runs is the following:

insert into pictures(user_id, picture_name) values ( ${user_id}, 'JMeter ${user_id}');
select count(*) from pictures where user_id = ${user_id};

I got a throughput of 16029 transactions per second.

With 10 threads and 100 repeats I get 17192 transactions per second.

I will do the same on SQL Server & Oracle & report the results

Lefty Kreouzis

E. Lefty Kreouzis

unread,

Apr 2, 2006, 11:39:23 AM4/2/06

to

I think that the important thing to keep in mind is the following:

We should design & test on a properly normalised database schema.
IF (and it is a big if) we get to /real/ and /existing/ performance
problems only then we should carefully trade off normalisation
(and concequently data integrity) for performance.

The tradeoff depends very heavily on the /actual/ use patterns
of our application.

Suppose that an insert to the picture table say takes 3ms,
a select count(*) 150ms and
the trigger execution takes 5ms.

Then if every time we insert a picture we execute the count(*)
then the trigger is faster (5ms vs 150ms).

If however the count(*) is a rare operation (say once every 1000 inserts)
then we will have payed 1000*5 = 5000ms in order to gain 150ms!

So we should be very carefull in denormalizing and only with
real world application loads & use patterns.

Let's not forget that premature optimization is the root of many
evils.

Lefty Kreouzis

Murdoc

unread,

Apr 3, 2006, 4:28:23 AM4/3/06

to

Tony Rogerson wrote:

> > SELECT COUNT(user_picture_id) FROM yourTable WHERE user = 'x'
> >
> > ensuring of course that
> > - user_picture_id is the unique key of the picture table; and
> > - there is an index on the table for the user code
>
> What if there where 1000's, 10000's of pictures, thats a lot of rows to count, it will be quick,
> but its the other stuff around it - how do you protect yourself from phantoms? How do you make
> sure nobody remove a row while you are counting?

That's true, although another factor that has been mentioned, is how often is the count required?
Is it required on a frequent or seldom basis? And yes, I do know that these are very subjective,
but the time it takes to count the rows needs to be weighed against the relative number of times
that count is required. For example, if a count of 1,000,000 rows was required every two weeks,
then it would probably not require an additional field, whereas a count of 50,000 rows every 15
seconds will definitely.

> Unless you serialise then as you are counting the 10000 rows, 5 might get deleted, you'll
> incorrectly give 10000 as the total when its actually 9995 - hence you need to use the isolation
> of serialisable which then starts to have an effect because of the number of pages that need to
> be read to satisfy the query.
>
> > We actually do that with several types of information: # of transactions in a batch; # of
> > members in a fund; running account balance, etc. However, we do it with code, and not with
> > database-level triggers.
>
> That's dangerous, with a trigger its consistent, you can't have people mangle the numbers so the
> aggregates get out of date to the facts in the table - you are relying on too much there I'm
> afraid.

First, we aren't. Since all of our business logic in implement in 4GL, it is the only thing that
can access the database. Clients only have read-only access outside of our software. Secondly, by
putting though into the transaction scoping, and making decisions about what is critically
necessary to be up-to-date and what isn't, it can be extremely easy to avoid using database
triggers.

The other advantage in not using database triggers, is that if a bug is found in the business
logic, no schema has to be delivered to client sites to resolve the issue - it is all purely done
in code. We only use triggers for system auditing (who changed what, when, from what to what, etc).

> A trigger will give a consistent value, the trigger fire whenever an insert/update/delete occurrs.

And yet ours also gives a consistent values (for those that it needs to).
--

Volker Hetzer

unread,

Apr 3, 2006, 9:21:29 AM4/3/06

to

Tony Rogerson schrieb:

> I think you've fallen into the usual developer trap in thinking what works
> small will scale.

Nothing scales indefinitely. Neither do triggers. OTOH, with the current
data warehousing hype, databases have become good at aggragating.

> If there are only 5 or 6 rows per user then fair enough, but what about a
> few thousand, 10'000's? Do you really think the IO will scale....

Yes. Just tried it with 10000 users and 10000 picture entries each.
After taking an eternity to insert (this is a very small machine and everything
is on one disk), create the index and gather the optimizer statistics, oracle
reported a total read of 39824 bytes in the execution plan of
"select count(*) from tmp_pictures where user_id=1000".
Executing the query didn't take any noticeable time. I think there's plenty
of room to scale for a real database server with a real disk layout.

> The trigger is a simple look up via a unique index, probably a few KBytes
> and then an update on those same pages - changing a 4 byte value.

The killer scenario here is a bunch of applications registering pictures
for a single user.
a) each trigger firing is a context switch between sql and the trigger language
b) every time two triggers fire, they will have to lock, the second will get a
"not serializable" error as soon as the first one commits and then has to
try again until he's lucky enough to be fastest.

Look what your database offers for large scale stuff. Oracle has
materialized views (create materialized view as select x, count(*) from y group by x)
with fast incremental update and they do exactly what you want without all that
hassle. I'm sure, sqlserver has something comparable.

And *please* OT, what's your load estimate? How many user, how many inserts and
how many queries?

Lots of Greetings!
Volker

buca...@gmail.com

unread,

Apr 3, 2006, 5:53:39 PM4/3/06

to

I would like to thank you all for the feedback, I see the question is
more then just this or that way. Not entering in the detailed level of
the history, the users/pictures were only an example of a situation I
frequently found myself in. I'm not too concerned with performance at
this time, since I even don't know if the project will succed. It was
much more an issue of right db designing than anything else (although
db designing is tied to performance). I think I will store the count,
keep things in their places, (what would be searching without
sorting?). Despite the need of increment/decrement, the count will be
immediately available in the object's result set.

Brian Selzer

unread,

Apr 4, 2006, 2:30:06 AM4/4/06

to

If you're not concerned about performance, then the answer to your question
is clear: don't store the count because it's redundant.

From a theoretical standpoint, redundancy is always bad. A fact can either
be known or not, so it makes no sense for the same fact to be known more
than once. In addition, redundancy shifts the responsibility for
maintaining integrity from the database to the application. Since there can
be more than one application for any given database, it's best to maintain
integrity within the database.

Even from a practical standpoint, redundancy is almost always bad.
Redundancy increases the amount of procedural code required to implement a
solution. This reduces reliability: more code = more bugs. The additional
lock duration combined with the increased query complexity leads to more
blocking and a higher probability for deadlocks, which further reduces
reliability.

Only in very rare instances would you ever want to store the result of an
aggregate function in an OLTP database that also contains the unaggregated
information. If the performance of a particular query is a stated
requirement, and if the only way to meet that requirement (without busting
the budget) is to denormalize or to otherwise introduce redundancy, then do
it. But only after every other avenue has been exhausted.

<buca...@gmail.com> wrote in message
news:1144101219.1...@j33g2000cwa.googlegroups.com...

Volker Hetzer

unread,

Apr 4, 2006, 3:00:16 AM4/4/06

to

buca...@gmail.com wrote:
> It was
> much more an issue of right db designing than anything else (although
> db designing is tied to performance). I think I will store the count,
> keep things in their places,

My sense of "right design" tells me, information shouldn't be twice in
the database. That includes row counts. That kind of "caching" is
definitely not part of the schema design phase.

> (what would be searching without
> sorting?). Despite the need of increment/decrement, the count will be
> immediately available in the object's result set.

I don't know what you mean about immediately, since everything takes time,
but if this is about query complexity, use a view:
create my_user_table_view
as
select attribute_1,...,attribute_n,count(*) picture_count
from users,pictures
where users.userid=pictures.userid
group by attribute_*
;
Then, always use this view instead of my_user_table.
If you run into a performance problem (for instance because a frequent
query which for some reason can't take a few seconds is
"select * from my_user_table_view" without any where clauses), use a
materialized, indexed or cube view, depending on your database system.

For me a general heuristics is to design the tables and relations as
clean as possible, then use views for the regular queries, nesting them
so as to keep the structure simple for the guy who has to maintain
it when you are on vacation.
Then, if there are performance problems, materialize/index/cube the
offending views.

Lots of Greetings!
Volker