Re: SQL Humor

Mikito Harakiri

unread,

Aug 18, 2005, 1:17:34 PM8/18/05

to

Another source for quote of the week.

Although the author is unfair emphasizing that DBA has to be aware char
vs varchar2 difference. The character string types in SQL are just
plain silly. For how long ordinary programming languages have [a
single] type String [with unlimited bounds] already? Also when did you
design a database schema with char datatype last time? In a word, there
is not a single advantage of char over varchar2. Therefore, char is not
even worth mentioned, and should just be deprecated: there are to many
really important things on DBA plate.

Mike Labosh wrote:
> I am sure many of you folks are already subscribed, but for those who are
> not, here's a story of some DBA job interviews of candidates that should
> just be summarily shot:
>
> (This will wrap)
>
> http://www.sqlservercentral.com/columnists/sMcCown/howdoyouspellsql_printversion.asp
>

JT

unread,

Aug 18, 2005, 1:33:16 PM8/18/05

to

If this was intended to be a humorous quote from another faux DBA, then
thanks for the laugh! If that was your own opinion (and I'm assuming not),
then do society a favor by quiting your job and living off unemployment.
;-)

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message
news:1124385453....@f14g2000cwb.googlegroups.com...

Mikito Harakiri

unread,

Aug 18, 2005, 2:14:09 PM8/18/05

to

No, that was my opinion. Do you imply I'm not qualified to be a DBA?
Well, I would be ashamed to be called a master of extents and segment
management.

vc

unread,

Aug 18, 2005, 2:21:54 PM8/18/05

to

Mikito Harakiri wrote:
> No, that was my opinion. Do you imply I'm not qualified to be a DBA?
> Well, I would be ashamed to be called a master of extents and segment
> management.
>

What's so shameful about being called a DBA ?

> >

Razvan Socol

unread,

Aug 18, 2005, 2:23:34 PM8/18/05

to

> there is not a single advantage of char over varchar2

"varchar2" ? You seem to live in an Oracle world, but here we are MS
SQL Server people. And in SQL Server there is an advantage for char
over varchar: char(n) has takes two bytes less to store than
varchar(n), if the string has always n characters.

Razvan

Mikito Harakiri

unread,

Aug 18, 2005, 2:44:32 PM8/18/05

to

Hmm... With todays terabytes of disc memory, are 2 extra bytes really
that important? If storage size is really critical, then maybe column
compression might help?

Stu

unread,

Aug 18, 2005, 2:55:55 PM8/18/05

to

Not that I totally disagree with the concept of deprecating char vs
varchar, but I see this argument used a lot, and there's two flaws with
it.

1. Storage size may be cheap and plentiful, but performance should
always be foremost in the DBA mind. 2 bytes in a single column pf
storage may not be much, but you also need to write queries that
retrieve that extra 2 bytes. It's not just disk space; it's also
memory and CPU.

2. Failing to appreciate the differences between varchar and char sets
us on a path of lazy design. If it doesn't matter if I use char(10) vs
varchar(10), then what's the matter with varchar(50)? How about
varchar(51) etc? Eventually we could get away with "oh screw it; disks
are cheap, memory is cheap, CPU's are powerful, let's just put
everything in an text column and parse it on the fly".

I realize that what you are saying is a long way from suggesting that
we just throw away relational design altogether, but it's the small
steps that lead to the bigger leaps that lead to the cliffs.

Just my .02
Stu

JT

unread,

Aug 18, 2005, 3:29:12 PM8/18/05

to

Consider what percentage of database servers in production have
terabytes of unused disk storage available for whatever machiavelian
database model some theorist chooses to dream up? Within the next 50 years,
advances in quantum mechanics may allow for transdimentional data storage,
but today we are constrained by budgets and the limitations of silicon and
magnetic metal plates.
What impact the 2 byte overhead has on the total width of a record
depends on how many VarChar columns are included on the record layout. None
of the top three database server platforms have native support for data
compression, but even if they did, there is still the issue CPU cycles
required fro decompression and the memory consumed by the uncompressed
image.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124390672....@z14g2000cwz.googlegroups.com...

JT

unread,

Aug 18, 2005, 3:31:58 PM8/18/05

to

You smell like someone who spends most of their time lurking around the
halls of a university or research lab rather than developing database models
for real production systems.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124388849.1...@g47g2000cwa.googlegroups.com...

vc

unread,

Aug 18, 2005, 3:48:49 PM8/18/05

to

JT wrote:
> Consider what percentage of database servers in production have
> terabytes of unused disk storage available for whatever machiavelian
> database model some theorist chooses to dream up? Within the next 50 years,
> advances in quantum mechanics may allow for transdimentional data storage,
> but today we are constrained by budgets and the limitations of silicon and
> magnetic metal plates.
> What impact the 2 byte overhead has on the total width of a record
> depends on how many VarChar columns are included on the record layout. None
> of the top three database server platforms have native support for data
> compression, but even if they did, there is still the issue CPU cycles
> required fro decompression and the memory consumed by the uncompressed
> image.

Actually, Oracle has and has had since release 9.

AK

unread,

Aug 18, 2005, 4:03:54 PM8/18/05

to

>> I would be ashamed to be called a master of extents and segment
management. <<

what is so shameful in it?

Mikito Harakiri

unread,

Aug 18, 2005, 4:17:37 PM8/18/05

to

Stu wrote:
> Not that I totally disagree with the concept of deprecating char vs
> varchar, but I see this argument used a lot, and there's two flaws with
> it.
>
> 1. Storage size may be cheap and plentiful, but performance should
> always be foremost in the DBA mind. 2 bytes in a single column pf
> storage may not be much, but you also need to write queries that
> retrieve that extra 2 bytes. It's not just disk space; it's also
> memory and CPU.

How much performance difference does it really make? Keep in mind that
in a typical table you can declare char a couple of boolean (Y/N)
columns at most.

> 2. Failing to appreciate the differences between varchar and char sets
> us on a path of lazy design. If it doesn't matter if I use char(10) vs
> varchar(10), then what's the matter with varchar(50)? How about
> varchar(51) etc? Eventually we could get away with "oh screw it; disks
> are cheap, memory is cheap, CPU's are powerful, let's just put
> everything in an text column and parse it on the fly".

Failing to appreciate the differences between varchar and char sets is
an ability to raise the level of abstraction. (BTW, the skill many DBAs
lack.) With low level of abstraction you would never quit chaising
perceived problems (eg. trying to figure out the "optimal" block size,
pondering if

select * from table

is faster than

select col1, col2, ... from table

etc.)

JT

unread,

Aug 18, 2005, 4:14:00 PM8/18/05

to

I said none of the top three have it. ;-)

"vc" <bost...@hotmail.com> wrote in message
news:1124394529.6...@f14g2000cwb.googlegroups.com...

AK

unread,

Aug 18, 2005, 4:22:39 PM8/18/05

to

>> > storage may not be much, but you also need to write queries that
> retrieve that extra 2 bytes. It's not just disk space; it's also
> memory and CPU.

How much performance difference does it really make?
<<

beleive me or not, in some cases it's 100% or more

AK

unread,

Aug 18, 2005, 4:24:04 PM8/18/05

to

>> I said none of the top three have it. ;-)
<<

then which ones do you call top 3 and what's your criteria

Mikito Harakiri

unread,

Aug 18, 2005, 4:28:24 PM8/18/05

to

Do you have a bencmark to prove it?

Hugo Kornelis

unread,

Aug 18, 2005, 4:50:33 PM8/18/05

to

On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:

>Stu wrote:
>> Not that I totally disagree with the concept of deprecating char vs
>> varchar, but I see this argument used a lot, and there's two flaws with
>> it.
>>
>> 1. Storage size may be cheap and plentiful, but performance should
>> always be foremost in the DBA mind. 2 bytes in a single column pf
>> storage may not be much, but you also need to write queries that
>> retrieve that extra 2 bytes. It's not just disk space; it's also
>> memory and CPU.
>
>How much performance difference does it really make? Keep in mind that
>in a typical table you can declare char a couple of boolean (Y/N)
>columns at most.

Hi Mikito,

The extra performance cost of declaring Y/N columns varchar instead of
char is trivial compared to the extra cost of doing so for foreign key
columns.

Most databases have lots of so-called "lookup tables". In the main
table, the state is stored as a 2-letter code, country as a 3-letter ISO
code and currency as another 3-letter ISO code. These codes are foreign
keys into the States, Countries and Currencies tables, that have the
primary key 2- or 3-letter code, a unique long name/description and
possibly some other columns as well.

All these 2- and 3-letter codes will be in indexes, and these indexes
will be heavily used during inserts and updates (to verify the foreign
key constraint) and in queries (because users generally prefer to see
the full name of the state/country/currency instead of the code).

Using varchar for a 2-letter code means that the space taken is doubled.
This halves the number of rows that fit on one leaf page of the index.
The result will be: more logical reads, lower cache hit ration, more
physical reads --> slower performance.

>> 2. Failing to appreciate the differences between varchar and char sets
>> us on a path of lazy design. If it doesn't matter if I use char(10) vs
>> varchar(10), then what's the matter with varchar(50)? How about
>> varchar(51) etc? Eventually we could get away with "oh screw it; disks
>> are cheap, memory is cheap, CPU's are powerful, let's just put
>> everything in an text column and parse it on the fly".
>
>Failing to appreciate the differences between varchar and char sets is
>an ability to raise the level of abstraction. (BTW, the skill many DBAs
>lack.)

It's not a prime skill for DBAs. The abstraction level where the actual
data type is irrelevant is where you'll find the information analyst,
functional designer, conceptual modeler or whatever the current name of
those people is. It's not the DBA's task to check if they did their job
properly. The DBA should ensure that the database runs smoothly, that as
blocking and deadlocks are minimized and that queries perform as fast as
possible. Especially the latter is impossible to do without appreciating
the difference between char, nchar, varchar, nvarchar, text, and ntext.

> With low level of abstraction you would never quit chaising
>perceived problems (eg. trying to figure out the "optimal" block size,

No SQL Server DBA will ever try tio figure out a block size. (Remember
that this thread started in a SQL Server group and was crossposted to
.theory later!)

>pondering if
>
>select * from table
>
>is faster than
>
>select col1, col2, ... from table

No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
for all serious RDBMS's) know that SELECT * should never be used in
production code (except in a EXISTS(..) subquery).

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

VC

unread,

Aug 18, 2005, 5:14:08 PM8/18/05

to

"JT" <som...@microsoft.com> wrote in message
news:Omzl8GDp...@TK2MSFTNGP15.phx.gbl...

>I said none of the top three have it. ;-)
>

Then they are not the top three but impostors ;)

Mikito Harakiri

unread,

Aug 18, 2005, 5:17:54 PM8/18/05

to

Now, we are talking! Verifying the foreign key constraint requires a
unique index scan. If the index grows in size by factor of two, the
number of levels might go up by 1, but typically would stay the same.
In short, verifying a foreign key constraint would be 2 or 3 logical
reads, in either case.

Normally, upper levels of index are cached. I guess for country codes,
the whole index is cached, so you are correct about country code PK
index taking twice the space. Let's see 200 countries multipled by 2
bytes. 800 bytes versus 400 bytes!

> >> 2. Failing to appreciate the differences between varchar and char sets
> >> us on a path of lazy design. If it doesn't matter if I use char(10) vs
> >> varchar(10), then what's the matter with varchar(50)? How about
> >> varchar(51) etc? Eventually we could get away with "oh screw it; disks
> >> are cheap, memory is cheap, CPU's are powerful, let's just put
> >> everything in an text column and parse it on the fly".
> >
> >Failing to appreciate the differences between varchar and char sets is
> >an ability to raise the level of abstraction. (BTW, the skill many DBAs
> >lack.)
>
> It's not a prime skill for DBAs. The abstraction level where the actual
> data type is irrelevant is where you'll find the information analyst,
> functional designer, conceptual modeler or whatever the current name of
> those people is. It's not the DBA's task to check if they did their job
> properly. The DBA should ensure that the database runs smoothly, that as
> blocking and deadlocks are minimized and that queries perform as fast as
> possible. Especially the latter is impossible to do without appreciating
> the difference between char, nchar, varchar, nvarchar, text, and ntext.

Well, operating a machine indeed doesn't require abstract thinking. I
never said it does. I'm just implying that without critical analysis of
what you are doing, you would never be able to quit fighting the mess
of real world.

BTW, in my (admittedly very limited) SQL tuning experience, I have yet
to see the case there the size of a datatype make any difference.

Mikito Harakiri

unread,

Aug 18, 2005, 5:24:19 PM8/18/05

to

Could you please give a hint to a humor impaired person what this laugh
is about?

Hugo Kornelis

unread,

Aug 18, 2005, 5:50:01 PM8/18/05

to

On 18 Aug 2005 14:17:54 -0700, Mikito Harakiri wrote:

(snip quoteback)

>Now, we are talking! Verifying the foreign key constraint requires a
>unique index scan. If the index grows in size by factor of two, the
>number of levels might go up by 1, but typically would stay the same.
>In short, verifying a foreign key constraint would be 2 or 3 logical
>reads, in either case.
>
>Normally, upper levels of index are cached. I guess for country codes,
>the whole index is cached, so you are correct about country code PK
>index taking twice the space. Let's see 200 countries multipled by 2
>bytes. 800 bytes versus 400 bytes!

Hi Mikito,

Ah, cynicism - I love it.

I chose these examples because they are well known by everyone. My bad;
I should have chosen more convincing (though less well-known) examples.

Many companies use a short mnemonic code for their customers. The use of
a short mnemonic code for products is not uncommon either. Imagine a
customers table with 10,000 customers, a products table with 5,000
products and an orders table with a few million rows. Now would you
prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
for ProductCode?

Another example: ticker symbols. Take a look at http://finance.yahoo.com
To be able to generate all the graphs they offer, they have to have
quite a few rows of historic quotes in their DB. How would you rate the
overhead of char(5) vs varchar(5) for ticker symbol in a row that only
has three columns: ticker, date/time and quote.

jxstern

unread,

Aug 18, 2005, 5:52:55 PM8/18/05

to

On 18 Aug 2005 11:23:34 -0700, "Razvan Socol" <rso...@gmail.com>
wrote:

I've had my ear bent about some more putative performance advantages
to char() fields, in that SQLServer can find the n'th data field more
efficiently, so that you should put the char() fields to the front of
the record and the varchar()'s to the end.

No idea if it's true or significant. Hope it's not too significant,
since I've never actually paid any attention to it!

J.

Mikito Harakiri

unread,

Aug 18, 2005, 6:02:56 PM8/18/05

to

Hugo Kornelis wrote:
> Many companies use a short mnemonic code for their customers. The use of
> a short mnemonic code for products is not uncommon either. Imagine a
> customers table with 10,000 customers, a products table with 5,000
> products and an orders table with a few million rows. Now would you
> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
> for ProductCode?

But if lookup table cardinality goes up, then, the storage factor
char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
effect of the two trailing bytes is not miniscule, as I previously
thought, but it just can't be significant. Any benchmark demonstrating
that the performance degradation is not in single percentage digits is
welcome.

Colin Dawson

unread,

Aug 18, 2005, 6:11:49 PM8/18/05

to

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message
news:1124385453....@f14g2000cwb.googlegroups.com...

> Another source for quote of the week.
>
> Although the author is unfair emphasizing that DBA has to be aware char
> vs varchar2 difference. The character string types in SQL are just
> plain silly. For how long ordinary programming languages have [a
> single] type String [with unlimited bounds] already? Also when did you
> design a database schema with char datatype last time? In a word, there
> is not a single advantage of char over varchar2. Therefore, char is not
> even worth mentioned, and should just be deprecated: there are to many
> really important things on DBA plate.
>

Last time I used a Char column? erm, how about today. They're ideal to
storing either single characters where a VarChar is completely pointless.
Also in longer columns when the data will always be a fixed length.

Regards

Colin Dawson
DBA ;-)
www.cjdawson.com

Mikito Harakiri

unread,

Aug 18, 2005, 6:13:28 PM8/18/05

to

My bad. I meant

select count(1) from table

vs.

select count(*) from table

BTW, you triggered the other example: is EXISTS or IN faster? This
question could come up only from somebody who is completely unaware of
SQL expression equivalency and query rewrite. Well, making sure the
extents and segments are layed out on disk properly, leaves little room
for education and abstract thinking.

-CELKO-

unread,

Aug 18, 2005, 6:14:45 PM8/18/05

to

>> Any benchmark demonstrating that the performance degradation is not in single percentage digits is welcome. <<

I am more worried about various numbers of extra blanks in the
VARCHAR(n) making display problems. Trimming and padding are costly.

Colin Dawson

unread,

Aug 18, 2005, 6:25:28 PM8/18/05

to

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124402576.4...@o13g2000cwo.googlegroups.com...

Surely, when developing a database application and performance tuning it
sqeezing every single performance point out of the database is worthwhile.
Everytime that you make a decision which costs a percent of two of
performance you degrade your database performance. Given enough of these
compomises the database performance will drop significatly, I've seen this
happen for real on databases with hundreds of thousands of rows, the
performance degraded so much that the application became unusable. A few
minor tweeks here and there and the program was running faster than it had
ever run before.

It may seem that it's a very academic kind of argument but in real terms the
more pedantic and performance oriented the DBA is the better the final
product will function.

Yes, you can always throw more kit at a problem, but when you can litterally
flick a switch or two and boost the performance it's worth doing. To do
otherwise could be contrude as neglegence!

Regards

Colin Dawson
www.cjdawson.com

Stu

unread,

Aug 18, 2005, 6:31:39 PM8/18/05

to

Let me try to be more clear; my objection to your original statement
(paraphrased as "we've got lot's of space; why should a couple of bytes
matter?") is grounded NOT in practical matters, but rather issues of
design theory and discipline. I would agree that in terms of
performance you'd be hard pressed to find differences between char and
varchar for 90% of use cases that you can come up with. Char is useful
in other ways besides just performance (e.g., validation, but that too
can be emulated by other means); however, I'll concede (as I did in my
original post) that distinguishing between char and varchar IN MOST
CASES is moot in terms of performance.

However, my point is this; if you disregard the differences here
because of substantial resources, why bother with any sort of datatype
at all? If a varchar(50) will do the trick of a char(10), what's 42
extra bytes? In the days of terabytes sized RAID arrays, why not use a
varchar (8000) for everything? Why use an integer if a varchar will
do? Obviously at some point, there will be a performance impact; why
not be disciplined enough to strike at the lowest level of that curve
and use the least expensive resource that is the best fit for your data?

Mikito Harakiri

unread,

Aug 18, 2005, 6:50:39 PM8/18/05

to

Stu wrote:
> if you disregard the differences here
> because of substantial resources, why bother with any sort of datatype
> at all? If a varchar(50) will do the trick of a char(10), what's 42
> extra bytes? In the days of terabytes sized RAID arrays, why not use a
> varchar (8000) for everything? Why use an integer if a varchar will
> do? Obviously at some point, there will be a performance impact; why
> not be disciplined enough to strike at the lowest level of that curve
> and use the least expensive resource that is the best fit for your data?

Humans are notoriously bad at storage management. Storage layout is
something that has to be hidden and managed automatically. When did you
specify memory parameters for your web browser program (Let see: 10M
for fonts, 15M for web pages cache, 5M for cookies, or, wait a minute,
maybe 2M for cookies would make my browser faster?)

I want a string datatype, and can't possibly predict (and don't really
care) how long it would be. Is it so difficult to design a RDBMS engine
that would allow me to declare a string of arbitrary length with some
rudimentary intelligence as far as storage is concerned?

Stu

unread,

Aug 18, 2005, 6:56:55 PM8/18/05

to

Mikito Harakiri wrote:

> Humans are notoriously bad at storage management. Storage layout is
> something that has to be hidden and managed automatically. When did you
> specify memory parameters for your web browser program (Let see: 10M
> for fonts, 15M for web pages cache, 5M for cookies, or, wait a minute,
> maybe 2M for cookies would make my browser faster?)
>

Apples to oranges; I can live with the default settings of my web
browser (with a few minor tweaks :)), but there's not several other
people all trying to run off my web browser at the same time I am.
This has nothing to do with design, and lots to do with end-user
performance.

> I want a string datatype, and can't possibly predict (and don't really
> care) how long it would be. Is it so difficult to design a RDBMS engine
> that would allow me to declare a string of arbitrary length with some
> rudimentary intelligence as far as storage is concerned?

Not difficult at all; however, designing one that performs well, that's
a different story.

Stu

unread,

Aug 18, 2005, 6:57:29 PM8/18/05

to

Mikito Harakiri wrote:

> Humans are notoriously bad at storage management. Storage layout is
> something that has to be hidden and managed automatically. When did you
> specify memory parameters for your web browser program (Let see: 10M
> for fonts, 15M for web pages cache, 5M for cookies, or, wait a minute,
> maybe 2M for cookies would make my browser faster?)
>

Apples to oranges; I can live with the default settings of my web
browser (with a few minor tweaks :)), but there's not several other
people all trying to run off my web browser at the same time I am.
This has nothing to do with design, and lots to do with end-user
performance.

> I want a string datatype, and can't possibly predict (and don't really

> care) how long it would be. Is it so difficult to design a RDBMS engine
> that would allow me to declare a string of arbitrary length with some
> rudimentary intelligence as far as storage is concerned?

Not difficult at all; however, designing one that performs well, that's
a different story.

Colin Dawson

unread,

Aug 18, 2005, 7:03:14 PM8/18/05

to

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124405438.9...@g47g2000cwa.googlegroups.com...

For that instance use a VarChar(8000) or even a Text column. When you do
know the structure of the data that you'll be storing, specify it. IMO,
the root of assigning datatypes is partly for performance and partly for
clarity. If you're able to specify that a UK Postcode is at most 8
characters long, then do so. If it's possible that your column could
contain the text from war and peace, make is capable of doing so.

The imporant thing here is that you have the choice. By making the smarter
decision means that there's a possibility that you'll be rewarded with a
slight database performance boost. As I said in another post, making lots
of smart decisions will add up and the overall performance will be
significant.

Regards

Colin Dawson
www.cjdawson.com

rchr...@patmedia.net

unread,

Aug 18, 2005, 10:07:31 PM8/18/05

to

Hope you don't mind, but I have some fuzzy thoughts on datatypes, smart
decisions, column lengths, and maybe the ghost of Y2K. How many DBAs
use number datatypes to hold Social Security Numbers? How many use
varchar(9) or char(9)? I always use the latter because the guru once
told me never to use a numeric datatype if I wasn't going to do the
math. Now, I wonder. There are about 270,000,000 people in the US of A,
each with their own unique SSN. Say another 50,000,000 to 100,000,000
IDs are already accounted for for one reason or another. 10 to the 9th
is only 1 billion, so nearly half the possible SSNs are used up
already. Folks, I'm closer to retirement age than whipper-snapper age
but I believe I could live to see us run out of SSNs unless the Feds do
something DRASTIC, like add a digit. Then what are we going to do?
Those who ignored the guru are probably sitting pretty. A BIGINT has
plenty of room to grow. All those char and varchar tables, however, are
going to need rewriting. Doesn't that sound familiar?

Stu

unread,

Aug 18, 2005, 10:38:31 PM8/18/05

to

two words: leading zeros. Most of the people born in the Northeast
have an SSN that starts with 0.

http://www.socialsecurity.gov/foia/stateweb.html

Nice, but no cigar.

Stu

Brian Selzer

unread,

Aug 18, 2005, 10:49:20 PM8/18/05

to

It's even worse if they're used in PK/FK relationships--in other words,
spread throughout the database. Another argument against natural primary
keys.... On the other hand, I really shouldn't complain: as a consultant, I
will be raking in the dough when the change finally happens. Maybe that's
why Joe Celko's so set against artificial keys :)

<rchr...@patmedia.net> wrote in message
news:1124417251.1...@o13g2000cwo.googlegroups.com...

Tibor Karaszi

unread,

Aug 19, 2005, 3:43:51 AM8/19/05

to

> , so that you should put the char() fields to the front of
> the record and the varchar()'s to the end.

SQL Server does this for you :-). If you study the row layout in SLQ server, you have all the fixed
columns, in order, in the beginning of the row. Then comes some fluff like a null block etc. Then
you have a variable column offset table, which has pointers to the last section of the row, the
actual variable column data block. In the variable column offset block, you first have a 2 byte
section describing at which position the first variable column's data ends, etc.

One can argue that this also has performance impact (handling the variable length stuff compared to
the simple fixed length algorithms). But as you know, we rarely measure CPU cycles in SQL Server
(like you probably do in a C compiler).

I'm surprised why this discussion haven't focused on the data? If it is fixed length data, use char!
Like SSN (I would imagine the counterpart in Sweden is fixed length). If it isn't fixed length
(firstname, lastname), use varchar.

--
Tibor Karaszi, SQL Server MVP
http://www.karaszi.com/sqlserver/default.asp
http://www.solidqualitylearning.com/
Blog: http://solidqualitylearning.com/blogs/tibor/

"jxstern" <jxs...@nowhere.xyz> wrote in message news:em0ag1p2v0mcc3726...@4ax.com...

Mike Hodgson

unread,

Aug 19, 2005, 3:53:56 AM8/19/05

to

Mikito Harakiri wrote:

Hugo Kornelis wrote:

On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:

pondering if

select * from table

is faster than

select col1, col2, ... from table

No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
for all serious RDBMS's) know that SELECT * should never be used in
production code (except in a EXISTS(..) subquery).

My bad. I meant

select count(1) from table

vs.

select count(*) from table

"select count(1) from table" and "select count(*) from table" will both come up with the same execution plan. They will both count the number of entries in the narrowest index on that table. That's, perhaps, a bad example as the optimiser is specifically designed to deal with that case I believe. See example below (both cases require 227 logical reads and return 121371 rows in this example): StmtText ------------------------------------------- select count(1) from dbo.SalesOrderDetail (1 row(s) affected) StmtText ------------------------------------------------------------------------------------------------------------------ |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1003]))) |--Stream Aggregate(DEFINE:([Expr1003]=Count(*))) |--Index Scan(OBJECT:([AdventureWorks2000].[dbo].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID])) (3 row(s) affected) StmtText ----------------------------------------------- select count(*) from dbo.SalesOrderDetail (1 row(s) affected) StmtText ------------------------------------------------------------------------------------------------------------------ |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1003]))) |--Stream Aggregate(DEFINE:([Expr1003]=Count(*))) |--Index Scan(OBJECT:([AdventureWorks2000].[dbo].[SalesOrderDetail].[IX_SalesOrderDetail_ProductID])) (3 row(s) affected)

BTW, you triggered the other example: is EXISTS or IN faster? This
question could come up only from somebody who is completely unaware of
SQL expression equivalency and query rewrite. Well, making sure the
extents and segments are layed out on disk properly, leaves little room
for education and abstract thinking.

The EXISTS() predicate is typically a fairly efficient predicate because it only needs to scan until it gets a match, at which time it returns. The worst case scenario (it finds a match on the last physical row, or it doesn't find any matching row) is the same I/O as the IN() predicate case because IN() will evaluate the entire subquery.

--
mike hodgson
blog: http://sqlnerd.blogspot.com

Roy Hann

unread,

Aug 19, 2005, 6:03:42 AM8/19/05

to

"AK" <AK_TIRE...@hotmail.COM> wrote in message
news:1124396559....@g43g2000cwa.googlegroups.com...

I simply don't believe you. I don't believe the difference can be 100%. I
certainly don't believe it can be more. And I don't believe you can even
measure what difference there is. And even if I did believe all that, I
don't believe it is anything but a bug in your software.

So I 100% don't believe you. However, I am willing to be persuaded by
proof.

Roy

Paul

unread,

Aug 19, 2005, 6:16:51 AM8/19/05

to

I guess the char() is really just being used as a hint to the DBMS so it
can make an informed decision of what physical data structures to use.

Maybe a better solution would be to have a single type ("string" or
whatever) but then have a check constraint like len(column) < 6.

So you could regard the char datatype as shorthand for a varchar
datatype with a check constraint. The two ways of looking at it are
functionally identical. I think in practice certain DBMSs will
internally store a varchar(n) as a char(n) for sufficiently low values
of n anyway.

In theory, check constraints should be helpful to the DBMS, both for
queries and for deciding physical storage structures.

So there are three possibilities:
1) have char(n) and varchar(n) datatypes
2) scrap char(n) and just have varchar(n). The DBMS can decide to use
different internal storage methods if n is small enough.
3) have varchar with no maximum length specified. Now if you forget to
specify a maximum length via a check constraint, the DBMS isn't able to
optimize things so well.

You might have a similar argument with tinyint, smallint, int, bigint types.

tinyint is really just an int column with a check constaint of
0 <= column <= 255

The question is: should the length constraint be part of the type or
part of the database? Does it matter even?

Paul.

Paul

unread,

Aug 19, 2005, 6:20:31 AM8/19/05

to

Mike Hodgson wrote:
> The EXISTS() predicate is typically a fairly efficient predicate because
> it only needs to scan until it gets a match, at which time it returns.
> The worst case scenario (it finds a match on the last physical row, or
> it doesn't find any matching row) is the same I/O as the IN() predicate
> case because IN() will evaluate the entire subquery.

Why does IN() need to evaluate the entire subquery? Couldn't it in
theory work exactly the same as EXISTS() at the physical level?

Paul.

Tony Andrews

unread,

Aug 19, 2005, 7:25:17 AM8/19/05

to

Yes it could, and indeed does (Oracle 9i). In the following example,
IN and EXISTS are processed the same way, and DO NOT evaluate the
entire subquery:

SQL> create table t1 as select object_id, object_name from all_objects;

Table created.

SQL> alter table t1 add constraint t1_pk primary key (object_id);

Table altered.

SQL> create table t2 as select object_id, object_name from all_objects
where rownum=1;

Table created.

SQL> alter table t2 add constraint t2_pk primary key (object_id);

Table altered.

SQL> analyze table t1 compute statistics;

Table analyzed.

SQL> analyze table t2 compute statistics;

Table analyzed.

SQL> select count(*) from t1;

COUNT(*)
----------
47355

SQL> select count(*) from t2;

COUNT(*)
----------
1

SQL> set autotrace on
SQL> select * from t2 where object_id in (select object_id from t1);

OBJECT_ID OBJECT_NAME
---------- ------------------------------
18164 /1005bd30_LnkdConstant

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=30)
1 0 NESTED LOOPS (Cost=1 Card=1 Bytes=30)
2 1 TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
3 1 INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
5 consistent gets
3 physical reads
0 redo size
227 bytes sent via SQL*Net to client
314 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> select * from t2 where exists (select null from t1 where
t1.object_id = t2.object_id);

OBJECT_ID OBJECT_NAME
---------- ------------------------------
18164 /1005bd30_LnkdConstant

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=26)
1 0 FILTER
2 1 TABLE ACCESS (FULL) OF 'T2' (Cost=1 Card=1 Bytes=26)
3 1 INDEX (UNIQUE SCAN) OF 'T1_PK' (UNIQUE) (Cost=1 Card=1 B
ytes=4)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
5 consistent gets
0 physical reads
0 redo size
227 bytes sent via SQL*Net to client
314 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Mike Hodgson

unread,

Aug 19, 2005, 8:44:13 AM8/19/05

to

Perhaps it's an implementation detail of Oracle then (I'm not very familiar with how Oracle performs the various operations). What would happen if you had slightly more realistic data (like both tables with more than 1 row in them)? For example, say you had: InsurancePolicies {policy_id, broker_id, inception_date, ...} - (80000 rows) clustered index on policy_id (don't know what Oracle parlance for "clustered index" is) InsuranceBrokers {broker_id, broker_name, ...} - (200 rows) clustered index on broker_id then you said: select * from InsuranceBrokers b where broker_id in ( select distinct broker_id from InsurancePolicies p where p.inception_date > '20050101' ) select * from InsuranceBrokers b where exists ( select * from InsurancePolicies p where p.broker_id = b.broker_id and p.inception_date > '20050101' ) Both queries should return the same data, namely all the brokers who own 1 or more policies that started this year. Now, with the IN() query, if the physical data in InsurancePolicies is sorted by policy_id, then how does the query engine know it's got all of the policies that started this year unless it goes through every single row in InsurancePolicies? With the EXISTS() version, as soon as the query engine finds that the broker in question owns a single policy that started this year it would stop trawling through the 80000 row policy table. Best case scenario for EXISTS(), the first policy row for that broker started this year so that broker is included in the result set (scan 1 row out of 80000); worst case scenario, the only policy the broker owns that started this year was created yesterday (and so has the greatest policy_id and so is last in the physical order of rows in the table - ie. full index scan; scan 80000 rows out of 80000). For all cases for IN() the query engine needs to go through every policy row that (that started this year) to compile the distinct list to present back to the outer query - i.e. full index scan. Perhaps Oracle have done some particular optimisations in that area, but I believe that's the way Microsoft deal with it. Bit of a dumb example really because an inner join would be the best way to write that query anyway (well it would in SQL Server - I assume the same would hold true for Oracle) but it's the simplest example my poor tired brain would come up with at 10:30 on a Friday night.

--
mike hodgson
blog: http://sqlnerd.blogspot.com

Paul

unread,

Aug 19, 2005, 9:27:16 AM8/19/05

to

Mike Hodgson wrote:
> select * from InsuranceBrokers b
> where broker_id in
> (
> select distinct broker_id from InsurancePolicies p
> where p.inception_date > '20050101'
> )

...

> Best case scenario for EXISTS(), the first policy row for that broker
> started this year so that broker is included in the result set (scan 1
> row out of 80000); worst case scenario, the only policy the broker owns
> that started this year was created yesterday (and so has the greatest
> policy_id and so is last in the physical order of rows in the table -
> ie. full index scan; scan 80000 rows out of 80000). For all cases for
> IN() the query engine needs to go through every policy row that (that
> started this year) to compile the distinct list to present back to the
> outer query - i.e. full index scan.

The optimizer should be able to recognise that the distinct is
unnecessary, so it wouldn't need to go through the whole table.

Just because the subquery uses "distinct" it doesn't mean that the DBMS
must materialise that internally if it's not necessary. Though I guess
index statistics may cause it to choose this option if it thinks it
would be faster.

If you think about it, both queries are logically identical so it would
be possible (at least in theory) for them to use identical query plans,
if the query optimizer is clever enough.

Paul.

Tony Andrews

unread,

Aug 19, 2005, 10:16:17 AM8/19/05

to

Here is the best simulation of your example I can come up with at the
moment. It shows that Oracle is indeed choosing different plans this
time, though the elapsed time is similar either way (EXISTS slightly
faster, 0.3 secs rather than 0.4)

>From the Oracle docs, it appears that "where x in (select PKCOL from
y)" is a special case, that can be optimized into a join.

SQL> select count(*) from insurance_brokers;

COUNT(*)
----------
67

SQL> select count(*), count(distinct broker_id) from
insurance_policies;

COUNT(*) COUNT(DISTINCTBROKER_ID)
---------- ------------------------
30881 30

SQL> select * from Insurance_Brokers b
2 where broker_id in
3 (
4 select distinct broker_id from Insurance_Policies p
5 where p.inception_date > date '2005-01-01'
6 );

BROKER_ID BROKER_NAME
---------- ------------------------------
0 SYS
5 SYSTEM
11 OUTLN
18 DBSNMP
20 WMSYS
41 AURORA$JIS$UTILITY$
29 ORDSYS
30 ORDPLUGINS
31 MDSYS
32 CTXSYS
34 XDB
38 FSC
39 RB
42 OSE$HTTP$ADMIN
56 PORTAL30
121 FLOWS_010600
120 FLOWS_FILES
58 PORTAL30_SSO
60 PORTAL30_SSO_PS
61 PORTAL30_DEMO
62 SCOTT
118 AFOSTER
71 WORKFLOW
122 SX3COM
128 TOAD
132 BMS1
134 DESDIR
141 JREED
145 NARROW
146 ARROW

30 rows selected.

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=94 Card=30 Bytes=720
)

1 0 HASH JOIN (Cost=94 Card=30 Bytes=720)
2 1 VIEW OF 'VW_NSO_1' (Cost=92 Card=30 Bytes=390)
3 2 SORT (UNIQUE) (Cost=92 Card=30 Bytes=270)
4 3 TABLE ACCESS (FULL) OF 'INSURANCE_POLICIES' (Cost=9
Card=30881 Bytes=277929)

5 1 TABLE ACCESS (FULL) OF 'INSURANCE_BROKERS' (Cost=1 Card=
67 Bytes=737)

SQL> select * from Insurance_Brokers b
2 where exists
3 (
4 select * from Insurance_Policies p
5 where p.broker_id = b.broker_id
6 and p.inception_date > date '2005-01-01'
7 );

BROKER_ID BROKER_NAME
---------- ------------------------------
0 SYS
5 SYSTEM
11 OUTLN
18 DBSNMP
20 WMSYS
41 AURORA$JIS$UTILITY$
29 ORDSYS
30 ORDPLUGINS
31 MDSYS
32 CTXSYS
34 XDB
38 FSC
39 RB
42 OSE$HTTP$ADMIN
56 PORTAL30
121 FLOWS_010600
120 FLOWS_FILES
58 PORTAL30_SSO
60 PORTAL30_SSO_PS
61 PORTAL30_DEMO
62 SCOTT
118 AFOSTER
71 WORKFLOW
122 SX3COM
128 TOAD
132 BMS1
134 DESDIR
141 JREED
145 NARROW
146 ARROW

30 rows selected.

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=3 Bytes=33)
1 0 FILTER
2 1 TABLE ACCESS (FULL) OF 'INSURANCE_BROKERS' (Cost=1 Card=
3 Bytes=33)

3 1 TABLE ACCESS (BY INDEX ROWID) OF 'INSURANCE_POLICIES' (C
ost=6 Card=1029 Bytes=9261)

4 3 INDEX (RANGE SCAN) OF 'IP_IB' (NON-UNIQUE) (Cost=2 Car
d=1029)

JT

unread,

Aug 19, 2005, 10:36:54 AM8/19/05

to

Well, the thread started out as a recollection of humorous responses given
by people interviewing for a DBA job position who clearly had little
eperience as a DBA. Then you showed up with your pointy headed rumenations
on database theory, and it's like something from Dilbert cartoon.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124400259.3...@g49g2000cwa.googlegroups.com...

JT

unread,

Aug 19, 2005, 10:49:59 AM8/19/05

to

Perhaps I'm misreading your point, but it's data in a Char column that
typically contains trailing blanks that need a Trim. A VarChar won't contain
extraneous blanks unless the developer explicitly includes them in the
value.

"-CELKO-" <jcel...@earthlink.net> wrote in message
news:1124403285.8...@g43g2000cwa.googlegroups.com...

JT

unread,

Aug 19, 2005, 10:57:49 AM8/19/05

to

There is a TEXT datatype that supports up to 2 GB. However, such free form
datatypes require more meta data overhead, and it's bad design practice to
store blobs of text in a relational database.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124405438.9...@g47g2000cwa.googlegroups.com...

JT

unread,

Aug 19, 2005, 11:07:28 AM8/19/05

to

And on top of that, specific ranges of the SSN coding scheme are reserved
for foreign nationals with working visas. As far as primary keys go, I would
trust SQL Server to generate a unique identity before I would trust the Fed
to assign a unique SSN. For person tables, just stick SSN in a varchar(20)
as an extraneous attribute and use an 4 byte integer identity for foreign
key relationships.

<rchr...@patmedia.net> wrote in message
news:1124417251.1...@o13g2000cwo.googlegroups.com...

--CELKO--

unread,

Aug 19, 2005, 11:30:35 AM8/19/05

to

>> just stick SSN in a varchar(20) as an extraneous attribute and use an 4 byte integer identity for foreign key relationships. <<

And sure enough, you will get a 20 letter SSN in your database. And it
will be a duplicate on top of that.

A good DB designer would know to use CHAR(9) and how to write at least
a minimal constraint to insure some data integrity. A good DB designer
would know how to verify SSNs via several commercial and puiblic
services.

But a bad DB designer does not have to bother with all that research
and work. He has his proprietary, magical, universal,
one-size-fits-all-creation IDENTITY column. Let's let him tell the IRS
that SSNs are an "extraneous attribute" while they haul himoff to jail
and close us down.

AK

unread,

Aug 19, 2005, 11:32:56 AM8/19/05

to

>>
Those who ignored the guru are probably sitting pretty. A BIGINT has
plenty of room to grow. All those char and varchar tables, however, are
going to need rewriting. Doesn't that sound familiar?
<<

it does.
We have considered the guru's advice and found it very impractical in
our situation.
More to that: starting several years ago, more and more of
our corporate clients are demanding that we wipe out their employee's
SS#s in our database. , so we just run

update ... set ssn=0 where ...

and that is that, the business keeps running. Note that some privacy
advocates are suggesting a law forbidding insurance companies from
storing their clients SSNs. Should it happen, we shat just drop an
index and run an update without where clause.

--CELKO--

unread,

Aug 19, 2005, 11:49:18 AM8/19/05

to

>> but I believe I could live to see us run out of SSNs unless the Feds do something DRASTIC, like add a digit. <<

Evolutionary, not drastic. An extra digit in the serial section of the
number would do it and not effect the validation by group number. The
nice thing aobut industry standards is that they have a migration path
when they change. Ever what ahppens when an IDENTITY runs out and
wraps? I was surprised at how often it can happen -- a bad routine
keeps failing in a loop, increment the exposed counter and makes huge
gaps. Since there is no audit trail on IDENTITY numbers, nobody knows
until you ge to a failure point.

Look at how the UPC codes are moving from 10 to 13 digits. See any
retail stores closing down because of it? Is P&G out of business?
Hallmark cards alone has over 10 million products to track and they
have used the industry standard barcodes for decades.

Talk about being old, do you remember "The Anti-Digit Dialing League"
that got national press when we went from exchange abbreviations to
all digit dialing? When we went to ten digit dialing in major metro
areas, nobody said a word.

Or the various Anti-ZIP code groups? But when we went to ZIP+4, nobody
said a word.

Dan

unread,

Aug 19, 2005, 11:58:56 AM8/19/05

to

The beauty of a RDMS is that one can change VARCHAR(9) to VARCHAR(10)
and it is no big deal. Client programs keep running with maybe light
display issues.

paul c

unread,

Aug 19, 2005, 12:27:46 PM8/19/05

to

--CELKO-- wrote:
> ...

> Talk about being old, do you remember "The Anti-Digit Dialing League"
> that got national press when we went from exchange abbreviations to
> all digit dialing? When we went to ten digit dialing in major metro
> areas, nobody said a word.

> ...

Heh, they were right and I did, but nobody listened, including the
telephone manufacturers. Talk about keys! If they had asked me, I would
have wanted area names *as well as* exchange names, say Vancouver Cedar
Cottage instead of the 604-708-xxxx that I have to remember. If I went
to Europe, I'd expect that there might be a conflict with the local
names, but I'd be expecting to have to make lots of other adjustments
too, like having to remember that in South Africa, directions will often
state that one must turn left at the next robot. I'll bet the other
'phone companies are pretty much as screwed up with their number 'keys'
as Bell Canada who for years couldn't send a single bill if one had two
residential 'phone numbers and maybe still can't for all I know.
(Apparently, they thought the telephone number *is* the customer!) The
only smart thing I ever saw them do was introduce 'invisible keys' to
the 'phone book, making it easier to lift and thumb through.

Thanks for listening to my rant!

p

AK

unread,

Aug 19, 2005, 12:33:57 PM8/19/05

to

>>
the Feds do something DRASTIC, like add a digit.
<<

the Feds may phase out using SSNs in many places, that would be
drastic.
They may let Canadians work here without getting an SSN, using their
SIN instead.
So on

JT

unread,

Aug 19, 2005, 12:32:53 PM8/19/05

to

"--CELKO--" <jcel...@earthlink.net> wrote in message
news:1124465435.1...@g14g2000cwa.googlegroups.com...

>>> just stick SSN in a varchar(20) as an extraneous attribute and use an 4
>>> byte integer identity for foreign key relationships. <<
>
> And sure enough, you will get a 20 letter SSN in your database. And it
> will be a duplicate on top of that.
>

I make good use of check constraints, especially on user entered VarChar
columns. The reason I suggested VarChar(20) is to support in future changes
the Fed may make to the SSN coding scheme.

> A good DB designer would know to use CHAR(9) and how to write at least
> a minimal constraint to insure some data integrity. A good DB designer
> would know how to verify SSNs via several commercial and puiblic
> services.
>

A good DBA would only be concerned with implementing a check constraint to
confirm that the SSN conforms to formatting specifications. If and when the
specication changes at some future date, the constaint can be modified as
needed, and the extended length provides growing room without additional
structural changes needed to existing data. It's the application developers
perogative to make whatever remote procedure calls are needed to confirm the
authenticity of the SSN number itself. Perhaps the developer would also want
to cross reference the customer's name with the supplied phone number, or
(if it's a flight scheduling application) even run their name against a
terrorist watch list, but that degree of data validation is beyond the scope
of database management.

> But a bad DB designer does not have to bother with all that research
> and work. He has his proprietary, magical, universal,
> one-size-fits-all-creation IDENTITY column.

The purpose of the identity column is to provide a non-changing single
integer column for use in foreign key relationships.

>Let's let him tell the IRS that SSNs are an "extraneous attribute" while
>they haul himoff to jail and close us down.
>

The Social Security numbering and allocation system was invented by the Fed
for the Fed's own accounting purposes. Over the years, it has been adopted
by other organizations, because it is the closest thing we have to a unique
identifier of US citizens. It provides a somewhat reliable method of looking
up a master detail record for a specific customer, but is not reliable
enough to be used as a foreign key. From the perspective of an eCommerce or
magazine subscription system, it is an extraneous attribute; no more
relevent at the application level than driver's license number or eye color.
Even if you store the SSN in an integer, it still would be an inappropriate
choice as a foreign key, because it is a confidential piece of customer
information, and it would be difficult to scope to what degree it propogates
throughout the enterprise. For compliance reasons, it is much better to
assign your customers a sequentialy generated integer as the primary key and
place a unique constraint on SSN.

Gene Wirchenko

unread,

Aug 19, 2005, 12:58:35 PM8/19/05

to

On 18 Aug 2005 19:38:31 -0700, "Stu" <stuart.a...@gmail.com>
wrote:

001-003 New Hampshire
004-007 Maine
008-009 Vermont
010-034 Massachusetts
035-039 Rhode Island
040-049 Connecticut
050-134 New York>

It looks rather likely to me.

Sincerely,

Gene Wirchenko

unread,

Aug 19, 2005, 12:58:35 PM8/19/05

to

On Fri, 19 Aug 2005 11:07:28 -0400, "JT" <som...@microsoft.com>
wrote:

>And on top of that, specific ranges of the SSN coding scheme are reserved
>for foreign nationals with working visas. As far as primary keys go, I would

I am a foreign national working in Washington state. My SSN
starts with 538 which according to
http://www.socialsecurity.gov/foia/stateweb.html is in the range for
Washington.

[snip]

Sincerely,

Gene Wirchenko

Mikito Harakiri

unread,

Aug 19, 2005, 1:01:00 PM8/19/05

to

JT wrote:
> There is a TEXT datatype that supports up to 2 GB. However, such free form
> datatypes require more meta data overhead, and it's bad design practice to
> store blobs of text in a relational database.

Think about it for a minute. You have varchar2 limited to 4000 bytes.
Then you have text to cover 4K to 2G range. Then you have to rely on
some other option to be able store data bigger than 2G (split it into
chunks in your application????).

If you suggested such a design to programmic language community, you
would be laughed at. Of course, you have a performance disclamer to
back you up, while in reality there is nothing that makes text datatype
inherently less performant than varchar2. In fact, those ugly length
limited datatypes are just artifacts of the early SQL days, when
Fortran was the most popular programming language, and datatypes with
dynamic memory managenent were yet to become mainstream.

mi...@sherrillshelton.net

unread,

Aug 19, 2005, 1:01:30 PM8/19/05

to

On Thu, 18 Aug 2005 22:50:33 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

[snip]
>The abstraction level where the actual
>data type is irrelevant is where you'll find the information analyst,
>functional designer, conceptual modeler or whatever the current name of
>those people is.
[snip]

The current name appears to be "Java programmer". <cough>

--
mike sherrill

Gene Wirchenko

unread,

Aug 19, 2005, 1:18:04 PM8/19/05

to

On Fri, 19 Aug 2005 12:32:53 -0400, "JT" <som...@microsoft.com>
wrote:

[snip]

>I make good use of check constraints, especially on user entered VarChar
>columns. The reason I suggested VarChar(20) is to support in future changes
>the Fed may make to the SSN coding scheme.

Why 20? Why not 30? Or 16?

[snip]

Sincerely,

Gene Wirchenko

AK

unread,

Aug 19, 2005, 1:22:41 PM8/19/05

to

>> You have varchar2 limited to 4000 bytes.
Then you have text to cover 4K to 2G range. Then you have to rely on
some other option to be able store data bigger than 2G (split it into
chunks in your application????).

If you suggested such a design to programmic language community, you
would be laughed at.
<<

oh really? then how comes C# has 8 types to store integers only, not
counting decimal type? Is it funny too?

>> In fact, those ugly length
limited datatypes are just artifacts of the early SQL days,
<<

on the contrary, it is a very powerful tool in enforsing data integrity

JT

unread,

Aug 19, 2005, 1:34:44 PM8/19/05

to

None of the data models I have designed included SSN number, so I havn't
participated in that round table debate to any degree of depth. However,
from what I have seen looking at systems that I have extended or maintained,
it seems that the method of implementing SSN has variations of VarChar to
Char. I've even seen cases where the length of SSN was inconsistent from one
table to another. Someone expressed concerns about the possibility of the
coding scheme changing, so I suggested a VarChar of 20. The reason I suggest
VarChar over BigInt is that the Fed could easily start adding alpha prefixes
or suffixes. Bad design considering it has been all numeric in the past, but
I would not put it past them. Doesn't matter that much about the actual
length; VarChar(20) with a 9 char code consumes the same disk storage as
VarChar(16). Conceptually, I would treat SSN in a manner similar to
LastName; it's just an identifier that has been 999999999 in the past, but
could change in the future.

"Gene Wirchenko" <ge...@ucantrade.com.NOTHERE> wrote in message
news:ku4cg1l8brqndtlr3...@4ax.com...

JT

unread,

Aug 19, 2005, 1:45:43 PM8/19/05

to

Once you start dealing with text of length > 4000 (Unicode) 8000 (ANSI),
you are practically working with documents not attributes, so this data
should be stored in a document management system with numeric or URL
pointers stored in the relational database. The TEXT datatype is there for
large blobs of data for those so inclined, but it's an entirely different
case usage than VarChar.
On the application side, if you are storing gigabytes or even 10s
megabytes of data in a string or array, then it is time to start
re-considering the application design.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124470860.9...@f14g2000cwb.googlegroups.com...

Mikito Harakiri

unread,

Aug 19, 2005, 2:25:01 PM8/19/05

to

JT wrote:
> Once you start dealing with text of length > 4000 (Unicode) 8000 (ANSI),
> you are practically working with documents not attributes, so this data
> should be stored in a document management system with numeric or URL
> pointers stored in the relational database. The TEXT datatype is there for
> large blobs of data for those so inclined, but it's an entirely different
> case usage than VarChar.
> On the application side, if you are storing gigabytes or even 10s
> megabytes of data in a string or array, then it is time to start
> re-considering the application design.

Remarkable. You suggest that arbitrary limitations on datatype length
provide invaluable insight on application design weakness? So if my app
processes 1GB strings all over the places (think XML!), I'm fine, but
if I get 3GB string, then I should redesign my application?

Gene Wirchenko

unread,

Aug 19, 2005, 3:00:15 PM8/19/05

to

On 19 Aug 2005 10:22:41 -0700, "AK" <AK_TIRE...@hotmail.COM>
wrote:

>>> You have varchar2 limited to 4000 bytes.
>Then you have text to cover 4K to 2G range. Then you have to rely on
>some other option to be able store data bigger than 2G (split it into
>chunks in your application????).
>
>If you suggested such a design to programmic language community, you
>would be laughed at.
><<
>
>oh really? then how comes C# has 8 types to store integers only, not
>counting decimal type? Is it funny too?

Is there anything different between a string 4096 long and a
string 4097 long? Are there any different machine instructions that
are required to handle them? No?

What about with integers? Yes?

There is your difference.

>>> In fact, those ugly length
>limited datatypes are just artifacts of the early SQL days,
><<
>
>on the contrary, it is a very powerful tool in enforsing data integrity

Oh? How? Data values have very little to do the difference
between 4096 and 4097 long strings.

Sincerely,

Gene Wirchenko

JT

unread,

Aug 19, 2005, 2:59:12 PM8/19/05

to

GB sized XML documents are typically loaded into structures such as datasets
which swap the data into and out of memory as needed. I don't see why an
application would need to load the entire contents into a flat memory object
like a string. If you want to write an application that loads GBs of data
into memory, then I'm sure the end users will express their own opinions
about the design when their PC or server bogs down due to depleted
resources.

"Mikito Harakiri" <mikharaki...@yahoo.com> wrote in message

news:1124475901.2...@o13g2000cwo.googlegroups.com...

AK

unread,

Aug 19, 2005, 3:31:15 PM8/19/05

to

>>
>>> In fact, those ugly length
>limited datatypes are just artifacts of the early SQL days,
><<

>on the contrary, it is a very powerful tool in enforsing data integrity

Oh? How?
<<

by declaring STATE_CODE CHAR(2) I don't let old style abbreviations
'Fla' and 'Ill' in

by declaring COUNTRY_CODE CHAR(3), etc.

Gene Wirchenko

unread,

Aug 19, 2005, 4:04:30 PM8/19/05

to

On 19 Aug 2005 12:31:15 -0700, "AK" <AK_TIRE...@hotmail.COM>
wrote:

[??? wrote:]

>>>> In fact, those ugly length
>>limited datatypes are just artifacts of the early SQL days,

[AK wrote:]

>>on the contrary, it is a very powerful tool in enforsing data integrity

[Gene Wirchenko wrote:]

> Oh? How?

>by declaring STATE_CODE CHAR(2) I don't let old style abbreviations
>'Fla' and 'Ill' in
>
>by declaring COUNTRY_CODE CHAR(3), etc.

True, but that was not the point. The point was that datatypes
limited to arbirary characters are artifacts. (Note: "datatypes" not
"columns". Your usage is good.)

(Please do not strip attributions when you quote someone.)

Sincerely,

Gene Wirchenko

Tom Ivar Helbekkmo

unread,

Aug 19, 2005, 4:37:45 PM8/19/05

to

Joe,

as a long time satisfied reader of your books, I'd like to ask you
specifically about this "IDENTITY is bad" thing of yours:

Say I do book-keeping for a club. Say I want to keep track of
members, and their addresses, due payments, and other details. Say I
can't legally ask them for their SSNs, and anyway would rather not.
Now, "Joe Celko, 21 Cedar Drive, 12345 Smallville" may be all I have
to identify a given person by. This is a pretty good candidate key
(although it is not really guaranteed to be unique), but it is not
something that is certain to be permanent. Worse, if I want to use it
as a foreign key in a lot of tables, that means I'll be using a
multi-column key all over the place, wasting storage space, and
complicating any code that accesses the data.

In this situation, I'd be sorely tempted to say "OK, our club will
have a 'member number' that identifies each member, and I'll use that
as my primary key for the members table, and Joe Celko can just have a
coronary if he wants to". :-) (He can also move, or change his name,
and it won't be a problem.)

Of course, the member number has no physical reality, and it will be
an automatically assigned serial number (without reuse of defunct
numbers when people quit or die) -- generally, it seems to be what you
rant against.

Am I doing something stupid here? If so, why?

-tih
--
Don't ascribe to stupidity what can be adequately explained by ignorance.

Ross Presser

unread,

Aug 19, 2005, 4:33:29 PM8/19/05

to

God have pity on the human race if they ever come close to needing 10^20
SSNs. Even 10^16 humans, even spread out over a hundred thousand years of
living and dying, is far more than Earth could ever support.

Mikito Harakiri

unread,

Aug 19, 2005, 4:52:47 PM8/19/05

to

Tom Ivar Helbekkmo wrote:
> In this situation, I'd be sorely tempted to say "OK, our club will
> have a 'member number' that identifies each member, and I'll use that
> as my primary key for the members table, and Joe Celko can just have a
> coronary if he wants to". :-) (He can also move, or change his name,
> and it won't be a problem.)
>
> Of course, the member number has no physical reality, and it will be
> an automatically assigned serial number (without reuse of defunct
> numbers when people quit or die) -- generally, it seems to be what you
> rant against.

Of course, it has physical reality. What do people mean when they say
"I was employee #10 at Waffle corp"? These numbers are imprinted on
their badges!

Gene Wirchenko

unread,

Aug 19, 2005, 5:17:26 PM8/19/05

to

On Fri, 19 Aug 2005 22:37:45 +0200, Tom Ivar Helbekkmo
<t...@hamartun.priv.no> wrote:

[snip]

>Of course, the member number has no physical reality, and it will be
>an automatically assigned serial number (without reuse of defunct
>numbers when people quit or die) -- generally, it seems to be what you
>rant against.

No, it is when the number is referring to how the data is
physically stored. If the number is a record number or an address or
other such, then it ties your data to the physical implementation. A
member number (if used just as that) would not do that.

>Am I doing something stupid here? If so, why?

You are misunderstanding.

Sincerely,

Gene Wirchenko

-CELKO-

unread,

Aug 19, 2005, 5:37:55 PM8/19/05

to

>> A VARCHAR(n) won't contain extraneous blanks unless the developer explicitly includes them in the value. <<

Not the developer, the data entry person who is probably an end user
these days. The DB guy's job is to add a constaint like
"CHECK (TRIM (BOTH foobar) = UPPER (foobar))" to the column on his
side. Now how often have you seen Newbies go to that trouble?

Instead you see a lot of VARCHAR(50) -- notice the magic number fifty
from ACCESS programmers -- columns which will eventually collect
garbage, like 40 spaces and a period. Even worse, NVARCHAR(50) whcih
collect garbage in Chinese!

Mikito Harakiri

unread,

Aug 19, 2005, 6:11:14 PM8/19/05

to

Why not

CHECK CONSTRAINT(LENGTH(COUNTRY_CODE)<=2)

BTW, this way you could also declare that COUNTRY_CODE has exactly 2
characters.

Hugo Kornelis

unread,

Aug 19, 2005, 6:32:28 PM8/19/05

to

On 18 Aug 2005 15:02:56 -0700, Mikito Harakiri wrote:

>Hugo Kornelis wrote:
>> Many companies use a short mnemonic code for their customers. The use of
>> a short mnemonic code for products is not uncommon either. Imagine a
>> customers table with 10,000 customers, a products table with 5,000
>> products and an orders table with a few million rows. Now would you
>> prefer char(6) or varchar(6) for CustomerCode? And char(5) or varchar(5)
>> for ProductCode?
>
>But if lookup table cardinality goes up, then, the storage factor
>char(n) vs varchar(n) goes down! Perhaps, you can convince me that the
>effect of the two trailing bytes is not miniscule, as I previously
>thought, but it just can't be significant. Any benchmark demonstrating
>that the performance degradation is not in single percentage digits is
>welcome.

Hi Mikito,

Below is the script I used to compare performance. I compared three
tasks:

1. Performing lots of single inserts (speed of inserts matters in
databases that have to process thousands of new rows per second)
2. Adding an index (not an everyday task, but there are situations where
a process can be sped up by adding an index, doing the process, then
dropping the index again)
3. Generating a typical report that involves some joins (the CASE
expressions I used may look silly, but imagine that I am using a payment
status instead - also note that I store the results in a temporary table
first, to make sure that network speed won't influence the results)

You'll find the full code further below. But first the results:

1. For the inserts, the elapsed time on my machine was 118156 ms with
char, 124406 ms with varchar. A degradation of over 5 %.

2. Adding the index took 2840 ms with char, 3236 ms with varchar. A
performance degradation of almost 14 %.

3a. Generating the report with the index created in step 2 took 224 ms
on average for char; 234 ms on average for varchar (I used three
consecutive executions for both tests, each time starting with a clean
cache). Performance degradation: 4.5 %.
3b. Generating the same report after dropping the supporting index
showed an even bigger difference: avg 2281 for char; avg 3250 for
varchar. Degradation: over 40 %.

1a. To top it off, I repeated the insertion test, but this time AFTER
creating the extra index. For char, the time taken was 227936 ms; for
varchar 313110 ms. Performance degradation: 37 %.

So here's your choice: either you use the index, degrade the queries by
"only" 4.5% but at the cost of slowing inserts down by 37%, or you
discard the index to reverse these figures (inserts slowed down 5%;
queries slowed down 40%).

Or you choose char instead of varchar when you know that the length is
more or less fixed. :-)

Here's the code I used for this benchmark:

-- Set up the tables
CREATE TABLE Clients
(ClientID char(6) NOT NULL,
ClientName varchar(36) NOT NULL,
PRIMARY KEY (ClientID)
)
CREATE TABLE Products
(ProdID char(5) NOT NULL,
ProdName varchar(30) NOT NULL,
PRIMARY KEY (ProdID)
)
CREATE TABLE Orders
(ClientID char(6) NOT NULL,
ProdID char(5) NOT NULL,
PRIMARY KEY (ClientID, ProdID),
FOREIGN KEY (ClientID) REFERENCES Clients,
FOREIGN KEY (ProdID) REFERENCES Products
)
go
-- Put some rows in the Clients table
DECLARE @i int, @r int, @NumClients int
SET @NumClients = 10000
SET @i = 0
WHILE @i < @NumClients
BEGIN
SET @r = CAST(@i AS bigint) * 308915776 / @NumClients
INSERT Clients (ClientID, ClientName)
VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
+ CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26))
+ CHAR(65 + ((@r / 11881376) % 26)),
LEFT(CAST(NewID() as VARCHAR(36)), 10 + RAND() * 26))
SET @i = @i + 1
END
select count(*) from Clients
go
-- Put some rows in the Products table
DECLARE @i int, @r int, @NumProds int
SET @NumProds = 5000
SET @i = 0
WHILE @i < @NumProds
BEGIN
SET @r = CAST(@i AS bigint) * 11881376 / @NumProds
INSERT Products (ProdID, ProdName)
VALUES(CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
+ CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26)),
LEFT(CAST(NewID() as VARCHAR(36)), 5 + RAND() * 25))
SET @i = @i + 1
END
select count(*) from Products
go
-- Have each client order up to 10 products
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
DECLARE @c int, @i int, @p int, @r int, @NumClients int, @NumProds int,
@ClientID char(6), @ProdID char(5), @Amt tinyint
SET @NumProds = 5000
SET @NumClients = 10000
SET @c = 0
WHILE @c < @NumClients
BEGIN
SET @r = CAST(@c AS bigint) * 308915776 / @NumClients
SET @ClientID =
CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) + CHAR(65
+ ((@r / 676) % 26))
+ CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976) %
26))
+ CHAR(65 + ((@r / 11881376) % 26))
SET @Amt = RAND() * 20 + 20
SET @i = 0
WHILE @i < @Amt
BEGIN
SET @r = CAST(RAND() * @NumProds AS bigint) * 11881376 /
@NumProds
SET @ProdID =
CHAR(65 + (@r % 26)) + CHAR(65 + ((@r / 26) % 26)) +
CHAR(65 + ((@r / 676) % 26))
+ CHAR(65 + ((@r / 17576) % 26)) + CHAR(65 + ((@r / 456976)
% 26))
INSERT Orders (ClientID, ProdID)
SELECT @ClientID, @ProdID
WHERE NOT EXISTS (SELECT *
FROM Orders
WHERE ClientID = @ClientID
AND ProdID = @ProdID)
SET @i = @i + 1
END
SET @c = @c + 1
IF @c % 100 = 0
PRINT @c
END
SET @End = CURRENT_TIMESTAMP
SELECT 'Generating orders', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
select count(*) from Orders
go
-- Add an index
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
CREATE INDEX Orders_ProdID ON Orders(ProdID)
SET @End = CURRENT_TIMESTAMP
SELECT 'Adding an index', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
go
-- DROP INDEX Orders.Orders_ProdID
-- Flush buffers
CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
go
-- Perform a typical query
DECLARE @Start datetime, @End datetime
SET @Start = CURRENT_TIMESTAMP
SELECT p.ProdID,
COUNT(CASE WHEN LEN(c.ClientName) < 20 THEN 1 END) AS
ShortNameOrders,
COUNT(CASE WHEN LEN(c.ClientName) > 30 THEN 1 END) AS
LongNameOrders,
COUNT(c.ClientName) AS TotalOrders
INTO #Results
FROM Products AS p
INNER JOIN Orders AS o
ON o.ProdID = p.ProdID
INNER JOIN Clients AS c
ON c.ClientID = o.ClientID
WHERE p.ProdID LIKE 'D%'
GROUP BY p.ProdID
SET @End = CURRENT_TIMESTAMP
SELECT 'Query with join', @Start, @End, DATEDIFF(ms, @Start, @End) AS
Elapsed
SELECT * FROM #Results
DROP TABLE #Results
go
-- Clean up the mess
DROP TABLE Orders
DROP TABLE Products
DROP TABLE Clients
go

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

paul c

unread,

Aug 19, 2005, 6:35:10 PM8/19/05

to

Gene Wirchenko wrote:
> On Fri, 19 Aug 2005 22:37:45 +0200, Tom Ivar Helbekkmo
> <t...@hamartun.priv.no> wrote:
>
> [snip]
>
>
>>Of course, the member number has no physical reality, and it will be
>>an automatically assigned serial number (without reuse of defunct
>>numbers when people quit or die) -- generally, it seems to be what you
>>rant against.
>
>
> No, it is when the number is referring to how the data is
> physically stored. If the number is a record number or an address or
> other such, then it ties your data to the physical implementation. A
> member number (if used just as that) would not do that.

> ...

i don't believe that som physical connection ensues by a machine's
choice of key, but maybe it doesn't matter as i'm stupid which some of
my postings prove. once the value of the number is in a relation, if we
are talking about relations and not something else, and we agree, if we
may, to call it a 'key', then i'd say it is pretty much immutable, until
somebody 'deletes' it. sorry for all the conditions, but my limited
understanding requires me to specify limits!

cheers,
p

paul c

unread,

Aug 19, 2005, 6:35:30 PM8/19/05

to

--CELKO-- wrote:
>...

> all digit dialing? When we went to ten digit dialing in major metro
> areas, nobody said a word.
> ...

Heh, I did, but nobody listened, including the telephone manufacturers.

Talk about keys! If they had asked me, I would have wanted area
names *as well as* exchange names, say Vancouver Cedar Cottage instead

of the 604-707-xxxx that I have to use. If I went to Europe, I'd expect

that there might be a conflict with the local names, but I'd be
expecting to have to make lots of other adjustments too, like having to
remember that in South Africa, directions will often state that one must
turn left at the next robot. I'll bet the other 'phone companies are
pretty much as screwed up with their number 'keys' as Bell Canada who
for years couldn't send a single bill if one had two residential 'phone

numbers and maybe still can't for all I know. The only smart thing I

Hugo Kornelis

unread,

Aug 19, 2005, 6:43:09 PM8/19/05

to

On Fri, 19 Aug 2005 10:49:59 -0400, JT wrote:

>Perhaps I'm misreading your point, but it's data in a Char column that
>typically contains trailing blanks that need a Trim. A VarChar won't contain

>extraneous blanks unless the developer explicitly includes them in the
>value.

Hi JT,

I think that Joe refers to the ANSI comparison rules for char and
varchar:

* Comparing two char (assume same length): Do character by character
compare.

* Comparing two varchar: First, find if actual length (not maximum
length!) is different. Then, pad shortest string with spaces until it's
just as long as the longest string. Finally, do character by character
compare of longest string to padded version of shortest string.

For comparison of char with different length, padding is also required,
but this padding is independent of the actual data; it's the same for
each row and the amount of padding can be determined in the compile
phase. (But in joins, you should not compare two different-length char
columns anyway!)

Hugo Kornelis

unread,

Aug 19, 2005, 6:46:35 PM8/19/05

to

On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:

>Hugo Kornelis wrote:
>> On 18 Aug 2005 13:17:37 -0700, Mikito Harakiri wrote:
>> >pondering if
>> >
>> >select * from table
>> >
>> >is faster than
>> >
>> >select col1, col2, ... from table
>>
>> No need to ponder that -- all SQL Server DBAs (and presumably all DBAs
>> for all serious RDBMS's) know that SELECT * should never be used in
>> production code (except in a EXISTS(..) subquery).
>
>My bad. I meant
>
>select count(1) from table
>
>vs.
>
>select count(*) from table

Hi Mikito,

As Mike said: absolutely no difference between the two.

>BTW, you triggered the other example: is EXISTS or IN faster? This
>question could come up only from somebody who is completely unaware of
>SQL expression equivalency and query rewrite.

SQL Server will often produce the same execution for both versions. If
performance is really critical, always test all versions. If performance
is important but not criticat, use EXISTS - I've seen cases where it's
faster than IN, but I haven't seen the reverse yet.

> Well, making sure the
>extents and segments are layed out on disk properly, leaves little room
>for education and abstract thinking.

Huh? I don't know what your DMBS of choice is, but SQL Server doesn't
bother the DBA with extents and segments.

Hugo Kornelis

unread,

Aug 19, 2005, 6:46:52 PM8/19/05

to

ROFL!!

<cough> indeed.

Hugo Kornelis

unread,

Aug 19, 2005, 6:49:14 PM8/19/05

to

On Fri, 19 Aug 2005 12:00:15 -0700, Gene Wirchenko wrote:

>On 19 Aug 2005 10:22:41 -0700, "AK" <AK_TIRE...@hotmail.COM>
>wrote:
>
>>>> You have varchar2 limited to 4000 bytes.
>>Then you have text to cover 4K to 2G range. Then you have to rely on
>>some other option to be able store data bigger than 2G (split it into
>>chunks in your application????).
>>
>>If you suggested such a design to programmic language community, you
>>would be laughed at.
>><<
>>
>>oh really? then how comes C# has 8 types to store integers only, not
>>counting decimal type? Is it funny too?
>
> Is there anything different between a string 4096 long and a
>string 4097 long? Are there any different machine instructions that
>are required to handle them? No?
>
> What about with integers? Yes?
>
> There is your difference.

Hi Gene,

So what exactly is the difference between a number that might reach
30,000 or a number that might reach 40,000? And how exactly is that
difference different from the difference between a 4096 character string
and 4097 character string?

Hugo Kornelis

unread,

Aug 19, 2005, 6:52:07 PM8/19/05

to

On 18 Aug 2005 15:50:39 -0700, Mikito Harakiri wrote:

>Stu wrote:
>> if you disregard the differences here
>> because of substantial resources, why bother with any sort of datatype
>> at all? If a varchar(50) will do the trick of a char(10), what's 42
>> extra bytes? In the days of terabytes sized RAID arrays, why not use a
>> varchar (8000) for everything? Why use an integer if a varchar will
>> do? Obviously at some point, there will be a performance impact; why
>> not be disciplined enough to strike at the lowest level of that curve
>> and use the least expensive resource that is the best fit for your data?
>
>Humans are notoriously bad at storage management. Storage layout is
>something that has to be hidden and managed automatically. When did you
>specify memory parameters for your web browser program (Let see: 10M
>for fonts, 15M for web pages cache, 5M for cookies, or, wait a minute,
>maybe 2M for cookies would make my browser faster?)

Hi Mikito,

What DBMS are you talking about here? Please remind me never to take a
job that involves managing one of those....

BTW, Internet Explorer does allow you to specify maximum cache space...

>I want a string datatype, and can't possibly predict (and don't really
>care) how long it would be. Is it so difficult to design a RDBMS engine
>that would allow me to declare a string of arbitrary length with some
>rudimentary intelligence as far as storage is concerned?

Such a string type exists. In SQL Server, and I bet in all other major
databases as well.
But the flexibility comes at a price: performance suffers. So if you
care about performance, you should only implement this flexibility where
you need it.

Hugo Kornelis

unread,

Aug 19, 2005, 6:56:04 PM8/19/05

to

On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:

(snip)

>BTW, you triggered the other example: is EXISTS or IN faster?

Woops - forgot to add this to my previous reply:

Before thinking about speed, you should think about the difference
between the two. IN can result in UNKNOWN; EXISTS can only result in
TRUE or FALSE. Since UNKNOWN and FALSE are treated the same in a WHERE,
WHEN, or HAVING clause, many people doon't notice the difference - until
they start combining IN with NOT and getting unexpected results!

Only choose the faster version if they truly are equal!!

Hugo Kornelis

unread,

Aug 19, 2005, 6:58:40 PM8/19/05

to

On Fri, 19 Aug 2005 14:17:26 -0700, Gene Wirchenko wrote:

>On Fri, 19 Aug 2005 22:37:45 +0200, Tom Ivar Helbekkmo
><t...@hamartun.priv.no> wrote:
>
>[snip]
>
>>Of course, the member number has no physical reality, and it will be
>>an automatically assigned serial number (without reuse of defunct
>>numbers when people quit or die) -- generally, it seems to be what you
>>rant against.
>
> No, it is when the number is referring to how the data is
>physically stored. If the number is a record number or an address or
>other such, then it ties your data to the physical implementation. A
>member number (if used just as that) would not do that.

Hi Gene,

Unless I am missing the context here, Tom was talking about surrogate
keys generated in MS SQL Server by the IDENTITY property.

And there is no relation between IDENTITY and physical storage.

paul c

unread,

Aug 19, 2005, 6:59:05 PM8/19/05

to

i don't mean to single anybody out, but much of the talk lately has been
about domains and syntax of domains. in spite of the many messages
about 'equality', i get the impression that very few people take the
minimalist view that i do which is that the RM only gets confused by
these questions. and that for some of us, the confusion is
never-ending if they are entertained. 'equality', if you will, is
fundamentally a point-of-view, from the RM's perspective. it just seems
to me that the RM becoms simpler (a possible advantage being simpler
implementation) if questions about reality are kept out of it.
obviously, when i say 'keep reality out of it', many people will start
flaming about 'what good is it then?', but i'll try to ignore them as i
attempt to understand the 'theory'. the big questions for the RM, *for
me* seem to include view updatability which has to do with the operators
of the algebra as well as whether a relational engine can implement
customary features such as concurrency control and presentation
coherence without being written in a language that eschews the
relational operators.

thanks again for listening to my rant!

pc

Mikito Harakiri

unread,

Aug 19, 2005, 7:18:03 PM8/19/05

to

Hugo,

I was going to perform this test in oracle, but then I came across the
following passage:

Tom...
I was told once that if you only need a single char
use CHAR(1) since using VARCHAR2(1) has overhead
due to the Oracle having to maintain how long
the value is in a VARCHAR2 field

Is this a load of rubbish or is there some truth
to it?

Followup:

that is rubbish.

A char(n) is a varchar2(n) that is blank padded to it's maximum length.
The
implementation of a char physically on disk is identical to a varchar2
-- there is a length byte for a char(1) just as there is for a
varchar2(1).

http://www.jlcomp.demon.co.uk/faq/char_vs_varchar2.html
3. As opposed to a wide-spread misunderstanding, char (1) does *not*
use less space in the database than varchar2 (1) when the value 'a' is
stored because also char (1) has a length field. This can be proved by
the dump () function.

I guess this storage quirk doesn't really affect TPC-C benchmarks:)

It might be interesting, however, to dig down into the insert
performance difference. As I mentioned, 2 extra bytes shouldn't really
affect the speed of index unique scan.

BTW, I always use INTEGERs for id columns.

Mikito Harakiri

unread,

Aug 19, 2005, 7:35:47 PM8/19/05

to

Hugo Kornelis wrote:
> On 18 Aug 2005 15:13:28 -0700, Mikito Harakiri wrote:
>
> (snip)
> >BTW, you triggered the other example: is EXISTS or IN faster?
>
> Woops - forgot to add this to my previous reply:
>
> Before thinking about speed, you should think about the difference
> between the two. IN can result in UNKNOWN; EXISTS can only result in
> TRUE or FALSE. Since UNKNOWN and FALSE are treated the same in a WHERE,
> WHEN, or HAVING clause, many people doon't notice the difference - until
> they start combining IN with NOT and getting unexpected results!
>
> Only choose the faster version if they truly are equal!!

This tiny difference is really a pain in the butt. This UNNKNOWN based
logic is rubbish, and any query involving NULLs still could give a
surprising result. Therefore, the life would be much simpler if SQL
standard just defined NOT IN and NOT EXIST to be identical. After all,
they both are just dumbed down aggregate scalar subquery expressions:

--IN, EXIST
select * from dept d
where 0<(select count(*) from emp
where dept.deptno=emp.deptno)

--NOT IN, NOT EXIST
select * from dept d
where 0=(select count(*) from emp
where dept.deptno=emp.deptno)

(both also known as semijoin and antijoin). Aggregate scalar subquery
syntax is more general, and generic solution always rule.

Mike

unread,

Aug 20, 2005, 8:39:52 AM8/20/05

to

On 19 Aug 2005 10:22:41 -0700, "AK" <AK_TIRE...@hotmail.COM> wrote:

>>> You have varchar2 limited to 4000 bytes.
>Then you have text to cover 4K to 2G range. Then you have to rely on
>some other option to be able store data bigger than 2G (split it into
>chunks in your application????).
>
>If you suggested such a design to programmic language community, you
>would be laughed at.
><<
>
>oh really? then how comes C# has 8 types to store integers only, not
>counting decimal type? Is it funny too?

Since the text you quoted was not written by me, and doesn't even
appear in my post, I'm not sure exactly what to laugh at. But I think
one of your news operators must be broken.

--
Mike Sherrill

Paul

unread,

Aug 20, 2005, 7:05:37 PM8/20/05

to

Hugo Kornelis wrote:
> So what exactly is the difference between a number that might reach
> 30,000 or a number that might reach 40,000? And how exactly is that
> difference different from the difference between a 4096 character string
> and 4097 character string?

No difference; it's really just a choice of whether you want the
constraint as part of the domain definition or just as a standard check
constraint in the database.

Are there any reasons to choose one over the other? Or is it an
arbitrary choice? I'm thinking here in theory rather than for a specific
DBMS.

Paul.

AK

unread,

Aug 22, 2005, 9:22:23 AM8/22/05

to

>>
> beleive me or not, in some cases it's 100% or more

I simply don't believe you. I don't believe the difference can be
100%. I
certainly don't believe it can be more. And I don't believe you can
even
measure what difference there is. And even if I did believe all that,
I
don't believe it is anything but a bug in your software.

So I 100% don't believe you. However, I am willing to be persuaded by
proof.
<<

Vow, I missed such a statement.
Well, imagine a table with rows that are almost 4K each. That's 2 rows
per an 8K page. Add some additional width, and you end up with just one
row per data page, a 100% increase in table size. Believe me or not,
that's not a bug in our software, that's the way SQL Server stores
data.
On one hand, moving from 2K pages in 65 to 8K pages in 70 in 200 is a
big improvement. On the other hand, in Oracle we can choose page size
(blocksize in Oracle universe)

Mike Labosh

unread,

Aug 22, 2005, 3:29:20 PM8/22/05

to

> Not the developer, the data entry person who is probably an end user
> these days. The DB guy's job is to add a constaint like
> "CHECK (TRIM (BOTH foobar) = UPPER (foobar))" to the column on his
> side. Now how often have you seen Newbies go to that trouble?

Quite.

Last week I discovered a record in our "City" table called "A** HOLE" linked
to a "CompanyLocation" record called "F*** YOU" and had a contact at that
location named after some 1970's British speed-punk band, who was in charge
of IBM mainframe products like "EAT SH**"

I have also come across contacts with names like "[TYPE CONTACT NAME HERE]"
and "#Error" (That's the funny one. #Error is what you get when you're in
MS Access and there's a data binding issue)

Working here is like living inside a Salvador Dali or MC Escher painting.

--
Peace & happy computing,

Mike Labosh, MCSD
"Musha ring dum a doo dum a da!" -- James Hetfield

--CELKO--

unread,

Aug 22, 2005, 3:46:56 PM8/22/05

to

My favorite was a Data Warehousing project migration with the part
number "I hate my job" repeated 9000 or so times. Jack Nicholson does
data entry!!

>> Working here is like living inside a Salvador Dali or MC Escher painting. <<

Nah! Escher and Dali had a pattern to their works -- projective
geometery and math for Escher, Freudian Psychology for Dali. we are in
Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
mixed with stupidity.

Mike Labosh

unread,

Aug 22, 2005, 4:31:55 PM8/22/05

to

> Nah! Escher and Dali had a pattern to their works -- projective
> geometery and math for Escher, Freudian Psychology for Dali. we are in
> Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
> mixed with stupidity.

I had never heard of this Hieronymus Bosch person so I Googled the name. He
is definately in need of some psychological help :)

Gene Wirchenko

unread,

Aug 22, 2005, 5:00:27 PM8/22/05

to

On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

[snip]

>So what exactly is the difference between a number that might reach

>30,000 or a number that might reach 40,000? And how exactly is that

Different machine instructions.

>difference different from the difference between a 4096 character string
>and 4097 character string?

This is just iterating a little more on the longer string.

Sincerely,

Gene Wirchenko

Hugo Kornelis

unread,

Aug 22, 2005, 6:32:20 PM8/22/05

to

On 19 Aug 2005 16:18:03 -0700, Mikito Harakiri wrote:

(snip)

>Hugo,
>
>I was going to perform this test in oracle, but then I came across the
>following passage:

(snip)

>I guess this storage quirk doesn't really affect TPC-C benchmarks:)

Hi Mikito,

I've never worked with Oracle, so I'll just have to take your word for
it, I guess.

>It might be interesting, however, to dig down into the insert
>performance difference. As I mentioned, 2 extra bytes shouldn't really
>affect the speed of index unique scan.

I think the culprint for the inserts is the extra index. The size of one
entry in the index grows from 11 bytes to 15 bytes (a large percentage).
And that will mean that less entries fit into one page - and since the
inserts are sequential for the PRIMARY KEY, but non-sequential for the
UNIQUE, the supporting index for the UNIQUE constraint will face lots of
page splits.

I include these two statements in the main loop for the inserts to have
some indication that the system is still busy:

>> IF @c % 100 = 0
>> PRINT @c

If you run the script, you'll see that the speed at which the counter
increase will graudually slow down as the number of index pages grows.

>BTW, I always use INTEGERs for id columns.

Yeah, I guess I should have named the key columns ClientCode and
ProdCode instead of ClientID and ProdID. :-)

Hugo Kornelis

unread,

Aug 22, 2005, 6:38:01 PM8/22/05

to

On Mon, 22 Aug 2005 14:00:27 -0700, Gene Wirchenko wrote:

>On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
><hugo@pe_NO_rFact.in_SPAM_fo> wrote:
>
>[snip]
>
>>So what exactly is the difference between a number that might reach
>>30,000 or a number that might reach 40,000? And how exactly is that
>
> Different machine instructions.

Hi Gene,

Isn't an RDBMS all about abstracting away from implementation details?

>>difference different from the difference between a 4096 character string
>>and 4097 character string?
>
> This is just iterating a little more on the longer string.

Toppling the size needed just over the maximum amount reserved for it in
specific internal structures. Or, put more simply, it takes other
machine instructions to handle.

Mike Hodgson

unread,

Aug 22, 2005, 7:44:48 PM8/22/05

to

Mike Labosh wrote:

Nah!   Escher and Dali had a pattern to their works -- projective
geometery and math for Escher, Freudian Psychology for Dali. we are in
Hieronymus Bosch! Crude, brutal views of Hell from the Middle Ages
mixed with stupidity.

I had never heard of this Hieronymus Bosch person so I Googled the name.  He 
is definately in need of some psychological help  :)

I think a 490 year old corpse is a little beyond help at this stage (but I agree that in his time he could have done with a good psychiatrist or 10).

--
mike hodgson
blog: http://sqlnerd.blogspot.com

Mark D Powell

unread,

Aug 23, 2005, 12:20:14 PM8/23/05

to

Colin, with SQL Server it might be "pointless" to store a one-byte
character in a varchar but it isn't in Oracle and may not be with some
other rdbms managers. In Oracle a char(1) and a varchar2(1) both take
2 bytes of internal storage when a value is present. There is a null
indicator and the data value for the char column. A length byte and a
value for the varchar2. If no value is present, that is the value is
said to be null, then the char(1) still takes two physical bytes of
storage but the varchar2(1) only requires 1 byte. If the column is at
the end of the row then Oracle effectively does not even store the null
indicator so the length becomes zero.

HTH -- Mark D Powell --

Gene Wirchenko

unread,

Aug 23, 2005, 12:57:58 PM8/23/05

to

On Tue, 23 Aug 2005 00:38:01 +0200, Hugo Kornelis
<hugo@pe_NO_rFact.in_SPAM_fo> wrote:

>On Mon, 22 Aug 2005 14:00:27 -0700, Gene Wirchenko wrote:
>
>>On Sat, 20 Aug 2005 00:49:14 +0200, Hugo Kornelis
>><hugo@pe_NO_rFact.in_SPAM_fo> wrote:
>>
>>[snip]
>>
>>>So what exactly is the difference between a number that might reach
>>>30,000 or a number that might reach 40,000? And how exactly is that
>>
>> Different machine instructions.

>Isn't an RDBMS all about abstracting away from implementation details?

And here I thought it was about storing/retrieving checked data.

Whether one uses, for example, a string or an integer for a
particular column is an implementation detail.

>>>difference different from the difference between a 4096 character string
>>>and 4097 character string?
>>
>> This is just iterating a little more on the longer string.
>
>Toppling the size needed just over the maximum amount reserved for it in
>specific internal structures. Or, put more simply, it takes other
>machine instructions to handle.

"more" not "other".

Sincerely,

Gene Wirchenko

Mike Labosh

unread,

Aug 23, 2005, 1:30:31 PM8/23/05

to

>>Toppling the size needed just over the maximum amount reserved for it in
>>specific internal structures. Or, put more simply, it takes other
>>machine instructions to handle.
>
> "more" not "other".

SHEESH!

I meant to pass on some amusement last week. I cannot believe you people
are still arguing about the miniscule details.