"We are planning on creating 'account opening sources' lookup table to store
'account source' codes and corresponding descriptions. Problem with storing
this data in a dedicated table is that as we go on we'd end up with tens and
possibly hundreds of lookup tables.
I suggest that instead we create a more generic table that stores various
codes used within a schema, along with corresponding description and a code
'category'. Another table (optional but recommended) would contain
'category' descriptions.
This is not my idea, I've seen this approach used at other companies.
Please see attached for an illustration.
Let me know what you think."
I'd like to know how other companies do this. Do you have a separate table
for each set of codes or do you have one table with basically three columns:
code_category, code_value, code_description (or just catagory, value,
description)?
Thanks for any insight.
Frank
> This is not my idea, I've seen this approach used at other companies.
> Please see attached for an illustration.
> Let me know what you think."
>
> I'd like to know how other companies do this. Do you have a separate table
> for each set of codes or do you have one table with basically three columns:
> code_category, code_value, code_description (or just catagory, value,
> description)?
The "one big Entity-Attribute-Value table" is a well-known DB
anti-pattern, i.e. a common mistake.
It _might_ make sense sometimes, but most of the times it doesn't.
The name of this design fallacy is OTLT ("One True Lookup Table") or
MUCK ("Massively Unified Code Keys"). You can Google around more
details or read Section 4.4 of THINKING IN SETS: AUXILIARY, TEMPORAL
AND VIRTUAL TABLES.
>> This is not my idea, I've seen this approach used at other companies. <<
Me, too. I charge them $1K-2K per day to clean up stuff like this.
That was my thought. I will try to discourage them from this path.
Thanks!
Frank
How did I know you'd have an opinion? :-)
No, I agree. I thought it was not the way to go, but I wanted to make sure
I was thinking of the correct issue and not just something similar.
Thanks,
Frank
So what? Tables aren't rationed.
The desire to conceal complexity is not the same as the desire to remove
complexity. The former is counterproductive while the latter is
praiseworthy. What you describe is a a desire to conceal what is going on.
How does that help anyone?
Maybe the implicit concern is not the number of tables in the database but
the amount of code required to maintain them. That's a programming problem.
Get the programmors off their asses and tell them to learn how to write
dynamic SQL.
Roy
Interesting. I will pass it on.
I have to be honest, though, that some of your reasons against it seem
spurious. For instance, the concern about someone adding a code_value that
is a lot of whitespace terminated by a single space. Unless you are using a
CHAR(1) field this is always a possibility. I'm not sure what the maximum
length of the field has to do with how likely the occurance would be. Maybe
you have some sort of statistics on this? :-)
Also, the whole VARCHAR(x) where x is a 'large number'. Does it take more
space to put a value of 'X' in a VARCHAR(255) column than it does to put 'X'
in a VARCHAR(40) column? Perhaps it does, and if so that's a good reason to
be against it. But if it doesn't take up more space I'm not sure I see the
reason to prefer the smaller max length to the larger one. I'm not honestly
suggesting that everything be VARCHAR(20000) or whatever the max is, if you
know you'll never come close to having a column that would take up that much
space, but if you aren't sure of the maximum possible length of something
then why not make it something fairly large?
One reason I *didn't* see in your article is the use of the OTLT as a parent
table for foreign key constraints. Certainly it seems better to say:
Alter Table a_table
Add Constraint a_table_fk1
Foreign Key (state_code)
References state_codes (state_code)
Enforced
Than it would be to put in a CHECK constraint against values on an OTLT. To
me this seems the best argument against it.
Frank
>
>> and
>> possibly hundreds of lookup tables.
>
> So what? Tables aren't rationed.
>
> The desire to conceal complexity is not the same as the desire to remove
>
> complexity. The former is counterproductive while the latter is
> praiseworthy. What you describe is a a desire to conceal what is going
> on.
> How does that help anyone?
Honestly, I don't know his reasoning. We're going to have a discussion next
week about it, and I'm sure more than one of us will shoot it down. Until
then I'm not sure what is real concern is.
> Maybe the implicit concern is not the number of tables in the database
> but
> the amount of code required to maintain them. That's a programming
> problem.
Hmm, I don't think that's it. Why would it be any more work? If anything
it's more work for the DBA, because he has to define the new table!
> Get the programmors off their asses and tell them to learn how to write
> dynamic SQL.
Hmm, better watch it here. :-) Both he and I are programmers.
Though I'm not sure where dynamic SQL comes in to play here.
Frank
"build it and they will come" One of the evil things I do is load the
Heart Sutra into huge NVARCHAR() columns when I stress a design. If
they will not learn RDBMS, they will have a chance for
Enlightenment :)
>> I'm not sure what the maximum length of the field [sic] has to do with how likely the occurrence would be. Maybev you have some sort of statistics on this? <<
Since I just started a book on industry standards, most are fixed
length strings under 25 characters -- boy, is that a generalization!
The other programmer's reasoning is presumably exactly what he said in his
email (the snippet that you included), the concern that there will be an
explosion in the number of tables. That he's not aware of the pitfalls of
his suggested approach is no great surprise...back in the day I surely
designed tables like this also, I'm sure all of us have.
The burden may be on you to show why the OTLT is not a good thing. Joe's
article will help, also this
(http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html),
and also this (http://www.projectdmx.com/dbdesign/lookup.aspx) Just be
prepared to do the heavy lifting. :-) A *lot* of programmers are not
particularly database savvy, although most of us like to think we are, and
this can include senior developers and architects. You may find, for
example, that your boss (and the other programmer's boss too) likes the OTLT
idea...best thing to do may be to weigh in on the proposal before the most
senior person (or any persons more senior than you) can open their mouths,
which gives them a chance to nod sagely and offer their own condemnations.
>> Maybe the implicit concern is not the number of tables in the database
>> but
>> the amount of code required to maintain them. That's a programming
>> problem.
>
> Hmm, I don't think that's it. Why would it be any more work? If anything
> it's more work for the DBA, because he has to define the new table!
>
>> Get the programmors off their asses and tell them to learn how to write
>> dynamic SQL.
>
> Hmm, better watch it here. :-) Both he and I are programmers.
See above. My comments apply to also not pissing off your colleague. Since
he has staked some of his credibility to this flawed design, depending on
whether or not he's otherwise a decent programmer and a decent guy, you
might be doing him a favour by writing him an email, attaching references to
OT:LT/MUCK articles, that apprises him of the downsides. That way he can
educate himself before the discussion/meeting, maybe retract the suggestion
beforehand, and otherwise not look like a clueless idiot. If he has any
decency he'll even credit you. As it is, best case, he'll still lose some
face.
My apologies if this is teaching my grandmother how to suck eggs. I'm not
far off the mid-century mark myself, and have little patience for
point-scoring these days. I see this scenario as an excellent opportunity
for having a productive discussion about good database design (and I
guarantee you that maybe over half of the other coders involved will get
educated), rather than having a slanging-match. Approached the wrong way
this particular situation could turn into a furrball.
Don't get me wrong - I don't believe in letting people off easily if they
need an abject lesson. For example, if I saw a senior programmer
perpetrating atrocities with threads or class design I'd think it time that
they were brought to task for not knowing basics. I've had my ass hauled
over the coals a few times too. But for something like this, where (let's
face it) many (if not most) general-purpose programmers, even senior ones,
aren't necessarily that good at database design, it might be better to craft
a lesson out of it rather than have an Inquisition...
[ SNIP ]
AHS
Roy's response is excellent! Concealing complexity is not the same thing as
removing complexity. I wish I'd said that... and I probably will!
>
> > Maybe the implicit concern is not the number of tables in the database
> > but
> > the amount of code required to maintain them. That's a programming
> > problem.
>
> Hmm, I don't think that's it. Why would it be any more work? If anything
> it's more work for the DBA, because he has to define the new table!
>
One of the values of a database, as compared with a simple files solution is
that the data is documented, at least to some extent. When a new type of
lookup table is to be created, it most often means that a new type of data
and a new entity exists in the system.
The people who want to bypass the DBA in the creation process thereby
inherit the responsibility of making the new type of code useable by
everyone who might need to know about it, and therefore the responsibility
of documenting the new data type. Often, when programmers are able to
create and store new types of code on the fly, in an EAV -- OTLT type of
design, the only place where knowledge about the data is kept is in the
brains of the programmers.
There are a few circumstances where this works out ok, but most of the time
it ends up causing a severe problem when people other than the original
programmers try to make use of the data.
Coding a lookup from an EAV table involves a more complex set of criteria.
EG:
where code = "CO" and code_type = "AIRLINE_CODE"
If different code types are stored in different columns (and therefore
probably in different tables) you only have to look up the code, and look
in the right table. So it's adding complexity to the code, while reducing
complexity in the table structure. More than one programmer has commited
the bug of leaving off the second part of the criteria above.
> > Get the programmors off their asses and tell them to learn how to write
> > dynamic SQL.
>
> Hmm, better watch it here. :-) Both he and I are programmers.
> Though I'm not sure where dynamic SQL comes in to play here.
>
I'm not sure how dynamic SQL applies either...
Some of us learned database design before we began designing databases.
I'd better explain then, since someone else queried it.
When the OTLT is combined with EAV you seem to be able to get away with
having a single maintenance function. If you abandon OTLT and have multiple
lookup tables then EAV is seen to be as stupid as it is, and so that gets
abandoned too. At which point it seems you need a distinct maintenance
function for each lookup table. Dymanic SQL allows you to code a single
maintenance function because you can adapt to different numbers of columns
and columns of different types.
Obviously this is not an issue with interfaces like ODBC and JDBC, where you
already code at a low level similar to dynamic SQL, but it is unfamiliar to
Cobol programmers and their kind, who are used to embedded SQL.
Roy
Frank
Most of us learn by doing, and it's during that process that you make
mistakes. You're not seriously suggesting you've never made any?
AHS
Interesting. Here's a page my co-worker is using to justify his position:
http://www.dbforums.com/showthread.php?t=1619660
Frank
I guess if neither the programmer nor his manager knows what a DBMS is
supposed to provide they won't ever see much harm in subverting it. I
haven't the time or energy to start explaining what's wrong with that, and
no possibility of being rewarded for it if I did. In any case, ignorance
is bliss.
Roy
>"David Cressey" <cres...@verizon.net> wrote in message
>news:_dO7k.7468$qb7.996@trndny08...
>>
>> "Arved Sandstrom" <asand...@accesswave.ca> wrote in message
[snip]
>>> The other programmer's reasoning is presumably exactly what he said in
>>> his
>>> email (the snippet that you included), the concern that there will be an
>>> explosion in the number of tables. That he's not aware of the pitfalls of
>>> his suggested approach is no great surprise...back in the day I surely
>>> designed tables like this also, I'm sure all of us have.
>>>
>> Some of us learned database design before we began designing databases.
>
>Most of us learn by doing, and it's during that process that you make
>mistakes. You're not seriously suggesting you've never made any?
There is making mistakes while learning to apply a workable
method, and there is making mistakes because one did not learn first.
Sincerely,
Gene Wirchenko
Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.
I agree. But it's also possible to be conscientious about educating
oneself - doing lots of Googling, reading articles and books, working
quality tutorials, posing questions on newsgroups - and nevertheless miss
things. Introductory material will not address intermediate and advanced
issues in detail, for example. Learning OTJ from peers and superiors is also
hit and miss. And not infrequently you simply will not think to research
something because you do not know that you do not know.
Take the subject of this thread, for example. *Once* you are somewhat aware
that there is such a database design issue, it doesn't take much Googling to
turn up some good articles about it. But absent that initial awareness it's
not that obvious.
AHS
Of course not. But the OTLT mistake is such a fundamental one that is can
be avoided by someone who has learned database design before designing
databases. You can, of course, design a database any way you like. But
many of those ways are anti-patterns. Most of the people who make that
mistake in real production databases are learning database design by sheer
trial and error.
There's a long distance between never making any mistakes and not learning
anything before diving in.
> AHS
>
>
The replies you gave to Gene show me how you and I can come to such
different views on the same subject. I learned database design in about the
1984 to 1986 time frame, after programming for fifteen years. There was no
Google. There were no forums, (unless you count VAXnotes conferences on
the Digital e-net). There were, however, a few good books on the subject,
some good lecture series, a few very good mentors, people who taught
database material for a living, and most importantly, some examples of well
designed databases.
So in my context it was very easy to become aware that there was something
fundamental that I did not know, and that I needed to learn. I had designed
indexed files before to support my apps, so I wasn't starting from ground
zero. And I had unconsciously applied some of the normalization rules in my
design of indexed files. That made it easier for me to learn the formal
normalization rules and the consequences of departing from them.
Your context sounds quite different. I don't want to put words in your
mouth. I will say that there are a lot of people out there today who switch
over from storing data in files to storing the same data in a database with
the idea that the switch is a relatively trivial matter with a few technical
details but no fundamentals to rethink. And the benefits they derive from
using a database and a DBMS turn out to be relatively trivial, while the
costs turn out to be huge.
Some of those people learn the hard way. Some never learn. They move back
to using files because, according to them, databases aren't worth the cost
and aggravation.
Humans have the nearly unique ability to learn from others mistakes.
You are not suggesting the ONLY way to learn is to make mistakes are
you??
Ed
I'll just make a brief comment on OTLT.
I have seen this work, but only because of two key factors:
First factor was that the number of entries in the look up table was
small (about 20). Several of the entries existed only because that
DBMS lacked sequences.
The other factor was the discipline of the developers. New entries to
this table were made with consideration similar to that of adding new
columns to a normal table.
Seeing a OTLT with hundreds of entries means to me there is no logic
in the table. It has become a garbage heap and smells appropriately.
Ed
I'm more familiar with embedded SQL as an interface to application code than
I am with ODBC or JDBC interfaces. (I never programmed in COBOL but I
taught database programming to COBOL programmers). I suspect that low level
programming carries a cost with it, and that part of that cost is that
programmers continue to "think like programmers", instead of learning to
think in SQL. That could be part of the attraction of OTLT and EAV to them.
But the embedded SQL that I used was not "dynamic", as I understand the
term. The precompiler processed the embedded SQL, opened a database whose
metadata was supposed to be the same as the target database, rewrote the
SQL in some lower level language that I didn't need to know, and replaced
the emebedded SQL with calls to these generated lower level routines before
passing the result to the compiler. We had something we called "dynamic"
SQL as well. But that involved generating SQL at run time, and letting an
SQL interpreter parse it. I suspect you are using "dynamic" in a different
sense than I am, but I'm not sure.
In any event, the interesting thing is, IMO, not so much at what point the
SQL is parsed and bound, but how automatic the process is of generating the
SQL needed to reference and maintain a new code table. Back in the day,
new code tables didn't occur often enough to do some heavy automation of the
process, but the necessary coding was so well understood that the manual
effort was relatively trivial. So the idea of storing all in one table "to
save programming effort" seemed like a bad joke to me. And the "delay due
to the cranky DBA" was not the kind of issue in my context that it is in
some of today's programmers.
In all my years of visiting newgroups and other forums, I've never quite
learned why programmers think that database programming is so difficult. I
think this discussion might help me understand that. Thanks for a
considered reply.
In all my years of discussing
I don't know where to start. If I were to discuss all his points in detail
I would end up writing a book on bad database design and how to avoid it.
And that book wouldn't be any better than several books Joe Celko has
already written.
But I'll give a brief summary:
*** One easy to find table with all valid values for given fields. This
means a maintenance screen can be added so these values can be kept up to
date in the future.
The table may be easy to find, but the relevant values for data validation
are just as hard to find, even harder to find IMO. You have just as many
types of codes in OTLT as you would have in separate reference tables and
you have to know just as much to restrict your view of the OTLT as you would
have to know to reference a separate table.
There is no need to add a maintenance screen for reference data. If the
data is to be maintained manually (and that's a big if), then a tool built
for that purpose is provided by the DBMS maker. That tool has a near zero
learning curve, and requires no programming. If worst comes to worst, you
can even use MS Access, set up a table link to each reference table, and
let Access make the grids or forms for you.
*** No need to create a new table for new lookup values. This means the
database schema doesn't change, it means no new code needs to be written and
it means you don't need to schedule a weekend release for your code.
This is possibly the worst reason of them all. What he's saying is that the
true structure of the data can be mutable while the database schema remains
static. It is precisely the fact that the database schema and the true
structure of the data are in conformance to each other that makes the
database schema a valuable tool for managing and using the data. A database
schema can be used as a poor man's data dictionary. In fact, Oracle refers
to its metadata as "the Oracle Dictionary".
Anybody who doesn't understand the usefulness of a data dictionary should
stay away from databases.
As far as new code goes, no new code will have to be written for new table
maintenance, as I said before. New code will have to be written to use the
new code type. But that's true regardless of whether you create a new table
or add to an existing OTLT. And as far as adding a new table breaking
existing code goes, it just doesn't happen in the real world. In over ten
years of database work, I've never seen any code that was dependent on the
non existence of a certain table for its correct functioning.
*** A single stored proc can be provided to test if a value is valid before
adding it to your param table. This proc should be the only method of adding
data to the database.
Code validation can be enforced using referential integrity constraints. A
single stored proc can still be built, even if that tables are kept
separate. I'll describe how to do that elsewhere.
*** Fewer tables in a database usually means less code meaning less bugs.
Just plain false. Fewer tables in a database generally means more code
meaning more bugs.
*** All the features you provide in your OTLT will be available for all
lookup values. So if you provide a method to order values in HTML forms then
this facility will be available for all lookups. You could provide default
values or transformations that might turn UNKNOWN values to NULL.
It's easy to propagate standard features to a new table. It's a moot point.
*** I can understand that they don't feel comfortable with storing the
allowed values 1,2,3,4 in a varchar field within our OTLT. Would it help if
we regard these values not as being integers but simply as valid values that
are stored in our valid values table.
It's isn't just the form of the data that I'm not comfortable with. The
CHAR(2) code "CO" can mean "Continental Airlines" or "Colorado" depending
on the context. These two codes are from different domains, even though
they have the exact same form.
My answers are just the tip of the iceberg. What I really want your friend
to learn is that database experts aren't total fools. Most of us have been
around real live databases, and we know what can go wrong from direct
experience, not just ivory tower dreaming. This is true regardless of
whether we learned from our own mistakes, or from the mistakes of others.
Nicely phrased, and you're quite correct, my context is not the same as
yours. My initial background is not databases at all - it's data
files...potentially reams and reams of them. Think scientific programming
back in the late '70's, '80's and early '90's and you'll have a good idea of
my initial computing environment. The typical data file would be
individually well-described and well-structured, both text and binary, but
there really wasn't a strong argument for using databases. netCDF is a good
example of the kind of system that a scientific data shop would use. There
is a lot of discipline, no less so than in a good RDBMS-oriented place, but
one does not think quite the same way.
I suspect the first database that I had anything serious to do with was
likely FoxPro 2.x, which was in use at one organisation for writing an app
to allow people to easily define queries to retrieve end-user data sets
(i.e. final-format data in tables that had already been through very
rigorous conditioning and computation....this latter process being more my
end of things).
At the time that I first started getting introduced to RDBMSs, which
wouldn't be much over a decade ago, Google didn't exist yet either. I don't
recall really using any other Internet search engines back then. I believe I
read a bunch of books (probably including some not so good ones), and I
lucked out in having some fairly database-savvy bosses at my first few
"commercial" jobs. But there's no question that my initial self-education,
however well-intentioned, was imperfect, and I made quite a few mistakes. I
suspect what kept me interested was this: having a mathematical bent made
relations intuitive (and so ideas of normalization inevitably followed), and
I also intuitively liked the declarative paradigm...i.e. SQL core. So I
thought it was worth pursuing.
Certainly my comments are phrased against the above background.
I might add, with respect to some of your final comments, that I'm not so
sure I'd want people who give up on using RDBMSs to switch back to files.
Because proper use of files is not a trivial matter. To put in a recent
timeframe, would _you_ want a person who can't get SQL and normalized tables
and indices to switch over to XML and XSLT? :-)
AHS
Not at all. But in order to learn from a mistake you need to know that
something is a mistake in the first place. To keep the discussion to
databases, you need to see a database that tackles a task differently than
you would have done it yourself - this may not happen. Or maybe you have a
peer or superior who sees your design and points out that you're heading
down the wrong path...or maybe you don't have that. Or maybe your choice of
professional books is good but incomplete, or just too basic. Maybe your
Internet searches are missing the mark because, being unaware of a concept,
you never search on suitable keywords. Maybe you get bad advice on a NG but
you don't know it was bad.
One company I worked at, around the turn of the century, happened to be
loaded with talent...guys that were very up to speed on DBMSs and SQL. I was
lucky to get that experience. Because in no job since, whether FT or
contract, have I seen people that skilled in database design. Not on a
programming team with one large Canadian IT company, not on a programming
team assembled by a very large multinational consulting shop, and certainly
not at any of the smallish operations I've worked for/with. So I can
certainly see how any number of programmers can start working with databases
without ever having had the benefit of someone else's experience. Failing
that it comes down to self-education, and you can miss stuff doing
that...there will be mistakes along the way.
AHS
Indeed. But I hear the words of Harvey "The Wolf" Keitel carried faintly on
the breeze... :-)
> So I hope we can
> explore this subject in some detail, and generate more light than heat.
Let's have go.
> I'm more familiar with embedded SQL as an interface to application code
> than
> I am with ODBC or JDBC interfaces. (I never programmed in COBOL but I
> taught database programming to COBOL programmers). I suspect that low
> level
> programming carries a cost with it, and that part of that cost is that
> programmers continue to "think like programmers", instead of learning to
> think in SQL. That could be part of the attraction of OTLT and EAV to
> them.
>
> But the embedded SQL that I used was not "dynamic", as I understand the
> term. The precompiler processed the embedded SQL, opened a database
> whose
> metadata was supposed to be the same as the target database, rewrote the
> SQL in some lower level language that I didn't need to know, and replaced
> the emebedded SQL with calls to these generated lower level routines
> before
> passing the result to the compiler.
Yep, that's plain old embedded SQL.
> We had something we called "dynamic"
> SQL as well. But that involved generating SQL at run time, and letting
> an
> SQL interpreter parse it. I suspect you are using "dynamic" in a
> different
> sense than I am, but I'm not sure.
I mean something different (I think). Dynamic SQL is somewhere intermediate
between embedded SQL and coding to an API directly. With embedded SQL you
somehow construct your SQL statement on the fly by concatenating strings or
whetever. (It doesn't matter where it comes from; you could even get the
user to key it by hand at run time.) You then send the SQL statement off to
the server and use the DESCRIBE statement to get a descriptor back that
tells you what the output (if any) of the query is going to look like. You
then set up a buffer with a sufficient amount of storage and a description
of where each column value will go in the buffer (the details are host
language-specific). You then tell the server to execute the statement, and
then fetch successive rows into the buffer and iterate over the returned
columns. It sounds complicated, but having done it once you find (1) it's
not that hard after all, and (2) the same bit of code is reusable virtually
everywhere you need to do such a thing.
> In any event, the interesting thing is, IMO, not so much at what point
> the
> SQL is parsed and bound, but how automatic the process is of generating
> the
> SQL needed to reference and maintain a new code table.
That's a "how long is a bit of string" question. It could be easy or you
could want all kinds of bells and whistles.
> Back in the day,
> new code tables didn't occur often enough to do some heavy automation of
> the
> process, but the necessary coding was so well understood that the manual
> effort was relatively trivial.
Agreed, although I suspect the turnaround would still be measured in days at
least, whereas adding a row to a lookup table takes seconds. The case we
have to answer is why we think those days are worth it. My argument--which
I won't spell out here--is that someone still spends days dealing with the
new code, it's just someone else and they do it later, over and over again.
The net benefit to the business of being able to improvise new codes
willy-nilly will usually be near zero or worse in the long term. I currenty
deal with two systems that have development teams that are maybe 25-50%
bigger than they need to be to because productivity is so low owing to
mysterious bits of the business model that are encoded in EAV and OTLT
tables instead of being made manifest in the database design. The knowledge
of these things is transmitted orally, like the folklore it is, and then
code written to manifest the real business model at run time.
> So the idea of storing all in one table "to
> save programming effort" seemed like a bad joke to me. And the "delay
> due
> to the cranky DBA" was not the kind of issue in my context that it is in
> some of today's programmers.
>
> In all my years of visiting newgroups and other forums, I've never quite
> learned why programmers think that database programming is so difficult.
It is, doing it the way some of them do it. I just finished a re-write of a
batch job that consisted of over 14,000 lines of code and sent 2.5 million
queries to the server, and ran for 4.5 hours. When I'd finished with it, it
was 600 lines of code and it sent 6 queries to the server and ran in 12
seconds. And I am pretty sure one of the six queries isn't needed.
> I
> think this discussion might help me understand that. Thanks for a
> considered reply.
Roy
I did a little scientific programming myself back in the 70s. Actually, it
was engineering oriented business programming. For example, I had to
modify a program written to keep track of the "inventory" for a company that
ran a natural gas pipeline that was about 200 miles long, and fed about two
dozen retailers. "Inventory management" of a product like natural gas is
more like scientific work than commercial work, even though the purpose was
commerce. At that time, I had no database background either. Traditional
inventory management involves counting things. This kind of inventory
management involved thermodynamic formulae.
I once heard someone say that scientific programming involves very
sophisticated processing of fairly simple data, while commercial
programming involves fairly simple processing of righly structured data.
It's an oversimplification, but there's some truth in it.
My "conversion" to database work took place at the same time as my
switchover to business oriented information systems. To me, it was the
beginning of a second career.
As far as telling some people to stick to files, that's not as snide as it
might sound. In the first place, there is actually some work that *should*
be done in XML. (I don't know any XSLT, but I presume the authors of it are
not total fools.) Some of my fellow experts disagree with me on this score,
but that's my story and I'm sticking to it.
Second, many of today's databases are built to be embedded in a single
application. The databases I worked on in the 80s were intended for use by
a large and growing number of applications. For example, data from
worksations where disk were being built. It could be used for inventory
management, worker incentive management, work scheduling, assembly line
balancing, quality control , and even underlying manufacturing process
analysis, leading to improvements. There were separate applications or
tools for each of these functions, but they all shared data.
Contrast this to a contract where the client wanted some custom reports
written from the data in a database that was part of a purchased application
package. When I got to hear from the lead programmer for the app vendor,
he said, quote: "There is no user useable data inside the database. If you
want to use the data, you have to come in through the application". I
declined the contract.
Today, creating a new database has become so easy that many programmers use
a DBMS to manage data in almost exactly the same way they would use a file
system to manage data. I'm not being facetious when I suggest that some of
those people would get more productivity out of files.
Ultimately data sharing and data management on an enterprise wide scale is
more productive is you learn a good RDBMS, database programming, database
design, data modeling, and data analysis. I don't know where the crossover
point is from where the shorter learning curve for files and the greater
potential for databases balance each other.
I just encountered a situation which shows a part of the problem. This
system uses a codes table and a duplicate had been entered
inadvertently. This caused my application to operate incorrectly as
its SELECT INTO failed. So a data change broke my code.
Which reminds me of a maxim I like and try to follow:
It's not how well it works when it works that matters. It's how well
it works when it doesn't work that matters.
So until a programming or data error happens, EAV (which includes
OTLT) seems to work fine. But later, during maintenance (code or data
changes) when a duplicate or other error is introduced into the
system, things start deteriorating. The value of DBMS enforced
integrity constraints begins to shine in the maintenance phase of the
system. EAV can never come close in large systems. When the original
chief programmer leaves, which system would you rather reverse
engineer. ('cause I will bet you, the EAV design document is
nonexistent or ,due to the nature of EAV, does not include the real
information your want about how the system really works.)
Ed
Very interesting. It's going to take me a while to absorb this. In the
meantime, a reaction.
"On the fly" could mean several things, and it could be done for several
reasons. In my experience with people using dynamic SQL (with DEC Rdb/VMS)
there were two kinds: computer linguists who were building expert systems
that verged on artificial intelligence, and people who didn't want to do a
thorough data analysis before writing code. These two groups had very
different reasons for deferring the construction of the query to the last
possible time.
I think the best I ever did was 15 minutes to 6 seconds. The amazing thing
to me is that almost all of the truly doggy code I've seen in database work
was justified as "we did it that way for efficiency". People who try to
optimize an application by doing manually what a good query optimizer does
mechanically almost always slow themselves down.
Back in the 80s, you saw the same thing among people who spent too much
time trying to write applications that would cause fewer page faults,
instead of getting the logic right.
> > I
> > think this discussion might help me understand that. Thanks for a
> > considered reply.
>
> Roy
>
> Repeat:
> The net benefit to the business of being able to improvise new codes
> willy-nilly will usually be near zero or worse in the long term.
This is another jewel from you, Roy. Thanks for expressing it so clearly.
I think almost all the people who think the opposite way work for a software
vendor, and not for the client who will end up using and/or struggling with
the business data.
> I think the best I ever did was 15 minutes to 6 seconds. The amazing
> thing
> to me is that almost all of the truly doggy code I've seen in database
> work
> was justified as "we did it that way for efficiency". People who try to
> optimize an application by doing manually what a good query optimizer
> does
> mechanically almost always slow themselves down.
Good point, and in fact my anecdote above has an amusing prequel. About
four months earlier I had decomissioned a system for a customer who had
seven years worth of data on a high-end Alphaserver with a StorageWorks
array. They wanted to keep the data available so I suggested porting it to
a virtual PC on one of their available PCs. It had a single very large
capacity disk drive which was just big enough for the database. About a
month later I got a call complaining that it worked but the reports were
dog-slow. (Really? Whoda thunk?) Anyway, I consented to see what I could
do and a couple of hours of poking about showed it would be possible to use
a rule-based partitioning scheme on the tables to speed up the reports. A
couple of hours later the big tables were suitably partitioned and I turned
on parallel query execution to process the partitions in parallel. Presto!
The virtual PC with one disk now out-performed the original Alphaserver
running the same report by 40-50%. (If I'd had four or five disks to play
with it would probably have been 200-300% faster.)
Flushed with this notable triumph, when the next client came along with a
performance problem, I smugly trotted out this tale and made extravagant
claims for what I could do for them almost for free.
Wrong-o!
Unfortunately the next application had completely subverted the server (for
performance reasons) and was not only bringing the tables to the application
and doing the joins in the application, they had also completely hardcoded a
query strategy that actually started with one of the least restrictive
conditions. Anyway, to cut the story short, that's how I ended up doing a
re-write. The original programmer had made it impossible for the optimizer
to have a global picture of the problem; the code was incapable of using
server-side parallelism to exploit all the cores and register sets, and the
storm of messages flying back and forth between the application and the
server actually looked like a denial of service attack!
As Hoare wrote (and was quoted by Knuth): premature optimization is the root
of all evil. To which I add: especially when it is so premature you are
still clueless.
Roy
After dealing with performance problems on Rdb/VMS on several contracts, I
developed a short list of things to check for, regardles of where the
client told me to look for performance issues. Briefly summarized the run
like this:
1. Defragment the Disks
2. Rough tune the virtual memory.
3. Map all unmapped tables (this one is good for Oracle as well)
4. Index all unindexed tables, unless there's a real good reason not to.
(Oracle gives you an index when you declare a primary key. Rdb didn't,
back then)
5. Rebuild all sorted indexes that allow duplicates (a flaw in Rdb).
6. Put over populated hash indexes in a right sized area (read:
tablespace)
7. Eliminate redundant indexes.
8. Tune the queries in the application code.
9. Balance the load on the disks.
10. Increase the number of buffers for database pages.
If I got positive results on about 5 of these, it was enough to make me
look like a genius. I remember one site where the end users said "We didn't
realize you were going to install a whole new system." They were used to
the app delaying 10 minutes to insert a new contract, instead of only
waiting two seconds.
Then again, I remember the site where I recommended defragmenting the disks,
and the DBA treated me like I was criminally insane. That guy thought that
everyone was a moron, except for him.
PS: I like your comment about subverting the server. Some programmers like
to tell the server what to do, a little bit at a time, out of fear of what
will happen if they tell the server what they want and not how to retrieve
it. The programmers are just avenging themselves on the users who reveal the
requirements to the programmers a little bit at a time, instead of all at
once.
Don't worry. We had 'the meeting' today and we're definitely *not* going
forward with the OTLT 'solution'.
Frank
Good. But don't imagine the fight is over. You may have to defend that
decision repeatedly.
Roy
Yes, there is a fair bit of truth to that. Look at physical oceanography,
for instance. You get a huge amount of mileage out of nothing more than
time/position readings of temperature and salinity.
> My "conversion" to database work took place at the same time as my
> switchover to business oriented information systems. To me, it was the
> beginning of a second career.
>
> As far as telling some people to stick to files, that's not as snide as
> it
> might sound. In the first place, there is actually some work that
> *should*
> be done in XML. (I don't know any XSLT, but I presume the authors of it
> are
> not total fools.) Some of my fellow experts disagree with me on this
> score,
> but that's my story and I'm sticking to it.
I'm absolutely not knocking XML. I happen to think it's overused, but it is
a natural fit for many applications.
My actual point was, flat files deserve respect also. The attitude I was
thinking of is exemplified by a poster in another newsgroup, who had to
process legacy flat files produced by what the fellow referred to as "old
programmers who only know Fortran", or something like that. What he was
actually having a problem with was header lines. I figure, if a programmer
doesn't have the discipline or willingness to "get" databases, they won't be
so hot with regular files either.
I'd also be a bit dubious about such a programmer tackling binary data
files...
[ SNIP ]
> Today, creating a new database has become so easy that many programmers
> use
> a DBMS to manage data in almost exactly the same way they would use a file
> system to manage data. I'm not being facetious when I suggest that some
> of
> those people would get more productivity out of files.
I suspect you're right. I'd phrase it differently: depending on the
application any one of us might get more productivity out of files. Another
factor is this - a huge amount of the productivity associated with files
derives from skillful use of shell programs like awk, cut, paste, grep etc
etc, and the percentage of programmers that are dab hands with tools like
this is not great.
> Ultimately data sharing and data management on an enterprise wide scale
> is
> more productive is you learn a good RDBMS, database programming,
> database
> design, data modeling, and data analysis. I don't know where the
> crossover
> point is from where the shorter learning curve for files and the greater
> potential for databases balance each other.
Very interesting question. And then there is the question of how do you
classify things like Berkeley DB?
AHS
If you want to learn from mistakes, let them be someone else's
mistakes! Much cheaper that way.
> I'm absolutely not knocking XML. I happen to think it's overused, but it
is
> a natural fit for many applications.
>
> My actual point was, flat files deserve respect also. The attitude I was
> thinking of is exemplified by a poster in another newsgroup, who had to
> process legacy flat files produced by what the fellow referred to as "old
> programmers who only know Fortran", or something like that. What he was
> actually having a problem with was header lines. I figure, if a programmer
> doesn't have the discipline or willingness to "get" databases, they won't
be
> so hot with regular files either.
>
You are right. I still use flat files on occasion, and I use CSV files as
well. I also still tune in to AM radio on occasion. Our entire legacy
deserves respect. The people who went before us were just as smart as we
are, and some of them were even smarter.
I partly disagree with your assessment about people who don't "get"
databases. There are people who don't "get" query optimization or
concurrency management, or transaction isolation, but who work just fine in
a single user files environment. To me the question is more like: do they
or do they not "get" data? There are a lot of programmers who just don't
think that there is anything to "get" about data. These people tend to
think that a computer program is a bunch of tedious garbage, followed by
the words "procedure division" or "begin" or some such introduction to "the
real computer program".
Unfortunately, the object oriented paradigm has taught many people that
data is a bunch of details that are best hidden away where almost no one can
see them. In the context of object oriented programming, that is a
valuable truth. In the context of data managment, it's the worst possible
attitude.
> I'd also be a bit dubious about such a programmer tackling binary data
> files...
>
> [ SNIP ]
> > Today, creating a new database has become so easy that many programmers
> > use
> > a DBMS to manage data in almost exactly the same way they would use a
file
> > system to manage data. I'm not being facetious when I suggest that some
> > of
> > those people would get more productivity out of files.
>
> I suspect you're right. I'd phrase it differently: depending on the
> application any one of us might get more productivity out of files.
Another
> factor is this - a huge amount of the productivity associated with files
> derives from skillful use of shell programs like awk, cut, paste, grep etc
> etc, and the percentage of programmers that are dab hands with tools like
> this is not great.
I agree with your point. But I reiterate that there are some people who
will be more productive with files in an environment where you or I would be
more productive using a DBMS.
Not "getting" data is part of a larger problem, which is forgetting what
software is actually for. Most end users couldn't care less about how their
solution is implemented, they just want a useable program. Useable meaning
reliable, not too hard to use, reasonably fast and so forth. Not "getting"
data is also accompanied by not "getting" user interfaces, reporting
requirements, documentation, error recovery etc.
One job I worked at involved property management software. We didn't write
the stuff, we had to access the backend data for the various applications in
use. Some apps used databases, some used files. Let's just say that on a
scale of 1 to 10 not many of these programs rose above a 5 on any objective
criterion. The treatment of data was often particularly bad. As an example,
how can you have a system that has as one of its main purposes the
management of rental information, that only provides for the entry of
personal information about one tenant? Guess how the typical landlord or
property manager is going to store information about two or three people?
That's right, any way that works...like stuffing three last names into a
single surname field. One application at least anticipated that there might
be multiple tenants - they provided six separate fields to account for the
first and last names of up to 3 people...not exactly a stellar solution. And
these apps often featured what I call "just in case" columns in their
tables, columns labelled like Info1, Info2, Info3 or suchlike, not
infrequently exposed to the end user, and not surprisingly used by the end
user in very personalized ways to store data. In other words, dangling data
as far as standard (anticipated) retrieval was concerned.
This wasn't a failure of "getting" data - this was a failure to "get"
practically anything.
A lot of programmers don't "get" any of this because they are too focused on
the implementation details, and not enough on what the actual purpose of the
software is. Every programmer should have to work at least once directly for
an end user who just wants results, and doesn't care about the details of
how it's done. In real life not many programmers ever are answerable to a
client. And I mean answerable to a real end-user, not an internal end-user.
The kind of end-user that will look at your new app running on a Pocket PC,
and say "it's a pretty interface, and my IT manager tells me you guys have
done a nice job of integrating your server side with our ERP system, but you
know, my drivers will never use this".
>> I'd also be a bit dubious about such a programmer tackling binary data
>> files...
>>
>> [ SNIP ]
>> > Today, creating a new database has become so easy that many
>> > programmers
>> > use
>> > a DBMS to manage data in almost exactly the same way they would use a
> file
>> > system to manage data. I'm not being facetious when I suggest that
>> > some
>> > of
>> > those people would get more productivity out of files.
>>
>> I suspect you're right. I'd phrase it differently: depending on the
>> application any one of us might get more productivity out of files.
> Another
>> factor is this - a huge amount of the productivity associated with files
>> derives from skillful use of shell programs like awk, cut, paste, grep
>> etc
>> etc, and the percentage of programmers that are dab hands with tools like
>> this is not great.
>
> I agree with your point. But I reiterate that there are some people who
> will be more productive with files in an environment where you or I would
> be
> more productive using a DBMS.
That's very likely true. You mentioned a crossover point before, and this is
of course a personal crossover point. But I think we'd both agree that there
are objective criteria also for when a file approach is indicated, and when
a DBMS is indicated.
AHS
End-users absolutely do not care about anything except how easily they can
get through the 9-5. They don't own the business and they don't pay it's
bills and they have no investment in it's most valuable (and expensive)
non-tangible asset--namely it's data. The end users' opinions on the
subject of databases is as relevant as the cows' opinions on cheese-making.
> Useable meaning reliable, not too hard to use, reasonably fast and so
> forth. Not "getting" data is also accompanied by not "getting" user
> interfaces, reporting requirements, documentation, error recovery etc.
Not "getting" data means being a fraud and an imposter. No business buys
database applications just to run applications; they buy them to get
accurate information and for no other reason. Not "getting" the other
things you list means delay and awkwardness and you'll surely be in trouble
if you can't deliver efficiency and speed, but those come second. To prove
it, propose to a company director that you can make his slow systems twice
as fast but they'll produce undetectable corruptions of the data, and see if
he goes for it.
Roy
> Not "getting" data is part of a larger problem, which is forgetting what
> software is actually for. Most end users couldn't care less about how
their
> solution is implemented, they just want a useable program. Useable meaning
> reliable, not too hard to use, reasonably fast and so forth. Not "getting"
> data is also accompanied by not "getting" user interfaces, reporting
> requirements, documentation, error recovery etc.
There is a larger problem, as you say, about not getting what software is
for. Or, generalizing some more, not getting what systems are for. But my
interest, in this discussion, is programmers who do get it about the
software systems they build but don't get it about the data their systems
manage.
Whether a useable program conceals all the data from the user or reveals
appropriate data to the user depends on the context. When I drive a car,
I'm a user of a system under the hood that controls the gas-air mixture. If
it protects my engine, delivers better power, and improves gas mileage, I'm
happy. I could give a lesser about the data it uses. But there are other
systems where the handling of data is critical to the useability of the
system.
I disagree. The "up to 3 people" case is clearly either a failure to
understand first normal form or an unwillingness to decompose records for
one reason or another. That's not getting data. The dangling data exposed
to the user for unanticipated purposes is a sin that I've committed myself.
It's only a sin when management later expects reports that aggregate
idiosyncratic data.
Which is every single time.
In a way, OTLT, where this discussion started, is a way of dangling some
columns where users (or perhaps programmers) can define any reference data
they want, so as to accomodate unanticipated codes.
[snip]
>Not "getting" data means being a fraud and an imposter. No business buys
>database applications just to run applications; they buy them to get
>accurate information and for no other reason. Not "getting" the other
>things you list means delay and awkwardness and you'll surely be in trouble
>if you can't deliver efficiency and speed, but those come second. To prove
>it, propose to a company director that you can make his slow systems twice
>as fast but they'll produce undetectable corruptions of the data, and see if
>he goes for it.
You just have to say it right. "I denormalise for performance."
I can't tell you the number of times programmers have come to me for help
with a query
that produces wrong results, and when I try to change their "select" to
"select distinct" to see if maybe that's the problem, they exclaim "Oh I
never use 'select distinct'! It runs too slow!"
I patiently explained to them that I first want to come up with a query that
is logically correct, then if necessary, come up with one that's logically
equivalent, but runs fast.
> I can't tell you the number of times programmers have come to me for help
> with a query
> that produces wrong results, and when I try to change their "select" to
> "select distinct" to see if maybe that's the problem,
Hey, wait a minute.
Usually, it's the other way round.
Some people _always_ run DISTINCT because they are not able to "remove
duplicates" :-)
> I patiently explained to them that I first want to come up with a query that
> is logically correct, then if necessary, come up with one that's logically equivalent, but
runs fast.
If "fast" comes first, any program will take longer to be "correct" and
will seldom ever be "clean".
When I see
SELECT DISTINCT
in anything but an ad hoc query, I know something is wrong with the
query. Possibly it is joining to a wrong table (to a detail table
instead of a header/parent table for example) or some filtering
condition is missed (a column not in the select list needs to be
used), or some join condition was missed (there is a compound key and
one component was left out). Even in an ad hoc query I tend to use
SELECT COUNT(*)
since that gives me a little bit more information for the same cost as
DISTINCT (i.e. it reads the same amount of data).
Seeing this post, I'm actually a little disappointed in you, David.
Your last sentence does show you mainly follow a good approach. It's
just that DISTINCT should be left to ad hoc queries IMO.
Have a good day.
Ed
> When I see
> SELECT DISTINCT
> in anything but an ad hoc query, I know something is wrong with the
> query.
Eh?! How can you possibly "know"any such thing? Projecting away
distinguishing columns is perfectly respectable, desirable and necessary.
And since SQL won't discrard the resulting duplicates automatically (unless,
bizarrely, it is in a UNION) we need to tell it to do so explicitly.
> Possibly it is joining to a wrong table (to a detail table
> instead of a header/parent table for example) or some filtering
> condition is missed (a column not in the select list needs to be
> used), or some join condition was missed (there is a compound key and
> one component was left out). Even in an ad hoc query I tend to use
> SELECT COUNT(*)
> since that gives me a little bit more information for the same cost as
> DISTINCT (i.e. it reads the same amount of data).
>
> Seeing this post, I'm actually a little disappointed in you, David.
> Your last sentence does show you mainly follow a good approach. It's
> just that DISTINCT should be left to ad hoc queries IMO.
This is not a matter to be decided on the basis of mere opinion.
Roy
I know what you are saying, but I regard this and a few other things
as symptoms of bad SQL, not absolute proof of it. I look for a CROSS
JOIN that wasn't supposed to be there. But SELECT DISTINCT can be
valid.
Other symptoms are the use of ORs when an IN() would work. Likewise,
failure to use BETWEEN is another symptom. These tell me he has not
un-learned his original procedural language.
Keep the context in mind. David was noting cases where he was asked to
<quote> help with a query
that produces wrong results, and when I try to change their "select"
to
"select distinct" to see if maybe that's the problem </quote>
I am saying DISTINCT should be one of the last things tried in this
situation. I usually begin by asking the programmer: so exactly what
is wrong with the results? It is easier to debug code when you know
what the result should be.
But yes I stick to my opinion that DISTINCT is seldom needed in
production queries. Right now I cannot think of any case where I
needed it in a production application.
Ed
>"Ed Prochak" <edpr...@gmail.com> wrote in message
>news:4b1507c9-207e-4555...@59g2000hsb.googlegroups.com...
>> On Jun 26, 12:31 pm, "David Cressey" <cresse...@verizon.net> wrote:
>
>> When I see
>> SELECT DISTINCT
>> in anything but an ad hoc query, I know something is wrong with the
>> query.
>
>Eh?! How can you possibly "know"any such thing? Projecting away
>distinguishing columns is perfectly respectable, desirable and necessary.
Why is it necessary?
[snip]
Keep in mind this is in context of someone coming up asking for help
on a query. DISTINCT would be about the last thing I would add to a
query in that case. I think you understand this, but I say it for the
SQL newbies.
>
> I know what you are saying, but I regard this and a few other things
> as symptoms of bad SQL, not absolute proof of it. I look for a CROSS
> JOIN that wasn't supposed to be there. But SELECT DISTINCT can be
> valid.
Agreed. Maybe you can provide an example, but I cannot recall ever
using DISTINCT in a production query.
>
> Other symptoms are the use of ORs when an IN() would work. Likewise,
> failure to use BETWEEN is another symptom. These tell me he has not
> un-learned his original procedural language.
Good points Joe. thanks for the follow up.
Ed
[snip]
>But yes I stick to my opinion that DISTINCT is seldom needed in
>production queries. Right now I cannot think of any case where I
>needed it in a production application.
Not so long ago, I removed all of them from my app. Use a group
by. If that wil not do it, then the original query is screwy.
You may have misunderstood as to who I meant by end users. Or maybe I
misunderstand by who you mean by "don't own the business". The end user to
me is the person who owns the database - it's their data...and presumably
also their business. Either that, or they work for the guy who owns the data
and business. Seems to me that they do have an investment in the data.
>> Useable meaning reliable, not too hard to use, reasonably fast and so
>> forth. Not "getting" data is also accompanied by not "getting" user
>> interfaces, reporting requirements, documentation, error recovery etc.
>
> Not "getting" data means being a fraud and an imposter. No business buys
> database applications just to run applications; they buy them to get
> accurate information and for no other reason. Not "getting" the other
> things you list means delay and awkwardness and you'll surely be in
> trouble if you can't deliver efficiency and speed, but those come second.
> To prove it, propose to a company director that you can make his slow
> systems twice as fast but they'll produce undetectable corruptions of the
> data, and see if he goes for it.
>
> Roy
I'm pretty sure that describing _any_ potential program flaw or deficiency
accurately and honestly to a buyer is going to result in no sale or a demand
for a fix. You're right - if I told a potential purchaser that he'll have a
responsive interface and beautiful reports, but that he really can't rely
100% on data integrity any more, he'd bail. But so too would he bail if I
told him that data integrity was assured, but that the user interface and
documentation were so bad that the operators of the system would be highly
likely to enter an unacceptable amount of incorrect information. That is,
the information will be faithfully stored and retrieved, but it just won't
be meaningful.
As an example, I am reasonably familiar with a CRM system for a large ISP.
One of the functions of the CRM is to allow for quick description of the
nature of the interaction with the customer, in a fairly standard way -
department(s) who handled the interaction, main category for the issue,
subcategories, and a similar breakdown for resolution. IOW, a fairly
standard thing for a CRM. The databases are reliable enough, but the
interface for doing the above call-wrapping is so bad that interactions are
simply not being correctly described. It's too difficult to find the correct
categories/subcategories, the overall interface was very badly designed
anyway (something I frequently see with Web UIs), and in any case the actual
choices are also not well thought-out. This is a classic GIGO
system...although one may be assured that the garbage will be safely stored
and speedily retrieved.
So, is this a data problem, or something else? Some of it actually is a data
problem, insofar as data also involves proper choices of acceptable domain
values, but a lot of it is a UI problem. Not much point in making it
possible for a user to do the right thing but ensuring that it's usually
easier to do the wrong thing.
AHS
Your comment doesn't agree with my experience. In general, there are two
circumstances where SELECT DISTINCT is the right approach. Both involve
situations where SELECT generates a bag, but the desired result is a set.
The first is a situation where the database has been misdesigned. If the
database has been frozen for a year or more, and there is a lot of
production software that depends on the existing table design, it may
simply be unfeasable to correct the design in order to make one query
logically easy to write.
The second is a situation where the query requires a subset of the data that
does not include any candidate keys from the underlying tables. In that
case, the generation of a query requires its own mechanism for eliminating
duplicates.
Perhaps the development of a new program, a long time after the database is
in production, and not easily altered, has much the same flavor of what you
have called an "ad hoc query". Using the database to organize data in a way
that the designer did not contemplate.
You weren't talking to the programmers I was talking to.
>
> But yes I stick to my opinion that DISTINCT is seldom needed in
> production queries. Right now I cannot think of any case where I
> needed it in a production application.
>
There are a large body of queries where SELECT and SELECT DISTINCT can be
shown to necessarily produce identical results. I can't help wishing that
the optimizer could figure it out. But that's probably way down on the list
of things an optimizer should be able to do.
I will strongly suggest to you that SELECT DISTINCT is more faithful to the
relational data model than SELECT ALL is. I also think that it's wierd that
UNION defaulted to UNION DISTINCT instead of UNION ALL.
I stand by my original comment. If I can show that SELECT DISTINCT produces
right results where SELECT (meaning SELECT ALL) produces wrong results then
the programmer and I have saved about a half an hour of discussing "what are
you really trying to do?".
Sometimes the DISTINCT feature had to be added to a subquery. That gets
really interesting.
And I didn't always do this when consulted by a programmer. Only when
something gave me a hunch that the programmer was thyinking in terms of sets
while SQL was thinking in terms of bags.
Okay, you have a case where the design error is not in the query. We
agree there is still something wrong here.
>
> The second is a situation where the query requires a subset of the data that
> does not include any candidate keys from the underlying tables. In that
> case, the generation of a query requires its own mechanism for eliminating
> duplicates.
I'd like to see an example of this case.
Then the question would be: is this case a rare exception?
I think it is very rare, since honestly I haven't seen it.
You seem to be saying it is not so rare (especially in your other
post). If so, an example or two should be easy to show.
Maybe I am making some queries more complicated than they need to be.
Please show me the error of my ways.
>
> Perhaps the development of a new program, a long time after the database is
> in production, and not easily altered, has much the same flavor of what you
> have called an "ad hoc query". Using the database to organize data in a way
> that the designer did not contemplate.
I am thinking of ad hoc as in a throwaway query. One you might do to
research an issue in the system. Occasionally it might be a business
request, but most are developer queries as part of new development
(especially when you just started a new job and there is little or no
documentation). And I would say it includes organizing data in ways
the designer did not anticipate. Whether these new ways are useful or
not is yet to be determined. So I guess, yes, in a few cases these
queries might work their way into a new program.
Ed
Actually, it is not too hard to detect in many cases. If the result
set has a key in it without a cross join, then the rows are UNIQUE.
Perfect hashing and bit vector indexing can also spot and reject
redundant duplicates during processing.