The easiest mistake I've bumped into thus far is to disregard the
presence of nulls. They're the kind of vice you should know, but which
you still don't mind enough. They just sort of slip through your
fingers.
My vice was the basic temporal table. You have the start time, and the
end time, then you have the rest of the key attached to those, plus
finally the dependent data, nicely normalized. You'd think you're
okay, and off to well-normalized salvation. Except that you let those
pesky nulls pass onto the end date. Nothing sinister in that -- after
all, the data that is current does not *have* an end date. The the
null is *such* a convenient marker for this very special case.
Perfectly suited for the occasion; and as it happens, for ending up in
the data purgatory.
All of the sudden your code is littered with is-null statements or the
converse.
You're fighting to keep the special cases in control when summarizing
your table.
Both entity and referential integrity are out of the window, since
nulls and keys don't really mingle. In the worst case, if you happen
to be repentant, you're already in limbo with null-minded triggers.
As Oracle at least doesn't index nulls, your carefully tuned table
refuses to perform with the most common online query, which is to ask
for the current status of a single entity that is currently
valid...having a null end time. Try as you might, indexing only makes
things worse since you want to index a null, and the DBMS just won't
comply.
Not to mention the fact that when you optimize your indexes for online
queries, suddenly you need another set of entirely different indexes
for the historical data. 'Cause it surely don't get hit by OLTP stuff.
It's all about temporal analytics, which translates to stabbing or
intersection queries on the start and end dates. This time onto
definite ones -- which of course cannot be satisfied by indexes well
enough, because of the possibility of a null popping up somewhere, and
the fact that just about nobody besides yourself knows how to exclude
that sort of behavior in logical level SQL. Suddenly your OLAP load
starts causing entire table scans, and pays no nevermind to physical
level optimization.
There was a point, once, where you could have intervened. That was
when you allowed the null in the end of validity column. Had you just
realized the hint that null/tuple mark was giving you, you might have
realized that you're actually trying to model two different kinds of
data altogether: current, transactional data which is mostly accessed
by key-absent-time, and properly indexed as such, and then historical
data which is mostly accessed for analytical, statistical purposes,
based on time. That the presence of the null actually suggested this
might be the case, from the very start, and that the special nature of
nulls suggested indexing would go amiss from the start. Not to mention
the hassle in maintaining the queries, the views, the constraints, and
so on. But no, you made your choice and now you're in data hell, with
the level steadily descending towards ASCII-in-an-unformatted-
flatfile.
I've committed the sin of "current data has a null end date" something
like three to four times. Thus far it's only bitten me back once. But
it's still one evil construct, alluring, far-reaching and just plain
wicked, all at once. It'll definitely return to haunt me in the
future. Don't go there, then. Separate the current and historical
data, at least for the most part, and if you don't, at least do not
use null as the separating factor.
Did you try FBI (on pseudocolumn such as "case when x=null then 'N/A'
else x end" )?
> I'm no Catholic, but I've still sinned my part in modeling data, and
> need my absolution.
That was a truly admirable confession although I am an invincible
agnostic.
Unfortunately my sin might be worse in a way. I claim never to be
willing to admit nulls into a design, yet faced with a mob of 30+
technical reviewers, down-stream users, and developers from other
departments all furiously telling me I'm wrong and that I'm making
things needlessly difficult, I have usually found it easiest to just
cravenly give in to them. Telling them that I am revealing rather than
creating complexity never works. I haven't the time or energy to
repeat the arguments and the demonstrations 30 times. I give in, go
home on time, and get paid just the same. Not very admirable huh?
I am looking forward to Part 2 now.
PS: Does anyone else see how ironic it is that in one thread I am
telling someone how a single table design can sometimes be OK and in
another I am claiming that I get accused of making things needlessly
complex? It's enough to give me a crisis of faith (or at least a
hissy-fit)!
--
Roy
A long time ago when I thought I already knew something about data but
hadn't yet worked with databases -- telling in itself -- I had the
opportunity to spread my views on file formats to a bunch of nerds. Of
course I already knew that no such discussion would be complete
without at least mentioning normalization. So I decided to illustrate
it in passing by an example. Since obviously the goal of normalization
is to remove redundancy, the first example naturally was pushing
strings into a lookup table and linking to them via their index in it.
Now, perhaps I could have gotten away with that had a gone on to
actually factor something into 3NF, but no. That was the only example.
I don't think Hail Maries quite cut that one, eventhough in a very
limited sense the part about redundancy is true.
Half a decade later I was designing one of my first schemas. Now I
actually knew what normal forms and integrity were about. But it seems
old habits die hard. The result was a beautifully constrained and
normalized base...with every single entity keyed by surrogate. I was
happy and went my merry way.
A couple of years ago I decided to see how the base was doing
nowadays. It was humming along nicely. The surrogates bothered me a
bit, but since they did simplify a couple of practical things and
indeed compressed the base (to its then whopping 20MB), I thought
everything was fine. Until I realized two minor details. First, I
hadn't been stern enough with the web developers, so they'd of course
put my surrogates on the user visible reports, and now they'd leaked
into the wild. Ooops. And second, it suddenly dawned on me why the
rest of the stuff seemed to be in order: during development I had
indeed been strict enough in some things. There was not a single null
in the schema...nor a single table auto generating surrogates which
wasn't also constrained by a unique key other than the surrogate.
There are no free lunches, but there is such a thing as karmic debt. I
think that goes a long way towards explaining why I now work with the
internals of a Siebel instance. In case you haven't heard, that's the
product whose trainer asks on the top of the introductory course
whether anybody present is a DBA. In case one is found, he's told he's
not going to like what he's about to see. One of those things is that
everything, and I mean *everything* is keyed by surrogates, and in one
particular table not a single field besides that surrogate is declared
NOT NULL.
Of course, my debt being what it is, we're not quite done yet. You
see, our setup is the central hub for a number of independent Siebel
installations. We distribute, deduplicate and publish data between
them. That means that each of the systems has its own, identically
formatted, and overlapping in range surrogate for each of the objects.
Since our system is supposed to have something to do with master data,
it of course additionally generates its own, authoritative, master ID
for each new object. Which isn't to say that some of the affiliate
systems wouldn't have their own, autogenerated, user visible keys in
addition to what Siebel does under the covers. The result is that the
maximum number of separate autogen keys for a single object I've seen
thus far has been in the vicinity of ten.
My title is global data administrator. It basically means I'm
responsible for the quality and uniqueness of the data.
--
Sampo
As an addendum, I just learned today that one of our affiliate systems
contains a nifty feature: they can manually change this authoritative,
master ID. And no, updates to our database are not correlated by the
ID we assigned. Instead they're done via some fuzzy logic, among other
things involving the surrogate the source system is using. The result?
We have duplicate, authoritative master ID's now.
That problem is not a new one. A thorough cleansing exercise already
took place before my "reign" to eradicate it. It's just that nobody
thought of declaring the problematic attribute UNIQUE after the
cleasing was done, and the problem recurred. (It couldn't have been
declared the primary key since, after all, we're working on top of
Siebel's out-of-the-box data model which elects to place objects there
which cannot be assigned one of our ID's. Plus of course the primary
key is already set as the Siebel internal surrogate.)
On the plus side, I was assured the fix to stop this from happening
again was coming Any Day Now. It essentially consists of keeping the
matching logic the same, but ignoring updates to the master ID.
Correlating updates by our ID was apparently not an option, eventhough
one of my colleagues has been talking about that ever since the
cleanup took place.
As the perennial optimist, I'm hoping the new code rejects updates
with changed master ID's or actively reverts them. But at this point I
think it's more likely that the update will be accepted, only with the
master ID stripped off, giving rise to inconsistency between the two
communicating systems.
The lesson? If you couple loosely, you don't need yet another key but
simply a rock-solid mapping mechanism. If you try to do MDM, you
should have the authority to force your affiliates to adopt your
master key as the One True Key. Mixing the two approaches ain't really
an option.
--
Sampo