Mapping arbitrary number of attributes to DB

Joshua J. Kugler

unread,

Oct 20, 2006, 9:15:08 PM10/20/06

to

Hello all.

A little background: I've done a fair bit of database application
programming over the years, starting with MS Access 2 and 97 <shudder> and
have read a lot about database theory, starting with this book:
http://www.amazon.com/SQL-Structured-Language-Carolyn-Hursch/dp/0830638032/sr=1-4/qid=1161392334/ref=sr_1_4/104-7369811-8083948?ie=UTF8&s=books

I'm very comfortable with 1NF (and maybe even 2NF, 3NF and higher, until my
head starts to hurt), so I can generally design tables and relations I
need.

I've got a problem now that, while I think I have figured out how to do it,
I'm wondering how others have solved this problem. I've tried googling,
but I really don't know the words to use to search.

Here it is: the application is going to be importing data from files that
contain sensor readings. These files can have an arbitrary number of
columns, anywhere from maybe three or four all the way up to 200+ depending
on the number of sensors attached to the particular unit sending the data.

Creating a table for each datafile does not seem practical (proliferation of
tables, plus the number of columns can change, making ongoing maintenance a
nightmare).

So here is what I'm thinking about doing.

Table: Station
id

Table: Station_log (stores information about each line in the datafile)
id
station_id

Table: Stored_data
id
Station_log_id
column_num
data_value

Where column_num would be the column number in the original file, and
data_value (a float) would be the value in that column. So, for each row
in a data file, there will be one row in Station_log, and for each *column*
in a row, there will be one row in Stored_data.

I think this is the best way to do things, keeping everything relationally
sound, and getting at a series of values later for a particular column
should be rather simple.

Other ideas on how to accomplish storing this kind of data?
Resources/reading to which you can point me? Other tips?

Thanks!

j

--
Joshua Kugler
Lead System Admin -- Senior Programmer
http://www.eeinternet.com
PGP Key: http://pgp.mit.edu/ ID 0xDB26D7CE

--
Posted via a free Usenet account from http://www.teranews.com

Cimode

unread,

Oct 21, 2006, 5:17:21 PM10/21/06

to

Joshua J. Kugler wrote:
<<Here it is: the application is going to be importing data from files
that
contain sensor readings. These files can have an arbitrary number of
columns, anywhere from maybe three or four all the way up to 200+
depending>>

No need to go any further to tell you this is the wrong direction

> Other ideas on how to accomplish storing this kind of data?
> Resources/reading to which you can point me? Other tips?

I am afraid, no *tips* can really help ...There are rules based on
applied science to implement if you want to do a clean design.

1) Conceptual design
2) Logical Design:full normalization
3) Physical Design of logical designfrom http://www.teranews.com

For more info

Practical Issues in Data Management (F. PASCAL)
Introduction to Database Theory (CJ DATE 8th edition)

Hope this helps...

David Portas

unread,

Oct 23, 2006, 3:16:01 PM10/23/06

to

I would have expected the file to be no more than a convenient medium
for interchanging the data. In that case the file structure is surely
the least relevant part of the problem - yet you have apparently based
your whole model around it!

The route to an effective logical design is via an understanding of the
concepts you are trying to model. I don't think columns in files will
help achieve that. Even given the limited information you supplied your
design still looks highly impractical.

--
David Portas

Joshua J. Kugler

unread,

Oct 23, 2006, 4:00:34 PM10/23/06

to

Cimode wrote:
> > Here it is: the application is going to be importing data from files
> > that contain sensor readings. These files can have an arbitrary number
> > of columns, anywhere from maybe three or four all the way up to 200+
> > depending
> No need to go any further to tell you this is the wrong direction

And David Portas wrote:

> I would have expected the file to be no more than a convenient medium
> for interchanging the data. In that case the file structure is surely
> the least relevant part of the problem - yet you have apparently based
> your whole model around it!
>
> The route to an effective logical design is via an understanding of the
> concepts you are trying to model. I don't think columns in files will
> help achieve that. Even given the limited information you supplied your
> design still looks highly impractical.

Sadly enough, I agree with both of you. The problem is, part of the way the
data is represented is out of my control. Data is coming in from weather
stations. Some of these stations may have 1 sensor (e.g. temperature),
some of these stations may have many sensors (temperature, wind speed, soil
temperature, humidity, air pressure, many of them redundant). Thus, each
station can have a number of attributes that vary from one to dozens. I
can't tell the scientists "You may have 32 sensors per station, no more, no
less." That just won't work, for reasons I hope are obvious. :)

What I'm trying to do is find a generic way to map these attributes to a DB
so that I can import and process any data file. I know the kind of design
I proposed ("flattening" it to one value per row) isn't the best, but right
now, I'm not sure how else I'd go about it. That's why I asked for ideas.

So, I guess stated another way, forget about those files...how might I go
about (or where might I read about) mapping an arbitrary number of
attributes to a database in a clean, efficient way?

David Portas

unread,

Oct 23, 2006, 4:54:45 PM10/23/06

to

Joshua J. Kugler wrote:
>
> Sadly enough, I agree with both of you. The problem is, part of the way the
> data is represented is out of my control. Data is coming in from weather
> stations. Some of these stations may have 1 sensor (e.g. temperature),
> some of these stations may have many sensors (temperature, wind speed, soil
> temperature, humidity, air pressure, many of them redundant). Thus, each
> station can have a number of attributes that vary from one to dozens. I
> can't tell the scientists "You may have 32 sensors per station, no more, no
> less." That just won't work, for reasons I hope are obvious. :)

No-one except you suggested that.

So you'll need attributes for Wind Speed, Temperature, Humidity etc.
Obviously those are very different domains so they belong in separate
columns. If each type of reading is totally independent of the others
then one would expect to see a separate table for each, with a common
key presumably comprising the station number and a timestamp. This is
of course totally unsupported guesswork on my part.

--
David Portas

Cimode

unread,

Oct 23, 2006, 4:55:07 PM10/23/06

to

Joshua J. Kugler wrote:

> What I'm trying to do is find a generic way to map these attributes to a DB
> so that I can import and process any data file. I know the kind of design
> I proposed ("flattening" it to one value per row) isn't the best, but right
> now, I'm not sure how else I'd go about it. That's why I asked for ideas.
>
> So, I guess stated another way, forget about those files...how might I go
> about (or where might I read about) mapping an arbitrary number of
> attributes to a database in a clean, efficient way?
>
> Thanks!

Again, online exchange does not help understand the full spectrum of
specifications bound to your need. All I can say is that you need to
do some reading. Also as any measuring system with a high number of
attributes, you will probably need to separate between type of input on
measurement (separate discrete values from ranges, numeric values from
strings qualifications)...

as I admit hating that you may consider a schema such as

station:
stationid, station_(attributes)

measurement_type_numeric_values:
measurement_type_numeric_values_description (type of
measurement, thresholds etc....)

---------------------Then you can separate pairs of tables by type of
possible input below two types of measurement :Numeric such as wind
speed AND String such {Good, Fair, Bad}

measurement_type_numeric_values#station:
stationid, measurement_type_numeric_values_description,
unitofmeasurement

measurement_numeric_values
stationid, measurement_type_numeric_values_description,
value, measurement_date

measurement_type_string_values#station:
stationid, measurement_type_string_values_description

measurement_numeric_values
stationid, measurement_type_numeric_string_description,
value, measurement_date

I hope this helps, but I doubt...I encourage you to do some reading
onto the books I pointed out...

Joshua J. Kugler

unread,

Oct 23, 2006, 8:32:39 PM10/23/06

to

David Portas wrote:

>> Thus, each
>> station can have a number of attributes that vary from one to dozens. I
>> can't tell the scientists "You may have 32 sensors per station, no more,
>> no
>> less." That just won't work, for reasons I hope are obvious. :)
>
> No-one except you suggested that.

Sorry, I didn't mean to suggest that anyone did...just thinking out-loud I
guess.

> So you'll need attributes for Wind Speed, Temperature, Humidity etc.
> Obviously those are very different domains so they belong in separate
> columns. If each type of reading is totally independent of the others
> then one would expect to see a separate table for each, with a common
> key presumably comprising the station number and a timestamp. This is
> of course totally unsupported guesswork on my part.

What you suggest is good, and might work if I had a fixed domain and set of
sensors. But the set of sensors, and types, could change at any time. If
it were feasible to take that approach (currently, it's not), we could go
into production, and two months down the road, we would need to import data
for sensor type Zerple, that does not fit in any of our previous profiles,
and we might have 20 different readings of type Zerple on each sample (line
of the data file). Creating a new table for each type of new sensor is not
going to work.

I admit, I'm working with a problem that is *very* database unfriendly. I
guess I'll see what I can read and come up with.

Thanks for your input so far.

Joshua J. Kugler

unread,

Oct 23, 2006, 8:36:24 PM10/23/06

to

Cimode wrote:
> ---------------------Then you can separate pairs of tables by type of
> possible input below two types of measurement :Numeric such as wind
> speed AND String such {Good, Fair, Bad}

> I hope this helps, but I doubt...I encourage you to do some reading
> onto the books I pointed out...

It points me in a possible direction. Also see my reply to David. I'll see
what dig up in my reading.

Thanks for your input so far.

j

--

paul c

unread,

Oct 23, 2006, 9:15:37 PM10/23/06

to

There are others here who design a lot more db's than I ever did, but
let me make two comments: 1) It seems important to me to ask the
'scientists' how they plan to use/manipulate this data. Eg., you might
find out that they plan to summarize it initially and do most of their
analysis from summaries, only keeping the raw data for adhoc historical
uses, but of course I'm just guessing. Or, if they have their hearts
set on using PERL or somesuch or using half an hour of cpu time to
crunch data, I might wonder what good is a dbms. Or maybe they aren't
sure what they're going to do with it and you could just go upstairs and
tell the big kahuna that all they need is an archive system. 2)
Regarding the Zerple, I doubt if any dbms could be made that would be
'eternel'. Things change, same with all systems. That's life.

paul c.

Bob Stearns

unread,

Oct 24, 2006, 12:56:28 AM10/24/06

to

What is wrong with a model like:

Station(stationid, location, name, etc)
Instrument(instrumentid, datatype, rangemin, rangemax, etc)
StationInstruments(stationid, instrumentid, positionininputstream)
Recordings(stationid, instrumentid, recordingid [maybe a timestamp],
value)

When new instrument types come on line new Instrument rows are added. If
instruments deliver non numeric data, then the simple (rangemin,
rangemax) will have to be extended in fairly obvious ways.

Roy Hann

unread,

Oct 24, 2006, 2:28:56 AM10/24/06

to

"Joshua J. Kugler" <jos...@eeinternet.com> wrote in message
news:453d522f$0$19658$8826...@free.teranews.com...

> I admit, I'm working with a problem that is *very* database unfriendly. I
> guess I'll see what I can read and come up with.

Actually your problem is not at all database unfriendly. Almost any product
you can buy today that calls itself a DBMS can handle the problem with a few
keystrokes or mouse clicks. What you have there is a programming problem.

Because it takes so much time and effort to make even trivial changes to the
programs that will manipulate your database you are hoping to find a
database design that will allow you to write a one-size-fits-all program.

I don't have a programming solution to your programming problem (well I
kinda do, but I am under no illusion that anyone would want to use it). But
as long as you hope for a database solution to a programming problem you are
going to be dissatisfied with the result.

Roy

David Cressey

unread,

Oct 24, 2006, 7:34:59 AM10/24/06

to

"Joshua J. Kugler" <jos...@eeinternet.com> wrote in message
news:453d522f$0$19658$8826...@free.teranews.com...

PMFJI.

Now that I see what you are up against, I have a better handle on why you
are perplexed.

If I can generalize a little bit from what you've written here, you want a
data model that will be impervious to changing information requirements. In
essence, the introduction of a new, and hitherto not conceived type of
sensor introduces new data types into the universe of discourse, and
possibly new entites as well.

The problem of coming up with a model so flexible that it can automatically
adapt to any possible change in the universe of discourse is one that has
vexed practitioners of relational systems for at least 36 years.

I think the best you are going to be able to do is to come up with some kind
of "meta model", a model that models models. I see your proposed solution
as the first step in this direction. It's highly flexible, but almost
useless. The next step, as I see it,
is to come up with a good model of the CURRENT state of the subject matter,
given the CURRENT information supplied by sensors.

This should be a straightforward exercise in data analysis and database
design, even if it takes a while. I'll call this the normalized database.

After that you need to come up with processes (programs) that will transform
input data in the form you poposed into the form I proposed. You will have
to make your programs work on incremental inputs, most likely, in order to
keep up.

Then you need to make metadata updates to the normalized database as simple,
easy, and automatic as you can. You will need to make metadata updates
whenever a new kind of sensor is introduced.

After that, you might want to build a reporting database along some kind of
OLAP principles that allow your data analysts to adapt more or less
automatically tothe changing world the sensors depict. ETL between the
normalized database and the reporting database will be a real bear, because
of the ongoing metadata updates to the normalized database.

How many years do you plan on working before retirement? ;-)

Roy Hann

unread,

Oct 24, 2006, 7:49:02 AM10/24/06

to

"David Cressey" <dcre...@verizon.net> wrote in message
news:DTm%g.7058$ke4.4635@trndny02...

> The problem of coming up with a model so flexible that it can
> automatically
> adapt to any possible change in the universe of discourse is one that has
> vexed practitioners of relational systems for at least 36 years.

No, no, no! A thousand times no. It is purely a programming problem. And
it is a programming problem because for some reason in the 21st century we
still insist on writing programs. Any DBMS, even a howling piece of crap
like MySQL or even MS Access can handle the changing universe of discourse
in a trice.

Roy

Joshua J. Kugler

unread,

Oct 24, 2006, 2:44:09 PM10/24/06

to

Bob Stearns wrote:

> Joshua J. Kugler wrote:
>> So, I guess stated another way, forget about those files...how might I go
>> about (or where might I read about) mapping an arbitrary number of
>> attributes to a database in a clean, efficient way?
>>

> What is wrong with a model like:
>
> Station(stationid, location, name, etc)
> Instrument(instrumentid, datatype, rangemin, rangemax, etc)
> StationInstruments(stationid, instrumentid, positionininputstream)
> Recordings(stationid, instrumentid, recordingid [maybe a timestamp],
> value)
>
> When new instrument types come on line new Instrument rows are added. If
> instruments deliver non numeric data, then the simple (rangemin,
> rangemax) will have to be extended in fairly obvious ways.

Well, that is almost exactly what I had originally. :) See:
http://groups.google.com/group/comp.databases.theory/browse_thread/thread/78267e0f38faa5c9/d24f8dcf6619342b

I just wondered if there were better ideas. :) Maybe that is the best idea
after all.

I also have toyed with the idea of putting each set of readings in a list
structure, serializing it, and inserting it into the database. Is that
ugly? Yeah, probably, but *it fits requirements* because all we really
need to be able to do is store/retrieve by time/date...searching for
specific values isn't really a part of the system; getting at a set of
readings by time/date is. So, we'll see. I'll talk to the stakeholders
some more and get them to determine exactly how much searching they want
vs. straight retrieval.

Thanks.

Joshua J. Kugler

unread,

Oct 24, 2006, 2:56:51 PM10/24/06

to

David Cressey wrote:
> PMFJI.
>
> Now that I see what you are up against, I have a better handle on why you
> are perplexed.
>
> If I can generalize a little bit from what you've written here, you want
> a
> data model that will be impervious to changing information requirements.

Actually, I just want a data model that can have a flexible number of
attributes since various stations will have a flexible number of
attributes. :)

But I do understand the impossibility of preparing for every eventuality. I
just want to be prepared for an eventuality I know will happen.

Message has been deleted

mAsterdam

unread,

Oct 24, 2006, 5:36:24 PM10/24/06

to

Joshua J. Kugler wrote:
> ... all we really need to be able to do is

> store/retrieve by time/date...searching for
> specific values isn't really a part of the system; getting at a set of
> readings by time/date is. So, we'll see. I'll talk to the stakeholders
> some more and get them to determine exactly how much searching they want
> vs. straight retrieval.

(Try to) Establish /specific/ information needs - not a
description, but real examples. Starting from those needs
you can work your way back to the (real) facts you will need to capture.

Typifying those facts gives you a good starting
structure for your design.

J M Davitt

unread,

Oct 24, 2006, 7:44:04 PM10/24/06

to

Joshua J. Kugler wrote:
> David Cressey wrote:
>
>>PMFJI.
>>
>>Now that I see what you are up against, I have a better handle on why you
>>are perplexed.
>>
>>If I can generalize a little bit from what you've written here, you want
>>a
>>data model that will be impervious to changing information requirements.
>
>
> Actually, I just want a data model that can have a flexible number of
> attributes since various stations will have a flexible number of
> attributes. :)
>
> But I do understand the impossibility of preparing for every eventuality. I
> just want to be prepared for an eventuality I know will happen.
>
> j
>

[I'm not recommending this, but...]

I'm surprised that the entity-attribute-value design
hasn't been mentioned yet.

paul c

unread,

Oct 24, 2006, 9:57:44 PM10/24/06

to

I thought the majority here were trying to not mention it.

p

Gene Wirchenko

unread,

Oct 24, 2006, 10:56:10 PM10/24/06

to

paul c <toledob...@dbms.yuc> wrote:

>J M Davitt wrote:

[snip]

>> [I'm not recommending this, but...]
>>

>> I'm surprised that the [EVIL THAT SHOULD NOT BE NAMED]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Previous poster's line edited for obvious reasons.

>> hasn't been mentioned yet.
>
>I thought the majority here were trying to not mention it.

Nothing personal, gents, but we are going to have to hang the two
of you.

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.

Marshall

unread,

Oct 25, 2006, 1:27:22 AM10/25/06

to

On Oct 24, 7:56 pm, Gene Wirchenko <g...@ocis.net> wrote:
>
> >> I'm surprised that the [EVIL THAT SHOULD NOT BE NAMED] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>

> Nothing personal, gents, but we are going to have to hang the two
> of you.

LOL

mAsterdam

unread,

Oct 25, 2006, 3:38:45 AM10/25/06

to

paul c wrote:
> J M Davitt wrote:
>>

>> I'm surprised that the entity-attribute-value design
>> hasn't been mentioned yet.
>
> I thought the majority here were trying to not mention it.

:-)

David Cressey

unread,

Oct 25, 2006, 7:30:17 AM10/25/06

to

"Gene Wirchenko" <ge...@ocis.net> wrote in message
news:qcktj25lnfle11ord...@4ax.com...

> paul c <toledob...@dbms.yuc> wrote:
>
> >J M Davitt wrote:
>
> [snip]
>
> >> [I'm not recommending this, but...]
> >>
> >> I'm surprised that the [EVIL THAT SHOULD NOT BE NAMED]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Previous poster's line edited for obvious reasons.
>
> >> hasn't been mentioned yet.
> >
> >I thought the majority here were trying to not mention it.
>
> Nothing personal, gents, but we are going to have to hang the two
> of you.
>

Just refer to it as the "abomination that causes desolation", and assign to
it the object id = 666.

Sampo Syreeni

unread,

Oct 25, 2006, 8:23:07 AM10/25/06

to

On 2006-10-24, J M Davitt wrote:

> I'm surprised that the entity-attribute-value design hasn't been
> mentioned yet.

Maybe by name it hasn't, but the principle in implicit in Joshua's
original design: E is {station_id, row_id}, A is {column_number} and V
is {value}. Granted, the representation is only applied to the part of
the schema that is expected to change rapidly and the result seems nicer
than normal because in this case it just so happens that all of the
attribute values are known to be floats. Still, if it walks like a
chicken, and talks like one...

OTOH, to me storing data with an open ended and not completely
understood structure as EAV doesn't seem too bad, as long as the data
we're actually querying and transforming is lifted off it and put into
an ordinary schema. (Which you can always do because as soon as you
understand the data sufficiently well to utilize it in any way besides
passing it through to somebody more knowledgeable than you, you'll
already know enough to write the DDL.) The best thing would be if the
DBMS had full support for this sort of thing -- e.g. you could import
blobs of data, a partial structuring schema could be imposed on them at
the physical level, suddenly the part of the data that was declared
would become visible as first class relations, and the rest of it would
remain accessible as binary or some semistructured representation like
EAV so that it could still be passed on as-is, maybe revealed on the
logical level later on, and in any case maintained under the strong ACID
guarantees that only DBMS's provide -- but as long as it doesn't and you
need to be able to abstract away from the structure of some piece of
data, circumventing the relational typing machinery can be the least bad
alternative.
--
Sampo Syreeni, aka decoy - mailto:de...@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

Roy Hann

unread,

Oct 25, 2006, 8:45:03 AM10/25/06

to

"Sampo Syreeni" <de...@iki.fi> wrote in message
news:Pine.SOL.4.62.06...@kruuna.helsinki.fi...

> The best thing would be if the DBMS had full support for this sort of
> thing -- e.g. you could import blobs of data, a partial structuring schema
> could be imposed on them at the physical level, suddenly the part of the
> data that was declared would become visible as first class relations, and
> the rest of it would remain accessible as binary or some semistructured
> representation like EAV so that it could still be passed on as-is, maybe
> revealed on the logical level later on, and in any case maintained under
> the strong ACID guarantees that only DBMS's provide -- but as long as it
> doesn't and you need to be able to abstract away from the structure of
> some piece of data, circumventing the relational typing machinery can be
> the least bad alternative.

To repeat what I already said in this thread, even the crappiest so-called
SQL DBMSs on the market already solve the problem extremely well for
practical purposes. We aren't going to move one inch further forward until
we acknowledge where the problem really is now, and it sure ain't in the one
part that's already solved it.

This is an application development problem. Let's look at what needs to
change there instead of trashing the one success we can already point to.
That would be a far more interesting conversation than rehashing "the
abomination that dare not speak its name" (in various forms) yet again.

Roy

JOG

unread,

Oct 25, 2006, 9:10:43 AM10/25/06

to

Joshua J. Kugler wrote:
> David Cressey wrote:
> > PMFJI.
> >
> > Now that I see what you are up against, I have a better handle on why you
> > are perplexed.
> >
> > If I can generalize a little bit from what you've written here, you want
> > a
> > data model that will be impervious to changing information requirements.
>
> Actually, I just want a data model that can have a flexible number of
> attributes since various stations will have a flexible number of
> attributes. :)

I worry that you are conflating entity-attributes with tuple-attributes
(proposition roles). Nevertheless my instinct would be to use something
like the following:

STATION:1 has INSTRUMENT:thermometer with RECORDING:57 at
TIME:25/10/06.
STATION:2 has an INSTRUMENT:wind_speed with RECORDING:120 at
TIME:25/10/06.
etc.

If you went for a proposition attribute such as THERMOMETER_READING for
every possible 'entity attribute' you would record a huge amount of
facts with extremely varied structures (i.e. with very few
commonalities between them from which to extract predicates from), and
would hence end up with an unwieldly amount of tables.

Bob Badour

unread,

Oct 25, 2006, 9:21:59 AM10/25/06

to

Marshall wrote:

What?!? No tarring? No feathering? No riding out on rails? You guys
spoil all the fun.

Gene Wirchenko

unread,

Oct 25, 2006, 12:41:38 PM10/25/06

to

Bob Badour <bba...@pei.sympatico.ca> wrote:

Fine, Bob. I can adjust a spec.

The method of punishment shall be implementation-defined.

Sampo Syreeni

unread,

Oct 25, 2006, 1:01:24 PM10/25/06

to

On 2006-10-25, Roy Hann wrote:

> To repeat what I already said in this thread, even the crappiest
> so-called SQL DBMSs on the market already solve the problem extremely
> well for practical purposes.

I don't think this is true, so we probably disagree about what the
problem to be solved really is. You seem to think this is a nonissue
because even the crappiest SQL database allows for schema evolution.
Fair enough, but somebody skilled in data modeling has to issue those
DDL statements, that somebody is going to have to be paid for the
effort, and so there has to be a tangible return on the modeling
investment before the data can be stored in bona fide relations.

Often the ROI just isn't there eventhough the data is. Instead we only
know that we have some data, some proper subset of it will be useful
later on, and so we want to store it, to be sorted out later. It doesn't
make sense to invest into real data modeling now because later on we'll
know better which subset of the data is going to be used. That both
enables us to cut down on the eventual total modeling cost and to get
the first application out without delay, while allowing today's data to
be built upon at a later date. If the DBMS doesn't support inclusion of
such data in some semistructured format, then the data is going to get
stored as a blob, and if even this is disallowed, the DBMS will be
ditched for flat file storage in an ad hoc format, like XML. This
doesn't make a whole lot of sense because then we'll risk losing some of
the structure that already exists, and we'll either have to reimplement
or do without all of the useful, schema independent features of the
DBMS, like recovery guarantees and access control.

A second typical example is data that is useful right now, but that is
part of an application that not supported well enough by the existing
DBMSs. (Some examples would be free text, objects used in hard realtime
systems and large scientific datasets.) In this case we still want to
leverage some parts of the system, like the recovery mechanisms, but
also want to implement the core data model somewhere else (say, because
of performance issues). The end result is the same as above; making
fully developed schemas a requirement only causes the DBMS to be traded
in for more primitive tools, and the data to be locked into formats with
no clear evolutionary path towards a proper data model.

The way I see it, DBMS's primarily exist to make the total cost of
ownership of data lower. Since all of the above happens on cost grounds,
allowing limited use of semistructured data within a DBMS is just good
engineering, provided it isn't overdone.

> We aren't going to move one inch further forward until we acknowledge
> where the problem really is now, and it sure ain't in the one part
> that's already solved it.

So what is the problem? In your earlier post you didn't really elaborate
on that.

> This is an application development problem. Let's look at what needs
> to change there instead of trashing the one success we can already
> point to.

I don't think I'm trashing much of anything, because I do advocate
proper modeling and relational storage as soon as we know what to model
and the RDBMS adequately supports the application.

Bob Badour

unread,

Oct 25, 2006, 5:14:30 PM10/25/06

to

Sampo Syreeni wrote:

> On 2006-10-25, Roy Hann wrote:
>
>> To repeat what I already said in this thread, even the crappiest
>> so-called SQL DBMSs on the market already solve the problem extremely
>> well for practical purposes.
>
> I don't think this is true, so we probably disagree about what the
> problem to be solved really is. You seem to think this is a nonissue
> because even the crappiest SQL database allows for schema evolution.
> Fair enough, but somebody skilled in data modeling has to issue those
> DDL statements, that somebody is going to have to be paid for the
> effort, and so there has to be a tangible return on the modeling
> investment before the data can be stored in bona fide relations.

You are begging the question.

> Often the ROI just isn't there eventhough the data is. Instead we only
> know that we have some data, some proper subset of it will be useful
> later on, and so we want to store it, to be sorted out later. It doesn't
> make sense to invest into real data modeling now because later on we'll
> know better which subset of the data is going to be used.

A dbms will allow one to dump each file into a relation that directly
reflects the structure of the file. That doesn't cost a whole lot in
terms of designing, and it does accomplish exactly what you describe above.

[remainder snipped]

J M Davitt

unread,

Oct 26, 2006, 4:32:46 AM10/26/06

to

Let's get it out of the way and advise the OP that EAV
is not a solution...

Sampo Syreeni

unread,

Oct 26, 2006, 5:44:47 AM10/26/06

to

On 2006-10-25, Bob Badour wrote:

> You are begging the question.

I believe the problem is the cost of up front modeling. I claim that it
is sometimes solved by postponing the effort while storing the data,
which means that the data will not be fully structured in the meanwhile.
EAV on top of an RDBMS is one means of storing such semistructured data.
Hence, there is a limited justification for using EAV or some similarly
less-than-relational data representation. How is this circular?

> A dbms will allow one to dump each file into a relation that directly
> reflects the structure of the file. That doesn't cost a whole lot in
> terms of designing, and it does accomplish exactly what you describe
> above.

Correct, but how precisely is that better than EAV? Suppose you've
progressed far enough to know that your users are going to be needing
trend reports on the sensor data that is present in field #1 of the
files produced by instrument A and in a different form in field #2 of
instrument type B. You can somehow tell the instruments apart (e.g. by
one of the attribute values or by the set of attributes present) ,
you've added an attribute in your core data model and you've implemented
functions to convert the raw data into a uniform representation. You now
need to move the data into the real schema. Which representation do you
think makes this easier, EAV or mass-of-relations, if you're using plain
SQL on the one hand, or the best existing tools for the model on the
other? Which model reacts better if one of the sensors is upgraded to
produce a new field? What happens when one of the fields has
substructure which needs to be preserved?

JOG

unread,

Oct 26, 2006, 6:05:06 AM10/26/06

to

Sampo Syreeni wrote:
> On 2006-10-25, Bob Badour wrote:
>
> > You are begging the question.
>
> I believe the problem is the cost of up front modeling. I claim that it
> is sometimes solved by postponing the effort while storing the data,
> which means that the data will not be fully structured in the meanwhile.
> EAV on top of an RDBMS is one means of storing such semistructured data.

What /exactly/ do you mean by 'semistructured data' Sampo? I see this
term bandied around a lot, but am yet to see a clear explanation of
what it /is/. I've heard that it is data that doesn't fit into the RM,
but AFAIK all data fits in the RM, so I find this highly suspect. Ta,
J.

Bob Badour

unread,

Oct 26, 2006, 10:31:13 AM10/26/06

to

Sampo Syreeni wrote:
> On 2006-10-25, Bob Badour wrote:
>
>> You are begging the question.
>
> I believe the problem is the cost of up front modeling.

As I stated clearly, you were begging the question. You invented an up
front modelling cost that doesn't necessarily even exist and then
alerted us to this 'problem'.

How is this circular?

You assumed a non-existent large up-front 'modelling cost' to prove a
large up-front 'modelling cost'.

>> A dbms will allow one to dump each file into a relation that directly
>> reflects the structure of the file. That doesn't cost a whole lot in
>> terms of designing, and it does accomplish exactly what you describe
>> above.
>
> Correct, but how precisely is that better than EAV?

In every way imaginable.

Sampo Syreeni

unread,

Oct 26, 2006, 1:16:40 PM10/26/06

to

On 2006-10-26, Bob Badour wrote:

> You invented an up front modelling cost that doesn't necessarily even
> exist and then alerted us to this 'problem'.

No. I started with a proposed, EAV-like model. Since people around these
parts don't exactly like EAV and to me it seems like a reasonable first
pass design when your understanding of the data is as limited as the
proposer's is, I have to come up with something that can be used to
support my position, or alternatively can be refuted. I have to probe
into why one would want to use EAV eventhough the relational model
exists. This is simple abduction. After that, we have a deductive,
noncircular argument backwards from the cost to why one might sometimes
want to consider EAV.

> You assumed a non-existent large up-front 'modelling cost' to prove a
> large up-front 'modelling cost'.

No, to prove that EAV has at least one valid application. But yes, at
this point the cost is a factual hypothesis, not a tried and true fact.

You claim that the cost is low. I on the other hand claim that the cost
of training and employing relational experts is great, and that with
current DBMS's this cost is largely incurred up front because there is
no incremental path from totally unstructured, opaque blobs stored
within the system to a fully relational, normalized, tightly constrained
and well annotated design. Your proposed solution is partial at best,
because in it the kinds of analysis that are often needed to generate
and validate subsequent design decisions reaches over huge numbers of
relations. Generating queries like that is cumbersome, current DBMS's
aren't really prepared to handle them and tools that would help organize
all this intermediate data are not readily available, whereas for at
least some EAV-like representations they are.

>>> A dbms will allow one to dump each file into a relation that directly
>>> reflects the structure of the file. That doesn't cost a whole lot in terms
>>> of designing, and it does accomplish exactly what you describe above.
>>
>> Correct, but how precisely is that better than EAV?
>
> In every way imaginable.

Then surely you must be able to list at least a few. My list starts with
the ease in EAV of querying for the presence of certain attributes
across all of the data, because this sort of thing is rather natural
when you're looking for potential dependencies and unifiable types in
your data. In your representation this would translate into complicated
SQL against a nonstandard catalog, in EAV it's a matter of self-joins on
A. Queries involving both A and V in EAV are also easy enough, but in
your representation would already involve quantification over relvars,
which existing DBMS's fail to support unless you resort to custom
procedural code.

Marshall

unread,

Oct 26, 2006, 1:29:36 PM10/26/06

to

On Oct 26, 7:31 am, Bob Badour <bbad...@pei.sympatico.ca> wrote:
>
> > I believe the problem is the cost of up front modeling.
> As I stated clearly, you were begging the question. You invented an up
> front modelling cost that doesn't necessarily even exist and then
> alerted us to this 'problem'.

> [...]

> You assumed a non-existent large up-front 'modelling cost' to prove a
> large up-front 'modelling cost'.

Wait, are you saying that modelling is free?

I want to make sure I understand what you are saying. It
*sounds* as if you are saying that, after gathering requirements,
turning those requirements into a relational data model (example:
SQL CREATE TABLE statements) costs less than epsilon.

My experience is that such process usually involves lots
of wailing and gnashing of teeth. But perhaps it would
make sense to attribute that effort to extended
requirements gathering?

Not sure I follow. The idea that there is a cost
to figuring out the right schema seems reasonable
to me. It is my experience that the cost is vastly
outweighed by the benefits almost all the time,
but I'd still say there was an associated cost.

Marshall

Bob Badour

unread,

Oct 26, 2006, 1:50:37 PM10/26/06

to

Marshall wrote:

> On Oct 26, 7:31 am, Bob Badour <bbad...@pei.sympatico.ca> wrote:
>
>>>I believe the problem is the cost of up front modeling.
>>
>>As I stated clearly, you were begging the question. You invented an up
>>front modelling cost that doesn't necessarily even exist and then
>>alerted us to this 'problem'.
>>[...]
>>You assumed a non-existent large up-front 'modelling cost' to prove a
>>large up-front 'modelling cost'.
>
> Wait, are you saying that modelling is free?

He posited a situation where the data need stored without any further
knowledge of the requirements (with the requirements presumably future
discovery.) Creating a schema that is a 1:1 mapping of a file may not be
free, but it is no more costly than any other option.

> I want to make sure I understand what you are saying. It
> *sounds* as if you are saying that, after gathering requirements,
> turning those requirements into a relational data model (example:
> SQL CREATE TABLE statements) costs less than epsilon.
>
> My experience is that such process usually involves lots
> of wailing and gnashing of teeth. But perhaps it would
> make sense to attribute that effort to extended
> requirements gathering?

I suggest you go back and read what Roy wrote and how Sempo responded.
Sempo posited a large up-front 'modelling cost' at the same time he
stated the requirements were largely unknown except for simple storage
for future analysis. How large a cost does one incur to say: "We don't
know what we are going to do with this so we better preserve its
original structure as closely as possible" ?

If my requirement is "I need a place to dump this for future use",
dumping it someplace that offers extremely effective management and
manipulation capabilities strikes me as a fair choice. That "even the

crappiest so-called SQL DBMSs on the market already solve the problem

[of dumping data without further analysis] extremely well for practical
purposes" strikes me as almost self-evident. What Sempo wrote in no way
contradicts Roy's assertion.

> Not sure I follow. The idea that there is a cost
> to figuring out the right schema seems reasonable
> to me. It is my experience that the cost is vastly
> outweighed by the benefits almost all the time,
> but I'd still say there was an associated cost.

What requirements? Sempo posited a situation with no known requirements:

Sampo Syreeni

unread,

Oct 26, 2006, 1:58:11 PM10/26/06

to

On 2006-10-26, JOG wrote:

> What /exactly/ do you mean by 'semistructured data' Sampo? I see this
> term bandied around a lot, but am yet to see a clear explanation of
> what it /is/.

Buneman defines it as data where the information normally associated
with schemas is part of the data itself. I.e. each time you load some
data, the metadata within it gets to tell you what the keys, attributes,
domains and constraints are, and you don't get a say in the matter.

I'd frame the issue a bit differently. Metadata like that is often used
when we want to abstract away from some part of the representation but
do not want to make the data completely opaque. We for example can tell
that a certain part of the data only concerns a specific application and
so that our application can safely ignore it, but we still want to be
able to both store the data so that it can be fed to said application
and to retain what structure it has so that e.g. parsing, buffering and
annotation code can be shared. That's how multimedia and object
container formats work. In this sense semistructured data is data
conforming to a schema that has been coarsened from what a relational
database would expect, and that is more weakly typed as a result. There
are also some applications with lots of semantic machinery that lives at
a level higher than RM's logical one, or which utilize a fundamentally
different data model; those end up being emulated/reified on top of RM.
Semistructured data as a concept tries to capture the common features of
all of these, allow for the data determining the schema instead of the
other way around, and bridge the gap between opaque blobs and well
structured databases.

> I've heard that it is data that doesn't fit into the RM, but AFAIK all
> data fits in the RM, so I find this highly suspect.

As Bob just showed, everything does fit in the RM. However, current
RDBMS's do not necessarily make storage of all kinds of data easy or
uniform enough. One nice example is disjunctive data from knowledge
bases. Even that fits, but the representations become rather ugly and a
RDMBS doesn't really understand them. Lots of extra machinery is needed
on top of the database to make sense of logical or's, so in this case
the RDBMS is really used more like a dumb record store, the real
semantics of the data are never revealed to it, and essentially the
final, disjunction aware database ends up being emulated on top of the
RM.

Marshall

unread,

Oct 26, 2006, 5:22:23 PM10/26/06

to

On Oct 26, 10:58 am, Sampo Syreeni <d...@iki.fi> wrote:
>
> One nice example is disjunctive data from knowledge
> bases. Even that fits, but the representations become rather ugly and a
> RDMBS doesn't really understand them. Lots of extra machinery is needed
> on top of the database to make sense of logical or's, so in this case
> the RDBMS is really used more like a dumb record store, the real
> semantics of the data are never revealed to it, and essentially the
> final, disjunction aware database ends up being emulated on top of the
> RM.

Can you give an example? I understand all the words in the above
but I can't construct a clear picture of what you're talking about.

Marshall

Sampo Syreeni

unread,

Oct 27, 2006, 5:03:21 AM10/27/06

to

On 2006-10-26, Marshall wrote:

>> [...] the final, disjunction aware database ends up being emulated on

>> top of the RM.
>
> Can you give an example?

Take the medical expert system. Many of the diagnostic rules they store
are disjunctions, because from a limited set of symptoms and diagnostics
you can predict a number of different illnesses. (What makes a good
expert system is that it can do heuristic and statistical inference and
use them to ask the right questions so that the alternatives are rapidly
and cheaply narrowed down.) The system might also contain rules which
enable it to second guess test results and the like.

Take the normal semantics of the relational model. Assume we have two
predicates, p(x,y) and q(z), meaning patient x has disease y and test z
has failed, respectively. It's entirely possible that some rule in the
system has as its consequent p(x,y) or q(z), and that you want to be
able to store that in the database as soon as the antecedent of the rule
can be evaluated, so that the next time the patient comes in, the doctor
can consider whether to diagnose x with y or to order z to be redone.

If you use two vanilla tables to store these predicates, you'll get the
wrong semantics. Inserting (x,y) into p and (z) into q means that the
patient both has the disease and that the test failed.

You can still fit the stuff into RM. Two alternatives that immediately
come to mind is to store all the disjunctive stuff as clear text logical
formulae -- really nasty -- or to reify p and q and point. In the
latter, instead of taking (x,y) in p to mean the fact that x has disease
y, you interpret it to mean the corresponding statement which may or may
not be true, and assign an id to it. Then you have secondary relvars
containing the structure of the formulae, e.g. r(x,y) meaning formula x
is parent of formula or statement y and q(x,z) where x is formulae and z
is the logical connective applied to its children. Now you can represent
what was originally p(x,y) as {p(f1,x,y), r(f2, f1), q(f2, and)} and the
disjunction as {p(f1,x,y), q(f2,z), r(f3,f1), r(f3,f2), q(f3,or)}. (This
is not the only design possible and probably not even the best, but it
suffices to illustrate the point.)

The problem is that the database has no idea about what this all means.
When asked, it can easily list all of the statements and formulas, but
in order to get out only the true facts, you'll need extensive reasoning
capability on top. The data model that lives inside the database is far
more complex than the fragment of propositional logic the RM captures,
so the RM is mostly blind to it and is only used to emulate the real
thing.

Finally, the database isn't really even equipped to deal with the
pattern of queries that result. A typical query against this sort of
database would ask which illnesses are still possible after considering
some symptoms, the tests done, the medication in effect and the
patient's history. In this representation you can still apply single
table recursive joins (of the bill-of-materials type) to get the
relevant formulae and join everywhere from there, but both the number of
columns and the number of rows in such queries suffer from combinatorial
explosion. Then there are going to be predicates describing incompatible
medications, side effects, afterillnesses, combined risk factors and
whathaveyou. Those translate into path queries and mutually recursive
transitive joins, with intermediate results that are going to be huge
unless the database actually understands the logic and can prune the
search intelligently. The only way to implement something reasonable on
top is to have the logic middleware navigate the database, but then high
performance navigation is again something that current relational
software doesn't support as well as, say, object frameworks.

Bob Badour

unread,

Oct 27, 2006, 9:11:05 AM10/27/06

to

Sampo Syreeni wrote:

> On 2006-10-26, Bob Badour wrote:
>
>> You invented an up front modelling cost that doesn't necessarily even
>> exist and then alerted us to this 'problem'.
>
> No.

Look, I am perfectly capable of reading what Roy wrote and how you
responded. Denial is just stupid.

Bob Badour

unread,

Oct 27, 2006, 9:32:07 AM10/27/06

to

Sampo Syreeni wrote:

> On 2006-10-26, Marshall wrote:
>
>>> [...] the final, disjunction aware database ends up being emulated on
>>> top of the RM.
>>
>> Can you give an example?

[long-winded straw man snipped]

> The problem is that the database has no idea about what this all means.

Sets of facts don't have ideas and don't ascribe meanings. Do you have a
point?

> When asked, it can easily list all of the statements and formulas, but
> in order to get out only the true facts, you'll need extensive reasoning
> capability on top.

Um, you mean like a query?

The data model that lives inside the database is far
> more complex than the fragment of propositional logic the RM captures,
> so the RM is mostly blind to it and is only used to emulate the real thing.

That is a remarkable assertion--all the more remarkable for its complete
lack of support or evidence.

> Finally, the database isn't really even equipped to deal with the
> pattern of queries that result.

Databases aren't equipped to deal with queries--that's what data
management systems do.

[handwaving about combinatorial explosions snipped]

Then there are going to be predicates describing incompatible
> medications, side effects, afterillnesses, combined risk factors and
> whathaveyou. Those translate into path queries and mutually recursive
> transitive joins,

Path query? Transitive join? What utter nonsense. If recursion and
transitive closure facilitate expressing the queries, then express the
queries using recursion and transitive closure.

with intermediate results that are going to be huge

Bullshit.

> unless the database actually understands the logic and can prune the
> search intelligently.

Sets of facts don't understand things. However, the manipulation
function of the dbms uses symbolic manipulation of formal logic to avoid
unecessary work.

[remaining nonsense snipped]

Sampo, you have no intellectual honesty. Your posts are long-winded and
almost entirely devoid of sensible content. Plonk.