Ian Goddard wrote:
> Richard Smith wrote:
> > Conversely, the danger of developing the data model in vacuo and only
> > later implementing it is that it turns out to be too cumbersome
>
> IME a well-done simple model can fit complex data easily. An
> ill-thought-out model can be cumbersome, fit all data badly and be a pig
> to implement & maintain: been there, had one imposed, the clients were
> for ever having to take the back off it and fiddle with the innards for
> every change in requirement.
I don't think we're really disagreeing here. I suspect that what I'm
referring to when I talk about a model developed in vacuo and without
real-world testing is an example of what you're calling an ill-thought-
out model.
> > Arguably it is issues such as these
> > that have lead to the Gentech data model largely being ignored,
> > despite seeming very well thought out on paper.
>
> I'm not so sure about that:
>
> - It doesn't separate Person from Persona.
It does in a way. Its persona object can be grouped into higher-level
personae. In your terminology, the lower-level one is the Persona and
the higher level one is the Person. The difference is that Gentech
allows multiple layers of such groupings where as your model does
not. However as I've argued elsewhere in this thread (and so won't
repeat again here), I think the ability to have multiple layers of
personae is a good thing.
> - The Assertion seems to be applied to join all sorts of data types,
Yes, in many ways that's a mess, especially as not all the
combinations are really meaningful.
> - The Repository entity is, IMV superfluous - it would be far simpler
> and more flexible to have a single self-referential entity for all
> levels of a provenance chain, the repositories, publishers, etc. being
> those instances with a null parent field.
I can't make up my mind whether I agree with you here. I think on
balance there is a useful distinction here, but that Gentech doesn't
quite get it quite right.
The Gentech source object is hierarchical. A source can represent a
page, another source can be the parish register enclosing the page, a
third source can be the collection of records deposited for a given
parish, and each of these sources has a single, possibly null, parent
field (the Higher-Source-ID field). So a page cannot be in two
different registers.
By contrast, a repository is the location of a source, and a source
may exist in multiple repositories. For example, a book or microfiche
may exist in several libraries. In it's original intent, it's clearly
meant as a physical place: somewhere with opening hours, an address
and a phone number. But almost certainly any modern implementation
would extend it to include websites, so that Archive.org or
Ancestry.com would be a respository. That would require certain
changes to the repository data model (at the very least to include its
URL) which could have been pre-empted by using a more general and
extensible contact information model.
This way Gentech distinguishes between a source and a specific copy of
a source. If I'm including a citation in a published report, I just
want to reference the source -- ordinarily I wouldn't state where I
accessed it, unless the only copies happened to be in obscure places.
But it's nevertheless useful to record where I found a copy of that
source so that I can plan future research: for example, I can ask my
system to give me a list of the tasks on my 'to-do' list that need
doing in the Public Record Office and that can't also be done online
or in my local library.
But with online sources this gets complicated. Gentech makes the
medium a property of the source: a source might be a book, or a
collection of loose leaves, or a photograph, or a map. That makes
sense. but an online copy clearly has a different type: it has a MIME
type (e.g. image/jpeg), a resolution, whether it's colour or
greyscale, and so on. And different online copies will have different
properties. Gentech has no facility to store this sort of per-
instance metadata. The alternative view is that each functionally
distinct online copy is a separate source, and there's some mechanism
for recording that one source is derived from another. But again,
Gentech has no means of recording that one source is derived from
another. Either way, it's a deficiency.
> - The citation chain is on the face of it equally superfluous as the
> material it contains should be part of the provenance chain; I think
> it's there to contain the damage from ESW fanboys who can't separate
> presentation and data layers.
I expect you're right. When I implemented some of that, I certainly
wasn't able to come up with a good reason why the chain of Citation-
Parts was separate from the chain of Sources. But given the lack of
examples and rationale in the Gentech spec, I wouldn't rule out there
being a valid reason.
> - But I think its main problem is that it's an ER design in what ought
> to be an OO domain. Consider, for instance how much easier it would be
> to be able to simply specify "PersonalName" in the first cut of the
> design and specify it as being implemented by its own class knowing that
> you'll be not only be able to sub-class it later to deal with the
> different name structures of different cultures but also add sub-classes
> whenever the demand for a new cultural requirement is added.
I agree that this is a big problem, but not for the reason you say.
The ER design is really just the way in which the data model is
documented. It doesn't preclude an OO implementation. Personal names
are complicated because they're partly folded into the general
characteristics subsystem. The Personal-Name attribute on the Persona
entity is simply a text string recording what the source said the name
was. If Persona is associated with a source which refers to him as
'John Smith', then that's what goes there. If the Persona entity
represents a conclusion person, then the the Personal-Name attribute
is your preferred display name. Arguably that whole Personal-Name
attribute is redundant and should be dropped; but in any case it's not
really pertinent to Gentech's ER name mechanism which is done with
characteristics.
A name is a type of characteristic. In the case of a name with
multiple components, such as "Homer J Simpson", this is a single name
Characteristic, but each word is a separate Characteristic-Part.
"Homer" is a given name; "J" is an initial; "Simpson" is a surname.
The correct order of these parts is ensured by the Sequenence-Number
attribute on the Characteristic-Part. The list of possible name
components is extensible because "given name", "initial", "surname",
together with things like "patronym" and "regnal number" are all
Characteristic-Part-Type entities. A program implementing this in a
language with OO duck-typing (such as Python or Perl) may very well
implement these as dynamically generated subclasses of an abstract
name class.
But I suspect that's not quite what you mean. I suspect you mean more
hardcoded classes for "western name" (being one or more given names,
followed by a surname), "Slavic name" (given name, patronymic,
surname), "royal name" (given name, regnal number) and so on. That's
entirely achievable in this sort of ER design too. In the ER paradigm
you would simply add a Characteristic-Type entity which would name and
aggregate a series of Characteristic-Part-Type entities. Translated
into OO terms, the *-Type entities describe the class hierarchy, the
other entities contain the data in them.
To summarise my point, an ER specification is entirely compatible with
an OO implementation and an OO exchange format. In the specific case
of names, I think their specification isn't ideal, but only in its
details. The ER specification may be less familiar to those used to
OO paradigm, but I'm not sure it's inappropriate here. After all, I
suspect a lot of people will be thinking in terms of an SQL-backed
implementation, in which case the ER formalism is much more natural.
Richard