Hi,
Not sure where this work is headed and unfortunately I lack the time &
head space just now to play any active role. But here's a quick
initial interjection - take or leave as you wish.
From what I can see the aim is to purely record dataset quality
measures against some measurement system. [1]
What you might want to consider is how such measures break down into
evidence and provide ways of recording those.
Let's take an example of "currency" (as in how current the data is).
Rather than have a good/bad/indifferent score for currency it might be
more helpful to have a set of assertions which cover frequency of
update of the data, time lag between change and publication, history
of actual updates etc. That way I can choose whether/how to use the
data based on my specific requirements. If I'm doing stock trading
then the 15min delay figures aren't useful but if I'm doing historical
analysis they are fine. If I'm looking at road layout data then an
update rate of weekly would mostly be just fine but not if I'm
interested changes due to traffic works.
Similarly some aspects of authority and accuracy are bundled up with
provenance. I might find it more useful to know exactly where the data
set came from and how it was processed and make my own judgement on
what that means for quality [2].
As an example of the sort of metadata about datasets which relates to
quality it is worth look at the statistics world which takes a huge
amount of care defining "data flows" with associated update frequency,
timeliness, provenance, measurement methodology, status etc. In
particular the SDMX standard has a set of "content oriented
guidelines" [3]. The cross-domain concepts (annexe 1) provides
concepts for defining frequency, validation, whether a particular
observation is real/estimated/forecast, measurement standards used,
professionalism standards used by the staff preparing the data etc.
The metadata guidelines (annexe 4) have a lot more concepts which
relate to quality plus a "quality index" concept. Now only a handful
of these are code lists, most are free text, so the level of machine
processability of such annotations is limited but it is a good example
of the kind of richness needed in domains where the data actually
matters. If you wanted this vocabulary to apply to government related
data then, at least in Europe, you would probably want to be able say
how it relates to the existing standards like SDMX.
FWIW we have RDF conversions of the SDMX annexe stuff. In particular
we encode the cross domain concepts as skos concepts [4] and as
various sorts of qb properties include qb:AttributeProperty [5] and
have the metadata annexe captured as a skos concept list [6].
Cheers,
Dave
[1] BTW that aim should be written down somewhere. IMHO to define a
vocabulary you need an aim and a set of competency questions that it
will allow you to answe. I take the current competency question to be
"how high quality is this data set on these dimensions", whereas what
I'm suggesting is you take the top level question of "is this data of
sufficient quality for my purpose" and break that down into more
specific questions like "has this been updated in the last week", "is
it likely to be updated sufficiently regularly in the future".
[2] Which is what we do in the
data.gov.uk work where we standardized
on OPMV as the vocabulary for provenance and we record processing
steps in sometimes quite fine grain detail.
[3]
http://sdmx.org/?page_id=11
[4]
http://purl.org/linked-data/sdmx/2009/concept#
[5]
http://purl.org/linked-data/sdmx/2009/attribute#
[6]
http://purl.org/linked-data/sdmx/2009/metadata# (humm that seems
to 404 at the moment, can send a copy if you are interested)