I have one suggestion that's significantly different from the ideas
I've seen here. It's the one real diatribe I'd like to drop on this
conversation (so far).
I'd like to see a bibliographic exchange format include a provision
for carrying the change history of every record along with the record.
Before you just snort coffee out your nose and click the "Trash"
button, give me a minute to try to convince you...
There are several lines of reasoning that prompt me to suggest this. I
think it's important for "buy-in" right now. I think it has a forward-
looking future-proofing importance. I think it's quite plausibly
lightweight to implement (in terms of complexity).
Right now, the standard model is that every taxonomist keeps his or
her own reference information privately, hermetically sealed from the
rest of the world (except as expressed in the citation lists of
published papers). Why would I (were I a taxonomist) do that? Because
I know just how much (or little) effort I've put into verifying the
accuracy of my references. And I care about that. A lot. Someone
else's references are going to be a mish-mash of exquisitely verified
information and crud screen-scraped off the end of a piece of grey
literature.
In putting together the decapod reference collection, we found that we
got adoption from workers in the field after we reluctantly
surrendered to the demand of one of them: track the changes made to
each reference in the database. Why did that make the difference?
Because then the level of verification for each reference was
apparent. If a reference has a series of six change records (each
attributed to a person), documenting the changes to the exact author
spelling, the non-imprinted year of publication, confirming that the
species name really is misspelled in the paper title, etc., then
workers are willing to adopt that reference and use it. The
granularity is important: a single change (say "fixing" the year) can
be backed out or personally rejected without putting the whole record
in doubt.
If the reference just has a source associated with it, and it shows no
subsequent record of verification and confirmation, workers tend to
look at its accuracy skeptically (rightly so). In those cases, they're
likely to believe their own information (year of publication, etc.)
over a centralized source of information that's not detectably
verified. Being able to carry the change information with a
bibliographic record retains the scholarship that speaks to the
accuracy. Stripping it out means that the reference has now
degenerated back to "uncertain" status, no matter what verification
was done to it.
One of the themes that kept surfacing again and again at the e-
Biosphere conference (for example) was the need for tracking data
provenance. It's becoming more and more important to keep track of who
generates data, who has contributed to its development, and how
trustworthy it is, now that we're starting to look at products other
than published papers as professional scientific products. Being able
to keep hold of the "audit trail" of a bibliographic reference is part
of that picture. Once again: bibliographic references in taxonomy
receive bizarrely huge amounts of time and effort to ensure their
validity. That work needs to be captured, it needs to be attributable,
and it needs to be fine-grained. A system for exchanging bibliographic
data really needs to be able to transport that scholarship intact as a
documented series of changes.
Finally, I don't think this should be awfully complex to implement.
The only really new information needed for tracking changes is the
description of the change and the name/ID of the person doing it.
Obviously, all the other data fields for all other versions of a
reference are the same. Prior versions could be carried along in their
entirety, or a scheme of deltas could be used. But either way, there's
nothing intrinsically complex that this would add to a data schema.
Clients supplying references that are unable to include versioning
information are no worse off than they were: their contributions
become "stripped" as they are supplied (bummer). Similarly, clients
unable to use (or uninterested in) version history could discard it on
receipt. Because that will account for many users, I think that the
most recent version of a reference should be immediately available (it
should not have to be derived by playing back deltas starting with an
initial version). But I expect that it will be increasingly non-
optional to track change information on data (including bibliographic
references), so more and more systems will hold this information.
By building in, at the outset, the capability to exchange change
information with bibliographic references, I think we encourage uptake
in the short term (because users will be able to see data provenance
now, where available), position the system/protocol for the future
(because I think tracking data history is going to become effectively
mandatory), and all for very little increase in complexity.
-Dean
--
Dean Pentcheff
pent...@gmail.com