This is mostly directed at the people who participated in the literature
discussions at TDWG, but comments from any/all are certainly welcome!
Towards the end of our discussion on the final day, we got to the bit in the
EndNote DTD that described how "style" was applied to various fields (such
as titles, etc.) EndNote has a mechanim where by certain aspects of style
such as italicized words in titles could be captured as attributes, rather
than as any sort of embedded markup in the data-field itself. We discussed
it briefly, and came to the conclusion that such style markup would not
assist in disambiguating duplicate records, and therefore was not necessary
for our "exchange" standard.
However, having thought about this a bit more, I realize that:
1) "exchange" is not necessarily just about disambiguation and
de-duplication; it is actually about "exchange"; and
2) our community tends to have a bunch of italicized words in the titles;
and
3) those italicized words (and the fact that they are italicized) are useful
to our community not just for formatting purposes, but also for tagging
certain taxon names.
Given that his could/should be the preferred exchange standard for
contributing content to "CiteBank", and given that "CiteBank" will serve as
a direct source of bibliographic citations for many folks who might want to
render titles with properly italicized scientific names in titles; perhaps
we *should* consider retaining some mechanism for embedding some form of
italics style within the exchange standard, so that datasets with known
italicized words do not have to disgard that information when contributing
citations to CiteBank.
Thoughts?
Aloha,
Rich
Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deep...@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html
<titles>
<title titleType="FullTitle" xml:lang="en" markup="none">The Chronicles of
Narnia: Voyage of The Dawn Treader</title>
<title titleType="FullTitle" xml:lang="cn" markup="none">黎明號的遠航
</title>
<title titleType="FullTitle" xml:lang="en" markup="html">The Chronicles of
Narnia: Voyage of <i>The Dawn Treader</i></title>
<title titleType="FullTitle" xml:lang="en" markup="markdown">The
Chronicles of Narnia: Voyage of *The Dawn Treader*</title>
<title titleType="ShortTitle" xml:lang="en" markup="none">Voyage of The
Dawn Treader</title>
<title titleType="ShortTitle" xml:lang="en" markup="html">Voyage of <i>The
Dawn Treader</i></title>
<title titleType="ShortTitle" xml:lang="en" markup="markdown">Chron.
Narnia: Voy. *Dawn Treader*</title>
<title titleType="AbbreviatedTitle" xml:lang="en" markup="none">Chron.
Narnia: Voy. Dawn Treader</title>
<title titleType="AbbreviatedTitle" xml:lang="en" markup="html">Chron.
Narnia: Voy. <i>Dawn Treader</i></title>
[...etc....]
</titles>
Same idea, but I don't see any need for defining an extra element for
"formattedTitle"; because whether or not the title is formatted is evident
from the attributes.
I'm not sure if these attribute names are the best (are there any other
existing defined attributes that equate to these?). I originally thought of
using "styleMarkup" instead of just "markup" to encourage people to limit
their markup to aspects of style. However, I also like Kevin Richard's idea
of leaving the door open to more robust semantic markup (e.g., of taxon
names) within titles; if someone has that sort of information. Mind you,
I'm not advocating it, necessarily -- just leaving the door open for it by
using a generalized solution to flagging the
Rich
Whatever publications in the future will be like, they will all
reference and build on "old school" publications (openly accessibvle
or not) that are originally referenced through classic citation
strings. So even future publications will found on this existing body
of literature, still growing this infinite directed graph, where
publications are the nodes and "Paper A references Paper B" relations
are the edges. In order to be able to browse this graph by means of
hyperlinks or UUID (to use Rich's general term) resolution, we need
two things:
- a UUID for all the existing publications, at least for the ones that
are referenced by other (in the graph model, nodes having in-edges)
- a facility to obtain the UUID of the referenced publications based
on the the citations strings given in the referencing publications
(in the graph model, find out where exactly the out-edges of a node go)
As we do not digitize bottom-up (ie starting with Linneaus), but more
or less top-down (ie with today's publications), we need to assign a
UUID to a publication based on the mere reference string, even if the
referenced publication is not yet available digitally and therefore
does not have a UUID yet. In other words, we need to be able to assign
UUIDs based on parsed reference strings, and exactly this is what
CitaBank is intended to make possible.
Synchronization is then to make sure each node in the reference graph
has exactly one UUID, but assigning more than one is not a severe
problem, as synchronization can later establish that several UUIDs
actually point to the same publication / node in the reference graph.
Far as I got it, this "collect reference strings annd issue UUIDs now,
synchronize later" model is what came out of the CiteBank discussion
in Montpellier.
Please, Rich, correct me if I'm mistaking.
Putting up a service offering something like "give me the reference
for the original description of taxon XYZ" based on CiteBank should
not be too much of a problem technically. And far as I get the
previous mails in this discussion, this is what taxonomists want, is it?
So far my two cents,
Guido
And to pick up on Donat's comment: one way to pull in the "classical"
taxonomists is to build a service that lets them get their work done
better than they do now with the tools they have now. Not a new way of
doing biology, but doing the "old" way quicker. They'll contribute if
their contributions come back better than when they put them in. At
least that's been our experience with the decapod system.
And finally... the issue of society journals. I take it we all have
just about no sympathy for commercial journals: they're a business.
Either they figure out how to make money giving us (all of us) what we
want, or they die. But there's the argument that society journals are
a significant source of income for biological societies.
We raised this issue with members (and officers) of the Crustacean
Society, publisher of the Journal of Crustacean Biology. We were
raising it in the context of our initiative to make decapod crustacean
taxonomy articles freely available to the public on our web server
(and that would necessarily include many JCB articles).
We made the argument essentially this way: No professional society
dedicated to advancing the study of a discipline can, in good
conscience, support itself by restricting access to the published
knowledge on that subject. There was silence. Heads nodded. And people
started talking about different financial models.
I still believe that argument.
-Dean
--
Dean Pentcheff
pent...@gmail.com
Great discussion -- but I only have a few minutes right now.
Very briefly:
> Please, Rich, correct me if I'm mistaking.
Actually, when I used the term "UUID", I use it very specifically to mean
the UUID mechanism for generating non-resolvable identifiers, which I
advocate should serve as the "identification" part of our "persistent
resolvable identifiers". The "resovable" part would come from an HTTP
prefix (as one example of a resolution protocol) appended to the front of
the UUID. In the context of what you wrote, I would use the more generic
term "GUID" (although lately, thanks in part to what I would consider a
misleading Wikipedia page, the term "GUID" is is almost synonymous with
"UUID"...but let's not go there).
Conceptually, though, I think what you write is consistent with my view.
> Putting up a service offering something like "give me the
> reference for the original description of taxon XYZ" based on
> CiteBank should not be too much of a problem technically.
Actually, this would be a service of GNUB. I imagine CiteBank as indexing
the units of literature and literature-like documentation sources; and the
taxonomic stuff falls under GNA.
Rich
Rich
- Guido
That's *exactly* what I think we're trying to do here.
Rich
That said, I think all professional societies that publish journals (at
least all with the good sense to pay attention) are examining ways to
accommodate open access and change current fiscal models. But it won't
happen overnight. Besides, open access cannot be equated with
universal access - until all have access to appropriate technology that
can be reliably provided, we need the alternative structure to keep
things going.
Let's be honest about something - if authors have to pay (from personal
budgets) to get their research published, then the pace of publication
could slow dramatically. I wonder how many of us have enough loose
change to cover these costs ourselves? Do you envision a system in which
only the relatively privileged few can continue publishing?
Finally, I'm glad to learn that most publishing taxonomists are at
"corporate institutions" that have funds readily available to pay for
publication costs. Please show me the data to support this contention.
And, given that you appear to know these things, please tell me where in
my institution's budget you have identified the funds to pay
publication costs?.
Dick J
Richard Jensen, Professor
Department of Biology
Saint Mary�s College
Well, I was just throwing around preliminary ideas, and making the point that libraries have a conflict of interest here, so clearly they don't want all literature to be open access on the net very soon...
>Do you envision a system in which only the relatively privileged few can continue publishing?
Absolutely not! I envisage a system where institutions pay the publishing costs of their researcher employees, and other people can apply for grants/exemptions. Something along these lines already exists, see: http://www.royalsociety.org.nz/Site/publish/authors/submit.aspx
>if authors have to pay (from personal budgets) to get their research published, then the pace of publication could slow dramatically
Well, the pace, around here anyway, is pretty bloody slow as it is! How slow can you go?
Bear in mind that authors also require access to literature, i.e., references for their publications, which on an open access model will be free. Institutions will spend far less on library budgets, so in principle can spend that money paying publication fees.
>Finally, I'm glad to learn that most publishing taxonomists are at "corporate institutions" that have funds readily available to pay for publication costs. Please show me the data to support this contention. And, given that you appear to know these things, please tell me where in my institution's budget you have identified the funds to pay publication costs?
I didn't say that they have funds "readily available". A complete overhaul of the business model would be required. Though, as I said above, money saved on journal subscriptions COULD go towards publishing costs. Currently in the corporate sector, I see an awful lot of money going on transport and accommodation costs, particularly for senior management types. The money is there - we "just" need to somehow make sure that it gets used appropriately. Big ask, I know! :)
Stephen
________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Richard Jensen [rje...@saintmarys.edu]
Sent: Friday, 18 December 2009 3:57 a.m.
To: Stephen Thorpe
Cc: Richard Pyle; 'Rod Page'; 'Taxonomic Literature'
Subject: Re: [TaxonLit] Re: Thoughts on Style
Libraries, as we know them, may be doomed to extinction, but using that
Dick J
Saint Mary’s College
Notre Dame, IN 46556
Tel: 574-284-4674
Stephen Thorpe wrote:
> Yes, but this is the very obstacle that needs to be changed, by changing the system. In the electronic future, libraries are doomed to extinction anyway! It seems fair enough to me for authors to have to pay to get their papers published, papers that will then be freely available. Most authors in taxonomy are 'professionals' working for corporate institutions who judge them by the number of their publications, so the institutions should pay to get those publications out. At any rate, it isn't at all clear to me that spending millions on numerous bioinformatics initiatives, all trying to get around the above problem, will be any cheaper in the long run than simply buying copyrights or else paying the cost of registering DOIs ...
>
>
>
--
I generally agree with Dean's points about the need for conferring a few
bits of style metadata in the exchange standard; and further that it should
be limited to style only (not semanitc markup). The italics probably don't
help to uniquely identify a piece of literature, but almost every consumer
of these citations will want that information included in the output when
downloading content. I think they only need to apply to titles -- no other
pieces of metadata.
> There's not much markup other than italicization that seems to occur.
> There's a very rare (in the taxonomic literature) occurrence
> of super- or sub-scripts. Boldface might occur (but I've
> never seen it). Beyond those, I think we're getting into
> MathML, and I don't think we want to go there.
I think Unicode+italics+subscript+superscript covers everything that most
consumers will want. UTF-8 is a given, so we're just talking about style
markup. Even though we only want those three (italics+sub/superscript)
initially, the mechanism for confering this information should be generic
(and extendible). I think there are two general approaches, each with an
array of sub-approaches. One approach is to embed the markup with HTML tags
(or similar tags) directly in the titles. The other approach is to embed
the information externally (e.g., as attributes within the <Title> tag). I
think I slighly prefer the latter, but could be persuaded either way.
> There's a broader issue raised by the seemingly minor issue
> of title italics (and I suspect that I may differ from Rod
> Page here, though I hold out hope of convincing him!). I
> think it's important to make sure that this exchange format
> can be used (albeit indirectly) to create fully and correctly
> formatted bibliographic reference lists (such as would appear
> at the end of a taxonomic publication in a journal).
Yes, I think that's what many/most end-user consumers/clients will want to
be able to do. Myself included.
Aloha,
Rich
I agree with the first. Second, I'm not so sure (I've seen italicized small
caps, so I'm not sure there really are any ambiguous examples). Third, I
definitely do not add italics where they didn't originally occur. In a
bibliography, I would not italicize scientific names if they were not
originally italicized.
Also, I only bother with markup if specific words in the title are
italicized. If the whole title is in italics, I don't bother. The
exception to this is the (very rare) case where the whole title is in
italics, but the scientific names are not in italics. In that case I think
I would tag the names as italics.
> What to do with all-caps titles?
I don't attempt to retain case (upper vs. lower vs. small caps, etc.) when
it's used for the whole title. But there are some capitalization issues
that we should try to standardize -- like how to capitalize jounral article
titles vs. book titles, etc.
> What punctuation to use to separate title and subtitle if
> they are only differentiated by typeface or point size in the
> original?
Good question -- I don't know.
> What to do with non-conventional capitalization in older papers (e.g.
> "Systematics and taxonomy of the Genus Abadabba")?
> etc.
I tend to retain those as originally rendered on the title page.
I think these are the kinds of details we'll need to think about when
developing the business rules around the "clean bucket" part of CiteBank.
And whatever those business rules are will be implemented in the output from
CiteBank. But I'm not sure we need to worry so much about them for *input*
to CiteBank (which, I suspect, in most cases will flow through the "dirty
bucket"). In other words, I think this sort of thing is more of an issue
for CiteBank business rules than for exchange standard. By contrast, the
italics thing *is* relevant to the exhcnage standard, because it affects the
actual structure of the exchnage standard. The caps thing really only
affects what gets inserted into the content of the exchanged documents --
not the structure of the exchange standard itself.
Rich
Yes -- more than any other biodiversity data intiiative I've been involved
with, I think this one has the potential to "hit the gorund running". And
compared to the analagous stuff in Taxon-name-land, this one is VERY close
to ready. Just a few more details to sort out, then we can go.
Rich
- A "dirty bucket" where any text string purported to represent a piece of
literature may be deposited by anyone, with a link back to where that text
string came from (i.e., some sort of identifier that points to the source
database record). This is exactly modelled after GNI.
- A "clean bucket" representing fully parsed citation records stored in a
robust and normalized data structure, issuing the GUIDs we'll all
(eventually) share.
- The above two "buckets" would not exist as single instances, but rather as
many, many replicate copies spread over the world with robust means to
maintain synchronization (i.e., replication and mirroring).
- A suite of services that allows reliable mapping between records in the
dirty bucket to GUIDs in the clean bucket
I believe the workflow would be something along the lines of the following:
- Any citation database (="content providers") can dump their
full-text-string citations into the dirty bucket.
- Services will parse these text strings, and establish "fuzzy" matches with
records in the "clean bucket".
- A report of these mappings, including confidence levels for each mapping,
will be provided back to the content provider.
- Where the content provider is confident in the mapping, the content
provider creates the link to their local copy of the clean bucket.
- Where the content provider has records that do not confidently map to the
"clean bucket", some mechanism for creating a new record in the clean bucket
would be followed.
Ultimately, all literature databases would be cross-linked to the clean
bucket; at which time we are in the realm that Rod and Donat envision.
My interest is to get us from where we are now, to where Donat and Rod (and
the rest of us) want the entire community to be, as quickly and efficiently
as possible.
Rich
And yes, I can see the ideal place where we can pull up a reference
list at the end of an old publication and have that near-automatically
linked up with a definitive database of accumulated taxonomic
references. Then we're in shiny linked-data-ville.
A few comments and questions on the proposed architecture:
How dirty should the Dirty Bucket be? I like the idea of a sort of
purgatory for semi-processed records, before they enter the paradise
of full-checkedness. But... as described, the Dirty Bucket could be
pretty much a bin of cut-and-pasted reference lists from the back end
of any/every taxonomic paper. I think that might be too permissive, in
that I doubt we'd get a substantial percentage of those definitively
linked to Clean Bucket references -- the workload would be just too
high.
One way to constrain that a bit might be to set things up so that the
Dirty Bucket will only accept some form(s) of parsed references. That
ensures that there's at least been some effort to "digest" the
references before dumping them in.
The valuable service of fuzzy-matching an arbitrary reference string
to a Clean Bucket record would be a service completely separate from
the Dirty Bucket, in that case.
Another reason I'm inclined to push for pre-parsed references in the
Dirty Bucket is that it's damned hard to parse arbitrary references.
Well, I'll qualify that: I found it damned hard. And I never got code
to do it well enough that it could run as better than a kind of
"parsing assistant" (see http://decapoda.nhm.org/recite). Nearly every
reference needs some sort of manual intervention to be properly parsed
-- journal-formatted bibliographic output just loses too much
field-specificity to be easily reversed back into a parsed record.
The next thing I'm scared about is the fuzzy comparison between Dirty
Bucket entries and Clean Bucket records. Because the comparisons have
to be fuzzy (something like a Levenshtein distance), one is stuck with
an exponential problem: every query record has to be checked against
every potential target record. Actually, it's a little worse: several
"title" fields really need to be checked against all title fields in
each target record. You can do a little preprocessing to speed the
comparisons, but you can't just do checksums and then an indexed
search.
I'm not saying we shouldn't be doing that comparison stage. But it's
going to take some careful planning, and probably some pretty inspired
input from some really smart computer science / information theory
kind of people. Without that, we'll very quickly get into a
computational quandary.
Another stage that will take some careful work is the "resolution"
phase once a candidate Dirty Bucket record is being presented next to
a menu of plausibile Clean Bucket records. That's very much like the
deduplication problem. The complex part there is more of an interface
issue. What we have found is that it's almost the rare case that one
can just say "Yup, record A is a dupe of record B, move on". Much more
often, it's "Yeah, looks like they are the same thing, but the journal
name on the new one is more complete than the existing reference,
however the existing one has the issue number that I'm missing....."
Rather than a simple "take this one, trash that one", the session
turns into more of a pick-and-choose field-by-field update of the
existing record.
But that's more of a look forward to interface design than underlying
database design.
-Dean
--
Dean Pentcheff
pent...@gmail.com
2009/12/19 Richard Pyle <deep...@bishopmuseum.org>:
Any text string purported to represent a citation, regardless of how
complete/incomplete, cleaned/dirty, verified/unverified it may be.
> I like the idea of a
> sort of purgatory for semi-processed records, before they
> enter the paradise of full-checkedness. But... as described,
> the Dirty Bucket could be pretty much a bin of cut-and-pasted
> reference lists from the back end of any/every taxonomic
> paper.
Yes, that's exactly what it should be.
> I think that might be too permissive, in that I doubt
> we'd get a substantial percentage of those definitively
> linked to Clean Bucket references -- the workload would be
> just too high.
Hard to say. The ones that remain unlinked remain unlinked. The cleaner
ones are more likely to get linked. But here's the key: with modern
database engines, the presence of the unlinked records does not have any
meaninful impact on the function of the system as a whole. In other words,
excluding the dirtiest of dirty records has almost no down-side for the
utility of the not-so-dirty records. The main benefits of a liberal "gate"
for the dirty bucket are:
- You get a larger scope of possible permutations of how citations may be
represented, which will helps identify the scope of variation that any
citation might take. Once the dirty ones do get linked, that facilitates
the linking of future dirty ones.
- You lower the bar for participation to anyone with any set of text strings
purported to represent citations, including "microcitations" cleaned from
scanned literature, OCR'd bibliographies from published papers, etc., etc.
This is essentially the model for the Global Names Index
(www.globalnames.org), and based on conversations that Chris and I and
others had at TDWG, I am absolutely convinced that this model will serve as
an important a function for CiteBank as GNI does for GNA.
> One way to constrain that a bit might be to set things up so
> that the Dirty Bucket will only accept some form(s) of parsed
> references. That ensures that there's at least been some
> effort to "digest" the references before dumping them in.
Why? What value to you gain by excluding the parsed text strings? As with
GNI, there will be parsing algorithms for the text strings. Also as has
been discussed for GNI, there should be a mechanism for content providers
with pre-parsed records (or, better yet, records pre-linked to the clean
bucket) to submit that parsed/linked content directly to the dirty bucket,
to help the parsing algorithms "learn" how to improve their methods.
> The valuable service of fuzzy-matching an arbitrary reference
> string to a Clean Bucket record would be a service completely
> separate from the Dirty Bucket, in that case.
It will be a completely separate service in any case. The more links that
are made (and verified), the more robust the linking capabilities become.
In other words: one does not need to link every unlinked dirty record to a
clean record -- one need only link the dirty records to other dirty records
that have already been linked to a clean record (such links would, of
course, require some sort of verification -- as would all links between
dirty bucket and clean bucket that were algorithmically derived).
> Another reason I'm inclined to push for pre-parsed references
> in the Dirty Bucket is that it's damned hard to parse
> arbitrary references.
Agreed. I've been *mighty* impressed by the work done for GNI in parsing
name-strings; but I agree that citation-strings are more complex and
potentially ambiguous. On the other hand, there is a much larger body of
work that has already been done on that (by other communities), and there
are many dictionaries that can assist the parsing algorithms. In any case,
I think there should be an option to included pre-parsed records, but I see
no reason why it would be advantageous to constrain the "least common
denominator" for contributed content to the pre-pasrsed subset.
> The next thing I'm scared about is the fuzzy comparison
> between Dirty Bucket entries and Clean Bucket records.
> Because the comparisons have to be fuzzy (something like a
> Levenshtein distance), one is stuck with an exponential
> problem: every query record has to be checked against every
> potential target record. Actually, it's a little worse:
> several "title" fields really need to be checked against all
> title fields in each target record. You can do a little
> preprocessing to speed the comparisons, but you can't just do
> checksums and then an indexed search.
I think we'll find that with modern computer technology, this is not so
scary. I am utterly *amazed* at how quickly the GNI comparisons can be done
-- and that will end up as a MUCH larger dataset.
> I'm not saying we shouldn't be doing that comparison stage.
> But it's going to take some careful planning, and probably
> some pretty inspired input from some really smart computer
> science / information theory kind of people. Without that,
> we'll very quickly get into a computational quandary.
Yup.
> Another stage that will take some careful work is the "resolution"
> phase once a candidate Dirty Bucket record is being presented
> next to a menu of plausibile Clean Bucket records. That's
> very much like the deduplication problem. The complex part
> there is more of an interface issue. What we have found is
> that it's almost the rare case that one can just say "Yup,
> record A is a dupe of record B, move on". Much more often,
> it's "Yeah, looks like they are the same thing, but the
> journal name on the new one is more complete than the
> existing reference, however the existing one has the issue
> number that I'm missing....."
> Rather than a simple "take this one, trash that one", the
> session turns into more of a pick-and-choose field-by-field
> update of the existing record.
Yup -- not easy. But not insurmountable. And again, I don't see how the
presense of unlinked dirty records in any way hampers the cross-linking
process for the less-dirty (e.g., pre-parsed) records.
Rich