Thoughts on Style

6 views
Skip to first unread message

Richard Pyle

unread,
Nov 14, 2009, 12:24:29 AM11/14/09
to Taxonomic Literature

Hi All,

This is mostly directed at the people who participated in the literature
discussions at TDWG, but comments from any/all are certainly welcome!

Towards the end of our discussion on the final day, we got to the bit in the
EndNote DTD that described how "style" was applied to various fields (such
as titles, etc.) EndNote has a mechanim where by certain aspects of style
such as italicized words in titles could be captured as attributes, rather
than as any sort of embedded markup in the data-field itself. We discussed
it briefly, and came to the conclusion that such style markup would not
assist in disambiguating duplicate records, and therefore was not necessary
for our "exchange" standard.

However, having thought about this a bit more, I realize that:

1) "exchange" is not necessarily just about disambiguation and
de-duplication; it is actually about "exchange"; and

2) our community tends to have a bunch of italicized words in the titles;
and

3) those italicized words (and the fact that they are italicized) are useful
to our community not just for formatting purposes, but also for tagging
certain taxon names.

Given that his could/should be the preferred exchange standard for
contributing content to "CiteBank", and given that "CiteBank" will serve as
a direct source of bibliographic citations for many folks who might want to
render titles with properly italicized scientific names in titles; perhaps
we *should* consider retaining some mechanism for embedding some form of
italics style within the exchange standard, so that datasets with known
italicized words do not have to disgard that information when contributing
citations to CiteBank.

Thoughts?

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deep...@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

sau...@ira.uka.de

unread,
Nov 14, 2009, 7:43:16 PM11/14/09
to taxo...@googlegroups.com
Hi All,

if we want to format titles, I'd rather suggest having something like
a title type "formatted", or even a formatedTitle element (as it's
rather distinct from the datawards-leaning title element we designed).
For formatting purposes, we can then allow to include HTML tags in the
textual content of the formattedTitle element.

While this approach adds slight redundancy, I see one considerable advantage:
The content of the original title elements can be normalized (eg
regarding whitespaces) without blowing apart the offsets stotred in
the style element. Normalization tends to help a great deal with
matching and de-duplication.

What do you think?

Hope you all had a safe trip home.

All the best,
Guido

Richard Pyle

unread,
Nov 14, 2009, 8:26:26 PM11/14/09
to sau...@ira.uka.de, taxo...@googlegroups.com

Thanks, Guido.

I don't have any insights on this question (even though I've wrestled with
it for many years). The two basic options are:

1) Store formatting metadata external to the title itself; or
2) Embed formatting metadata (e.g., as HTML tags) within the title itself.

Each has advantages and disadvantages. In my implementations, I store both
in separate fields (the formattedTitle, and the cleanTitle; the latter being
derived from the former); but I'm very happy to swap it out. But it seems
like we're heading toward a model that accomodates n-number of title
representations, so we have some flexibility. If we went with the external
formatting metadata approach, we would potentially need to attach it to
every representation of the title. If we went with the embeded approach, we
could treat it as an attribute (e.g., with an attribute for
"embeddedFormatting" with values of "None", "HTML", etc.), or we could
include it as part of the definition of the "kindOfTitle" (e.g.,
"FullTitle", "FullTitleHTML", "ShortTitle", "ShortTitleHTML", etc.).

Like I said, they all have costs & benefits.

We might also want to define the scope of what is allowable for formatting.
In my implementations, the only ones I bother with are italics, subscript,
and superscript. There may also be a case for bold and/or underlined. Any
others?

We should, of course, be keping in mind that this is still intended as an
exchange schema, and that the question of whether or not to include
formatting is probably less of a disambiguation issue than it is a "this is
useful information for most clients, so let's included it on the exchange
standard" issue.

Aloha,
Rich

Kevin Richards

unread,
Nov 15, 2009, 3:38:48 AM11/15/09
to Richard Pyle, sau...@ira.uka.de, taxo...@googlegroups.com
We have too grappled with this issue in the past.
We also keep both fields in our DB - the formatted one being a "cache" of the other.

Our rference structure has always been quite structured (as Rich has promoted).

We tend to maintain various forms of formatting reference citations in a generic reference format table.
The reference formatting structure is roughly:

FormatType RefFieldType Sequence PreText PostText
--------------------------------------------------------------------------------------------
Article Author 1 null ,
Article Year 2 null :
Article Title 3 null ;
...

I am therefore happy to see structured reference fields and various forms of formatted citations.

It would also be a good idea to store the original citation (verbatim).

So the various citations forms may include:
- verbatim reference
- standard ref citation
- fomatted ref citation (HTML)
- Citation in X format style
- Other Citations in various formats ...

As for the exchange standard, I think you at least need to consider all these alternatives and recommends some best practices.
Exchange standards often are not quite enough for a particular use case, and people tend to ask things like "why haven't they considered X?" etc "It is not quite right, so I will come up with my own" - which is often just a dump to a csv file, specifying which fields are in the dump.
I think it is therefore important to consider the whole problem domain, define an "encompassing" ontology, define use cases, then define best practices and exchanges schemas for particular use cases (the exchange schemas would be a subset of the encompassing ontology for a particular use case).
This way you can then map the ontology (and exchange schema) to the "TDWG core ontology" and other external ontologies.

Kevin

________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Richard Pyle [deep...@bishopmuseum.org]
Sent: Sunday, 15 November 2009 2:26 p.m.
To: sau...@ira.uka.de; taxo...@googlegroups.com
Please consider the environment before printing this email
Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails.
The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz

Rod Page

unread,
Nov 15, 2009, 5:08:39 AM11/15/09
to Taxonomic Literature
Gack, do we really need formatted titles? If we want taxonomic names,
why not store these as keywords and/or use name extraction algorithms.
Already it seems we're off adding stuff, heading down the road of
feature bloat...

Regards

Rod


On Nov 15, 8:38 am, Kevin Richards <Richar...@landcareresearch.co.nz>
wrote:
> From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Richard Pyle [deepr...@bishopmuseum.org]
> Sent: Sunday, 15 November 2009 2:26 p.m.
> To: saut...@ira.uka.de; taxo...@googlegroups.com
> > > email: deepr...@bishopmuseum.org

Kevin Richards

unread,
Nov 15, 2009, 5:46:52 AM11/15/09
to Rod Page, Taxonomic Literature
I prefer to think of it as "feature structuring" rather than "feature bloat". ;-)

________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Rod Page [r.p...@bio.gla.ac.uk]
Sent: Sunday, 15 November 2009 11:08 p.m.
To: Taxonomic Literature

Richard Pyle

unread,
Nov 15, 2009, 7:23:24 AM11/15/09
to Kevin Richards, sau...@ira.uka.de, taxo...@googlegroups.com
Thanks, Kevin.

We may be discussing slightly different things here. We're not talking
about formatting in the sense of concatening fields into a single formatted
citarion record. Rather, we were talking specifically about embedding tags
for italics (and other such font style parameters) within the Title field
itself. In one sense, they shouldn't be in the Title field, because they
are not really part of the title. In another sense they are particularly
useful for this community, so we know which words in a titel to render in
italics (for example).

Because we are looking at this as an exchange standard, we can leave it to
the client to assemble the various bits into a full citation blob of text.
But we do need to dicde if (and how) we need to include information about
italics, superscript, subscript, and other such styles as they appear within
the title of a work. I don't think this applies to any other field in the
exchange realm.

Rich



> -----Original Message-----
> From: taxo...@googlegroups.com
> [mailto:taxo...@googlegroups.com] On Behalf Of Kevin Richards
> Sent: Sunday, November 15, 2009 9:39 AM
> To: Richard Pyle; sau...@ira.uka.de; taxo...@googlegroups.com
> Subject: [TaxonLit] Re: Thoughts on Style
>
>

Rod Page

unread,
Nov 15, 2009, 7:40:31 AM11/15/09
to Taxonomic Literature
My gut instinct is that any formatting information in the title is a
bad idea. Any client database would have to strip the formatting in
order to index the field. Then you have to decide what formatting tags
to use (HTML, Wikipedia, Markdown, RTF, BibText, OpenDoc, etc.). And
how do I know that your idea of formatting matches mine? Surely we
want to separate data from presentation as much as possible?

If you want to make some elements in a title appear in italics, why
not just leave that to clients to figure out?

Regards

Rod
> > On Behalf Of Richard Pyle [deepr...@bishopmuseum.org]
> > Sent: Sunday, 15 November 2009 2:26 p.m.
> > To: saut...@ira.uka.de; taxo...@googlegroups.com
> > > > email: deepr...@bishopmuseum.org

Richard Pyle

unread,
Nov 15, 2009, 7:49:39 AM11/15/09
to Taxonomic Literature
Yes, that's my gut feeling as well. But I'm also not comfortable with
stripping the information entirely from the exchange format, because then
I've lost the ability to render the titles of imported refs the way they
were originally rendered. We could solve part of the problem with lists of
taxon names and such -- but that leaves open the problem of super- and
subscripts, plus other italicized words (e.g., names of ships). If the
provider database tracks these in some way (and most such databases that
I've used do track it), then I think it would be nice to include in the
exchange standard.

The EndNote spec already accomodates this by storing the style information
external to the title itself. That way, the title can remain in its clean
form, but the client can also extract the style metadata from the exchange
content (if it's provided), and reconstruct the properly rendered title
using whatever formatting tags they want.

It seems like the best way to have our cake and eat it too.

Kevin Richards

unread,
Nov 15, 2009, 1:49:51 PM11/15/09
to Richard Pyle, sau...@ira.uka.de, taxo...@googlegroups.com
Yes, I did realise that you were talking about html formatting. I just had a feeling it was 2 parts of the same issue - ie that there is many ways to structure a reference citation.

In the end, any reference citation is just a "cached" string of the fields of a reference (author, year, title, etc). And one containing html tags is just one form.
I agree that it doesnt feel right to leave html tags in the citation, but I have had to deal with parsing and building these formatted citations in the past, and they are a real pain in the butt to parse, esp. if you dont know the possible positions of the html tags.

It may be that we end up with a GRI (global reference index) of all possible citations strings, linking together equivalent ones - as Rod is suggesting, and others are working on. But I see this as an outcome rather than a component of an exchange standard.

Kevin
________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Richard Pyle [deep...@bishopmuseum.org]
Sent: Sunday, 15 November 2009 2:26 p.m.
To: sau...@ira.uka.de; taxo...@googlegroups.com

Richard Pyle

unread,
Nov 15, 2009, 9:37:27 PM11/15/09
to Kevin Richards, sau...@ira.uka.de, taxo...@googlegroups.com

Right -- but there are two levels to this issue. One level is how you
concatenate the bits into a citation. The other issue is how you represent
style within a parsed bit (e.g., italics within a title). Assuming the
exchange standard parses/atomizes the bits sufficiently, we can leave the
(much larger) concatenation formatting issues to the client; who can insert
whatever style tags they want that apply to an entire element. But my gut
feeling is that we want some way to communicate some minimal style
information *within* a data element (Title seems to be the only element this
might be needed for).

As for GRI -- this is *exactly* what we discussed at TDWG. Basically, the
GNI/GNUB model seems like it's *exactly* the right model for delaing with
citation data (i.e., GRI/GRUB)

More on that later...gotta run now.

Rich

> -----Original Message-----
> From: Kevin Richards [mailto:Rich...@landcareresearch.co.nz]
> Sent: Sunday, November 15, 2009 7:50 PM
> To: Richard Pyle; sau...@ira.uka.de; taxo...@googlegroups.com

sau...@ira.uka.de

unread,
Nov 15, 2009, 10:09:21 PM11/15/09
to Richard Pyle, Kevin Richards, taxo...@googlegroups.com
Hi Rich,

> Right -- but there are two levels to this issue. One level is how you
> concatenate the bits into a citation. The other issue is how you represent
> style within a parsed bit (e.g., italics within a title). Assuming the
> exchange standard parses/atomizes the bits sufficiently, we can leave the
> (much larger) concatenation formatting issues to the client; who can insert
> whatever style tags they want that apply to an entire element. But my gut
> feeling is that we want some way to communicate some minimal style
> information *within* a data element (Title seems to be the only element this
> might be needed for).
>
> As for GRI -- this is *exactly* what we discussed at TDWG. Basically, the
> GNI/GNUB model seems like it's *exactly* the right model for delaing with
> citation data (i.e., GRI/GRUB)
>
> More on that later...gotta run now.
>

Following your comment on the journal names and series designators, I
come to have the feeling that the textual content of our title element
should simply be allowed to bear basic layout info in the form of html
tags, and the normalized title generally should be stored in an
attribute of the title element.

- Guido

Rod Page

unread,
Nov 15, 2009, 11:05:22 PM11/15/09
to Taxonomic Literature
I'd be in favour of the reverse. Formatting is a secondary, client
issue. The normalised text string is primary. I also wonder why the
assumption is that HTML formatting is the main use case. If you're
going to have formatted strings, then wouldn't it make sense to have
them in their own field, with attributes specifying the format (hence
telling the client how to parse the field)?

Regards

Rod
> >> On Behalf Of Richard Pyle [deepr...@bishopmuseum.org]
> >> Sent: Sunday, 15 November 2009 2:26 p.m.
> >> To: saut...@ira.uka.de; taxo...@googlegroups.com
> >> > > email: deepr...@bishopmuseum.org

Richard Pyle

unread,
Nov 16, 2009, 12:11:49 PM11/16/09
to Rod Page, Taxonomic Literature

I agree with Rod on this one (although I perhaps don't understand the
implications thoroughly enough). Let me play around a bit with how EndNote
manages this sort of formatting, and then later today I'll report back on
several examples.

Guido Sautter

unread,
Nov 16, 2009, 12:24:40 PM11/16/09
to Richard Pyle, Rod Page, Taxonomic Literature
The implication is this: The normalized title is a plain string with no
embedded tags. Thus it easily stores in an attribute. The formatted
title, in turn, can/will have text embedded, which is possible only in
the content of the title element - except we want to accept considerable
overhead for escaping and un-escaping.
Buttom line: While Rod's proposal feels more intuitive and closer to the
semantic relationship between the two title forms. But it incurs more
processing overhead, which imho is more important for a data excange,
storage and comparison standard. Won't fight for it, but I see some
advantages in my proposal, i.e., having the normalized title in an
attribute and the formatted form in the content of the title element.

D.J.King

unread,
Nov 17, 2009, 7:15:01 AM11/17/09
to taxo...@googlegroups.com
From reading this long thread we seem to have various options for representing titles and any styles contained within them.
(Sorry I am late to join this thread, I took a some time off after TDWG and am still catching up.)


There's Guido's approach, which involves having the clean form of the title as an attribute and the complete form in child elements:

<title normalizedTitle="Voyage of The Dawn Treader">
<formattedTitle>Voyage of <i>The Dawn Treader</i></formattedTitle>
</title>

Then there's the classic XML (complete with XHTML style elements) like this:

<title>
<normalizedTitle>Voyage of The Dawn Treader</normalizedTitle>
<formattedTitle>Voyage of <i>The Dawn Treader</i></formattedTitle>
</title>

This makes manipulating the data quite straight forward because everything is exposed directly as elements in the two forms our users want it.

Then there's a third version picking up on Rod's comment that formatted elements should be in their own field, which I interpret as this TEI inspired solution:

<title>
<titleText>Voyage of </titleText>
<titleText hi="italic">The Dawn Treader</titleText>
</title>

To extract the normalised form of the title remains easy, ie just select title, and if we need the formatted version then it is equally easily available.

We seem to have the classic XML design problem here, because XML is a data mark up language NOT a data modelling language, XML will not provide any guidance as to how to structure the data. Hence, all design decisions must be based on the proposed use of the data not on any intrinsic value of the data itself.

So are we after a data centric view of the data (ie elements provide semantic clues) - which would suggest that the second solution is our answer.
Or are we after a document centric view of the data (ie elements just represent the text) - which would suggest that the third solution is our answer.

I would be uncomfortable with the first solution because it distributes data between attributes and elements in, to my mind, a confusing manner. I would prefer to keep all data as elements because that's the way XML is intended to work. While attributes can be used for data, you quickly run into problems.

Finally, I don't think that performance considerations should influence our design decisions. We are not going to be using this data as a database but as a storage and exchange format. The marginal differences in XSLT and Xpath retrieval times are of no consequence for the sort of one off accesses that this data would be subject to.


Cheers,
Dauvit.
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302).

Guido Sautter

unread,
Nov 17, 2009, 7:24:30 AM11/17/09
to D.J.King, taxo...@googlegroups.com
Hi Dave,

my variant is actually even more simplistic, for there is no need for
the formattedTitle element:

<title normalizedForm="Voyage of The Dawn Treader">Voyage of <i>The Dawn
Treader</i></title>

If we want to avoid using attributes here (what sort of problems are you
talking about, btw?), I'd definitely vote for the second of Daves
alternatives, not for the third, as the latter would never give any for
of a title as a whole to an SAX parser, which might become a true pita
in stream processing. This is not in terms of performance, but of
implementation of applications that use our format.

- Guido

Roderic Page

unread,
Nov 17, 2009, 7:56:27 AM11/17/09
to Guido Sautter, D.J.King, taxo...@googlegroups.com
I don't recognise my proposal in Dave's example.

I guess I'd prefer something like this:

<title xml:lang="en">Voyage of The Dawn Treader</title>
<title xml:lang="cn">黎明號的遠航 </title>
<formattedTitle xml:lang="en" format="html">Voyage of <i>The Dawn
Treader</i></title>
<formattedTitle xml:lang="en" format="markdown">Voyage of *The Dawn
Treader*</i></title>

and so on. Client can pick language and format, if they want, simply
by querying on attributes. I guess there's lots of ways to do this, my
own preference is that the tag name tells us what the tag contents are
about, and the attributes qualify it (e.g., language). I also like
flat XML files, nesting tags never seems a good idea. My most of this
is down to taste, and how easy we want to making developing clients.

Regards

Rod
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.p...@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpa...@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html






Guido Sautter

unread,
Nov 17, 2009, 8:43:27 AM11/17/09
to Roderic Page, taxo...@googlegroups.com
Hi Rod,

this seems a little too flat for me: As we might have multiple title
elements per record (eg for title and sub title), there might be as many
corresponding formatted titles per language. The structure you propose
lacks any facility for finding out which title and formattedTitle belong
together - aside from grouping the elements by kindOfTitle and language,
which looks pretty odd to me. Settling on elements for the data, the
whole thing should look like this:

<title xml:lang="en">
<normalizedTitle>Voyage of The Dawn Treader</normalizedTitle>
<formattedTitle format="html">Voyage of <i>The Dawn Treader</i></title>
<formattedTitle format="markdown">Voyage of *The Dawn Treader*</title>
</title>
<title xml:lang="cn">
<normalizedTitle>黎明號的遠航</normalizedTitle>
</title>

or, somewhat more simplistic in cases where we don't have any formatted
titles (the most?), using a simple and a complex content model for the
title element:

<title xml:lang="en">
<normalizedTitle>Voyage of The Dawn Treader</normalizedTitle>
<formattedTitle format="html">Voyage of <i>The Dawn Treader</i></title>
<formattedTitle format="markdown">Voyage of *The Dawn Treader*</title>
</title>
<title xml:lang="cn">黎明號的遠航</title>

- Guido

Roderic Page

unread,
Nov 17, 2009, 8:54:22 AM11/17/09
to Guido Sautter, taxo...@googlegroups.com
Dear Guido,

To be honest I'm not terribly fussed which way it goes, so long as
it's relatively simple to process, keeps formatting separate from
content, and atomises metadata wherever possible (unlike, say, OAI
Dublin Core). The reality is any tool in the area will be dealing with
a bunch of plain text, XML, and JSON formats, this format will simply
be one of them.

Regards

Rod

D.J.King

unread,
Nov 17, 2009, 11:00:29 AM11/17/09
to Roderic Page, Guido Sautter, taxo...@googlegroups.com
Rod, Guido - thank you both for the clarifications.

While Guido proposes a beautifully simple solution, and such elegance counts in Java programming, it doesn't work that way in XML. I find Rod's approach preferable because it recognises that the title is data in its own right, not metadata about another element. Hence in Rod's version the data is readily retrievable in a variety of presentations that can easily be extended should we need to record a new language for example. In addition, this design captures both document and data centric views of the text.

This is one area where attributes can lead to difficulties because the design has to be right first time. Everything you can do to an attribute you can do to an element, but the opposite is not true. As Rod's example demonstrates, it is elements that make XML extendable and flexible. So unless there's a clear reason to use an attribute, eg it really is just a qualifier, then use an element and keep the options open for the future. It's more difficult to change an attribute to an element than an element to an attribute - ask the designers of DocBook who have had to convert attributes to elements several times now.

BTW: If you search for element and attribute design choices in both XML and SGML you'll find this debate is very old (and yes I have been using XML since it first appeared in 1998 and SGML before that so I've been here before). Sometimes you have no choice about the decision, eg if you want substructures you have to use an element, but sometimes the decisions are just down to personal style. That said, Rod's taste follows what is generally accepted as best practice, ie "tag name tells us what the tag contents are about, and the attributes qualify it".

Cheers,
Dauvit.

PS On other matters:
1) Rod, while I agree in general nesting tags only add complexity to an XML file and should be avoided, there are times when they are a good idea, but that's another debate!
2) Guido, I've sent you an XSLT that should address your concerns about retrieving the title from the redundant third option in my previous e-mail. I suggest we carry on that discussion between ourselves rather than clutter up the general mailing list with the intricacies of SAXON/XALAN processing, if that's all right with you.

Richard Pyle

unread,
Nov 17, 2009, 1:19:19 PM11/17/09
to taxo...@googlegroups.com

How about a slightly modifid version of Rod's approach; something along the
lines of the following:

<titles>
<title titleType="FullTitle" xml:lang="en" markup="none">The Chronicles of
Narnia: Voyage of The Dawn Treader</title>
<title titleType="FullTitle" xml:lang="cn" markup="none">黎明號的遠航
</title>
<title titleType="FullTitle" xml:lang="en" markup="html">The Chronicles of
Narnia: Voyage of <i>The Dawn Treader</i></title>
<title titleType="FullTitle" xml:lang="en" markup="markdown">The
Chronicles of Narnia: Voyage of *The Dawn Treader*</title>
<title titleType="ShortTitle" xml:lang="en" markup="none">Voyage of The
Dawn Treader</title>
<title titleType="ShortTitle" xml:lang="en" markup="html">Voyage of <i>The
Dawn Treader</i></title>
<title titleType="ShortTitle" xml:lang="en" markup="markdown">Chron.
Narnia: Voy. *Dawn Treader*</title>
<title titleType="AbbreviatedTitle" xml:lang="en" markup="none">Chron.
Narnia: Voy. Dawn Treader</title>
<title titleType="AbbreviatedTitle" xml:lang="en" markup="html">Chron.
Narnia: Voy. <i>Dawn Treader</i></title>
[...etc....]
</titles>

Same idea, but I don't see any need for defining an extra element for
"formattedTitle"; because whether or not the title is formatted is evident
from the attributes.

I'm not sure if these attribute names are the best (are there any other
existing defined attributes that equate to these?). I originally thought of
using "styleMarkup" instead of just "markup" to encourage people to limit
their markup to aspects of style. However, I also like Kevin Richard's idea
of leaving the door open to more robust semantic markup (e.g., of taxon
names) within titles; if someone has that sort of information. Mind you,
I'm not advocating it, necessarily -- just leaving the door open for it by
using a generalized solution to flagging the

Rich

D.J.King

unread,
Nov 17, 2009, 2:25:52 PM11/17/09
to taxo...@googlegroups.com
Nice one Rich. It keeps things simple, and gets rid of the superfluous "formattedTitle" element.
 
Depending on the rest of the design we might not need the container element "titles" either. If all of these "title" elements are already contained within a common higher element for the document then there is no need for a container at this level. The document element is an adequate container.
 
Cheers,
Dauvit.
 
 
-----Original Message-----
Sent: 17 November 2009 18:19
Subject: [TaxonLit] Re: Thoughts on Style
 
 
 
 

  ________________________________  

Guido Sautter

unread,
Nov 17, 2009, 2:32:44 PM11/17/09
to taxo...@googlegroups.com
I agree, we don't need the "titles" element. Apart from that, I'm
perfectly fine with Rich's proposal.

Cheers,
Guido
> _ ________________________________ _

Dean Pentcheff

unread,
Dec 14, 2009, 8:14:33 PM12/14/09
to Taxonomic Literature
Supporting italics in article/book titles is absolutely required.
Losing them will cause a huge part of the systematics community to
reject this format out of hand.

Sure, one option could be to create markup that denotes species,
genus, family, or other taxonomic entities instead of a
typographically-oriented "italic" markup. But... really what we're
doing in this format (I think) is encoding taxonomic _literature_, not
_taxa_. The publication itself has the genus/species italicized in the
title, and that's what's being represented. That the italicization
denotes a particular taxonomic rank for those words is interesting
(and very possibly machine-readable-useful), but is secondary to
faithfully representing the original publication in a taxonomically
responsible way.

There's not much markup other than italicization that seems to occur.
There's a very rare (in the taxonomic literature) occurrence of super-
or sub-scripts. Boldface might occur (but I've never seen it). Beyond
those, I think we're getting into MathML, and I don't think we want to
go there.

There's a broader issue raised by the seemingly minor issue of title
italics (and I suspect that I may differ from Rod Page here, though I
hold out hope of convincing him!). I think it's important to make sure
that this exchange format can be used (albeit indirectly) to create
fully and correctly formatted bibliographic reference lists (such as
would appear at the end of a taxonomic publication in a journal).
That's one motivation behind my remarks above on the necessity of
retaining title italics. Clients that don't want/need the markup can
easily strip it; clients that need it, well, need it!

If the exchange format doesn't make that goal possible, then I think
it will be near-instantly rejected by the non-byte-headed taxonomic
community (I put all of us corresponding on this in the byte-head
category). Today, systematists want to use paper and electronic
resources to put together taxonomic treatments. Those get published in
journals. A bibliographic exchange format that doesn't retain all the
information from a reference needed to make a full bibliographic entry
is a non-starter for them.

One goal of a bibliographic reference is to locate the publication.
But I would argue that the bibliographic reference is, itself, an
object that comes close to being a real data object for taxonomists. A
reference (as such, not just as a pointer to a piece of paper/PDF) is
where taxonomy stores important parts of its "wisdom" about
nomenclature. Dates that differ from the imprinted year, author name
subtleties, volume number typographic corrections... All that kind of
information is part of a reference, but is not necessarily in the
paper to which a reference points.

Developing carefully curated references is one of the core (if
ludicrously fussy) elements of doing taxonomy. It involves making
pointers to the original publication, but includes much more
information than that. The information included in taxonomic
references cannot be replaced purely with an electronic pointer to the
original published page.

We really want to be able to harvest the insanely careful
bibliographic curation that systematists have invested in their
reference collections. To do that, we'll have to be able to promise
that they'll get back reference information that's at least as
complete as what they push into some sort of on-line electronic
reference storage or exchange system. If not, they just won't play and
we will: (1) fail to capture their expertise; and (2) fail to give
them the incentive to be more forward-thinking in how they work.

-Dean
--
Dean Pentcheff
pent...@gmail.com

Dick Jensen

unread,
Dec 15, 2009, 8:31:27 AM12/15/09
to Dean Pentcheff, Taxonomic Literature
I'm curious about something I have missed while watching this discussion: what if species' names are not italicized in both the table of contents and the paper tile itself? Is this a problem?

Dick J

Richard Jensen, Professor
Department of Biology
Saint Mary's College
Notre Dame, IN 46556

tel: 574-284-4674

----- Original Message -----
From: Dean Pentcheff <pent...@gmail.com>
To: Taxonomic Literature <taxo...@googlegroups.com>
Sent: Mon, 14 Dec 2009 20:14:33 -0500 (EST)
Subject: [TaxonLit] Re: Thoughts on Style

--

You received this message because you are subscribed to the Google Groups "Taxonomic Literature" group.
To post to this group, send email to taxo...@googlegroups.com.
To unsubscribe from this group, send email to taxonlit+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/taxonlit?hl=en.




Dean Pentcheff

unread,
Dec 15, 2009, 5:16:56 PM12/15/09
to Taxonomic Literature
I can tell you the decisions we made for the decapod literature list,
but I don't pretend to think it's a "standard" (nor would I even argue
hard that it's the best approach!).

Where genus/species were italicized in the title, we retained that.
Where the title text was ambiguous (e.g. all-caps no-italics title),
we italicized genus/species names (as well as lower-casing the title
text).
Where the title text was upper- and lower-case but the genus/species
was not italicized, we went ahead and italicized it (that's the
approach that I suspect is the least justifiable, but was also
extremely uncommonly [perhaps never?] seen).

What your question raises is the more general issue: what typographic
transformations are desirable or required? For example:

What to do with all-caps titles?
What punctuation to use to separate title and subtitle if they are
only differentiated by typeface or point size in the original?
What to do with non-conventional capitalization in older papers (e.g.
"Systematics and taxonomy of the Genus Abadabba")?
etc.

-Dean
--
Dean Pentcheff
pent...@gmail.com

Rod Page

unread,
Dec 15, 2009, 6:43:31 PM12/15/09
to Taxonomic Literature
Dean,

I can't escape the feeling that fretting over styling is a huge
mistake. Roger Hyam has a great quote from David King on his blog
(http://www.hyam.net/blog/archives/730):

"When we recruited applications programmers they were told in no
uncertain terms that to simply take an existing paper based workflow
and replace it with one that just mapped one piece of a paper to one
screen was tantamount to a sackable offence."

If we have an identifier for the publication, and that identifier is
resolvable, then as far as I'm concerned we're basically done. Fussing
with text formatting, whether plates should should be listed in pages,
etc., are all legacies of a pre-digital age. The sooner we get this
out of our system the better. It's a huge obstacle in the way of
making progress.

If every taxonomic publication had a DOI (for example), then this
discussion simply wouldn't arise.

Regards

Rod


On Dec 15, 10:16 pm, Dean Pentcheff <pentch...@gmail.com> wrote:
> I can tell you the decisions we made for the decapod literature list,
> but I don't pretend to think it's a "standard" (nor would I even argue
> hard that it's the best approach!).
>
> Where genus/species were italicized in the title, we retained that.
> Where the title text was ambiguous (e.g. all-caps no-italics title),
> we italicized genus/species names (as well as lower-casing the title
> text).
> Where the title text was upper- and lower-case but the genus/species
> was not italicized, we went ahead and italicized it (that's the
> approach that I suspect is the least justifiable, but was also
> extremely uncommonly [perhaps never?] seen).
>
> What your question raises is the more general issue: what typographic
> transformations are desirable or required? For example:
>
> What to do with all-caps titles?
> What punctuation to use to separate title and subtitle if they are
> only differentiated by typeface or point size in the original?
> What to do with non-conventional capitalization in older papers (e.g.
> "Systematics and taxonomy of the Genus Abadabba")?
> etc.
>
> -Dean
> --
> Dean Pentcheff
> pentch...@gmail.com
> > pentch...@gmail.com

Stephen Thorpe

unread,
Dec 15, 2009, 7:13:03 PM12/15/09
to Rod Page, Taxonomic Literature
Finally, I see Rod's point! If every publication had a DOI, we could just cite that as the reference, which would make life so much easier. OK, every publication doesn't have a DOI, but what would it take to give them all one???


________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Rod Page [r.p...@bio.gla.ac.uk]
Sent: Wednesday, 16 December 2009 12:43 p.m.
To: Taxonomic Literature

sau...@ira.uka.de

unread,
Dec 15, 2009, 7:19:18 PM12/15/09
to taxo...@googlegroups.com
Hi Rod,

how right you are, DOIs/Handles/etc are the brilliant future.

Yet we have to deal with the tedious paper based past. Fretting on =20
style is definitely one step too far in my opinion, but we'll have to =20
deal with the author-year-title-journal-issue-pages stuff (and its =20
counterparts for books, book chapters, etc) as well as micro =20
citataions and other oddities for bridging the past to the future, or =20
- in other terms - anchoring the future in the past.

We have to be able to _retrieve_ the DOI/Handle/whatever for any given =20
author-year-title-journal-issue-pages string. That's just what we are =20
confronted with when bringing the legacy data onto the web, and we =20
have to be able to recognize which DOI/Handle/whatever this string =20
refers to. Therefore, we need to be able to _query_ registries with =20
reference strings, getting back one or more possible =20
DOIs/Handles/whatevers. Ideally, a quorum of users can help with =20
disambiguation in case of ambiguity.

That's basically the same as with taxonomic names and LSIDs: We have =20
to be able to _retrieve_ the appropriate LSID for what we find in the =20
legacy data: taxonomic name strings. Otherwise, we'll never be able to =20
link the treasures hidden in BHL's repositories to modern =20
DOI/Handle/whatever based referencing.

Bottom line: DOIs/Handles/etc are great. All we need is registries =20
that facilitate encoding the less-than-great (but human readable) =20
identifiers used in the past in the modern unique identifiers, through =20
respective query mechanisms, not _only_ vice versa.

So long my two cents from the legacy data markup front,

Guido

Stephen Thorpe

unread,
Dec 15, 2009, 7:36:35 PM12/15/09
to Rod Page, Taxonomic Literature
I'm assuming Rod's idea goes something like this:
If every publication had a STABLE electronic identifier, then this is all we need to reference the publication in a database. Ideally, all publications would be digitised and allocated a DOI. Then we could link to the publication from the database using the DOI (maybe via a paywall, for non-freely available publications). Then we could simply forget entirely about referencing and parsing publication details in the old way ...

________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Rod Page [r.p...@bio.gla.ac.uk]
Sent: Wednesday, 16 December 2009 12:43 p.m.
To: Taxonomic Literature

Rod Page

unread,
Dec 16, 2009, 2:44:32 AM12/16/09
to Taxonomic Literature
Guido,

I completely agree that we need to deal with the legacy literature,
indeed that was one of the motivations for developing the bioGUID
OpenURL resolver http://bioguid.info/openurl, which retrieves DOIs,
Handles, or URLs (e.g., from JSTOR) for articles, given basic metadata
such as journal, volume, and pagination.

In the same way, I've spent the last couple of months exploring ways
to extract article-level metadata from BHL, and I'm developing an
OpenURL resolver to locate articles in that repository (and generate
identifiers for those articles).

For my purposes the primary goal is to resolve citations to
identifiers, then link those identifiers together (e.g., through
citation networks) and to other identifiers (e.g., for names and
specimens). Hence, I see the discussion about literature as
fundamentally about how to go from legacy string identifiers (i.e.,
citations) to digital identifiers, not how to reproduce Byzantine
practices from another age.

Regards

Rod

Richard Pyle

unread,
Dec 16, 2009, 2:52:00 AM12/16/09
to Stephen Thorpe, Rod Page, Taxonomic Literature

Hi Stephen -- yes, this is *exactly* what we've *all* been pushing for. We
know where we want to be (the DOI bit is debatable, but if you replace that
with the more generic "persistent resolvable identifier", it's what we're
all on about).

What we are all discussing now is how to *get* there (as Guido described).
There are two basic issues:

1) Where do the persistent resolvable identifiers come from? DOIs make a
lot of intuitive sense, but have a couple major limitations...namely that
the vast, vast majority of the literature we want to index do not have them,
and likely won't have them anytime soon; they cost money; it's not clear
whether they can be assigned to non-traditional units of publication that we
need to reference and/or different levels of citations; etc.

2) How do we cross-link the hundreds of existing literature databases to
these identifiers?

As with many aspects of biodiversity informatics; where we want to be is
easy to envision. How to get there is always the tricky bit.

Rich

Richard Pyle

unread,
Dec 16, 2009, 3:00:31 AM12/16/09
to Rod Page, Taxonomic Literature
> For my purposes the primary goal is to resolve citations to
> identifiers, then link those identifiers together (e.g.,
> through citation networks) and to other identifiers (e.g.,
> for names and specimens). Hence, I see the discussion about
> literature as fundamentally about how to go from legacy
> string identifiers (i.e.,
> citations) to digital identifiers, not how to reproduce
> Byzantine practices from another age.

I see our emails crossed. What Rod describes above is *exactly* what I
think this discussion is (or ought to be) about.

Aloha,
Rich


Donat Agosti

unread,
Dec 16, 2009, 6:34:46 AM12/16/09
to taxo...@googlegroups.com, sau...@ira.uka.de, Roderic Page
I agree wit Rod, we should abandon the discussion on details of a citation.
What the citation stands for, at least in the past, is to find somewhere in
a library the respective publication; today we just need a doi/handle to
build the bridge.

Today, the library is one mouse click away from the citation. The only thing
we need is to have a copy of the publication sitting somewhere, be it BHL or
another digital archive, such as we have at antbase or plazi, and a handle
or similar to find it. This would resolve all the issues. Having a pdf at
hand would resolve all the issues about style, because the publication is
actually in front of you, and secondly, if you still would like to have a
nicely groomed bibliographic database, you could immediately check the
original.

So, why not work on combining the efforts of BHL and the many small
collector of taxonx specific pdf collections, which are in almost all cases
linked to a bibliographic databse, and assign handles or similar as routine.

We also have to talk to publishers to include our handles as they do for
DOIs, and if we have something (a digital library) to offer, this might
actually work.

This could even work, if some of the pdfs are only available for the Club
due to copyright restrictions.

Donat





-----Original Message-----
From: taxo...@googlegroups.com [mailto:taxo...@googlegroups.com] On Behalf

Richard Pyle

unread,
Dec 16, 2009, 11:13:51 AM12/16/09
to Donat Agosti, taxo...@googlegroups.com, sau...@ira.uka.de, Roderic Page
> I agree wit Rod, we should abandon the discussion on details
> of a citation.

Really? Then how are we going to get all of the millions of data records
(specimens, names, BHL scanned pages, etc.) that are currently linked to
citations that are represented by those citation details, cross-linked to
the identifiers? And where are the identifiers coming from? How do we
*make* the crosslinks if not through the citation details?

In my mind, the identifiers should come from CiteBank. Where DOIs exist,
they should absolutley be part of the metadata for those CiteBank records.
But for the reasons I outlined, we'll be sitting on our hands for a long
time if we wait for DOIs to be assigned not only to every historical
article, but also to every historical book, every taxonomic treatment within
every book/article, etc. I'd rather not wait another decade for that to
happen.

So if CiteBank is built, and issues the identifiers, then how do we know
what identifier goes with what publication, if not for the citation details?
And the crux, how do we link the existing records in the hundreds of
citation databases (like HNS), which are themselves cross-linked to millions
of names/specimens/pages/etc. records, to those identifiers -- if not
through a mechanism that involves citation details?

> So, why not work on combining the efforts of BHL and the many
> small collector of taxonx specific pdf collections, which are
> in almost all cases linked to a bibliographic databse, and
> assign handles or similar as routine.

Having the PDFs would be great! But more fundamentally we need to
established the cross-links among all of the currently disconnected
datasets.

Aloha,
Rich


Dean Pentcheff

unread,
Dec 16, 2009, 12:57:25 PM12/16/09
to taxo...@googlegroups.com
I think we need to be mindful of William Gibson's observation: "The
future is already here. It's just not very evenly distributed."

Taxonomists can today assemble taxonomic works by referencing past
work using DOIs, link to species or taxon treatments directly with
hyperlinks, and publish electronically with embedded TaxonX (or other)
markup. We've arrived!

Um. Well. Except for the 99.9% of publications that aren't done that
way, the 99% of taxonomists who don't know how to make a link in
Microsoft Word, and the 99% of taxonomic literature that (even if a
small part of it is digital) isn't uniformly locatable via a
standardized DOI (or DOI-like) system.

We're in the business of envisioning and enabling a fully digital
future for taxonomic science. But we _must_ simultaneously be in the
business of rolling over the existing world of taxonomy forward into
that future. Doing that means creating a very visible, very
incremental pathway forward for all the practicing taxonomists who
have never heard of TaxonX or DOIs, and who push Return and several
spaces to make hanging indents in their reference lists.

That means (I think) that we do need to build systems that encapsulate
all the messy complexities and conventions of traditional
bibliographic curation (title italics is, of course, just one piece of
that).

We've made a cottage industry out of moaning about the decline in the
number of taxonomists. We can't afford to set up digital taxonomic
systems that fail to absorb the work of traditionally-practicing
taxonomists. We need to encapsulate their expertise now -- there isn't
going to be a big new generation of experienced and digital
taxonomists who can inherit that knowledge base via apprenticeship.

My experience with slinging taxonomic reference information around
recently has convinced me of several things:

1. Traditional taxonomists really do have an incredibly rich and
detailed knowledge of their taxon groups (both critters and
publications).
2. Except for a handful of gear-heads, taxonomists are some of the
least technically savvy, least digitally adventurous, and most
methodologically conservative scientists in the world.
3. Traditional taxonomists can be migrated forward digitally if you
can enable them to do what they're doing now better and faster.

We want a taxonomic world of digitally linked taxon treatments, names,
publications, and other beautiful shiny things. The only way to pull
most taxonomists into that world is to seduce them by building digital
systems that make it easier to do what they see as their job today:
publishing taxonomic papers (in print (that's ink on mashed up woody
bits)). Those same digital systems will also pivot the field forward
to a point where the traditional approaches we worked hard to
accomodate will become obsolete. Victory.

So there's the long-winded, pompous background to why I want italics.

-Dean
--
Dean Pentcheff
pent...@gmail.com

Donat Agosti

unread,
Dec 16, 2009, 12:58:45 PM12/16/09
to Richard Pyle, taxo...@googlegroups.com, sau...@ira.uka.de, Roderic Page
I mean the details whether something should be italics, bold, etc. A
reference has to be as good that it allows the discovery of the original
publication on the one hand, and that it is unambiguouse to be sure another
citation refers to the same.

Forel, 1908 is obviously not enough
# 10226 Forel, A. 1908. Lettre à la Société Entomologique de Belgique.
Annales de la Société Entomologique de Belgique 52: 180-181. Browse or
download entire file (72k)
# 10227 Forel, A. 1908. Remarque sur la réponse de M. le prof. Emery.
Bulletin de la Société Vaudoise des Sciences Naturelles 44: 218. Browse or
download entire file (30k)
# 22785 Forel, A. 1908. E. Formiciden. Wissenschaftliche Ergebnisse der
Expedition Filchner nach China und Tibet, 1903-1905. X. Band - 1. Teil. :
105. (79k PDF file)
# 4013 Forel, A. 1908. Fourmis de Ceylan et d'Égypte, récoltées par le Prof.
E. Bugnion. Bulletin de la Société Vaudoise des Sciences Naturelles 44:
1-22. Browse or download entire file (2.0M)
# 4014 Forel, A. 1908. Fourmis de Costa-Rica, récoltées par M. Paul Biolley.
Bulletin de la Société Vaudoise des Sciences Naturelles 44: 35-72. Browse or
download entire file (1.4M)
# 4015 Forel, A. 1908. Ameisen aus Sao Paulo (Brasilien), Paraguay, etc.
Gesammelt von Prof. Herm. v. Ihering, Dr. Lutz, Dr. Fiebrig, etc.
Verhandlungen der Zoologisch-Botanischen Gesellschaft in Wien 58: 340-418.
Browse or download entire file (4.1M)
# 4016 Forel, A. 1908. Catalogo systematico da collecção de formigas do
Ceará. Boletim do Museu Rocha, Fortaleza 1: 62-69. Browse or download entire
file (230k)
# 1


So it needs a bit more.

A year and a page range is almost 95% ok, and additionally a journal name
would resolve almost all. Disambiguities can at some point only be resolved
if the content of the publication is known.

I also think that if we can do 90% by machine, and need to correct 10%
because of incorrect matching, that's fine, as long as there are mechanisms
to correct the errors and change the database, similarly as you have it now
in Zoobank.

For example in GoldenGATE and our mark-up process we deal with this issue on
a regular base.
1. we locate the citation
2. identifiy the elements of a citation (eg author, year, title...). This is
done via an interactive window, so you can confirm or change. For new
publications there are very little wrong predictions.
3 Find all the short internal ciations, and link them up with the complete
citations.

Unfortunately, there is no system yet or anymore (bioguid worked for while)
where we can upload the citations or get a handle/doi.

I wonder, whether we should get back to our own system where we use the
handles we assigned to all the ant literature, and just use them.

This way we would have for a small body of literature (only ants) all what
we talk in place. Through the interactive citation parsing, we also have
quiet a high degree of quality control of the data: for all the ant
taxonomic publication we can easily refer to the original publication, that
are all scanned and available.

I agree, we should have something like Citebank. But I don' think we need a
complete citebank before we can start. What we need are seeds of taxa, that
make use of citebank, and that let it grow, and attract others. But without
content citebank is yet another nice empty shell for the history books.

Data sets are not just disconnected but in various stages of completion -
therefore it needs an approach that takes care of that. I would also make an
effort to talk to some publishers that are close to what we do and make
sure, they include our dois/handles for old literature. And again, my bet
is, that this might work if we can find taxonomic based group, where there
is a mixture of authors, databases and publishers that work together. And
Hymenoptera might be such a taxon.


Donat



-----Original Message-----
From: taxo...@googlegroups.com [mailto:taxo...@googlegroups.com] On Behalf

Dean Pentcheff

unread,
Dec 16, 2009, 1:06:31 PM12/16/09
to Taxonomic Literature
Yes. I think inevitably we will progress forward in chunks, the main
quantum of which will be "drops" of literature from well-organized
taxon worker-groups. That's one element that I see as a major
contributor to buy-in from taxonomic workers.

People suck onto a system or resource when it gets over some (hard to
define) threshold of completeness and usability. Once there's a very
good chance you'll find what you need at a resource, you keep going
there (and you are likely to contribute to it). If the
papers/references/whatever you need are so sparse in a resource that
you only get a rare "hit", it's not interesting.

So having pre-curated taxon-specific batches of elements appear in
online resources is a great way to get buy-in, community by community.

-Dean
--
Dean Pentcheff
pent...@gmail.com

Donat Agosti

unread,
Dec 16, 2009, 1:16:28 PM12/16/09
to Dean Pentcheff, taxo...@googlegroups.com
I am not interested in the 99.9% if nobody is using them. And if people are
using them, then there are bibliographic databases around. Even we from the
0.1% did something we discuss here: Spend weeks to years to sort out all the
references, find reprings or library sources, scanning thousands of them to
create among others pdfs and discover during this process many of the errors
which allowed to correct them. Linking species citations to a particular
page, eg Forel 1908:34 to the appropriate pdf cite was one of the best
control.

I assume that there are enough databases with pdfs available that Citebank
can fly and will attract more contribution. But without a live citebank,
even a small but supported by a vibrant community, a vision to scale up (eg
BHL, or commitment from some large scale initiative, such as Noye's Chalcid
database, HNS, we will talk as we do.

It needs work, a lot. But I think the work to create the content should be
the focus, not a highly sophisticated system to support the conversion, and
all its idiosyncracies. The good thing is, that it is a one off work, like
the activation energy in an exothermic chemical reaction, even though the
energy can be exorbitantly high, especially, if few people have to deliver.
And the rule in this life is, that nobody but a few will do it.

Donat





-----Original Message-----
From: taxo...@googlegroups.com [mailto:taxo...@googlegroups.com] On Behalf

Guido Sautter

unread,
Dec 16, 2009, 1:22:28 PM12/16/09
to taxo...@googlegroups.com
I do think we need something like CiteBank to grow along with the
initial data set from the start ...

1. We need a service to obtain unique identifiers from, given a parsed
citation string - and this service can just be a first implementation of
CiteBank.

2. Once the system is established and many people want to use it, it's
too late to start developing something lasting as CiteBank - it has to
be up & running by then. And the Hymenoptera literature surely is a
suitable data set to test CiteBank with, as well as a nice demostrator
that will help bring other people aboard.

In this respect, I'm in line with Rich - we should not waste effort on
creating prototypes, but start building the real thing. The basic
structure is not so complex that it might prevent a successful start. We
just have to keep the basic backend software sufficiently extensible so
we can optimize it later on as both traffic and amount of data increase.

Cheers,
Guido

> I mean the details whether something should be italics, bold, etc. A
> reference has to be as good that it allows the discovery of the original
> publication on the one hand, and that it is unambiguouse to be sure another
> citation refers to the same.
>
> Forel, 1908 is obviously not enough
> # 10226 Forel, A. 1908. Lettre � la Soci�t� Entomologique de Belgique.
> Annales de la Soci�t� Entomologique de Belgique 52: 180-181. Browse or
> download entire file (72k)
> # 10227 Forel, A. 1908. Remarque sur la r�ponse de M. le prof. Emery.
> Bulletin de la Soci�t� Vaudoise des Sciences Naturelles 44: 218. Browse or
> download entire file (30k)
> # 22785 Forel, A. 1908. E. Formiciden. Wissenschaftliche Ergebnisse der
> Expedition Filchner nach China und Tibet, 1903-1905. X. Band - 1. Teil. :
> 105. (79k PDF file)
> # 4013 Forel, A. 1908. Fourmis de Ceylan et d'�gypte, r�colt�es par le Prof.
> E. Bugnion. Bulletin de la Soci�t� Vaudoise des Sciences Naturelles 44:
> 1-22. Browse or download entire file (2.0M)
> # 4014 Forel, A. 1908. Fourmis de Costa-Rica, r�colt�es par M. Paul Biolley.
> Bulletin de la Soci�t� Vaudoise des Sciences Naturelles 44: 35-72. Browse or
> download entire file (1.4M)
> # 4015 Forel, A. 1908. Ameisen aus Sao Paulo (Brasilien), Paraguay, etc.
> Gesammelt von Prof. Herm. v. Ihering, Dr. Lutz, Dr. Fiebrig, etc.
> Verhandlungen der Zoologisch-Botanischen Gesellschaft in Wien 58: 340-418.
> Browse or download entire file (4.1M)
> # 4016 Forel, A. 1908. Catalogo systematico da collec��o de formigas do
> Cear�. Boletim do Museu Rocha, Fortaleza 1: 62-69. Browse or download entire

Dean Pentcheff

unread,
Dec 16, 2009, 1:32:59 PM12/16/09
to Taxonomic Literature
Yes, this should not be yet-one-more little proof-of-concept island. I
think we've got lots of those already. (Though we should be prepared
to do major rethink and redesign if we realize we're going down the
wrong road in some way.)

> 1. We need a service to obtain unique identifiers from, given a parsed
> citation string - and this service can just be a first implementation of
> CiteBank.
This will also have to be accompanied by a strong system for being
able to "synonymize" references (and identifers). Inevitably we will
be generating multiple unique identifiers for references that turn out
to be pointing to the same publication. I have a creepy feeling that
this synonymization will turn out to be much more complex than it
seems on the surface.

-Dean
--
Dean Pentcheff
pent...@gmail.com

On Wed, Dec 16, 2009 at 10:22 AM, Guido Sautter <sau...@ira.uka.de> wrote:
> I do think we need something like CiteBank to grow along with the
> initial data set from the start ...
>
> 1. We need a service to obtain unique identifiers from, given a parsed
> citation string - and this service can just be a first implementation of
> CiteBank.
>
> 2. Once the system is established and many people want to use it, it's
> too late to start developing something lasting as CiteBank - it has to
> be up & running by then. And the Hymenoptera literature surely is a
> suitable data set to test CiteBank with, as well as a nice demostrator
> that will help bring other people aboard.
>
> In this respect, I'm in line with Rich - we should not waste effort on
> creating prototypes, but start building the real thing. The basic
> structure is not so complex that it might prevent a successful start. We
> just have to keep the basic backend software sufficiently extensible so
> we can optimize it later on as both traffic and amount of data increase.
>
> Cheers,
> Guido
>
>> I mean the details whether something should be italics, bold, etc. A
>> reference has to be as good that it allows the discovery of the original
>> publication on the one hand, and that it is unambiguouse to be sure another
>> citation refers to the same.
>>
>> Forel, 1908 is obviously not enough
>> # 10226 Forel, A.  1908. Lettre à la Société Entomologique de Belgique.
>> Annales de la Société Entomologique de Belgique 52: 180-181. Browse  or
>> download  entire file (72k)
>> # 10227 Forel, A. 1908. Remarque sur la réponse de M. le prof. Emery.
>> Bulletin de la Société Vaudoise des Sciences Naturelles 44: 218. Browse or
>> download entire file (30k)
>> # 22785 Forel, A. 1908. E. Formiciden. Wissenschaftliche Ergebnisse der
>> Expedition Filchner nach China und Tibet, 1903-1905. X. Band - 1. Teil. :
>> 105. (79k PDF file)
>> # 4013 Forel, A. 1908. Fourmis de Ceylan et d'Égypte, récoltées par le Prof.
>> E. Bugnion. Bulletin de la Société Vaudoise des Sciences Naturelles 44:
>> 1-22. Browse or download entire file (2.0M)
>> # 4014 Forel, A. 1908. Fourmis de Costa-Rica, récoltées par M. Paul Biolley.
>> Bulletin de la Société Vaudoise des Sciences Naturelles 44: 35-72. Browse or
>> download entire file (1.4M)
>> # 4015 Forel, A. 1908. Ameisen aus Sao Paulo (Brasilien), Paraguay, etc.
>> Gesammelt von Prof. Herm. v. Ihering, Dr. Lutz, Dr. Fiebrig, etc.
>> Verhandlungen der Zoologisch-Botanischen Gesellschaft in Wien 58: 340-418.
>> Browse or download entire file (4.1M)
>> # 4016 Forel, A. 1908. Catalogo systematico da collecção de formigas do
>> Ceará. Boletim do Museu Rocha, Fortaleza 1: 62-69. Browse or download entire

Donat Agosti

unread,
Dec 16, 2009, 1:37:29 PM12/16/09
to Dean Pentcheff, Taxonomic Literature
May the wrong road was that we haven't done it yet.

Rich can tell you stories about synchronization. But at the same time, all
this is a one of process, and once done, we still have the chance to correct
- which we will do, if we really use the system.

Guido Sautter

unread,
Dec 16, 2009, 2:03:53 PM12/16/09
to Taxonomic Literature
I think the first and foremost thing we need is a stable registy for the
citation strings.

When uploading a new string, the system can fuzzy query the strings
already in the database and return possible matches for the uploader to
choose from, or a new ID if no possible match is found.

Or the system initially just stores each distinct citation string as is,
and synchronization can be done asynchronously through a web interface,
maybe even as a citizen science project, relying on the votes of
multiple users for ensuring data quality.

The nice issue about the second approach is that the system can be put
in place and start collecting data and issuing IDs pretty quickly, and
the synchronization sub system can be added later. After
synchronization, the IDs issued for the plain strings can be resolved to
a single ID existing for the referenced data item, thus resulting in a
two step resolution process.

- Guido

Stephen Thorpe

unread,
Dec 16, 2009, 3:51:13 PM12/16/09
to Richard Pyle, Rod Page, Taxonomic Literature
Well, some thoughts:

(1) the fact that Magnolia Press don't use DOIs is a shame

(2) I tend to think that more money and effort ought to go into making literature freely available online. In this regard ZooKeys has the advantage over Zootaxa. If we forget about cross-linking to existing databases, and think instead of creating a new comprehensive database, the database itself could contain the literature files, and the unique identifiers could be assigned and used purely internally, without humans even needing to see the identifiers. The user could just click on a literature citation of the form Smith, 1876c and the PDF would open...

________________________________________
From: Richard Pyle [deep...@bishopmuseum.org]
Sent: Wednesday, 16 December 2009 8:52 p.m.
To: Stephen Thorpe; 'Rod Page'; 'Taxonomic Literature'

Richard Jensen

unread,
Dec 16, 2009, 3:59:58 PM12/16/09
to Stephen Thorpe, Richard Pyle, Rod Page, Taxonomic Literature
As for your second point, we need to be careful about this. As you know,
many of the best journals are published by scientific societies whose
major funding for the journal comes from library subscriptions. One
thing we know about libraries is that, if the journals are
simultaneously made available online, they (the libraries) will stop
subscribing. This isn't a hypothetical - the libraries have made this
abundantly clear. We need a system that makes publications available
online and free, but not at the cost of losing the very journals that
serve as outlets.

Dick J

Richard Jensen, Professor
Department of Biology
Saint Mary�s College
Notre Dame, IN 46556
Tel: 574-284-4674

Donat Agosti

unread,
Dec 16, 2009, 4:00:58 PM12/16/09
to Stephen Thorpe, Richard Pyle, Rod Page, Taxonomic Literature
If we have a database from where Zoobank could add the dois, then they might
do it.

The issue is actually not so much zookeys or Zootaxa, but the authors. About
80% of the authors of Zootaxa, a hybrid journal, don't care about open
access. It is up to them to pay around USD20 per page to get the publication
open access. In Zookeys, a gold road oa journal, everybody has to pay, and
all the articles are open access.
There is a similar experience, that almost none of the Zootaxa authors did
the little effort to register their new taxonomomic names into Zoobank.
That of course makes you think, who then is going to make the effort to
enter all the bibliographic data into Citebank...

I guess a few of us will make it happen...

Donat

Donat Agosti

unread,
Dec 16, 2009, 4:05:43 PM12/16/09
to Taxonomic Literature, rje...@saintmarys.edu
There are a lot of talks about this far beyond our community, but what we
need is access to what we publish. Essentially we are not here to conserve
our past, but to create the best possible environment to communicate our
discoveries. Open access does that - so we have to find new ways, including
new business models to get there where we want to be.
Donat


-----Original Message-----
From: taxo...@googlegroups.com [mailto:taxo...@googlegroups.com] On Behalf
Of Richard Jensen
Sent: Thursday, December 17, 2009 12:30 AM
To: Stephen Thorpe
Cc: Richard Pyle; 'Rod Page'; 'Taxonomic Literature'
Subject: Re: [TaxonLit] Re: Thoughts on Style

As for your second point, we need to be careful about this. As you know,
many of the best journals are published by scientific societies whose
major funding for the journal comes from library subscriptions. One
thing we know about libraries is that, if the journals are
simultaneously made available online, they (the libraries) will stop
subscribing. This isn't a hypothetical - the libraries have made this
abundantly clear. We need a system that makes publications available
online and free, but not at the cost of losing the very journals that
serve as outlets.

Dick J

Richard Jensen, Professor
Department of Biology

Stephen Thorpe

unread,
Dec 16, 2009, 4:11:47 PM12/16/09
to rje...@saintmarys.edu, Richard Pyle, Rod Page, Taxonomic Literature
Yes, but this is the very obstacle that needs to be changed, by changing the system. In the electronic future, libraries are doomed to extinction anyway! It seems fair enough to me for authors to have to pay to get their papers published, papers that will then be freely available. Most authors in taxonomy are 'professionals' working for corporate institutions who judge them by the number of their publications, so the institutions should pay to get those publications out. At any rate, it isn't at all clear to me that spending millions on numerous bioinformatics initiatives, all trying to get around the above problem, will be any cheaper in the long run than simply buying copyrights or else paying the cost of registering DOIs ...

________________________________________
From: Richard Jensen [rje...@saintmarys.edu]
Sent: Thursday, 17 December 2009 9:59 a.m.
To: Stephen Thorpe
Cc: Richard Pyle; 'Rod Page'; 'Taxonomic Literature'
Subject: Re: [TaxonLit] Re: Thoughts on Style

As for your second point, we need to be careful about this. As you know,
many of the best journals are published by scientific societies whose
major funding for the journal comes from library subscriptions. One
thing we know about libraries is that, if the journals are
simultaneously made available online, they (the libraries) will stop
subscribing. This isn't a hypothetical - the libraries have made this
abundantly clear. We need a system that makes publications available
online and free, but not at the cost of losing the very journals that
serve as outlets.

Dick J

Richard Jensen, Professor
Department of Biology

Stephen Thorpe

unread,
Dec 16, 2009, 4:12:11 PM12/16/09
to Donat Agosti, Taxonomic Literature, rje...@saintmarys.edu
Yes, I think that is what I am saying: new business models to facilitate open access ... anybody thought of advertising? It might seem crass to have an advert for McDonalds in the middle of a taxonomic publication, but hey! ...

________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Donat Agosti [ago...@amnh.org]
Sent: Thursday, 17 December 2009 10:05 a.m.
To: 'Taxonomic Literature'
Cc: rje...@saintmarys.edu

Stephen Thorpe

unread,
Dec 16, 2009, 4:30:37 PM12/16/09
to rje...@saintmarys.edu, Richard Pyle, Rod Page, Taxonomic Literature
One option would be to simply let the diversity of taxonomic journals in the world today go extinct, and instead fund one "megajournal" for the publication of new taxa. It would make life easier ...

________________________________________
From: Richard Jensen [rje...@saintmarys.edu]
Sent: Thursday, 17 December 2009 9:59 a.m.
To: Stephen Thorpe
Cc: Richard Pyle; 'Rod Page'; 'Taxonomic Literature'
Subject: Re: [TaxonLit] Re: Thoughts on Style

As for your second point, we need to be careful about this. As you know,
many of the best journals are published by scientific societies whose
major funding for the journal comes from library subscriptions. One
thing we know about libraries is that, if the journals are
simultaneously made available online, they (the libraries) will stop
subscribing. This isn't a hypothetical - the libraries have made this
abundantly clear. We need a system that makes publications available
online and free, but not at the cost of losing the very journals that
serve as outlets.

Dick J

Richard Jensen, Professor
Department of Biology

sau...@ira.uka.de

unread,
Dec 16, 2009, 7:19:04 PM12/16/09
to taxo...@googlegroups.com
Let's just for a moment not bring up visions what the future might be
like, and the less discuss these visions, but see where we are:

Whatever publications in the future will be like, they will all
reference and build on "old school" publications (openly accessibvle
or not) that are originally referenced through classic citation
strings. So even future publications will found on this existing body
of literature, still growing this infinite directed graph, where
publications are the nodes and "Paper A references Paper B" relations
are the edges. In order to be able to browse this graph by means of
hyperlinks or UUID (to use Rich's general term) resolution, we need
two things:
- a UUID for all the existing publications, at least for the ones that
are referenced by other (in the graph model, nodes having in-edges)
- a facility to obtain the UUID of the referenced publications based
on the the citations strings given in the referencing publications
(in the graph model, find out where exactly the out-edges of a node go)

As we do not digitize bottom-up (ie starting with Linneaus), but more
or less top-down (ie with today's publications), we need to assign a
UUID to a publication based on the mere reference string, even if the
referenced publication is not yet available digitally and therefore
does not have a UUID yet. In other words, we need to be able to assign
UUIDs based on parsed reference strings, and exactly this is what
CitaBank is intended to make possible.

Synchronization is then to make sure each node in the reference graph
has exactly one UUID, but assigning more than one is not a severe
problem, as synchronization can later establish that several UUIDs
actually point to the same publication / node in the reference graph.
Far as I got it, this "collect reference strings annd issue UUIDs now,
synchronize later" model is what came out of the CiteBank discussion
in Montpellier.

Please, Rich, correct me if I'm mistaking.

Putting up a service offering something like "give me the reference
for the original description of taxon XYZ" based on CiteBank should
not be too much of a problem technically. And far as I get the
previous mails in this discussion, this is what taxonomists want, is it?

So far my two cents,
Guido

Dean Pentcheff

unread,
Dec 16, 2009, 7:57:48 PM12/16/09
to Taxonomic Literature
I agree with Guido's perspective: we must build a system so that we
can include/point to the historical literature as well as linking up
present and future literature. That's what makes taxonomy weird among
the sciences: we really don't get to let the past drop off into the
abyss.

And to pick up on Donat's comment: one way to pull in the "classical"
taxonomists is to build a service that lets them get their work done
better than they do now with the tools they have now. Not a new way of
doing biology, but doing the "old" way quicker. They'll contribute if
their contributions come back better than when they put them in. At
least that's been our experience with the decapod system.

And finally... the issue of society journals. I take it we all have
just about no sympathy for commercial journals: they're a business.
Either they figure out how to make money giving us (all of us) what we
want, or they die. But there's the argument that society journals are
a significant source of income for biological societies.

We raised this issue with members (and officers) of the Crustacean
Society, publisher of the Journal of Crustacean Biology. We were
raising it in the context of our initiative to make decapod crustacean
taxonomy articles freely available to the public on our web server
(and that would necessarily include many JCB articles).

We made the argument essentially this way: No professional society
dedicated to advancing the study of a discipline can, in good
conscience, support itself by restricting access to the published
knowledge on that subject. There was silence. Heads nodded. And people
started talking about different financial models.

I still believe that argument.

-Dean
--
Dean Pentcheff
pent...@gmail.com

Richard Pyle

unread,
Dec 17, 2009, 1:13:09 AM12/17/09
to sau...@ira.uka.de, taxo...@googlegroups.com

Hi All,

Great discussion -- but I only have a few minutes right now.

Very briefly:

> Please, Rich, correct me if I'm mistaking.

Actually, when I used the term "UUID", I use it very specifically to mean
the UUID mechanism for generating non-resolvable identifiers, which I
advocate should serve as the "identification" part of our "persistent
resolvable identifiers". The "resovable" part would come from an HTTP
prefix (as one example of a resolution protocol) appended to the front of
the UUID. In the context of what you wrote, I would use the more generic
term "GUID" (although lately, thanks in part to what I would consider a
misleading Wikipedia page, the term "GUID" is is almost synonymous with
"UUID"...but let's not go there).

Conceptually, though, I think what you write is consistent with my view.

> Putting up a service offering something like "give me the
> reference for the original description of taxon XYZ" based on
> CiteBank should not be too much of a problem technically.

Actually, this would be a service of GNUB. I imagine CiteBank as indexing
the units of literature and literature-like documentation sources; and the
taxonomic stuff falls under GNA.

Rich


Richard Pyle

unread,
Dec 17, 2009, 1:20:52 AM12/17/09
to taxo...@googlegroups.com

Apologies to everyone who is not Donat, but it appears that none of my
off-list messages to Donat (RE: HNS) are getting through, and many on his
are not getting through to me. Donat -- please try deep...@hawaii.rr.com.

Rich


Guido Sautter

unread,
Dec 17, 2009, 6:25:32 AM12/17/09
to Taxonomic Literature
Hi Rich,

> the UUID mechanism for generating non-resolvable identifiers, which I
> advocate should serve as the "identification" part of our "persistent
> resolvable identifiers". The "resovable" part would come from an HTTP
> prefix (as one example of a resolution protocol) appended to the front of
> the UUID. In the context of what you wrote, I would use the more generic
> term "GUID" (although lately, thanks in part to what I would consider a
> misleading Wikipedia page, the term "GUID" is is almost synonymous with
> "UUID"...but let's not go there).
>
> Conceptually, though, I think what you write is consistent with my view.
>
Just meant "some sort of unique, resolvable identifier", and meant to
avoid saying DOI/Handle/etc all the time to prevent the discussion from
drifting towards which of these identifier systems to use, as this is
completely irrelevant to CiteBank system design ...

>> Putting up a service offering something like "give me the
>> reference for the original description of taxon XYZ" based on
>> CiteBank should not be too much of a problem technically.
>>
> Actually, this would be a service of GNUB. I imagine CiteBank as indexing
> the units of literature and literature-like documentation sources; and the
> taxonomic stuff falls under GNA.
>
... which induces that GNUB and GNA are to be interoperating from the
start, a point to consider early in system design. Later adapter
implementations tend to be either of (a) complex and highly expensive,
(b) rather fragile stopgap solutions, (c) lasting "prototypes" with
limited functionality or performance. Neither is what we want, so let's
think of interoperability from the start.

- Guido

Richard Pyle

unread,
Dec 17, 2009, 7:58:21 AM12/17/09
to Guido Sautter, Taxonomic Literature

> ... which induces that GNUB and GNA are to be interoperating
> from the start, a point to consider early in system design.
> Later adapter implementations tend to be either of (a)
> complex and highly expensive,
> (b) rather fragile stopgap solutions, (c) lasting
> "prototypes" with limited functionality or performance.
> Neither is what we want, so let's think of interoperability
> from the start.

That's *exactly* what I think we're trying to do here.

Rich


Richard Jensen

unread,
Dec 17, 2009, 9:57:16 AM12/17/09
to Stephen Thorpe, Richard Pyle, Rod Page, Taxonomic Literature
Libraries, as we know them, may be doomed to extinction, but using that
argument is a cop out - humans are doomed to extinction as well. I'm
willing to bet that libraries will not go extinct in our lifetimes and,
until they do, we need to work with what is here now and what will be
here until the transition is complete.

That said, I think all professional societies that publish journals (at
least all with the good sense to pay attention) are examining ways to
accommodate open access and change current fiscal models. But it won't
happen overnight. Besides, open access cannot be equated with
universal access - until all have access to appropriate technology that
can be reliably provided, we need the alternative structure to keep
things going.

Let's be honest about something - if authors have to pay (from personal
budgets) to get their research published, then the pace of publication
could slow dramatically. I wonder how many of us have enough loose
change to cover these costs ourselves? Do you envision a system in which
only the relatively privileged few can continue publishing?

Finally, I'm glad to learn that most publishing taxonomists are at
"corporate institutions" that have funds readily available to pay for
publication costs. Please show me the data to support this contention.
And, given that you appear to know these things, please tell me where in
my institution's budget you have identified the funds to pay
publication costs?.

Dick J

Richard Jensen, Professor
Department of Biology

Saint Mary�s College

Stephen Thorpe

unread,
Dec 17, 2009, 3:13:03 PM12/17/09
to rje...@saintmarys.edu, Richard Pyle, Rod Page, Taxonomic Literature
Hi Dick,

Well, I was just throwing around preliminary ideas, and making the point that libraries have a conflict of interest here, so clearly they don't want all literature to be open access on the net very soon...

>Do you envision a system in which only the relatively privileged few can continue publishing?

Absolutely not! I envisage a system where institutions pay the publishing costs of their researcher employees, and other people can apply for grants/exemptions. Something along these lines already exists, see: http://www.royalsociety.org.nz/Site/publish/authors/submit.aspx

>if authors have to pay (from personal budgets) to get their research published, then the pace of publication could slow dramatically

Well, the pace, around here anyway, is pretty bloody slow as it is! How slow can you go?
Bear in mind that authors also require access to literature, i.e., references for their publications, which on an open access model will be free. Institutions will spend far less on library budgets, so in principle can spend that money paying publication fees.

>Finally, I'm glad to learn that most publishing taxonomists are at "corporate institutions" that have funds readily available to pay for publication costs. Please show me the data to support this contention. And, given that you appear to know these things, please tell me where in my institution's budget you have identified the funds to pay publication costs?

I didn't say that they have funds "readily available". A complete overhaul of the business model would be required. Though, as I said above, money saved on journal subscriptions COULD go towards publishing costs. Currently in the corporate sector, I see an awful lot of money going on transport and accommodation costs, particularly for senior management types. The money is there - we "just" need to somehow make sure that it gets used appropriately. Big ask, I know! :)

Stephen

________________________________________
From: taxo...@googlegroups.com [taxo...@googlegroups.com] On Behalf Of Richard Jensen [rje...@saintmarys.edu]
Sent: Friday, 18 December 2009 3:57 a.m.


To: Stephen Thorpe
Cc: Richard Pyle; 'Rod Page'; 'Taxonomic Literature'
Subject: Re: [TaxonLit] Re: Thoughts on Style

Libraries, as we know them, may be doomed to extinction, but using that

Dick J

Saint Mary’s College


Notre Dame, IN 46556
Tel: 574-284-4674

Stephen Thorpe wrote:
> Yes, but this is the very obstacle that needs to be changed, by changing the system. In the electronic future, libraries are doomed to extinction anyway! It seems fair enough to me for authors to have to pay to get their papers published, papers that will then be freely available. Most authors in taxonomy are 'professionals' working for corporate institutions who judge them by the number of their publications, so the institutions should pay to get those publications out. At any rate, it isn't at all clear to me that spending millions on numerous bioinformatics initiatives, all trying to get around the above problem, will be any cheaper in the long run than simply buying copyrights or else paying the cost of registering DOIs ...
>
>
>

--

Richard Pyle

unread,
Dec 19, 2009, 11:32:13 AM12/19/09
to Dean Pentcheff, Taxonomic Literature

Just catching up on old posts....

I generally agree with Dean's points about the need for conferring a few
bits of style metadata in the exchange standard; and further that it should
be limited to style only (not semanitc markup). The italics probably don't
help to uniquely identify a piece of literature, but almost every consumer
of these citations will want that information included in the output when
downloading content. I think they only need to apply to titles -- no other
pieces of metadata.

> There's not much markup other than italicization that seems to occur.
> There's a very rare (in the taxonomic literature) occurrence
> of super- or sub-scripts. Boldface might occur (but I've
> never seen it). Beyond those, I think we're getting into
> MathML, and I don't think we want to go there.

I think Unicode+italics+subscript+superscript covers everything that most
consumers will want. UTF-8 is a given, so we're just talking about style
markup. Even though we only want those three (italics+sub/superscript)
initially, the mechanism for confering this information should be generic
(and extendible). I think there are two general approaches, each with an
array of sub-approaches. One approach is to embed the markup with HTML tags
(or similar tags) directly in the titles. The other approach is to embed
the information externally (e.g., as attributes within the <Title> tag). I
think I slighly prefer the latter, but could be persuaded either way.

> There's a broader issue raised by the seemingly minor issue
> of title italics (and I suspect that I may differ from Rod
> Page here, though I hold out hope of convincing him!). I
> think it's important to make sure that this exchange format
> can be used (albeit indirectly) to create fully and correctly
> formatted bibliographic reference lists (such as would appear
> at the end of a taxonomic publication in a journal).

Yes, I think that's what many/most end-user consumers/clients will want to
be able to do. Myself included.

Aloha,
Rich


Richard Pyle

unread,
Dec 19, 2009, 11:43:08 AM12/19/09
to Dean Pentcheff, Taxonomic Literature
> Where genus/species were italicized in the title, we retained that.
> Where the title text was ambiguous (e.g. all-caps no-italics
> title), we italicized genus/species names (as well as
> lower-casing the title text).
> Where the title text was upper- and lower-case but the
> genus/species was not italicized, we went ahead and
> italicized it (that's the approach that I suspect is the
> least justifiable, but was also extremely uncommonly [perhaps
> never?] seen).

I agree with the first. Second, I'm not so sure (I've seen italicized small
caps, so I'm not sure there really are any ambiguous examples). Third, I
definitely do not add italics where they didn't originally occur. In a
bibliography, I would not italicize scientific names if they were not
originally italicized.

Also, I only bother with markup if specific words in the title are
italicized. If the whole title is in italics, I don't bother. The
exception to this is the (very rare) case where the whole title is in
italics, but the scientific names are not in italics. In that case I think
I would tag the names as italics.

> What to do with all-caps titles?

I don't attempt to retain case (upper vs. lower vs. small caps, etc.) when
it's used for the whole title. But there are some capitalization issues
that we should try to standardize -- like how to capitalize jounral article
titles vs. book titles, etc.

> What punctuation to use to separate title and subtitle if
> they are only differentiated by typeface or point size in the
> original?

Good question -- I don't know.

> What to do with non-conventional capitalization in older papers (e.g.
> "Systematics and taxonomy of the Genus Abadabba")?
> etc.

I tend to retain those as originally rendered on the title page.

I think these are the kinds of details we'll need to think about when
developing the business rules around the "clean bucket" part of CiteBank.
And whatever those business rules are will be implemented in the output from
CiteBank. But I'm not sure we need to worry so much about them for *input*
to CiteBank (which, I suspect, in most cases will flow through the "dirty
bucket"). In other words, I think this sort of thing is more of an issue
for CiteBank business rules than for exchange standard. By contrast, the
italics thing *is* relevant to the exhcnage standard, because it affects the
actual structure of the exchnage standard. The caps thing really only
affects what gets inserted into the content of the exchanged documents --
not the structure of the exchange standard itself.

Rich


Richard Pyle

unread,
Dec 19, 2009, 11:49:41 AM12/19/09
to Guido Sautter, taxo...@googlegroups.com
> In this respect, I'm in line with Rich - we should not waste
> effort on creating prototypes, but start building the real
> thing. The basic structure is not so complex that it might
> prevent a successful start. We just have to keep the basic
> backend software sufficiently extensible so we can optimize
> it later on as both traffic and amount of data increase.

Yes -- more than any other biodiversity data intiiative I've been involved
with, I think this one has the potential to "hit the gorund running". And
compared to the analagous stuff in Taxon-name-land, this one is VERY close
to ready. Just a few more details to sort out, then we can go.

Rich


Richard Pyle

unread,
Dec 19, 2009, 12:04:40 PM12/19/09
to Taxonomic Literature

I suspect tat for the next 3-5 years or so, the vast bulk of the effort will
be synchronization among existing bibliographic databases. As such, I think
we need to optimize the mechanism (part of which is the exchange standard)
for reconciliation among existing literature databases. I believe the
architecture should be something like the following:

- A "dirty bucket" where any text string purported to represent a piece of
literature may be deposited by anyone, with a link back to where that text
string came from (i.e., some sort of identifier that points to the source
database record). This is exactly modelled after GNI.

- A "clean bucket" representing fully parsed citation records stored in a
robust and normalized data structure, issuing the GUIDs we'll all
(eventually) share.

- The above two "buckets" would not exist as single instances, but rather as
many, many replicate copies spread over the world with robust means to
maintain synchronization (i.e., replication and mirroring).

- A suite of services that allows reliable mapping between records in the
dirty bucket to GUIDs in the clean bucket

I believe the workflow would be something along the lines of the following:

- Any citation database (="content providers") can dump their
full-text-string citations into the dirty bucket.

- Services will parse these text strings, and establish "fuzzy" matches with
records in the "clean bucket".

- A report of these mappings, including confidence levels for each mapping,
will be provided back to the content provider.

- Where the content provider is confident in the mapping, the content
provider creates the link to their local copy of the clean bucket.

- Where the content provider has records that do not confidently map to the
"clean bucket", some mechanism for creating a new record in the clean bucket
would be followed.

Ultimately, all literature databases would be cross-linked to the clean
bucket; at which time we are in the realm that Rod and Donat envision.

My interest is to get us from where we are now, to where Donat and Rod (and
the rest of us) want the entire community to be, as quickly and efficiently
as possible.

Rich

Dean Pentcheff

unread,
Dec 20, 2009, 12:16:44 PM12/20/09
to Richard Pyle, taxonlit
The distinction between the business rules of CiteBank vs. the needs
of an exchange format seems excellent. Not having been at the
Montpellier discussions, I wasn't entirely clear on that structure.

And yes, I can see the ideal place where we can pull up a reference
list at the end of an old publication and have that near-automatically
linked up with a definitive database of accumulated taxonomic
references. Then we're in shiny linked-data-ville.

A few comments and questions on the proposed architecture:

How dirty should the Dirty Bucket be? I like the idea of a sort of
purgatory for semi-processed records, before they enter the paradise
of full-checkedness. But... as described, the Dirty Bucket could be
pretty much a bin of cut-and-pasted reference lists from the back end
of any/every taxonomic paper. I think that might be too permissive, in
that I doubt we'd get a substantial percentage of those definitively
linked to Clean Bucket references -- the workload would be just too
high.

One way to constrain that a bit might be to set things up so that the
Dirty Bucket will only accept some form(s) of parsed references. That
ensures that there's at least been some effort to "digest" the
references before dumping them in.

The valuable service of fuzzy-matching an arbitrary reference string
to a Clean Bucket record would be a service completely separate from
the Dirty Bucket, in that case.

Another reason I'm inclined to push for pre-parsed references in the
Dirty Bucket is that it's damned hard to parse arbitrary references.
Well, I'll qualify that: I found it damned hard. And I never got code
to do it well enough that it could run as better than a kind of
"parsing assistant" (see http://decapoda.nhm.org/recite). Nearly every
reference needs some sort of manual intervention to be properly parsed
-- journal-formatted bibliographic output just loses too much
field-specificity to be easily reversed back into a parsed record.


The next thing I'm scared about is the fuzzy comparison between Dirty
Bucket entries and Clean Bucket records. Because the comparisons have
to be fuzzy (something like a Levenshtein distance), one is stuck with
an exponential problem: every query record has to be checked against
every potential target record. Actually, it's a little worse: several
"title" fields really need to be checked against all title fields in
each target record. You can do a little preprocessing to speed the
comparisons, but you can't just do checksums and then an indexed
search.

I'm not saying we shouldn't be doing that comparison stage. But it's
going to take some careful planning, and probably some pretty inspired
input from some really smart computer science / information theory
kind of people. Without that, we'll very quickly get into a
computational quandary.


Another stage that will take some careful work is the "resolution"
phase once a candidate Dirty Bucket record is being presented next to
a menu of plausibile Clean Bucket records. That's very much like the
deduplication problem. The complex part there is more of an interface
issue. What we have found is that it's almost the rare case that one
can just say "Yup, record A is a dupe of record B, move on". Much more
often, it's "Yeah, looks like they are the same thing, but the journal
name on the new one is more complete than the existing reference,
however the existing one has the issue number that I'm missing....."
Rather than a simple "take this one, trash that one", the session
turns into more of a pick-and-choose field-by-field update of the
existing record.

But that's more of a look forward to interface design than underlying
database design.

-Dean
--
Dean Pentcheff
pent...@gmail.com

2009/12/19 Richard Pyle <deep...@bishopmuseum.org>:

Richard Pyle

unread,
Dec 20, 2009, 2:12:03 PM12/20/09
to Dean Pentcheff, taxonlit

> How dirty should the Dirty Bucket be?

Any text string purported to represent a citation, regardless of how
complete/incomplete, cleaned/dirty, verified/unverified it may be.

> I like the idea of a
> sort of purgatory for semi-processed records, before they
> enter the paradise of full-checkedness. But... as described,
> the Dirty Bucket could be pretty much a bin of cut-and-pasted
> reference lists from the back end of any/every taxonomic
> paper.

Yes, that's exactly what it should be.

> I think that might be too permissive, in that I doubt
> we'd get a substantial percentage of those definitively
> linked to Clean Bucket references -- the workload would be
> just too high.

Hard to say. The ones that remain unlinked remain unlinked. The cleaner
ones are more likely to get linked. But here's the key: with modern
database engines, the presence of the unlinked records does not have any
meaninful impact on the function of the system as a whole. In other words,
excluding the dirtiest of dirty records has almost no down-side for the
utility of the not-so-dirty records. The main benefits of a liberal "gate"
for the dirty bucket are:
- You get a larger scope of possible permutations of how citations may be
represented, which will helps identify the scope of variation that any
citation might take. Once the dirty ones do get linked, that facilitates
the linking of future dirty ones.
- You lower the bar for participation to anyone with any set of text strings
purported to represent citations, including "microcitations" cleaned from
scanned literature, OCR'd bibliographies from published papers, etc., etc.

This is essentially the model for the Global Names Index
(www.globalnames.org), and based on conversations that Chris and I and
others had at TDWG, I am absolutely convinced that this model will serve as
an important a function for CiteBank as GNI does for GNA.

> One way to constrain that a bit might be to set things up so
> that the Dirty Bucket will only accept some form(s) of parsed
> references. That ensures that there's at least been some
> effort to "digest" the references before dumping them in.

Why? What value to you gain by excluding the parsed text strings? As with
GNI, there will be parsing algorithms for the text strings. Also as has
been discussed for GNI, there should be a mechanism for content providers
with pre-parsed records (or, better yet, records pre-linked to the clean
bucket) to submit that parsed/linked content directly to the dirty bucket,
to help the parsing algorithms "learn" how to improve their methods.

> The valuable service of fuzzy-matching an arbitrary reference
> string to a Clean Bucket record would be a service completely
> separate from the Dirty Bucket, in that case.

It will be a completely separate service in any case. The more links that
are made (and verified), the more robust the linking capabilities become.
In other words: one does not need to link every unlinked dirty record to a
clean record -- one need only link the dirty records to other dirty records
that have already been linked to a clean record (such links would, of
course, require some sort of verification -- as would all links between
dirty bucket and clean bucket that were algorithmically derived).

> Another reason I'm inclined to push for pre-parsed references
> in the Dirty Bucket is that it's damned hard to parse
> arbitrary references.

Agreed. I've been *mighty* impressed by the work done for GNI in parsing
name-strings; but I agree that citation-strings are more complex and
potentially ambiguous. On the other hand, there is a much larger body of
work that has already been done on that (by other communities), and there
are many dictionaries that can assist the parsing algorithms. In any case,
I think there should be an option to included pre-parsed records, but I see
no reason why it would be advantageous to constrain the "least common
denominator" for contributed content to the pre-pasrsed subset.

> The next thing I'm scared about is the fuzzy comparison
> between Dirty Bucket entries and Clean Bucket records.
> Because the comparisons have to be fuzzy (something like a
> Levenshtein distance), one is stuck with an exponential
> problem: every query record has to be checked against every
> potential target record. Actually, it's a little worse:
> several "title" fields really need to be checked against all
> title fields in each target record. You can do a little
> preprocessing to speed the comparisons, but you can't just do
> checksums and then an indexed search.

I think we'll find that with modern computer technology, this is not so
scary. I am utterly *amazed* at how quickly the GNI comparisons can be done
-- and that will end up as a MUCH larger dataset.

> I'm not saying we shouldn't be doing that comparison stage.
> But it's going to take some careful planning, and probably
> some pretty inspired input from some really smart computer
> science / information theory kind of people. Without that,
> we'll very quickly get into a computational quandary.

Yup.

> Another stage that will take some careful work is the "resolution"
> phase once a candidate Dirty Bucket record is being presented
> next to a menu of plausibile Clean Bucket records. That's
> very much like the deduplication problem. The complex part
> there is more of an interface issue. What we have found is
> that it's almost the rare case that one can just say "Yup,
> record A is a dupe of record B, move on". Much more often,
> it's "Yeah, looks like they are the same thing, but the
> journal name on the new one is more complete than the
> existing reference, however the existing one has the issue
> number that I'm missing....."
> Rather than a simple "take this one, trash that one", the
> session turns into more of a pick-and-choose field-by-field
> update of the existing record.

Yup -- not easy. But not insurmountable. And again, I don't see how the
presense of unlinked dirty records in any way hampers the cross-linking
process for the less-dirty (e.g., pre-parsed) records.

Rich


Reply all
Reply to author
Forward
0 new messages