I'm currently on hour-34 of my long journey home. I spent much of last
night in Marseille airport (and two batteries' worth of time on the plane to
Houston) cleaning up my own list of journal titles. What I currently lack,
and what I think would be EXTREMELY useful, is a list of "clean" (i.e.,
validated) full journal titles (plus one or more standard abbreviations),
that we could use to cross-link our own (often messy & unverified) journal
titles to. I've worked on this before, and I know that certain nomenclator
databases have pretty clean lists of group-specific journal titles +
abbreviations (e.g., Catalog of Fishes, Hymenoptera Name Server, Hexacoralia
database, etc.) Chris (F.) mentioned that BHL has some sort of list, and
that no such list for our community exists out in library-land. Cathy
indicated that we might be able to extract something from the Library of
Congress or some other Library resource. Rod -- were you able to compile a
list of clean journal titles in your bioguid work on this?
From my perspective, this should be the first content-related priority
(after we get a draft data standard & data model hammered out).
One more thing: I mentioned at TDWG about ISO 833 (a PDF of which I just
uploaded to the site:
http://groups.google.com/group/taxonlit/web/ISO833.pdf). This might be
useful for normalizing our standard abbreviations (even if it no longer is a
vlaid ISO standard). Anybody want to take on the task of OCRing it (or
finding an electronic copy online)?
Aloha,
Rich
Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deep...@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html
Yes, that's what I intended to convey; but I see now I didn't quite put it
that way in my email. Obviosuly, the goal is to build the links. I have
found that having one "clean" form of the title is the easiest way to
achieve this so you have "n" variants linking to one "master", rather than
"n" variants cross-linked to each other. Obviously, we're going to want to
accommodate multiple simultaneously legitimage representations of Journal
titles, so I completely agree with that. I was more thinking in terms of
building a foundation to reconcile all the illegitimate variants
(misspellings, missing words or punctuations, truncated titles, etc.)
against a "clean" varaition that accomodates multiple legitimate
representations (different languages, different character sets,
abbreviations, legitimate alternate titles, etc.)
> Journals can
> have more than one name in use. My approach in bioGUID is to
> gather alternative names, link them by the journal ISSN,
Yes, the ISSN is handy, but only applies to journals with ISSNs. For
example, I'm currently working with a set of 4700 journal titles from Zoo.
Record; of which over 500 lack ISSNs. I have another master list of 157,000
titles of periodicals, which represent 103,000 unique titles, of which
31,000 lack ISSNs.
> I think the ability to match names is what we really want, a
> clean definitive list is simply a by-product of supporting that.
In one sense it's a byproduct; but in another sense it's also a tool to help
us achieve the reconciliation/matching.
> I could make a dump of what I have in bioGUID available. You
> can also access individual journals like this:
>
> http://bioguid.info/issn/index.php?issn=0454-6296
>
> http://bioguid.info/issn/index.php?issn=0372-1426
>
> I also have a journal lookup service that uses approximate
> string matching at http://bioguid.info/services/
Cool! Thanks! I'd definitely like to compare what you have to what I have
-- especially in terms of matching ISSNs for those journals that have them.
> A clean list by itself isn't of much use given that people
> don't always use these. What you want are a list of what
> people do use, plus tools to cluster these (and variants you
> haven't seen yet) into sets that refer to the same journal.
Unfortunately, in my experience, most of what people use is a hodge-podge of
abbreviations, full titles (some with periods, some without), and many,
manyy spelling errors, truncations and duplicates. I see the one "clean"
version of the title as the most effective mechanism that I have found in
consolodating both within a messy set of journals, and between two sets
(whether they are messy or clean). The problem I keep encountering has more
to do with illegitimate alternative representations of titles, than with
legitimate ones.
I definitely look forward to seeing a dump of what you have.
Rich