Firstly, I think Biodiversity Informatics needs one good names management environment, that is open and flexible. It needs to be an environment that interconnects available services and authoritative data sources.
It is my presumption that a significant goal is to interconnect distributed data by using the names. That means we have to overcome problems that there are many names written in various ways for the same species. Then we have to standardise the results in the context of any one of many authoritative taxonomies. These are the challenges that Global Names was set up to address, sitting on top of a 10 year discussion (boringly outlined at globalnames.,org)
We are more or less in the position of delivering pretty solid services that will deal with some of the issues that Arlin has identified.
GN has names recognition and discovery tools, so we can run through sources from docs, to pdfs, to text, to html, to images and so on, find the names, and spit them out. This is scalable and is currently being run against the full corpus of Biodiversity Heritage Library.
Once we have the names we can run a variety of services.
There is that one of dealing with variant name strings. We currently apply the Tony Rees / Mike Giddens fuzzy match but should also be looking at other options that are available. In addition, we have already rendered down our reference system of about 22 million name strings to about 7 million groups. This environment also helps to open up 'Did you mean ...?' options.
Alongside that we need to ensure that we cover the homonyms problems., but through collaboration with Tony Rees and IRMNG, we have access to a vast amount of homonymy information, for much but not all of which we can offer taxonomic context that can be used to help in disambiguating homonyms.
I am attaching a copy of our TREE article in which we laid out our approach.
I am sure there will be many more emails from me as I work through myu inbox./
Paddy