Advice sought for DOI metadata, taxonomic name-finding and resolution

21 views
Skip to first unread message

David Shorthouse

unread,
Apr 18, 2012, 3:26:13 PM4/18/12
to drya...@googlegroups.com
Folks,

I noticed on your development list, http://wiki.datadryad.org/Repository_Development_Plan, that you are considering ingestion of taxonomic / vernacular names to help supplement search across your holdings. I also understand that you have been in touch with David Patterson, PI of the NSF-funded Global Names project.

Am looking for a simple way for you to take advantage of the Global Names taxonomic name-finding and resolution services. These are still under development and receiving feedback from consumers. Nonetheless, one of our services can take a URL as a query parameter and find all names. This URL could point to a PDF, image, doc, xls, etc and does OCR on-the-fly as needed. The response is a list of unique names. Another service of ours can take a flat list of names and resolve these against other lists (e.g. Catalogue of Life, NCBI, EOL, GBIF) and produce their local identifiers for a linking service as well as their tree paths to root for possible concept expansion in your index.

So, I'm writing to inquire if you have plans to include direct links to data packages (and MIME type, though not immediate necessary) in responses to DOI content negotiations.

For example:
curl -LH   "Accept: application/rdf+xml" "http://dx.doi.org/10.5061/dryad.584" (or any of your other supported content types as expressed at http://data.datacite.org/10.5061/dryad.584)

...gives me some nice metadata, but doesn't actually give me a link to the data package that I'm most interested in. The only apparent way to get the package is to visit http://datadryad.org/resource/doi:10.5061/dryad.584 and fish for it. Had a link to the package been provided, you'd be pretty close to scratching "...Search over hierarchical concepts (e.g., "all lizards")..." off your list.

There are however going to be some limitations and requirements for names within any of your submitted data packages. These may have to feed back to data depositors if they wish to have names within their submissions recognizable and indexable. We can chat more about that at a later date.

All the best,

David P. Shorthouse
Marine Biological Laboratory
Woods Hole, MA

Mark Diggory

unread,
Apr 19, 2012, 11:28:57 AM4/19/12
to David Shorthouse, drya...@googlegroups.com
David,

Having worked on the "/resource/" service for Dryad I can advise that the idea was that it eventually support content negotiation and exposure of the resource in various Citation, LoD and XML formats.  At this time however, this is still limited.  I did want to give you an example of how metadata is exposed via the resource service.

Currently we are supporting the following without content negotiation:

Eventually OAI-ORE available from

would be available from

And in the case of an RDF representation, the intention would be to eventually see use of 303 redirection and content negotiation to expose it on the following variations for rdf browsers 
http://datadryad.org/resource/doi:10.5061/dryad.584/rdf.n3


Conversely, the approaches we are taking to populate Datacite identifiers and metadata are backside calls to those services at the time it is appropriate to be releasing the identifier, metadata or content to be publicly available.  I could see similar approaches globalnames.org.  It could be feasible to add providers that would interact with the API to submit content and retrieve and store suggested fields/values that may be appropriate for the data package and data file records.  An initial strategy for this could operate asynchronously in the background after submissions have been completed workflow processing and were archived and could be shared with external services.  This feedback would allow the curators and submitters to cleanup suggested values after the submission had been completed.

If its your interest to pre-calculate those records by harvesting the data beforehand, I do imagine calculating and traversing the above paths to be feasible. Eventually, RDF representations could be traversed via the content negotiation of the /resource/ reference from Datacite redirection.  

At this time I think the hierarchical graph of DataPackage and DataFiles is not something we can easily push into Data-cite.  The actual Data File Record and its parent Package Record are separate entires in Datacite, and the actual content bitstreams would be poorly described in the service.

I would be interested to find out more about the API for globalnames.org.

Regards,

--
@mire Inc. 
Mark Diggory (Schedule a Meeting)
2888 Loker Avenue East, Suite 305, Carlsbad, CA. 92010
Esperantolaan 4, Heverlee 3001, Belgium
http://www.atmire.com


Hilmar Lapp

unread,
Apr 19, 2012, 11:34:47 AM4/19/12
to Mark Diggory, David Shorthouse, Dryad Developers
David - do I understand your email correctly in that you wanted to crawl the actual data files for taxonomic names, and not the metadata?

Programmatic access to the actual data files is possible, but not quite as straightforward and clean as we want to be. The progress and aims are documented here: http://wiki.datadryad.org/Data_Access

I think the DataONE data access API will be live within the week or two, if it isn't already since with the just-released v1.11. (Ryan can update us on that.)

-hilmar
 
--
You received this message because you are subscribed to the Google Groups "dryad-dev" group.
To post to this group, send email to drya...@googlegroups.com.
To unsubscribe from this group, send email to dryad-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dryad-dev?hl=en.

-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================



Shorthouse, David

unread,
Apr 19, 2012, 12:06:14 PM4/19/12
to Hilmar Lapp, Mark Diggory, Dryad Developers
> David - do I understand your email correctly in that you wanted to crawl
> the actual data files for taxonomic names, and not the metadata?

Actually, both. And, because these are early days in my exploration,
I'm looking for ways to ensure that name-crawling procedures can be as
simple as possible, either upon ingestion or throughout the lifecycle
of the data package. As you know, name-crawling and indexing cannot be
a one-off event for any one data package. Rather, taxonomy is a moving
target so packages will have to be re-crawled for names.

After having examined many data sets already accessible via Dryad, I
see all manner of idiosyncratic ways taxonomic names have been
expressed, which is not overly suprising. Crawling will work well in
some instances (i.e. if near code-compliant), but not so well in
others (e.g. genera lowercase, underscores between genus and epithet,
etc.). So, crawling for names will have to hit both the metadata and
the data packages (and ideally, the published works as well) in an
attempt to get as much coverage as possible.

> Programmatic access to the actual data files is possible, but not quite as
> straightforward and clean as we want to be. The progress and aims are
> documented here: http://wiki.datadryad.org/Data_Access

Excellent. This is a great start.

> I think the DataONE data access API will be live within the week or two,
> if it isn't already since with the just-released v1.11. (Ryan can update us
> on that.)
>
> -hilmar

The very early GN APIs are accessible at: http://gnrd.globalnames.org
and http://resolver.globalnames.org. The latter is *very* early days,
but will give you a sense for what it can do. It's rudimentary API doc
is here: https://github.com/GlobalNamesArchitecture/gni/wiki/Name-resolution-API-(early-public-ALPHA).
There's also a chance we might deprecate the former and roll it into
the latter, which would eliminate an API call. Needless to say, we've
been concentrating more on the backend robustness, algorithms, and
names list acquisition/ingestion.

Cheers,

David P. Shorthouse

Hilmar Lapp

unread,
May 1, 2012, 1:43:35 PM5/1/12
to davidpsh...@gmail.com, Mark Diggory, Dryad Developers

On Apr 19, 2012, at 12:06 PM, Shorthouse, David wrote:

> The very early GN APIs are accessible at: http://gnrd.globalnames.org

Do I understand correctly that in contrast to the other one this interface doesn't allow the choosing of data sources?

And BTW if I feed it a data package URL (http://datadryad.org/resource/doi:10.5061/dryad.743) it (erroneously) complains in the response that "That URL was inaccessible". Any thoughts on what might be happening?

> and http://resolver.globalnames.org. The latter is *very* early days, but will give you a sense for what it can do. It's rudimentary API doc is here: https://github.com/GlobalNamesArchitecture/gni/wiki/Name-resolution-API-(early-public-ALPHA).

BTW the HTML output isn't working for that (it doesn't show any metadata for the list datasources example). Aside from that, the documentation seems to be suggesting that you need to present it with a list of name string candidates already, rather than with a URL to a page or a PDF that may contain such name strings. Is that true, or is the documentation behind? Also, there is no good example for how a resolution query would actually look like.

I for my part am looking forward though for these resources to finally become functional one day.

Cheers,

-hilmar

David Shorthouse

unread,
May 1, 2012, 2:31:22 PM5/1/12
to drya...@googlegroups.com, davidpsh...@gmail.com, Mark Diggory

> The very early GN APIs are accessible at: http://gnrd.globalnames.org

Do I understand correctly that in contrast to the other one this interface doesn't allow the choosing of data sources?

That's correct. This is purely name-finding using both a dictionary-based and an NLP-based approach; it does not do any resolution against any sources of names. It is built for speed.
 
And BTW if I feed it a data package URL (http://datadryad.org/resource/doi:10.5061/dryad.743) it (erroneously) complains in the response that "That URL was inaccessible". Any thoughts on what might be happening?

Your server returns an HTTP 405 Method Not Allowed for a HEAD request, which I do prior to a GET request in the event there are any 404s or 301/302 redirects. Is there any reason why your server(s) is/are configured to prevent HEAD requests? This would certainly affect your page ranks because many search engine bots do HEAD requests.
 
> and http://resolver.globalnames.org. The latter is *very* early days, but will give you a sense for what it can do. It's rudimentary API doc is here: https://github.com/GlobalNamesArchitecture/gni/wiki/Name-resolution-API-(early-public-ALPHA).

BTW the HTML output isn't working for that (it doesn't show any metadata for the list datasources example). Aside from that, the documentation seems to be suggesting that you need to present it with a list of name string candidates already, rather than with a URL to a page or a PDF that may contain such name strings. Is that true, or is the documentation behind? Also, there is no good example for how a resolution query would actually look like.

Indeed, the resolver accepts lists of candidate names and cross-references these against known lists. As you suggest, the documentation is a bit behind and we are working on it these next few weeks. So, now's the time to make requests.

Cheers,

David P. Shorthouse

Ryan Scherle

unread,
May 2, 2012, 11:36:25 AM5/2/12
to David Shorthouse, Dryad Developers

On May 1, 2012, at 2:31 PM, David Shorthouse wrote:

> Your server returns an HTTP 405 Method Not Allowed for a HEAD request, which I do prior to a GET request in the event there are any 404s or 301/302 redirects. Is there any reason why your server(s) is/are configured to prevent HEAD requests? This would certainly affect your page ranks because many search engine bots do HEAD requests.

Thanks for bringing this to our attention. No, there is no reason to prevent HEAD requests. We will get this corrected soon.

--- Ryan


Reply all
Reply to author
Forward
0 new messages