[periodicals] Integration of crossref titles

0 views
Skip to first unread message

Chris Clarke

unread,
May 1, 2009, 2:59:20 PM5/1/09
to datain...@googlegroups.com
Hello,

I've been working on integrating the 20048 titles covered by CrossRef
into the periodicals dataset. I've also taken the chance to change
some other bits and pieces and I've loaded these into the store for
review via http://periodicals.dataincubator.org:

* The void dataset descriptor has changed. Rather than just one
descriptor, the main descriptor now has 3 subsets, one each for
Highwire, NLM and CrossRef. Each subset's source now points to the
actual file the data was generated from. Each resource has one or more
dct:partOf predicates pointing back to the subset(s) it was generated
from.
* For items from the CrossRef set, and where available, a title-level
DOI is provided as a literal via bibo:doi. This is also turned into a
bibo:uri by pre-pending the dx.doi.org prefix and also a owl:sameAs is
generated to a resource describing the DOI. See http://periodicals.dataincubator.org/journal/cellmotilityandthecytoskeleton
* Resource URIs for titles from all three sets are now based upon
lower-casing and stripping spaces and special chars from the
periodical's title. This means that data that overlaps between the
sets is naturally merged - e.g. http://periodicals.dataincubator.org/journal/cellmotilityandthecytoskeleton
now benefits from data gleaned from both the NLM (provides the
foaf:primaryTopicOf) and Crossref (provides the DOI and publisher
info) sets.
* Added a new target in the rake file to also upload the void.rdf file
* Added dct:publisher literal where available

On the last point, dct:publisher is modeled and used as a literal,
however what do people think about minting URIs and creating entities
for the publishers themselves? Under what schema/ontology would these
be modeled?

Have a good weekend,

Chris

Please consider the environment before printing this email.

Find out more about Talis at www.talis.com

shared innovationTM

Any views or personal opinions expressed within this email may not be those of Talis Information Ltd or its employees. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited.

Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.

Bruce D’Arcus

unread,
May 1, 2009, 3:23:43 PM5/1/09
to Data Incubator


On May 1, 2:59 pm, "Chris Clarke" <chris.cla...@talis.com> wrote:

> I've been working on integrating the 20048 titles covered by CrossRef  
> into the periodicals dataset. I've also taken the chance to change  
> some other bits and pieces and I've loaded these into the store for  
> review viahttp://periodicals.dataincubator.org:

Cool. Am still looking for some social scientific geography journals
though :-)

> * The void dataset descriptor has changed. Rather than just one  
> descriptor, the main descriptor now has 3 subsets, one each for  
> Highwire, NLM and CrossRef. Each subset's source now points to the  
> actual file the data was generated from. Each resource has one or more  
> dct:partOf predicates pointing back to the subset(s) it was generated  
> from.
> * For items from the CrossRef set, and where available, a title-level  
> DOI is provided as a literal via bibo:doi. This is also turned into a  
> bibo:uri by pre-pending the dx.doi.org prefix and also a owl:sameAs is  
> generated to a resource describing the DOI. Seehttp://periodicals.dataincubator.org/journal/cellmotilityandthecytosk...

Hmm ... am not sure Iike this idea. The link, after all, can be
automatically generated from the DOI, and at least my intention behind
bibo:uri was always in essence to refer to the HTTP locator for web
resources. E.g. for a citation processor, it would print that when
present.

A doi resolver link seems a little different.

But I'm open on this; just wanted to throw out my perspective.

> * Resource URIs for titles from all three sets are now based upon  
> lower-casing and stripping spaces and special chars from the  
> periodical's title. This means that data that overlaps between the  
> sets is naturally merged - e.g.http://periodicals.dataincubator.org/journal/cellmotilityandthecytosk...

Ugh; don't like that. Can you not use a standard slugifying algorithm,
where you replace the spaces with a hyphens?

>   now benefits from data gleaned from both the NLM (provides the  
> foaf:primaryTopicOf) and Crossref (provides the DOI and publisher  
> info) sets.
> * Added a new target in the rake file to also upload the void.rdf file
> * Added dct:publisher literal where available
>
> On the last point, dct:publisher is modeled and used as a literal,  
> however what do people think about minting URIs and creating entities  
> for the publishers themselves?

Yes, but question: does that belong in a separate, linked, dataset?

> Under what schema/ontology would these  
> be modeled?

I was thinking a publisher is a foaf:Organization, or some subclass of
it.

Bruce

Chris Clarke

unread,
May 1, 2009, 4:13:06 PM5/1/09
to datain...@googlegroups.com, Data Incubator
To be honest, I wasn't sure where to put it, if we can think of
somewhere more suitable happy to move it.


>> * Resource URIs for titles from all three sets are now based upon
>> lower-casing and stripping spaces and special chars from the
>> periodical's title. This means that data that overlaps between the
>> sets is naturally merged - e.g.http://periodicals.dataincubator.org/
>> journal/cellmotilityandthecytosk...
>
> Ugh; don't like that. Can you not use a standard slugifying algorithm,
> where you replace the spaces with a hyphens?

Sure not a problem

>
>
>> now benefits from data gleaned from both the NLM (provides the
>> foaf:primaryTopicOf) and Crossref (provides the DOI and publisher
>> info) sets.
>> * Added a new target in the rake file to also upload the void.rdf
>> file
>> * Added dct:publisher literal where available
>>
>> On the last point, dct:publisher is modeled and used as a literal,
>> however what do people think about minting URIs and creating entities
>> for the publishers themselves?
>
> Yes, but question: does that belong in a separate, linked, dataset?

Maybe

>
>
>> Under what schema/ontology would these
>> be modeled?
>
> I was thinking a publisher is a foaf:Organization, or some subclass of
> it.

Sounds perfect. I think we've misused dct:publisher in Aspire by
linking it direct to a foaf:Org, so given that it stays as a literal
how could the foaf:Org be attached to the journal?

>
>
> Bruce

Bruce D’Arcus

unread,
May 1, 2009, 5:28:28 PM5/1/09
to Data Incubator


On May 1, 4:13 pm, "Chris Clarke" <Chris.Cla...@talis.com> wrote:

> Sounds perfect. I think we've misused dct:publisher in Aspire by  
> linking it direct to a foaf:Org, so given that it stays as a literal  
> how could the foaf:Org be attached to the journal?

Just to be clear, this is what I mean:

<http://ex.net/1> a bibo:Periodical ;
dct:publisher <http://ex.net/2> .

<http://ex.net/2> a foaf:Organization ;
foaf:name "Some Publisher" .

Bruce

Chris Clarke

unread,
May 2, 2009, 8:08:31 AM5/2/09
to datain...@googlegroups.com
> Cool. Am still looking for some social scientific geography journals
> though :-)

I forgot to say that I'd also sorted out the search indexing. Do any of these fit the bill?

http://periodicals.dataincubator.org/~search.html?query=social+geography

winmail.dat

Bruce D’Arcus

unread,
May 2, 2009, 11:20:47 AM5/2/09
to Data Incubator


On May 2, 8:08 am, "Chris Clarke" <Chris.Cla...@talis.com> wrote:
> > Cool. Am still looking for some social scientific geography journals
> > though :-)
>
> I forgot to say that I'd also sorted out the search indexing. Do any of these fit the bill?
>
> http://periodicals.dataincubator.org/~search.html?query=social+geography

Awesome!

Another little suggestion on the sluggifying algorithm: strip out the
standard stop word ("the", "a", etc.).

That would mean what is now:

<http://periodicals.dataincubator.org/journal/
theprofessionalgeographer>

.. .would be:

<http://periodicals.dataincubator.org/journal/professional-geographer>

Bruce

Bruce D’Arcus

unread,
May 2, 2009, 11:31:35 AM5/2/09
to Data Incubator
A few other comments/questions as I looked at the RDF:

<rdf:Description rdf:about="http://periodicals.dataincubator.org/
journal/theprofessionalgeographer">

<bibo:issn>0033-0124</bibo:issn>
<foaf:isPrimaryTopicOf rdf:resource="http://www.ncbi.nlm.nih.gov/
sites/entrez?
Db=nlmcatalog&amp;doptcmdl=Expanded&amp;cmd=search&amp;Term=100969256%5BNlmId
%5D"/>
<foaf:isPrimaryTopicOf rdf:resource="http://locatorplus.gov/cgi-
bin/Pwebrecon.cgi?
DB=local&amp;v1=1&amp;ti=1,1&amp;Search_Arg=100969256&amp;Search_Code=0359&amp;CNT=20&amp;SID=1"/
>
<dct:publisher>Informa UK (Taylor &amp; Francis)</dct:publisher>

Per previous, better not as a literal.

<owl:sameAs rdf:resource="http://periodicals.dataincubator.org/
eissn/1467-9272"/>
<owl:sameAs rdf:resource="http://periodicals.dataincubator.org/
issn/0033-0124"/>
<dct:partOf rdf:resource="http://periodicals.dataincubator.org/
datasets/nlm"/>

<dct:partOf rdf:resource="http://periodicals.dataincubator.org/
datasets/crossref"/>

Question: I guess this gets to Leigh's earlier questions; is
dct:partOf the right relation above? Also, isn't it dct:isPartOf?

<bibo:eissn>1467-9272</bibo:eissn>
<dct:title>The Professional Geographer</dct:title>
<dct:title>The Professional geographer </dct:title>
<dct:identifier>100969256</dct:identifier>

Note duplicate dct:title properties (except for the title casing and
whitespace issues). Not sure how to deal with that.

<rdf:type rdf:resource="http://purl.org/ontology/bibo/Journal"/>
<bibo:shortTitle>Prof Geogr</bibo:shortTitle>

</rdf:Description>

Otherwise, looking good!

Bruce

pitman

unread,
May 3, 2009, 4:40:22 PM5/3/09
to Data Incubator
Chris, re:

>what do people think about minting URIs and creating entities for the publishers themselves?

I strongly support creation and maintenance of a comprehensive open
index of journal/publisher relations. A good start on this
including a listing of URIs for publishers which may be adequate, or
which could be extended if necessary, is provided by SHERPA:
http://www.sherpa.ac.uk/romeo/api.html
"The SHERPA/RoMEO Application Programmers' Interface (API) is a
machine-to-machine interface that lets programmers access SHERPA/RoMEO
data from their applications. For instance, you could use the API to
incorporate an automatic look-up of a journal or publisher into your
repository's deposition process." This supports much of what is
needed, though with only XML instead of RDF returns, e.g.

# Publisher's name search - 'Institute of Physics':

http://www.sherpa.ac.uk/romeoapi11.php?pub=institute%20of%20physics&qtype=exact

# Journal title search - 'Journal of Geology' - Single Result:

http://www.sherpa.ac.uk/romeoapi11.php?jtitle=Journal%20of%20Geology

# Journal title search - containing 'Modern Language' - Multiple
Results:

http://www.sherpa.ac.uk/romeoapi11.php?jtitle=modern%20language&qtype=contains

# ISSN - International Standard Serial Number:

http://www.sherpa.ac.uk/romeoapi11.php?issn=1444-1586

It would seem a useful exercise to map the SHERPA data in bulk to RDF
and provide access to it on the Talis platform.
I for one would much rather see search results delivered in JSON than
XML, and the RDF should assist integration of multiple
data sets.
The issue of maintenance and synchronization of the Talis and SHERPA
data may still be vexing. There is
the same issue for every journal data source, another significant one
being JournalSeek http://journalseek.net/
--Jim

Chris Clarke

unread,
May 6, 2009, 2:11:47 PM5/6/09
to datain...@googlegroups.com
Hi Jim,


The issue of maintenance and synchronization of the Talis and SHERPA
data may  still be vexing. There is
the same issue for every journal data source, another significant one
being JournalSeek http://journalseek.net/

As discussed on the phone earlier, I'm interested in incorporating more sources where available - by available I mean in a form where we can bulk download the data and transform it to RDF, and licensed in such a way that we can include the results at a more permanent namespace outside dataincubator once we're happy with the modeling. Do you know if JournalSeek provide a bulk download? I couldn't find anything on their site.

Does anyone have any suggestions of other such sources to include? So far we have PubMed, CrossRef and Highwire titles?

Chris

Senior Programme Manager

Talis Information Limited
Knights Court, 
Solihull Parkway,
Birmingham Business Park,
United Kingdom
B37 7YB

Direct Number: +44 (0)870 400 5423
Mobile Number: +44 (0)7595 022154
Office Number: +44 (0)870 400 5000

Reply all
Reply to author
Forward
0 new messages