Query dataverse by DOI / extract DOI list?

78 views
Skip to first unread message

Reinhard Engels

unread,
May 30, 2013, 10:16:45 AM5/30/13
to dataverse...@googlegroups.com
Hi all,

We're trying to figure out if there's any overlap between articles in our repository (DASH) and datasets in dataverse, so that maybe we could create a widget in DASH that would display a link to any relevant dataverse records.

The most obvious way to try to join them would be using DOI. As a first step, we'd love to get a quick list of DOIs in dataverse, but we're having some trouble figuring out how to extract this info from dataverse. Even just a raw, messy data dump, if it's embedded in citation or some other field, would be helpful just to get a sense if we have any matches at all (if not, there isn't really any point in proceeding yet). 

Anyone know if there's a way to grab this info? I imagine it could be useful for linking records to lots of external data sources beyond DASH.

Thanks in advance for any light you can shed on this,

Reinhard

Philip Durbin

unread,
May 30, 2013, 1:53:53 PM5/30/13
to dataverse...@googlegroups.com
I don't know the answer or if this helps at all but I thought I'd
point out http://projects.iq.harvard.edu/ojs-dvn since it seems
related:

"The Dataverse Network in collaboration with the Public Knowledge
Project (PKP) are on a two-year endeavor to make data sharing and
preservation an intrinsic part of the publication process."
> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataverse-commu...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin

Condon, Kevin

unread,
May 30, 2013, 2:26:19 PM5/30/13
to dataverse...@googlegroups.com

Reinhard,

We currently use handles as persistent identifiers for our studies but are looking at allowing DOIs to be used as well. So, an example of a global identifier/ persistent identifier for a study currently in production is: hdl:1902.1/00001

Does this help answer your question? Would that information help?

Kevin

--

August Muench

unread,
May 30, 2013, 3:23:09 PM5/30/13
to dataverse...@googlegroups.com
I think that what is being asked for is an ability to pull the DOIs out of the <relPubl> DDI tag matched to all the dataverse study handles so he can cross-reference those DOIs to those in DASH.

We want this too (as a full dump, not as a query) for cross-indexing data studies to SAO-NASA ADS records (though our users understand "bibcodes" more than DOIs). I know that there is a way to do it via either the OAI interface or the Search API. But I have not worked out a script to do it.  My faint memory tells me I can't do a full pull on all Astronomy records via the search API (only a results query).  

It is also a bit confusing, but clearly identifiable that the related publication DOI appears multiple times each record:

<titlStmt>
<IDNo agency="DOI">10.1086/115522</IDNo>
</titlStmt>

and 


but maybe that is how the creator/author/distributor authored the record (filling in both an ID and a URL/URI). That duplicity is a bit weird too btw. 


- Gus
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Reinhard Engels

unread,
May 30, 2013, 3:57:33 PM5/30/13
to dataverse...@googlegroups.com

Thanks, all! Yes, August, that's precisely what we're trying to do, cross reference works we have in DASH using this existing identifier. I haven't been able to figure out using the dataverse search API.

Ideally I'd like 2 things:

1. a way to dump all the dois in dataverse
2. a way to query by single doi to determine if dataverse has any matching records and grab urls for them.

The goal is to create an javascript widget that will check if dv has anything for a doi, and if so, provide a link or links to the user.

#2 would be sufficient for that, but #1 will make it easier to determine if there is any overlap at all between our repositories (if not, the project isn't really worth pursuing just yet).

There isn't really anything DASH-specific about this widget -- might be a neat thing for dataverse to offer in general. We'll certainly be happy to share whatever we come up with.

Reinhard

Philip Durbin

unread,
May 30, 2013, 4:03:17 PM5/30/13
to dataverse...@googlegroups.com
Hmm, I don't see "relPubl" (which Gus mentioned) as a searchable field
in the DVN Data Sharing API:

https://thedata.harvard.edu/dvn/api/metadataSearchFields

A search on "title" works fine:

https://thedata.harvard.edu/dvn/api/metadataSearch/title:democracy

See also http://guides.thedata.org/book/data-sharing-api
> --
> You received this message because you are subscribed to the Google Groups
> "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataverse-commu...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



Durand, Gustavo

unread,
May 30, 2013, 6:52:27 PM5/30/13
to <dataverse-community@googlegroups.com>
There are two related searchable fields in the API called: publicationCitation and publicationReplicationData. 

But when I tried that with DOI it didn't return any results. I think this could partly because it searches the content of the citation, but not the id's (though I did try some searches that should have returned something and didn't, so I may need to ask Leonid about these fields).

I also went to advanced search through the UI and did  two searches:
- Publication, Replication for: "doi" and got 124 studies
- Related Publications:  'doi' and got 46 studies.


Clearly, we should add a searchable field for publication id (or make it part of a general field for publication like advanced search) and/or fix the above fields. If anyone has specific suggestion on how they would like to see this work, please let us know.
______

Regardless I ran a direct query in the db just now to try get all current DOIs:

select distinct s.protocol||':'||s.authority||'/'||s.studyid as handle, 
        idtype||':'||idnumber as doi,
        CASE 
                WHEN strpos(lower(text), 'doi:') > 0 
                THEN substr(text,strpos(lower(text), 'doi:'),50)
        END as text
from studyrelpublication, studyversion sv, metadata m, study s
where studyrelpublication.metadata_id = m.id
        and m.id = sv.metadata_id
        and sv.study_id = s.id
        and sv.versionstate='RELEASED'
        and ((idtype='DOI' and idnumber != '') or lower(text) like ('%doi:%'))
order by handle;



This returned 222 results I've attached the results of the query as a text file. (look for doi either in the 2nd column which means they filled out the id field, or the 3rd column, which means it found it somewhere in the text - note I just find the dpi, then take the next 50 characters, since doi can be of variable length):




Let me know if this helps or if I can try to clean it up any more.




Also, Gus, I checked the DDI export code and you are correct, the author populated both - we just spit out the ID in the ID field and URL in the URL field.

Thanks,
Gustavo



On May 30, 2013, at 4:03 PM, Philip Durbin <philip...@harvard.edu>
 wrote:
doi_query

Reinhard Engels

unread,
May 31, 2013, 10:09:48 AM5/31/13
to dataverse...@googlegroups.com
Beautiful -- thank you Gustavo!

--
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/e9vlN5Dyb5U/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages