Direct data access

326 views
Skip to first unread message

Jamie Monogan

unread,
May 23, 2015, 2:05:05 AM5/23/15
to dataverse...@googlegroups.com
Hi everyone! Is it possible for me to make datasets I post on Dataverse directly accessible through a URL in R? In other words, I want to something to the effect of this in R, but with a Dataverse-hosted URL:

evolution<-read.dta("http://j.mp/BPchap7",convert.factors=FALSE)

I've written a book about using R and want the code to call stable references to the data when readers try things out. Thank you for any tips you can offer!

Philip Durbin

unread,
May 23, 2015, 12:32:32 PM5/23/15
to dataverse...@googlegroups.com
Hi Jamie!

The short answer is yes. :)

Let's first define our terms, especially dataset vs. file.

A "dataset" within Dataverse is a collection of one or more files *plus* metadata about those files: http://guides.dataverse.org/en/4.0/user/dataset-management.html

Since I see `read.dta` in your example and http://j.mp/BPchap7 redirects to http://spia.uga.edu/faculty_pages/monogan/computing/r/BPchap7.dta it sounds like what you're calling a dataset would be referred to as a *file* in Dataverse. You want to download a Stata file to process in R.

You *can* download files directly like this:

https://apitest.dataverse.org/api/access/datafile/12

But of course when you say "stable" you probably mean that you'd like to reference a DOI. Files in Dataverse do not have DOIs but datasets do. Currently, the best way to look up a dataset via DOI is via SWORD: http://guides.dataverse.org/en/4.0/api/sword.html

From SWORD you can get a list of file IDs, such as "12" in the example above. That "access" API endpoint is documented at http://guides.dataverse.org/en/4.0/api/dataaccess.html

Since you're using R, you'll probably be interested in keeping an eye on the status of Dataverse 4.0 compatibility of this R package: https://github.com/ropensci/dvn/issues/23 . That package is also listed here: http://guides.dataverse.org/en/4.0/api/client-libraries.html

I hope this helps! Please let me know if anything is unclear. There's a lot to unpack. :)

Phil

p.s. I'd love to hear about your book. :)

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/1a71d272-5174-45f0-a0df-da97ea97fccd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Thomas Leeper

unread,
May 24, 2015, 5:38:12 AM5/24/15
to dataverse...@googlegroups.com
Just following up on Phil's comments. My current "dvn" package is designed to work only with Dataverse < v4.0, so there's no direct data access available there. I'm still working on developing a newer package that will work with Dataverse 4.0 and thus allow direct data access. Hopefully I'll have the time to publish it soon.

If you want to just retrieve a file via direct URL access, it's worth being cautious about using R's native data import functions. Until R 3.2.0, these for the most part do not support HTTPS URL schemes (like those used by Dataverse); or, at least there is inconsistent behavior across platforms. So, you may want to check out "rio" (http://cran.r-project.org/web/packages/rio/index.html) as an easier way to import data in general and from the web in particular.  For your particular use case (importing Stata data), it should be particularly helpful because it wraps Hadley Wickham's "haven" package, which supports contemporary Stata file formats (>= v13), which "foreign" no longer supports.

-Thomas

Mercè Crosas

unread,
May 24, 2015, 10:44:10 AM5/24/15
to dataverse...@googlegroups.com, Thomas Robitaille, Vito D'Orazio
Following up on Phil and Thomas, Tom Robitaille (cc'ed) is working on something similar for Python, to import datasets to Glu from Dataverse, using the search and data APIs. You might want to follow up with him.

Also, Vito D'Orazio (cc'ed) is accessing datasets from Dataverse from TwoRavens (a web interface to run statistical summaries and analysis using Zelig/R). He might have some additional information on how best to access data from Dataverse from R.

Mercè Crosas, Ph.D.
Director of Data Science, IQSS
Harvard University


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

Jamie Monogan

unread,
May 24, 2015, 4:32:19 PM5/24/15
to dataverse...@googlegroups.com, philip...@harvard.edu
Hi Phil and everyone:

Thank you for all of your help on this. This all looks quite promising. Where I"m getting stuck is in trying to access API endpoint. I'll admit, that I'm trying to use SWORD in Terminal and not getting a lot of results. In the HTML printouts as I try to look up things, I frequently see the message, "Access to the specified resources has been forbidden." So I'm not sure if there's some permission I need to change or what.

Would it help to try a specific file? Here's the TwoRavens link to a file I put up called vaDistMeasures.dta:

https://dataverse.harvard.edu/dataexplore/gui.html?dfId=2461083&key=0c81e12c-3510-4c50-a728-18221ad8ef61

I thought 2461083 might be the code, so I've tried R access code like this, using "rio":

distMeasures<-import('https://apitest.dataverse.org/api/access/datafile/2461083',format='read_dta')

And I've tried various combinations of that. If I wanted to get that file into R from a URL, could you give me a specific example with that? How would extract the correct API endpoint from the Terminal with SWORD for this file? How would I get the data to import given that API endpoint? For what it's worth the relevant file is in this dataset: http://hdl.handle.net/1902.1/22006

Thank you for all of your help and time, everyone! And since you asked, the R book is forthcoming with Springer, "Political Analysis Using R." We expect it to be out this fall, and it should be freely available via SpringerLInk.

Thank you,
Jamie


On Saturday, May 23, 2015 at 12:32:32 PM UTC-4, Philip Durbin wrote:
Hi Jamie!

The short answer is yes. :)

Let's first define our terms, especially dataset vs. file.

A "dataset" within Dataverse is a collection of one or more files *plus* metadata about those files: http://guides.dataverse.org/en/4.0/user/dataset-management.html

Since I see `read.dta` in your example and http://j.mp/BPchap7 redirects to http://spia.uga.edu/faculty_pages/monogan/computing/r/BPchap7.dta it sounds like what you're calling a dataset would be referred to as a *file* in Dataverse. You want to download a Stata file to process in R.

You *can* download files directly like this:

https://apitest.dataverse.org/api/access/datafile/12

But of course when you say "stable" you probably mean that you'd like to reference a DOI. Files in Dataverse do not have DOIs but datasets do. Currently, the best way to look up a dataset via DOI is via SWORD: http://guides.dataverse.org/en/4.0/api/sword.html

From SWORD you can get a list of file IDs, such as "12" in the example above. That "access" API endpoint is documented at http://guides.dataverse.org/en/4.0/api/dataaccess.html

Since you're using R, you'll probably be interested in keeping an eye on the status of Dataverse 4.0 compatibility of this R package: https://github.com/ropensci/dvn/issues/23 . That package is also listed here: http://guides.dataverse.org/en/4.0/api/client-libraries.html

I hope this helps! Please let me know if anything is unclear. There's a lot to unpack. :)

Phil

p.s. I'd love to hear about your book. :)
On Sat, May 23, 2015 at 2:05 AM, Jamie Monogan <jmon...@gmail.com> wrote:
Hi everyone! Is it possible for me to make datasets I post on Dataverse directly accessible through a URL in R? In other words, I want to something to the effect of this in R, but with a Dataverse-hosted URL:

evolution<-read.dta("http://j.mp/BPchap7",convert.factors=FALSE)

I've written a book about using R and want the code to call stable references to the data when readers try things out. Thank you for any tips you can offer!

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
May 24, 2015, 6:41:59 PM5/24/15
to dataverse...@googlegroups.com
Right, 2461083 is the id of your file (the database id, technically). It's how your "vaDistMeasures" file is uniquely identified in the system.

You can download it in various ways:

- tab-separated values (TSV) file: https://dataverse.harvard.edu/api/access/datafile/2461083
- original: https://dataverse.harvard.edu/api/access/datafile/2461083?format=original
- RData: https://dataverse.harvard.edu/api/access/datafile/2461083?format=RData
- JSON: https://dataverse.harvard.edu/api/access/datafile/2461083?format=prep
- DDI (XML): https://dataverse.harvard.edu/api/meta/datafile/2461083

I believe there are other possibilities such as subsetting documented at http://guides.dataverse.org/en/4.0/api/dataaccess.html

I hope this helps!

Phil

p.s. As Merce mentioned, TwoRavens calls into these APIs. I think it constructs the initial "pebbles" on based on summary data (JSON or XML) provided by one API endpoint and uses another endpoint to actually download the file to pass it to Zelig.



To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jamie Monogan

unread,
May 24, 2015, 10:22:52 PM5/24/15
to dataverse...@googlegroups.com, philip...@harvard.edu
Thank you, all. I tried this a few ways in rio:

distMeasures<-import('https://dataverse.harvard.edu/api/access/datafile/2461083?format=original',format='read_dta')
distMeasures<-import('https://dataverse.harvard.edu/api/access/datafile/2461083',format='read.table')
distMeasures<-import('https://dataverse.harvard.edu/api/meta/datafile/2461083')

Every time I got something to the effect of:

Error in import("https://dataverse.harvard.edu/api/access/datafile/2461083?format=prep",  :
  Unrecognized file format

Of course, if I just put the URL in my browser, the file is ready to download. Is this the issue Thomas was talking about with R not recognizing https? I upgraded to R 3.2.0, and still got this result.

If anyone else has other ideas and can read these data into R, that would be appreciated. This feels close, but if R doesn't have the ability, then the other option would be to see if my publisher can host stable server space. Thank you all again!
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

Thomas Leeper

unread,
May 25, 2015, 3:49:06 AM5/25/15
to dataverse...@googlegroups.com, philip...@harvard.edu
Jamie,

Ah, sorry about that. It is a small bug in the most recent version of rio. I've fixed the issue on Github and will send that version to CRAN briefly. You can then do:


or


Both will return the data.frame:

> str(distMeasures)
'data.frame':   11 obs. of  9 variables:
 $ cd        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ mrpcd     : atomic  56.8 56.9 50.8 57.3 58.8 ...
  ..- attr(*, "label")= chr "mrp.cd"
 $ mrpcdvar  : atomic  680 668 674 674 691 ...
  ..- attr(*, "label")= chr "mrp.cd.var"
 $ obama08   : num  48.5 51 76 50.5 48.5 ...
 $ dem08     : num  42.4 52.5 100 40 50 ...
 $ unweighted: num  61.8 67.3 47.8 62.5 55.6 ...
 $ weighted  : num  61.5 66.4 48.9 59 51.8 ...
 $ krigecd   : atomic  52.4 50.7 50 52.1 57.3 ...
  ..- attr(*, "label")= chr "krige.cd"
 $ krigecdvar: atomic  738 730 745 736 751 ...
  ..- attr(*, "label")= chr "krige.cd.var"

To get the latest version of rio, install from Github:

library("devtools")
install_github("leeper/rio")

Best,
-Thomas

Jamie Monogan

unread,
May 28, 2015, 1:44:14 AM5/28/15
to dataverse...@googlegroups.com, philip...@harvard.edu
Success! Thank you everyone! With Thomas's new version of rio, I was able to load the Stata format of these data on a Snow Leopard Mac running R 3.1.3.

Thank you all for all of your generosity with your time, and your extensive help on this. So Thomas and Phil, if my book was your book, would you feel perfectly comfortable providing data access in this way? That is, in the printed example code, when the data are first loaded would saying to load rio and use it to load the Dataverse data be a good permanent solution? If the URLs are going to be permanent and rio will always provide this functionality, then I think I'll probably do things this way. Thank you all so much!

Take care,
Jamie

Philip Durbin

unread,
May 29, 2015, 9:59:21 AM5/29/15
to dataverse...@googlegroups.com
Jamie, it's great news that you got it all working!

Since your audience is R users, ideally you would point them at the
"dataverse" package but it's still under development as Thomas
indicated in this thread and at
https://github.com/ropensci/dvn/issues/23 . I like the idea of the
"dataverse" R package being a bit of a buffer between R users and the
Dataverse APIs. As the APIs change the package can be updated. I'm not
much of an R user myself but from what I understand the older "dvn" R
package that works with DVN 3.x is very nice!

I don't mean to pressure Thomas though so let's figure out what we can
do in the meantime!

One thing you could do to make your book a bit more future proof is to
indicate which Dataverse API version you're using by putting it in the
URL like this:

https://dataverse.harvard.edu/api/v1/access/datafile/2461083

For more about how we version Dataverse APIs please see
http://guides.dataverse.org/en/4.0/api/native-api.html

Inevitably, we will be deprecating API versions in the future similar
to how GitHub does it: https://developer.github.com/v3/versions/

Please let me know if any of this is unclear! If you plan to publish
any of this as a blog post or whatever, please send along a link!

Phil
>>>>>>> send an email to dataverse-commu...@googlegroups.com.
>>>>>>> To post to this group, send email to dataverse...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/dataverse-community/1a71d272-5174-45f0-a0df-da97ea97fccd%40googlegroups.com.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Philip Durbin
>>>>>> Software Developer for http://dataverse.org
>>>>>> http://www.iq.harvard.edu/people/philip-durbin
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Dataverse Users Community" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to dataverse-commu...@googlegroups.com.
>>>>> To post to this group, send email to dataverse...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/dataverse-community/37cc5299-23c7-46c0-bc8a-130fadfa3f12%40googlegroups.com.
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Philip Durbin
>>>> Software Developer for http://dataverse.org
>>>> http://www.iq.harvard.edu/people/philip-durbin



Philip Durbin

unread,
Sep 4, 2015, 12:54:12 PM9/4/15
to dataverse...@googlegroups.com
Hi Jamie and everyone,

I'm reviving this old thread because I finally got around to writing a little script in R to download a file from Dataverse 4: https://github.com/IQSS/dataverse/commit/812424a

I'm not an R programmer but I hope it helps. Thomas, I see what you mean about R's native download.file method[1] being a little difficult to work with. If I leave off "method = 'curl'" I get "unsupported URL scheme" when using HTTPS. So I left it in there at the risk of making this little script less portable. Jamie, I know you got Thomas's rio example working too. Great. More portable, I'm sure.

In a comment I link to these examples I gave earlier in this thread of how to construct the file URL depending on if you want the original file, tab-separated values (TSV), RData, JSON, or DDI (XML): https://groups.google.com/d/msg/dataverse-community/fFrJi7NnBus/LNpfXItbtZYJ

I also linked to a this comment about how currently (as of Dataverse 4.1) the most reliable way to get a list of file IDs is via the SWORD API: https://github.com/IQSS/dataverse/issues/1837#issuecomment-121736332

In the example I left "v1" in the URL which is the current and only version but as I indicated earlier in this thread, we reserve the right to someday deprecate API versions: https://groups.google.com/d/msg/dataverse-community/fFrJi7NnBus/4ymZwq2CqhEJ

Related to all of this are a couple of newish issues that people might not be aware of. The first is https://github.com/IQSS/dataverse/issues/2416 entitled "Hovering mouse over Download button does not reveal the URL of the file and the URL does not contain the file name." Jamie, since you have tabular data, from the GUI, you can probably just click "Explore" at http://dx.doi.org/10.7910/DVN/ARKOTI and find the file ID in the URL. Or you can get the file IDs from SWORD, as I mentioned. You *should* be able to use https://cran.r-project.org/web/packages/dvn/ for this even with Dataverse 4.0 since tried to keep as much backward comparability with DVN 3.x as possible for the SWORD API.

The other new issue is https://github.com/IQSS/dataverse/issues/2438 about persistent identifiers for files. Right now files are uniquely identified by their database id but on that issue or in another thread we are asking for ideas about what else (DOIs, UUIDs, etc.) we could use: https://groups.google.com/d/msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

Phew! I hope this helps! Again, in the future we plan to have a nice Dataverse 4-compatible R package for everyone to use, thanks to Mr. Thomas Leeper (no pressure!). The issue to track for this for now is https://github.com/ropensci/dvn/issues/23 and it will some day live at https://github.com/IQSS/dataverse-client-r
Reply all
Reply to author
Forward
0 new messages