Setting up a dynamic, self-updating analysis across multiple datasets owned by different people (including some not yet born)

18 views
Skip to first unread message

A Cristia

unread,
Mar 1, 2019, 5:48:06 AM3/1/19
to Dataverse Users Community
Dear colleagues,

I believe the following should conceptually be possible, but I would welcome advice on two points and examples of other researchers who have set up similar pipelines.

The end goal is to setting up a dynamic, self-updating analysis across multiple datasets owned by different people that can persist for a very long time (decades at least). Imagine that this is a project on language experiences using a very specific piece of equipment (call it LENA) that has a very standardized data output unlikely to change in that time scale.

The procedure would be:
1. Create instructions for people who have relevant datasets to contribute them to dataverse using (A) a standardized (meta)-data format, and (B) some sort of tag or identifier that uniquely indicates that this dataset is relevant (e.g., a tag "language-experiences-LENA")
2. Write a shinyapp that uses a dataverse API to list all datasets anywhere on Dataverse that have the tag "language-experiences-LENA", and integrates them into the analysis.

Can you please confirm that:
- tags like the one in my example can be created (it doesn't seem so -- would an alternative be to have data contributors tag their data as replications of a dataset I could contribute?)
- I can use an API to list ALL datasets rather than only "my" datasets (again, doesn't seem like this is possible -- and I could not think of an alternative route)

Thank you in advance!

Philip Durbin

unread,
Mar 1, 2019, 8:01:02 AM3/1/19
to dataverse...@googlegroups.com
I think what you want is possible but are you planning on running your own installation of Dataverse or are you planning on making use of one of the (currently three?) installations of Dataverse that host data from anyone? Last I knew, these three installations are:


(Please correct me if I'm wrong about this list! If there's a canonical list out there that I'm not aware of, please let me know!)

I'll assume for a moment that you don't want to run Dataverse yourself. You would instruct your contributors to standardize on a (hopefully unique) tag or identifier.

At the dataset level, you could use the "Keyword" field, which allows arbitrary values, such as "Politics" in the example below:


At the file level, you could use the "File Tag" field, which allows arbitrary values, such as "disaster" in the example below:


These examples make use of the Search API: http://guides.dataverse.org/en/4.11/api/search.html

The problem with this approach if you're not in control of the installation is that other people could come along and use your tag/identifier. I could dream up a great unique Twitter hashtag (#languageExperiencesLENA or whatever) but that doesn't prevent others to start using it for a different purpose in the future.

A whole world of possibilities is open if you run your own installation of Dataverse, such as creating a custom metadata block specific to your data. This task is not for the faint of heart but it's documented at http://guides.dataverse.org/en/4.11/admin/metadatacustomization.html

I hope this helps!

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/27ebe448-83d6-4a23-a533-df36087ef4c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Sebastian Karcher

unread,
Mar 1, 2019, 8:10:25 AM3/1/19
to dataverse...@googlegroups.com
Two things to add to Phil's great answer:
1) since you mention Shiny apps, just making sure you're aware of the dataverse package for R (https://cran.r-project.org/web/packages/dataverse/index.html ), which includes what I believe is full API functionality, certainly allowing for any search possible in the API.
2) Short of running your own Dataverse installation, you can run your own _Dataverse_ on, e.g., the Harvard Dataverse (which is definitely possible even for the faint of heart ;)) and control either who can publish there and/or vet datasets before publication. You can then restrict your API query to that sub-dataverse, which allows you to solve the issue with other people using "your" keyword/tag

Question for Phil -- any reason fileTag isn't included in the JSON response from the API? That's a bit... confusing.

Hth,
Sebastian


For more options, visit https://groups.google.com/d/optout.


--
Sebastian Karcher, PhD
www.sebastiankarcher.com

Philip Durbin

unread,
Mar 1, 2019, 8:34:44 AM3/1/19
to dataverse...@googlegroups.com
Good point about subdataverses, Sebastian. (The term "Dataverse" is a bit overloaded, unfortunately.) One would use the "subtree" feature of the Search API to filter down to a subdataverse, as explained in the API Guide I linked earlier.

Yes, the "dataverse" R package works but Thomas Leeper is looking for a new maintainer if there are any R hackers out there would are interested in helping. Please see https://github.com/IQSS/dataverse-client-r/issues/21

We did recently expose some additional fields at the file level from the Search API in Dataverse 4.11 as part of https://github.com/IQSS/dataverse/issues/5339 and more fields could be exposed without much effort. File tags are already indexed in Solr (otherwise the example I gave, wouldn't have worked) so it's just a matter of tweaking the Search API output*. I would certainly welcome more issue and pull requests from the community for this! :)

Phil



For more options, visit https://groups.google.com/d/optout.

Philipp at UiT

unread,
Mar 1, 2019, 9:16:36 AM3/1/19
to Dataverse Users Community
One thing to add to Phil's and Sebastians answers:

Since the initial example is about language, you may consider to use the Tromsø Repository of Language and Linguistics (TROLLing) to implement your analysis; cf. https://trolling.uit.no.

Best,
Philipp
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.


--

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.


--
Sebastian Karcher, PhD
www.sebastiankarcher.com

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Jonathan Crabtree

unread,
Mar 1, 2019, 1:42:14 PM3/1/19
to dataverse...@googlegroups.com

All great suggestions.

 

We also have a use case for longitudinal studies across various datasets so this would be interesting.

 

If Odum can help let us know. I think starting with a curated sub-dataverse and custom tags is a good first step.

 

Jon

Reply all
Reply to author
Forward
0 new messages