Hi Amber,
Thanks, those slides look interesting and I appreciate you opening those issues. I'm looking forward to learning more.
As for Croissant and what's supported, yes, there is support for variable-level metadata. Primarily I've been playing around with a sample file from Stata about cars that we were already using for tests in the main Dataverse code base. Here's how the file and the first two fields look when expressed in Croissant:
"distribution": [
{
"@type": "cr:FileObject",
"@id": "data/stata13-auto.dta",
"name": "stata13-auto.dta",
"encodingFormat": "application/x-stata-13",
"md5": "7b1201ce6b469796837a835377338c5a",
"contentSize": "6443",
"description": "",
"contentUrl": "
http://localhost:8080/api/access/datafile/6?format=original"
}
],
"recordSet": [
{
"@type": "cr:RecordSet",
"field": [
{
"@type": "cr:Field",
"name": "make",
"description": "Make and Model",
"dataType": "sc:Text",
"source": {
"@id": "11",
"fileObject": {
"@id": "data/stata13-auto.dta"
}
}
},
{
"@type": "cr:Field",
"name": "price",
"description": "Price",
"dataType": "sc:Integer",
"source": {
"@id": "5",
"fileObject": {
"@id": "data/stata13-auto.dta"
}
}
},
(If that formatting is not so good, you can also see it here:
https://github.com/gdcc/exporter-croissant/blob/croissant-0.1.2/src/test/resources/cars/expected/cars-croissant.json#L82 )
That is, "distribution" is an array that lists files and "recordSet" list fields (columns). Notice that each field can show "name", "description" and "dataType" (text vs. integer vs. float). I'm not sure if there's a place to put codes or categories (Slava might know), and I couldn't tell from a quick look at the spec:
https://docs.mlcommons.org/croissant/docs/croissant-spec.htmlYes, Croissant is for discovery (you can now find a Croissant icon at Google Dataset Search) but it's also designed to help people feed the data into machine learning models. That's why it's useful to know the types (text vs integer) for example. I've been meaning to play with some ML tools to better understand how they work.
There are also some interesting examples in the Croissant spec where fields can be defined semantically, through Wikidata for example:
"Finally, the following example shows an enumeration featuring the url field to describe the semantic meaning of the enumeration values. It is extracted from the Titanic Croissant definition, and is used to define the passenger's gender. Wikidata URLs are used to define both the meaning of the general enumeration (gender - Q48277) as well as the meaning of individual enumeration values (female - Q6581072, male - Q6581097)."
What we want ultimately is for Dataverse to know that gender, for example, is being stored in various columns in various tabular data files across various datasets so that the information can be combined (joined). Right now we just see 0 and 1 in various columns and we don't know what's what. Once we do know, we should describe the field as gender, for example, in Croissant and other formats, like DDI.
I'm sure the Croissant task force would appreciate your perspective. I'd love to get you and Stephanie on the schedule.
Thanks,
Phil
p.s. Here's slide 11 for anyone who's curious:
Recommendations for generalist repos
- Related resources are crucial and need to be emphasized/required
- limit proliferation of duplicate copies of data/code already identified and preserved elsewhere (also: importance of unambiguous citations)
p.p.s. It seems like Stephanie might have given a similar talk at
https://www.youtube.com/watch?v=fPqzScvKT8U . Slides at
https://www.cni.org/topics/digital-curation/future-proofing-research-data-repositories-keeping-up-with-the-machine-learning-artificial-intelligence-revolution . Great stuff. It was interesting to hear her compare the search interface of Zenodo and Figshare vs UC Irvine ML Repo and OpenML. UC allows you to search by number of features (variables) and instances (observations). We show these numbers for each tabular file already, but it isn't searchable.