Croissant, a high-level format for machine learning datasets

185 views
Skip to first unread message

Philip Durbin

unread,
Mar 15, 2024, 11:09:39 AMMar 15
to dataverse...@googlegroups.com
Hello Dataverse enthusiasts!

During the community meeting last week in Mexico, a new format called Croissant was mentioned here and there. It's brand new, announced last Wednesday, March 6th.

From the Dataverse perspective, it is the successor to the format we call "Schema.org JSON-LD" used by Google Dataset Search*. (The same people and others are behind it.) It also promises to make datasets more discoverable for machine learning and exposes variable-level data.



I'm actively working on an exporter for the Croissant format for Dataverse. See https://github.com/gdcc/dataverse-exporters/pull/4 and https://github.com/IQSS/dataverse/issues/10341

The Croissant Working Group has weekly meetings and this coming Wednesday, March 20th, Slava and I plan to present what we have so far. As described at https://mlcommons.org/working-groups/data/croissant/ the meetings are held on Wednesdays from 9:05am-10:00am Pacific. To join these meetings (on Google Meet), you can join the mailing list as described at https://github.com/mlcommons/croissant#getting-involved

By the way, there's also a Croissant Editor that reminds me vaguely of the Data Curation Tool. You can play with it at https://huggingface.co/spaces/MLCommons/croissant-editor

I think that's everything. Back to coding! Have a good weekend!

Phil


p.s. Here's the main place we're discussing Croissant on Zulip: https://dataverse.zulipchat.com/#narrow/stream/379673-dev/topic/Croissant/near/426549770

Vyacheslav Tikhonov

unread,
Mar 20, 2024, 2:22:03 PMMar 20
to Dataverse Users Community
Hi all,

You can find here Croissant Task Force Minutes: 

https://docs.google.com/document/d/1C33FAR6s421WV9U50dzlBkVZRTWTlWguc-RoxakOly0/edit 

and presentation slides covering the present and the future of Croissant support in Dataverse: 

https://zenodo.org/records/10843668


Best,

Slava

DANS-KNAW

Philip Durbin

unread,
Apr 24, 2024, 9:43:59 AMApr 24
to dataverse...@googlegroups.com
We had an accidental Croissant meeting the other day during a pyDataverse* meeting. :)

The first 40 minutes or so are mostly about Croissant, if you're interested: https://drive.google.com/file/d/1BdaTgmhcqnfB4mReD5Ab4BViYYOC55Pu/view?usp=share_link


Meanwhile, I'm still hacking on https://github.com/gdcc/dataverse-exporters/pull/4 and opening issues like these (more to come):


Thanks,

Phil


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/2b2ecb93-9583-44d1-9256-4d8e7c2ae63en%40googlegroups.com.

Philip Durbin

unread,
May 30, 2024, 2:21:22 PMMay 30
to dataverse...@googlegroups.com
Me again. A couple days ago I published a Croissant exporter that you can start playing around with BUT please keep in mind that it hasn't gone through review or QA yet, so it isn't final.


There's a README at https://github.com/gdcc/exporter-croissant with various details and you're welcome to open an issue in that repo (or reply here) if you find any bugs.


In short, to test it, tell Dataverse where you'd like to put these jar files for external exporters, put the Croissant jar file there, and restart Payara.

Thanks,

Phil

Philip Durbin

unread,
Jun 11, 2024, 4:18:44 PMJun 11
to dataverse...@googlegroups.com
Hello Dataverse enthusiasts,

I have a couple more quick updates on Croissant.

Tomorrow (Wednesday) at noon Boston time I'm going to be presenting the Dataverse implementation of Croissant to the Croissant Task Force. You're welcome to join. Here are the slides I threw together: https://docs.google.com/presentation/d/17cAn1aAkVQ_EGIiQEPe44BGnvkLPC1AGPCs8UX5n2Mk/edit?usp=sharing

Sorry for the late notice. I just found out myself. As before, the way to join is described at https://github.com/mlcommons/croissant#getting-involved but if you really want the link (on Google Meet) just let me know privately, please.

The other update is that I released a new version, 0.1.2, that fixed a couple bugs that Geoff from Kaggle identified. See https://github.com/gdcc/exporter-croissant/blob/croissant-0.1.2/CHANGELOG.md and download from https://repo1.maven.org/maven2/io/gdcc/export/croissant/0.1.2/croissant-0.1.2.jar

As before, these Croissant jar files have not gone through review or QA yet so please consider them experimental, but I'm happy to listen to any feedback you have.

Thanks,

Phil

Amber Leahey

unread,
Jun 13, 2024, 10:22:57 AMJun 13
to Dataverse Users Community
Hi Phil, 
This is great to see, we also learned about Machine Learning datasets IASSIST a few weeks ago, and many of these kinds of datasets are ending up in Dataverse and they aren't that big actually (in some cases!). See Stephanie Labou's presentation (Metadata Ahoy! Charting a reusable path for machine learning (cern.ch) slide 11 -), many data repositories can do more to support deposit of this kind of data, for example we could improve the metadata deposit form (Citation Block)  "Related Dataset" field to be similarly structured as "Related Publication" and added to first level of deposit, to support machine learning research use cases but also other kinds of secondary use using structured fields including PIDs and URL and defined relations for citing data sources used. And there could be a Machine Learning metadata block added to support descriptions of "number of instances", "task type", and other technical details (according to Stephanie's presentation). Also, they created this neat tool for automating the API metadata searching across all these repositories, it's called PyCurator (may have heard?) that helps with routine metadata API calls to get metadata from existing repositories (including Dataverses, and others!) Welcome to PyCurator! — PyCurator 0.1.2 documentation

Regarding Croissant exporter: this is so cool! thank you for all your work on this. 
I have a question about how it handles tabular dataset metadata for variables, codes, categories, etc. does it handle this level of metadata, or is this still at the level of citation/discovery? Do you see anything in Croissant that can be helpful for thinking about how to build better metadata fields for describing Machine Learning datasets? Stephanie also mentions that repositories are generally better suited for publishing these kinds of datasets (since we support licensing for example), rather than some of the current existing Machine Learning community sites, so there is a lot of possibility for improvement in Dataverse. 
Best, 
Amber

Philip Durbin

unread,
Jun 13, 2024, 4:17:41 PMJun 13
to dataverse...@googlegroups.com
Hi Amber,

Thanks, those slides look interesting and I appreciate you opening those issues. I'm looking forward to learning more.

As for Croissant and what's supported, yes, there is support for variable-level metadata. Primarily I've been playing around with a sample file from Stata about cars that we were already using for tests in the main Dataverse code base. Here's how the file and the first two fields look when expressed in Croissant:

"distribution": [
    {
        "@type": "cr:FileObject",
        "@id": "data/stata13-auto.dta",
        "name": "stata13-auto.dta",
        "encodingFormat": "application/x-stata-13",
        "md5": "7b1201ce6b469796837a835377338c5a",
        "contentSize": "6443",
        "description": "",
        "contentUrl": "http://localhost:8080/api/access/datafile/6?format=original"
    }
],
"recordSet": [
    {
        "@type": "cr:RecordSet",
        "field": [
            {
                "@type": "cr:Field",
                "name": "make",
                "description": "Make and Model",
                "dataType": "sc:Text",
                "source": {
                    "@id": "11",
                    "fileObject": {
                        "@id": "data/stata13-auto.dta"
                    }
                }
            },
            {
                "@type": "cr:Field",
                "name": "price",
                "description": "Price",
                "dataType": "sc:Integer",
                "source": {
                    "@id": "5",
                    "fileObject": {
                        "@id": "data/stata13-auto.dta"
                    }
                }
            },

(If that formatting is not so good, you can also see it here: https://github.com/gdcc/exporter-croissant/blob/croissant-0.1.2/src/test/resources/cars/expected/cars-croissant.json#L82 )

That is, "distribution" is an array that lists files and "recordSet" list fields (columns). Notice that each field can show "name", "description" and "dataType" (text vs. integer vs. float). I'm not sure if there's a place to put codes or categories (Slava might know), and I couldn't tell from a quick look at the spec: https://docs.mlcommons.org/croissant/docs/croissant-spec.html

Yes, Croissant is for discovery (you can now find a Croissant icon at Google Dataset Search) but it's also designed to help people feed the data into machine learning models. That's why it's useful to know the types (text vs integer) for example. I've been meaning to play with some ML tools to better understand how they work.

There are also some interesting examples in the Croissant spec where fields can be defined semantically, through Wikidata for example:

"Finally, the following example shows an enumeration featuring the url field to describe the semantic meaning of the enumeration values. It is extracted from the Titanic Croissant definition, and is used to define the passenger's gender. Wikidata URLs are used to define both the meaning of the general enumeration (gender - Q48277) as well as the meaning of individual enumeration values (female - Q6581072, male - Q6581097)."

What we want ultimately is for Dataverse to know that gender, for example, is being stored in various columns in various tabular data files across various datasets so that the information can be combined (joined). Right now we just see 0 and 1 in various columns and we don't know what's what. Once we do know, we should describe the field as gender, for example, in Croissant and other formats, like DDI.

I'm sure the Croissant task force would appreciate your perspective. I'd love to get you and Stephanie on the schedule.

Thanks,

Phil

p.s. Here's slide 11 for anyone who's curious:

Recommendations for generalist repos
- Related resources are crucial and need to be emphasized/required
  - limit proliferation of duplicate copies of data/code already identified and preserved elsewhere (also: importance of unambiguous citations)

p.p.s. It seems like Stephanie might have given a similar talk at https://www.youtube.com/watch?v=fPqzScvKT8U . Slides at https://www.cni.org/topics/digital-curation/future-proofing-research-data-repositories-keeping-up-with-the-machine-learning-artificial-intelligence-revolution . Great stuff. It was interesting to hear her compare the search interface of Zenodo and Figshare vs UC Irvine ML Repo and OpenML. UC allows you to search by number of features (variables) and instances (observations). We show these numbers for each tabular file already, but it isn't searchable.

Reply all
Reply to author
Forward
0 new messages