VS: DDI Summary Statistics CV

0 views
Skip to first unread message

Hilde Orten

unread,
Oct 29, 2025, 10:50:38 AMOct 29
to ddi...@googlegroups.com

For your info

 

Fra: Sanda Ionescu <san...@umich.edu>
Sendt: onsdag 29. oktober 2025 04:12
Til: Pascal Heus <pas...@codata.org>
Kopi: Hilde Orten <hilde...@sikt.no>; Arofan <il...@yahoo.com>; Simon Hodson <si...@codata.org>
Emne: Re: DDI Summary Statistics CV

 

Hi, Pascal.

 

Nice to hear from you.

I envy you and everyone else who is going to be at Dagstuhl. I miss it.

 

Regarding the vocabularies, the DDI CVs are produced by us (the Alliance) - we only use the Cessda tool to publish and manage them.

It's the only tool available for this purpose.

We do synchronize the Cessda site with our own CV page(s) on the Alliance site, but this does not happen regularly, so for the most up-to-date versions you should always check (and use) the Cessda site.

 

For the Summary statistics list - yes, indeed, this list is limited to summary statistics at the variable level, and yes, this is intentional.

The vocabulary is built for the summary statistics elements in DDI-L and DDI-C (2.5) which only document summary statistics for the variable.

If you find that some terms are needed, but missing at this level, you ask the DDI_CVG group (I am the lead in this group, so you can email the group or just me) to update the vocabulary to include those terms. 

If you plan to do this we would also need some tentative definitions for the terms you would like us to add.

If you send us a request, we would also be happy to invite you to one or two of our meetings in which the proposed update will be discussed.

Just let me/us know.

 

We would not extend our vocabulary to file-level statistics, or frequencies, because the said DDI field(s) do not describe those.

For frequencies we might need another vocabulary. For file-level, I am not sure if there is a DDI field to document them. But again, I think this would be a separate list.

 

Existing vocabularies may be "extended" by using the Other code in the CV and then filling in the actual intended code/term in the OtherType attribute. For details please see the Usage tab in the vocabulary documentation. If it is still unclear, please let me know and I can try to provide details.

If you are not using DDI for your metadata, but just the vocabularies, also let me know.

We would be willing to add terms to the existing Summary statistics vocabulary to accommodate your needs.

 

For file-level and variable frequencies, maybe we can work to create DDI vocabularies, or you can use your own. 

 

Please let me know if you have comments, or additional questions.


Best,

Sanda.

 

 

 

Sanda Ionescu

Documentation Specialist Senior

ICPSR

University of Michigan

Ann Arbor, MI

Phone numbers:

cell: 734-474-7605

office (please leave a message): 734-615-2932

 

I've traveled a long way and not all of the roads were paved.

 

 

On Tue, Oct 28, 2025 at 1:16PM Pascal Heus <pas...@codata.org> wrote:

Sanda, Hilde:

 

Hope all is well with you. I'm reaching out as, through CODATA CDIF, I'm involved with MLCommons Croissant, a metadata specification describing data sets for AI / machine learning purposes. This is backed by Google, the Open Data Institute, and other key stakeholders and rapidly growing in popularity.

 

In that context, I'm looking into adding support for descriptive statistics in Croissant, and there is general agreement that the DDI controlled vocabulary should be used as a reference / starting point. 

 

The CV is quite comprehensive, but I have a few questions and identified some gaps. When you get the chance, would you provide some information on:

- How is this maintained and possibly updated (understand it is under CESSDA)

- Are there any recommended extension mechanisms?

- It does not cover statistics at the file level (nRecords, nVariables, , ) or frequencies. Is this intentional (only variable level)? 

 

FYI, some entries that I could not find include things like NullCases, UniqueCases or statistics on text fields (min/max/avg length, min/max/avg word count, lexical diversity, etc.). See this spreadsheet for more details (work in progress).

 

Let me know if you have any questions/suggestions. Happy to schedule a chat if it is easier (we can also discuss in Dagstuhl in a couple of weeks). Many thanks!

 

Best,

*P

 

 

 

Reply all
Reply to author
Forward
0 new messages