Creating a better understanding of DV's Dates Fields

215 views
Skip to first unread message

Michael Steeleworthy

unread,
Aug 29, 2018, 2:06:36 PM8/29/18
to Dataverse Users Community
Hi everyone,

As some of you are aware, in Canada, there is a "Dataverse North Working Group" (DVNWG), representing several instances of DV across the country.  Within DVNWG is a metadata working group, where right now we're trying to develop some DV best practices for our users.   We are considering both novice and experienced users of all different 'types' (e.g, researcher, archivist, DevOp, downloader, librarian, etc.)

Our problem:
Like many other DV users, we are having difficulty understanding the different date fields in DV and then recommending proper use. This is a tough problem since the fields can represent different concepts or might be used in different ways from one discipline to another. i.e, recommending a best practice for a trans-disciplinary platform is not always an easy thing to do!. Our goal is to provide clear, plain language advice to users, but we're being held up on dates, especially regarding the Production Date, Date of Deposit and Date of Distribution.

We have summarized the fields in the table at this link, with some context back to our conference calls: 

Our Ask:
We'd like to get the community's input on their understanding and recommended use of DV's date fields. What do these fields mean to you and your users? How do you interpret them or recommend their use? This is working with the understanding that DV can be generalist in nature and that use-cases of DV itself (instance by instance) can differ widely as well.


How to contribute:
If you're up for it, please feel free to add notes to the shared&open google doc, or respond to the thread on the mailing list. On our end, we will respond with some kind of crowdsourced summary between the channels..


Thanks, from the entire group!


--
Michael Steeleworthy
Coordinator, Research Data Services
Wilfrid Laurier University Library
Waterloo, Ontario, Canada

julian...@g.harvard.edu

unread,
Aug 30, 2018, 11:47:55 AM8/30/18
to Dataverse Users Community
Hi Michael and DVNWG,

Thank you so much for this effort to bring clarity! The curation team for Harvard's Dataverse installation will share how we interpret those fields as soon as possible, and take a look at how depositors actually use those fields (Harvard Dataverse is largely self-curated).

Very much looking forward to learning what others think!


Julian Gautier
Product Research Specialist, IQSS

Sherry Lake

unread,
Aug 30, 2018, 3:46:55 PM8/30/18
to Dataverse Users Community
Be prepared for a history lesson ;-).

But first a comment on users and what they know and do: I have no idea what users think, just that they are confused & have different opinions. UVa has tried to make things clearer, by changing a field label and some popup text.

UVa has changed the label for “Production Date" to “Data Creation Date”. You can also change the text in the hover definition if you feel this would help your users better understand the context of the field.  

So now on to DDI history - Maybe history will help???
As my understanding, Dataverse metadata fields are based on DDI metadata, so I went back (yes, way back to 2001) to my paper files and found the documentation for DDI V2.1. Reading the definition of the date fields in the original context might help “us” understand.

I have my paper copy, but you can see the documentation here on the wayback machine:

There is no publication date in DDI, so I assume that for Dataverse it is the date that the dataverse dataset was published in dataverse?

1st define the “Producer”: The producer of the data collection is the person or organization with the financial or administrative responsibility for the physical processes whereby the data collection was brought into existence.

So the Production Date is the date that the collection was produced “brought into existence”.
To me its the date when all the stuff ready to go into the dataset was complete, but “production” in a digital world isn’t exactly like that of a analog world, which is what the DDI was originally created to describe - DDI was the machine readable version of the paper codebooks of ICPSR.

Next definitions: 
“Distributor” (again referring to the analog version of things): The organization designated by the author or producer to generate copies of a particular data collection including any necessary editions or revisions

“Depositor” person (or institution) who provided this data collection to the archive storing it.

Date of Depositdate that the data collection was deposited with the archive that originally received it.
so would make sense with dataverse that this is the date the dataset was created in dataverse - the Date that the Dataset was deposited into the repository. (In dataverse, it is auto populated with CURRENT day when a dataset is created)

Date of Distributiondate that the data collection was released for distribution.
which to me sounds very much like a “publication date” - but I don’t think it gets a value unless entered in by the user.

Date of collection is in the Study Scope section:

Date of Collection
Contains the date(s) when the data were collected. 

And there is another date - Time Period Covered The time period to which the data refer. This item reflects the time period covered by the data, not the dates of coding or making documents machine-readable or the dates the data were collected.

Not sure if this makes things any clearer. This history lesson tells me that there are some dates that just don't apply to born digital items. Maybe if these fields are never used, dataverse can remove them from the metadata pages?

Michael hope this helps??

--

Sherry Lake | Scholarly Repository Librarian | University of Virginia Library | shL...@virginia.edu | 434.924.6730 | @shLakeUVA | Alderman Library, 160 N. McCormick Road, Charlottesville, VA 22903 | Alderman 563 | LinkedIn Profile | orcid.org/0000-0002-5660-2970  | “Keeper of the Dataverse" 

Michael Steeleworthy

unread,
Aug 31, 2018, 9:41:09 AM8/31/18
to Dataverse Users Community
Thanks,  Sherry.   While our discussions have been grounded in DDI (as well as mapping onto DataCite, if I recall), I don't think we've considered how the fields and their definitions might have been considered in their sometimes-original analogue state.  It does add some colour to the conversation on Production Date and Distribution Date, which are the pesky fields for us at the moment.

Regarding users, what users think, and UX, this is the million dollar question for us, as well as for everyone else, I bet. The tough thing for us is that we're trying to build out some  "leading practices" for a number of different DV instances in Canada, their admins and librarians, and their users.  Some instances are fully self-serve, some are full mediated, and some are a mixture of the two.  So, we are being careful not to over-interpret or prescribe too much, in part because we don't want to deviate from the norm but also because local considerations may affect our recommendations and documentation.   

It is a tall order, but we think we can meet our goal on this one, which is to build upon and contextualize the existing DV documatation and knowledge base for its users.

Thanks again!,

Michael.

Pete Meyer

unread,
Aug 31, 2018, 10:40:26 AM8/31/18
to Dataverse Users Community
For another perspective, most of the datasets I deal with are ones where a researcher did an experiment, and the dataset is the output of the detector during that experiment (producing a set of files).  My interpretation is that "Production Date" is the date those files came off the detector - this seemed to have significant overlap with "Dates of Collection", so for these cases I'd consider the second redundant.  I'd consider "Date of Deposit" to be either the date that the dataset was registered with the repository, or when the checksums for the data files were verified internally by the repository (in current implementation I believe it's the first, but the second might be closer to the meaning of "deposit").  I consider "Publication Date" to be when a dataset was made available through the repository, with "Date of Distribution" redundant for these cases.

Extending slightly to a case where a researcher produced another dataset based on the analysis of the primary dataset (detector output), I that case I'd consider "Dates of Collection" inapplicable, and "Production Date" to be the date the analysis programs finished (and treating the other dates you mentioned the same).

The assumption underlying these perspectives is that a dataset doesn't contain both input and output files (the input is either a detector / measurement process, or another dataset) - this assumption is far from universal though, which is something to keep in mind when trying to generalize.

Best,
Pete

Philipp at UiT

unread,
Sep 4, 2018, 3:41:27 AM9/4/18
to Dataverse Users Community
A note and question about the producer field:

In our Dataverse, this field is usually pre-filled with the name of the institution the author works for / is associated with.
I just had a look at the CESSDA Expert Tour Guide on Data Management. In the section about DMPs, they use the terms

Data producer:
- Which organisation has the administrative responsibility for the data?

Data owner:
- Which organisation(s) own(s) the data?
- If several organisations are involved, which organisation owns what data?

Do we need a field for data owner in Dataverse?

Best,
Philipp

Michael Steeleworthy

unread,
Sep 5, 2018, 2:30:31 PM9/5/18
to Dataverse Users Community
Thanks, Pete.  Your example is similar to one of our use cases that come up in our conversations.  It doesn't fully solve the issue at hand, but it does remind me how difficult a task it can be to normalize these terms, in plain language, across disciplines/subject domains.

julian...@g.harvard.edu

unread,
Sep 11, 2018, 9:06:51 PM9/11/18
to dataverse...@googlegroups.com

Hi everyone,

 

I've been looking at what depositors put in the date metadata fields in Harvard Dataverse and a few other Dataverse installations, and wanted to share what I've learned so far and some thoughts:

 

Deposit date:

  • This field is pre-populated with the date that the dataset was first created in the Dataverse repository (when someone clicks "New dataset"). If someone creates a dataset and doesn't change the pre-populated date, then visits the repository a week later to publish the dataset, the deposit date and publication date are a week apart.
    • I’m not sure if we or DDI intended the field to be equivalent to the date that the first draft was saved, or the date that the dataset is publicly available. In Harvard Dataverse, most depositors publish datasets where the deposit date and the publication date contain different dates.
  • The DDI definition of deposit date is the date of deposit into the original repository, so if Dataverse isn’t the original repository, depositors are able to change the date. A lot of datasets in ADA Dataverse use the deposit date field this way, as well as some dataverses in Harvard Dataverse (e.g. Murray, UCLA Social Science Archive).
    • Do depositors who use the deposit date field this way mean the date when the data was first added/stored someplace or the date that it was available for others to access? Is this a distinction they're thinking about or care about?
    • Discussion with the DDI's metadata group about its DDI definition is happening in this JIRA ticket.
  • Some dataverses in Harvard Dataverse change this date to the latest published version (e.g. Antislavery Petition dataverse). That dataverse's admin wrote that doing it this way "seems useful because if I'm browsing dataverse contents I might want to know when the data I'm looking at was written to Dataverse. For example, if I see that the deposit date for a particular dataset is more recent than the last time I was browsing, I might pay it special [attention]." Steve McEachern also thinks a dataset's deposit date should be tied to the dataset version. I should note that Dataverse records and displays the date that each version was published, as well as the version number.

Distribution date:

  • Some depositors are using this field for the date that embargoed data is released (when all or some files will be unrestricted and available to everyone). I think this includes DataverseNO and at least one dataverse in Harvard Dataverse. (Nice coincidence that embargo is on the roadmap and being designed. https://github.com/IQSS/dataverse/issues/4052)
  • I think distribution date should not be used for an embargo release date, unless it's used only that way.
  • Regardless, does the "date that the work was made available for distribution/presentation" mean:
    • the date when the data was first distributed anywhere? (which I think could be the same as the deposit date)
    • or the date when the data was first distributed in the current (Dataverse) repository? (which I think would be the same as the publication date)


Publication date:

  • DataCite’s description of the property "publicationyear" includes: “If an embargo period has been in effect, use the date when the embargo period ends.” This sounds like once Dataverse has an embargo feature, the publicationyear that Dataverse sends to DataCite should be the year of the embargo release date (and not the year of the publication date that Dataverse sends to DataCite now, which is the date that the dataset's first version was released).
  • But if the embargo release date is the day when the files become unrestricted, then why doesn’t Dataverse do this now? That is, why doesn’t Dataverse use the year in which the files become unrestricted as the publicationyear? It’s because when depositors hit publish, Dataverse has to send DataCite a publicationyear, and depositors have no way to indicate when the files will become unrestricted (until there’s an embargo feature).
  • For datasets where an embargo is set, if Dataverse sends the embargo release date to DataCite as the publicationyear, then in some cases the publicationyear that DataCite has will be different than the publication year in Dataverse’s dataset citation… unless the publication date in the citation changes to the embargo release date.
  • Trying to find what others have written about what “publish” really means. Could also reach out to DataCite’s metadata group.
  • Update (2018-09-18): DataCite recommends using the "date" property with the dateType "available" to indicate the end of an embargo period. Also, none of the datasets on DataCite Search have publication years in the future (2018 is the latest year), which makes it less likely that other repositories ("Data Centers") are entering the embargo release years as the publication years (otherwise there would be future years... or DataCite is hiding anything greater than 2018 in the facets). So I think the publication date should be the date that the dataset's first version is published (and embargo start and end dates can be set through the embargo feature and sent to DataCite using their schema's date attributes).


Production date:

  • I'd interpret this as the date when the data was "finalized" and ready to be analyzed or distributed, as the DVNWG wrote. I don't see any problem with research archives interpreting this as "the date that the data was given to the archive because that’s the closest approximation we can make if no other input on this timestamp was offered to us."
  • So I'd assume that the production date should always come BEFORE the distribution date, but there are hundreds of datasets in Harvard Dataverse whose production dates come after. Trying to find out why.

 

Here's my running list of descriptions of dates I’ve seen used so far, trying to be as distinct and jargon-free as possible:

  • Dates that the data refers to
  • Dates in which actions were taken to create the deposited dataset (could be a single date if the date was collected in one day; the DDI element collDate has attributes for single dates, date ranges, and “cycles”, not sure how “cycles” would work)
  • Date when data is "finalized" and ready for distribution (could be happen to be the same date as “date of collection” in cases such as Pete Meyer’s, where I think the data is created within one day and either doesn't really go through a "finalization" process or that process also happens within the same day that is was created)
  • Date when data was first deposited anywhere, but not available for distribution or presentation
  • Date when dataset was first published/distributed anywhere.
  • Date when data was first deposited in current repository (e.g. Dataverse), but not available for distribution or presentation
  • Date when dataset was first published/distributed in current repository (e.g. Dataverse)
  • Dates when different versions have been published/distributed (in Dataverse this is system generated of course)
  • Date when files are no longer restricted (embargo release date)


I’m testing an online survey and hope to use it soon so we can learn more from curators, and maybe as a way to test changes that we hope will clarify the fields. Looking forward to hearing everyone’s thoughts.

 

Julian

Amber Leahey

unread,
Sep 13, 2018, 12:45:52 PM9/13/18
to Dataverse Users Community
Many many thanks Julian, this is really helpful!  I'm looking forward to the survey and responses from folks. Your input and running list of possible dates used in interpreting these fields is great. 


On Tuesday, 11 September 2018 21:06:51 UTC-4, julian...@g.harvard.edu wrote:

Hi everyone,

 

I've been looking at what depositors put in the date metadata fields in Harvard Dataverse and a few other Dataverse installations, and wanted to share what I've learned so far and some thoughts:

 

Deposit date:

  • This field is pre-populated with the date that the dataset was first created in the Dataverse repository (when someone clicks "New dataset"). If someone creates a dataset and doesn't change the pre-populated date, then visits the repository a week later to publish the dataset, the deposit date and publication date are a week apart.
    • I’m not sure if we or DDI intended the field to be equivalent to the date that the first draft was saved, or the date that the dataset is publicly available. In Harvard Dataverse, most depositors publish datasets where the deposit date and the publication date contain different dates.
  • The DDI definition of deposit date is the date of deposit into the original repository, so if Dataverse isn’t the original repository, depositors are able to change the date. A lot of datasets in ADA Dataverse use the deposit date this way, as well as some dataverses in Harvard Dataverse (e.g. Murray, UCLA Social Science Archive).
    • Do depositors who use the deposit date field this way mean the date when the data was first added/stored someplace or the date that it was available for others to access? Is this a distinction they're thinking about or care about?
  • Some dataverses in Harvard Dataverse change this date to the latest published version (e.g. Antislavery Petition dataverse).

Distribution date:

  • Some depositors are using this field for the date that embargoed data is released (when all or some files will be unrestricted and available to everyone). I think this includes DataverseNO and at least one dataverse in Harvard Dataverse. (Nice coincidence that embargo is on the roadmap and being designed. https://github.com/IQSS/dataverse/issues/4052)
  • I think distribution date should not be used for an embargo release date, unless it's used only that way.
  • Regardless, does the "date that the work was made available for distribution/presentation" mean:
    • the date when the data was first distributed anywhere? (which I think could be the same as the deposit date)
    • or the date when the data was first distributed in the current (Dataverse) repository? (which I think would be the same as the publication date)


Publication date:

  • DataCite’s description of the property "publicationyear" includes: “If an embargo period has been in effect, use the date when the embargo period ends.” This sounds like once Dataverse has an embargo feature, the publicationyear that Dataverse sends to DataCite should be the year of the embargo release date (and not the year of the publication date that Dataverse sends to DataCite now, which is the date that the dataset's first version was released).
  • But if the embargo release date is the day when the files become unrestricted, then why doesn’t Dataverse do this now? That is, why doesn’t Dataverse use the year in which the files become unrestricted as the publicationyear? It’s because when depositors hit publish, Dataverse has to send DataCite a publicationyear, and depositors have no way to indicate when the files will become unrestricted (until there’s an embargo feature).
  • For datasets where an embargo is set, if Dataverse sends the embargo release date to DataCite as the publicationyear, then in some cases the publicationyear that DataCite has will be different than the publication year in Dataverse’s dataset citation… unless the publication date in the citation changes to the embargo release date.
  • Trying to find what others have written about what “publish” really means. Could also reach out to DataCite’s metadata group.


Production date:

  • I'd interpret this as the date when the data was "finalized" and ready to be analyzed or distributed, as the DVNWG wrote. I don't see any problem with research archives interpreting this as "the date that the data was given to the archive because that’s the closest approximation we can make if no other input on this timestamp was offered to us."
  • So I'd assume that the production date should always come BEFORE the distribution date, but there are hundreds of datasets in Harvard Dataverse whose production dates come after. Trying to find out why.

 

Here's my running list of descriptions of dates I’ve seen used so far, trying to be as distinct and jargon-free as possible:

  • Dates when data was collected (could be a single date if the date was collected in one day; the DDI element collDate has attributes for single dates, date ranges, and “cycles”, not sure how “cycles” would work)
  • Date when data is "finalized" and ready for distribution (could be the same as “date of collection” in cases such as Pete Meyer’s)
  • Date when data was first deposited anywhere, but not available for distribution or presentation
  • Date when dataset was first published/distributed anywhere.
  • Date when data was first deposited in current repository (Dataverse), but not available for distribution or presentation
  • Date when dataset was first published/distributed in current repository (Dataverse)
  • Dates when different versions have been published (this is system generated of course)
  • Date when files are no longer restricted (embargo release date)

Philipp at UiT

unread,
Sep 18, 2018, 6:56:03 AM9/18/18
to Dataverse Users Community
There is another thing to bear in mind: The different date fields are specified at dataset level. However, in some cases a dataset may consist of both locked/embargoed files and unlocked/open files. This is why we earlier have suggested the new embargo feature to be available at file level.

Philipp

julian...@g.harvard.edu

unread,
Sep 27, 2018, 11:43:34 AM9/27/18
to dataverse...@googlegroups.com
Thanks Phillipp. You're right, the thinking so far is that a depositor will be able to embargo some (or all) files in a dataset, and set the release date for those files, but make other files publicly available when she makes the dataset itself publicly available.

I don't think our community has discussed in depth how we'd handle the metadata for embargoes, and I see how a discussion dependent on a feature that hasn't been designed yet (a discussion I'd love to have eventually) might complicate and hamper this discussion. So I'm wondering if we can attempt to avoid embargo release dates (I know, I brought up) for these five metadata fields. Otherwise I worry we'll need to wait until the design of the embargo feature is settled, or consider all of the possible ways the feature could work, before figuring out how embargo affects the use of these five fields.

I just realized that I should mention that I've been updating my last post, from Sept. 11, as I learn more about how some people are interpreting and using these fields and how metadata standards contend with different types of dates.

Philip Durbin

unread,
May 12, 2022, 1:51:10 PM5/12/22
to dataverse...@googlegroups.com
I know this thread is quite old (we have an embargo feature now!) but I thought I'd point out that there's a new discussion going on having to do with dates, specifically, some sort of retention date. "Due to our national legislation, these medical datasets should be deleted after a defined period." Please feel free to join in: https://groups.google.com/g/dataverse-community/c/OFpPiUzYihk/m/URiC_WtKBQAJ

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/e5603fb7-8b71-4258-877b-0364cf4edbe4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Reply all
Reply to author
Forward
0 new messages