Set of allowed values for subject metadata and integration with other applications

85 views
Skip to first unread message

Pablo Valério Polônia

unread,
Apr 20, 2022, 1:33:17 PM4/20/22
to Dataverse Users Community
Hi,

Subject metadata is mandatory, but restricted to a set of 14 items on a controlled vocabulary.

The list is limited in the number of options, and the field is mandatory on the API.

I don't know exactly what the solution could be, but would like to bring the matter up for debate.

Maybe making it optional, or more flexible, could help with integration with other applications.

What kind of user is the feature intended for?
API User, and users of systems that could be integrated with Dataverse

What inspired it?
We're working on a new implementation of a plugin to integrate OPS, and OJS 3.x the future, with Dataverse (version 5.x at least).

The plugin uses the SWORD API to communicate with Dataverse.

The SWORD API Docs explains this rule:

> "Subject” uses our controlled vocabulary list of subjects. This list is in the Citation Metadata of our User Guide > Metadata References. Otherwise, if a term does not match our controlled vocabulary list, it will put any subject terms in “Keyword”. If Subject is empty it is automatically populated with “N/A”.

But the areas of knowledge defined on an OPS press or OJS journal do not always match the list of required Dataverse subjects.

Any related open or closed issues to this feature request?
Having trouble trying to define a subject using the SWORD API, using controlled vocabulary.

I filled up an issue on dataverse github, but reconsidered because this list seems more suitable for discussions. Kept the message organization as in the template, because it seemed clearer to understand the motivations and the context.

Thanks,
Pablo


Philip Durbin

unread,
Apr 25, 2022, 4:46:43 PM4/25/22
to dataverse...@googlegroups.com
Hi Pablo,

First, thank you very much for working on the OPS/OJS 3 integration with Dataverse! I haven't had a chance to reply on the PKP forum myself, but I encouraged others in the Dataverse community to post: https://groups.google.com/g/dataverse-community/c/cEQWBKVo634/m/yhbyCf20BAAJ

I worked on the original OJS 2/Dataverse 3 integration almost a decade (!) ago.

You're asking about the famous 14 subjects of Dataverse. There has been a lot of discussion over the years about where to take that "Subject" field. In short, it's well known that more subjects are desired. I don't remember lots of people advocating strongly that "Subject" should not be a required field. It's possible to fill in "Other" or in some cases "N/A".

Thanks for opening https://github.com/IQSS/dataverse/issues/8625 about how dcterms:subject isn't working as documented. As I wrote there, I'm not sure how long this hasn't been working. :(

Debate is certainly welcome!

Even if Dataverse had hundreds or thousands of subjects to choose from, I imagine OPS, OJS, and other systems won't have a perfect match of subjects. In this case, "Other" or "N/A" seems appropriate. The difference is that when "N/A" gets into the system (through migration or the SWORD API), the user will be prompted to select a subject the next time they edit the dataset in the GUI.

As you've already discovered, you can fill in "Keyword" from the SWORD API. This seems like a good place to put subjects (areas of knowledge) from OPS/OJS that Dataverse doesn't know about.

As for making "Subject" optional, this was one of the big changes we made when rewriting DVN 3 to Dataverse 4. Previously "Subject" was optional but for a variety of reasons, it's now required. DataCite (a search index of datasets) strongly recommends it. Also, having a subject improves metadata quality.

I hope others join this discussion! Thanks again for all your hard work on the plugin!

Thanks,

Phil

p.s. I thought I'd include an issue about a desire to customize the subject list (it was later merged into an umbrella issue) https://github.com/IQSS/dataverse/issues/5938

p.p.s. Here's a little recent discussion of where the 14 subjects came from: https://groups.google.com/g/dataverse-community/c/rpkfWoid-bw/m/_qAtVhczAAAJ

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/e44cd1e6-4150-49a1-80e0-922c45acd35en%40googlegroups.com.


--

Pablo Valério Polônia

unread,
May 3, 2022, 6:45:03 PM5/3/22
to Dataverse Users Community

Hi Philip,

Thanks for the reply.

Our initial plan is to use "N/A" as a Subject field for now, because in the current MVP stage it seems like the most simple decision.

We want to come back to this topic later. One idea for the future is for example to display the 14 subjects of Dataverse as check-boxes, for selection during the dataset submission.

What do you think about it?

Another point is related to keywords. Using the preprint keywords as the dataset keywords is considered a good practice?

We thought that in some situations this won’t fit very well. For example: the keywords of the dataset can be more specific than the keywords of the preprint. Maybe the dataset could have its own properties related to keywords.

Comments are welcome!

Thanks,
Pablo

Philip Durbin

unread,
May 4, 2022, 11:29:49 AM5/4/22
to dataverse...@googlegroups.com
Sure, using "N/A" sounds fine.

And yes, presenting the 14 subjects sounds fine. Years ago we talked about presenting the list of possible subjects via API and there's technically a method for this (below) but it's hidden behind the "admin" API and not documented. Please feel free to create a GitHub issue if you'd like us to clean this up and make it real.

As for preprint keywords vs. dataset keywords, I don't have a good intuitive feel for the difference. We see all sorts of keywords on the dataset side. I'd say you should feel free to send whatever keywords you like.

Thanks,

Phil

p.s. Here's the "list subjects" API endpoint:

curl -s http://localhost:8080/api/admin/datasetfield/controlledVocabulary/subject | jq .
{
  "status": "OK",
  "data": [
    "Agricultural Sciences",
    "Arts and Humanities",
    "Astronomy and Astrophysics",
    "Business and Management",
    "Chemistry",
    "Computer and Information Science",
    "Earth and Environmental Sciences",
    "Engineering",
    "Law",
    "Mathematical Sciences",
    "Medicine, Health and Life Sciences",
    "Physics",
    "Social Sciences",
    "Other"
  ]
}

Reply all
Reply to author
Forward
0 new messages