How do you think Dataverse's Keyword and Topic Classification fields are similar and differ?

Gautier, Julian

unread,

Feb 24, 2025, 1:50:02 PMFeb 24

to dataverse...@googlegroups.com

Hi everyone,

Have you used or considered using Dataverse's Keyword and Topic Classification fields and thought about how the fields are similar and different??

If so, in this Google Groups post or in an email to my email address (julian...@g.harvard.edu), could you share your thoughts about what you think of these fields and particularly what you've thought, if anything, about their similarities and differences?

I've also more directly contacted folks who manage collections that have datasets where both fields are used, so apologies if I've already contacted you.

Lastly, just for more context I'm considering reaching out to the folks who work on the DDI Codebook metadata standard, since that standard has influenced how we've designed those two fields (and lots of others).

Best regards,

Julian

Julian Gautier (he/him)

Product Research Specialist, IQSS

Interested in helping test Dataverse? Sign up for user experience research

ᐧ

Philipp Conzett

unread,

Feb 25, 2025, 3:22:00 AMFeb 25

to Dataverse Users Community

Hi Julian, everyone,

Back in 2013 when we started our Dataverse journey, we experimented somewhat with the Topic Classification field in TROLLing, which is our domain-specific repository for linguistic data. We experienced that quite a few users were struggling distinguishing between Topic Classification and Keyword. As a result, we added a manual controlled vocabulary to the Topic Classification field, populated with sub-fields within Linguistics. At that time, we experienced some technical challenges managing the adapted metadata schema when upgrading to new versions of the Dataverse software, so in the end, we decided to hide the Topic Classification field from the Citation Metadata schema.

Today, when there is support for external controlled vocabularies, I think it would make sense for TROLLing as well as DataverseNO (a national, generalist repository) to consider reactivating the Topic Classification field, connected to external controlled vocabularies that cover sub-fields within disciplines. As external controlled vocabulary support is still experimental in Dataverse, we haven't deployed this feature yet in our repositories.

Best,
Philipp

Marion Wittenberg

unread,

Feb 25, 2025, 6:05:17 AMFeb 25

to dataverse...@googlegroups.com

Hi Julian, everyone,

The ELSST thesaurus used by CESSDA repositories makes use of ELSST Keywords as well as the CESSDA Topic Classification. At DANS we make use of both for the SSH Data Station.

Kind regards,

Marion

ᐧ

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/97f3ac69-985b-4168-8690-b20329ce1a13n%40googlegroups.com.

Katie Mika

unread,

Feb 25, 2025, 1:14:15 PMFeb 25

to dataverse...@googlegroups.com

Hi Julian,

Perhaps to support Philip's idea about: "reactivating the Topic Classification field, connected to external controlled vocabularies that cover sub-fields within disciplines," I've found that many users in Harvard Dataverse find the "Subject" field too broad and wish to clarify or more granularly describe their data or sub-discipline. They can use a specific classification system (like LCSH or MESH) and find that Topic Classification more naturally describes the subject/content of the dataset, while the "Keyword" field seems to be more useful for natural language keywords that might fall outside of the controlled vocabulary's useful purview.

For example, a researcher may apply "Economic History" as a Topic Classification term with an LCSH url (https://lccn.loc.gov/sh85040817), then in the keywords they might also use "trade history," "USA manufacturing," or "1970s trade statistics" that would be more natural human readable/usable terms that users searching for data might use.

Best,

Katie

To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/DU0PR02MB9587A8BF991E58BF5E9D3A8DCCC32%40DU0PR02MB9587.eurprd02.prod.outlook.com.

Youn Noh

unread,

Feb 26, 2025, 10:58:16 AMFeb 26

to Dataverse Users Community

I strongly support the idea of having the Subject field extended to include subfields within disciplines, e.g., Social Sciences > Political Science > Political Behavior. The subfields might correspond very roughly to academic departments and specializations within those departments. If the hierarchical structure could also be represented in the UI, that would be a plus.

Julian Gautier

unread,

Feb 28, 2025, 2:01:29 PMFeb 28

to Dataverse Users Community

Thanks everyone for the insight so far! I have lots of questions!

Philipp, you wrote that users were "struggling distinguishing between Topic Classification and Keyword". Could you write more about that? How do you know they were struggling?

When TROLLing's Topic Classification field was using a controlled vocabulary of Linguistics sub-fields defined in its Citation metadata block TSV file, what were you expecting depositors to enter in the Keyword field?

And now that the Topic Classification field is hidden, are you expecting depositors to enter Linguistics sub-fields into the Keyword field?

Marion, so the SSH Data Station has a metadata block named dansSocialSciences, which has the two fields "Keyword ELSST" and "Topic Classification CESSDA". Depositors can enter keywords from the ELSST Thesaurus into that "ELSST Keywords" field, and they can enter CESSDA Topic Classifications from the ELSST thesaurus into that "Topic Classification CESSDA" field.

SSH Data Station's Citation metadata block also has the Keyword field, which I see is being used. What do you expect depositors to enter into that Keyword field?

And are depositors able to use the Citation metadata block's Topic Classification field, too? If so, how do you expect depositors to use that field? I see the field in SSH Data Station's Citation metadata block, but I couldn't find any published datasets in SSH Data Station that are using that field.

It would be really helpful to learn more about how depositors choose terms from the ELSST thesaurus when using the "ELSST Keywords" and "Topic Classification CESSDA" fields, but to keep this thread focused on the differences between the keyword and topic classification fields, I'll reach out directly.

Katie, could you write more about the "Keyword" field seeming to be more useful for natural language keywords that might fall outside of the controlled vocabulary's useful purview? Is it Harvard Dataverse depositors who find the keyword field more useful for natural language keywords?

Youn, if the Citation metadata block's Subject field let depositors choose subfields within disciplines, like your "Political Behavior" example, what do you think the Keyword and Topic Classification fields could be used for?

Philipp Conzett

unread,

Mar 3, 2025, 1:34:28 AMMar 3

to Dataverse Users Community

Hi Julian, all,

It's more than 10 years ago we used the Topic Classification field, so I can't really recall the details, but I think some depositors would also add Linguistics sub-fields in the Keyword field, because they couldn't find appropriate terms in the controlled vocabulary we used in the Topic Classification field. We also were uncertain where to draw the line between Linguistics sub-fields and more specific topic terms describing the dataset. I think that's why we in the end opted for only displaying the Keyword fields.

At the time we used the Topic Classification field, metadata schema customization worked differently from now. This was pre-v4 of Dataverse, I think it was possible to define vocabularies through the graphical user interface back then.

We'd be happy to discuss how the Topic Classification field could be coupled with external controlled vocabularies, e.g., covering sub-fields from different research disciplines or domains.

Best,
Philipp

Marion Wittenberg

unread,

Mar 3, 2025, 9:01:58 AMMar 3

to dataverse...@googlegroups.com, Ricarda Braukmann, Laura Huis in 't Veld

Hi Julian,

I will ask my colleagues Ricarda Braukmann (Data Station manager SSH) and Laura Huis in ‘t Veld to answer your question.

Best, Marion

Error! Filename not specified.ᐧ

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/97f3ac69-985b-4168-8690-b20329ce1a13n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/DU0PR02MB9587A8BF991E58BF5E9D3A8DCCC32%40DU0PR02MB9587.eurprd02.prod.outlook.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/e777df91-691d-4a19-9a9c-6111bf27fcbfn%40googlegroups.com.

Katie Mika

unread,

Mar 3, 2025, 9:31:45 AMMar 3

to dataverse...@googlegroups.com

Hi Julian - basically yes, I see Harvard Dataverse users understanding and using the "keyword" field to add terms that users may use for simple keyword searching. This is distinct from the way they want to use the Topic Classification field, which is generally understood how Philip describes it, as a field for adding a controlled vocabulary for subject sub-fields.

Another example to my earlier one could be a Topic Classification of "Acute Kidney Injury" with the URI link: https://www.ncbi.nlm.nih.gov/mesh/?term=acute+kidney+failure. And then a user may add Keyword terms such as "Renal failure", or "Diabetes" without any controlled vocab or schema.

Does that help?

Best,

Katie

To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/DU0PR02MB95878C0A537879F53D4147C5CCC92%40DU0PR02MB9587.eurprd02.prod.outlook.com.

Youn Noh

unread,

Mar 3, 2025, 10:19:40 AMMar 3

to Dataverse Users Community

Hi, Julian. Thanks for your question and for starting this discussion. I also would expect depositors and searchers to enter natural language terms into the Keyword field and to enter controlled vocabulary terms into the Topic Classification field. I think both could be useful in combination with an expanded hierarchical Subject field by allowing users to browse to a field or specialization within that field by Subject, search using a term of their choice in Keyword, then broaden or narrow their search results using a controlled term in Topic Classification. Continuing with the "Social Sciences > Political Behavior" example, a keyword within that specialization could be "climate activism" and a narrower controlled term from a classification scheme could be "Environmental Concern" and a broader controlled term, "Social Psychology" or "Climate Change". I think the advantage of Keyword is that it allows users to select terms that are in current use among researchers or are very specific but might not be part of a classification scheme. Controlled vocabulary terms permit classification by interdisciplinary or cross-cutting research areas that may not be represented in Subject and could prevent Subject, even expanded into a hierarchy, from getting too unwieldy. I hope this helps. Youn

Janet McDougall - Australian Data Archive

unread,

Mar 4, 2025, 11:36:42 PMMar 4

to Dataverse Users Community

hi All

It's been interesting to read various use cases for Keyword & Topic classificaiton.

Keyword:

ADA has been using the 'Keyword' field to hold standard search terms, originally based on Australian Public Affairs Information Service (APAIS) - a subject thesaurus, unfortunately no longer updated but indicative of the types of keywords depositors expect their data to be searched under. We ask depositors to decide on keywords, but ADA archivists may include more.

Topic Classification:

ADA has had 2 main use cases for this field. 15 yrs ago or so the field was used internally to indicate catalogue structure with in-house code and Nesstar archive platform.

Since moving to Dataverse in 2017, and requirements by one of the Aust gov funders (ARDC) to harvest our metadata to a central catalogue, we use the field specifically for Aust & NZ Std Research Classification (ANZSRC) Fields of Research (FoR) Classification vocabulary. ANZSRC is a product of the Aust Bureau of Stats (ABS) - it is a statistical classification used for the measurement and analysis of R&D in Aust and NZ - basically research output reporting. ** NZ is New Zealand...

Amber Leahey

unread,

Mar 6, 2025, 1:39:30 PMMar 6

to Dataverse Users Community

We use 'Keyword' more for non-standardized terms that may apply to a series or to a dataset list of variables and themes (e.g. job status, childcare, smoking, etc. ), that kind of thing. And then we apply topic classifications for filtering in our Odesi Browse Categories (https://odesi.ca > Browse) which are based on / derived from Statistics Canada subject headings that we have used for many years. Here is our Metadata best practices guide ( Odesi Metadata Best Practices Guide 2023 ) for more information about how we use Dataverse keyword and topic fields for our social science data collections in Odesi.

Youn Noh

unread,

Aug 18, 2025, 9:46:50 AMAug 18

to Dataverse Users Community

Would there be any utility in using the Keyword and Topic Classification fields for narrower and broader terms in the same controlled vocabulary? I think you could accomplish the same if you were using external vocabulary services, but that might not always be the case. I think it would also make it easier to relate Keyword and Topic Classification to each other. However, the optimal degree of separation between the terms selected would be subject to interpretation or might need to be standardized.

Also, would there be any value in adding a Term URI subfield for Topic Classification? I think it would be nice to have the subfields for Keyword and Topic Classification be consistent.