Why are the MDC download counts from Dataverse API/database different than MDC download counts from DataCite API/Event Data?

76 views
Skip to first unread message

Julian Gautier

unread,
Jan 13, 2026, 12:01:14 PMJan 13
to Dataverse Users Community
Hi everyone,

Why are the MDC download counts from Dataverse API/database different than MDC download counts from DataCite API/Event Data?

Here's more context and examples:

As I've been learning what users think about the Make Data Count usage metrics, I've been collecting these MDC download counts from datasets and collections in Dataverse repositories, mostly from Harvard Dataverse but also from other Dataverse repositories that are collecting and report MDC counts.

Until yesterday, I've been doing this by only using the DataCite API to retrieve information about each DOI. For example, for the dataset at https://doi.org/10.7910/DVN/WS9OUR, the API call https://api.datacite.org/dois/10.7910/DVN/WS9OUR returns 31 as the total MDC downloadCount (as of today) and then breaks that count down by month and year:

Screenshot 2026-01-13 at 10.50.37 AM.png

DataCite's docs for this API say that those counts are "pulled from Event Data", which I assume is the same as what we've been calling Make Data Count counts. These are also the same counts shown in DataCite commons, such as https://commons.datacite.org/doi.org/10.7910/DVN/WS9OUR.

When I instead use the Dataverse API or query the Dataverse database where Dataverse records those MDC counts, I get different counts.


From the Dataverse database's datasetmetrics table, I can also see these counts broken down by month so that I'm able to see for which months the counts are different between the two sources.

The differences in the examples I've given are relatively small - 31 versus 35. I suppose they could be explained by caching or other timing issues, especially when the counts are from more recent months?

But for other datasets, the differences get bigger, like:
From the datasets I've checked so far, the counts from the Dataverse API have been greater than the counts from the DataCite API.

Lastly, I haven't compared the view counts but I can or others can if folks think that might help with troubleshooting.

Thanks in advance for any insights you can provide :)

Julian Gautier (he/him)
Product Research Specialist, IQSS
Interested in helping test Dataverse? Sign up for user experience research

James Myers

unread,
Jan 13, 2026, 1:03:26 PMJan 13
to dataverse...@googlegroups.com

FWIW: My initial guess would be that the reporting to DataCite failed for some months. As you said, for the dataset you cite, the DataCite API shows views and downloads per month and you can see there are gaps in that list (months where no views/downloads occurred). I’d suggest looking at the DataCite reports API to see if there are reports for the months that are missing from the dataset-level info (FOR QDR the query is https://api.test.datacite.org/reports?platform=QDR&created-by=QDR). If not, then those missing reports could contain the missing counts. Alternately, you should be able to get the total download counts for each month for that dataset from the datasetmetrics table and see if all the missing counts are from months not shown in the DataCite API.

 

-- Jim

 

From: dataverse...@googlegroups.com <dataverse...@googlegroups.com> On Behalf Of Julian Gautier
Sent: Tuesday, January 13, 2026 12:01 PM
To: Dataverse Users Community <dataverse...@googlegroups.com>
Subject: [Dataverse-Users] Why are the MDC download counts from Dataverse API/database different than MDC download counts from DataCite API/Event Data?

 

Hi everyone,

 

Why are the MDC download counts from Dataverse API/database different than MDC download counts from DataCite API/Event Data?

 

Here's more context and examples:

 

As I've been learning what users think about the Make Data Count usage metrics, I've been collecting these MDC download counts from datasets and collections in Dataverse repositories, mostly from Harvard Dataverse but also from other Dataverse repositories that are collecting and report MDC counts.

 

Until yesterday, I've been doing this by only using the DataCite API to retrieve information about each DOI. For example, for the dataset at https://doi.org/10.7910/DVN/WS9OUR, the API call https://api.datacite.org/dois/10.7910/DVN/WS9OUR returns 31 as the total MDC downloadCount (as of today) and then breaks that count down by month and year:

 

 

DataCite's docs for this API say that those counts are "pulled from Event Data", which I assume is the same as what we've been calling Make Data Count counts. These are also the same counts shown in DataCite commons, such as https://commons.datacite.org/doi.org/10.7910/DVN/WS9OUR.

 

When I instead use the Dataverse API or query the Dataverse database where Dataverse records those MDC counts, I get different counts.

 

 

From the Dataverse database's datasetmetrics table, I can also see these counts broken down by month so that I'm able to see for which months the counts are different between the two sources.

 

The differences in the examples I've given are relatively small - 31 versus 35. I suppose they could be explained by caching or other timing issues, especially when the counts are from more recent months?

 

But for other datasets, the differences get bigger, like:

From the datasets I've checked so far, the counts from the Dataverse API have been greater than the counts from the DataCite API.

 

Lastly, I haven't compared the view counts but I can or others can if folks think that might help with troubleshooting.

 

Thanks in advance for any insights you can provide :)

 

Julian Gautier (he/him)

Product Research Specialist, IQSS

Interested in helping test Dataverse? Sign up for user experience research

 

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dataverse-community/099b8865-d76e-44b0-acf7-e51dfb5114ecn%40googlegroups.com.

Julian Gautier

unread,
Jan 13, 2026, 3:24:45 PMJan 13
to Dataverse Users Community
Thanks Jim. I think your reply's answered a few questions I might've made more explicit earlier:
  • It's fair to think that ideally the MDC download counts from Dataverse (API and database) and the download counts from DataCite (API and Event Data) should be the same. And it's fair to think that the downloadCounts from DataCite's API, which are "pulled from Event Data", are the same as what we've been calling Make Data Count counts. I'd been wondering if I've misunderstood the intention and the different terms being used.
  • The differences between the counts from both sources are great enough to suspect that it's not only a matter of something expected and temporary, like caching, that should correct itself in a reasonable amount of time.
I'll look more closely at some datasets' monthly counts from each source and check out those reports you mentioned. I see the ones for Harvard Dataverse at https://api.datacite.org/reports?platform=Harvard+Dataverse&created-by=Harvard+Dataverse&page[size]=1000.

And I'll ping Steve Winship. Ceilyn reminded me that he also worked on how Dataverse sends this info to DataCite.

Julian Gautier

unread,
Jan 14, 2026, 9:08:24 AMJan 14
to Dataverse Users Community
I made a Google Sheet to compare the monthly Make Data Count download counts from Harvard Dataverse's database to monthly counts from DataCite's API, using the same example dataset I wrote about yesterday, https://doi.org/10.7910/DVN/WS9OUR. There are 45 months.

The table shows that DataCite has reports for all 45 months, from what I can tell from the output of the reports API endpoint you shared, Jim.

In that endpoint's output, each month has the exception "usage data needs to be uncompressed" / "Report is compressed using gzip". Is that relevant here? I couldn't find any discussion about this exception in a couple of IQSS's GitHub repos and in my email inbox.

Two of the months also have the exception "usage data has not been processed for the entire reporting period" / "partial data returned". Those two months do have different counts, but almost all of the months do, too. And there are months with larger differences. So I'm not sure that that exception explains anything.

It's weird to me that for six of the months, the Harvard Dataverse database has no counts for this dataset, but the DataCite API does.

Could DataCite be getting counts from other sources?

This makes me less certain that the more accurate source of download counts is in the repository's database and what's reported by Dataverse's "makeDataCount/downloadsTotal" endpoint.

Julian Gautier

unread,
Jan 14, 2026, 9:59:32 AMJan 14
to Dataverse Users Community
Ugh, I just noticed that I mixed up counts from two different datasets. Sorry!

I updated the Google sheet. Now it's showing monthly MDC counts from the dataset at https://doi.org/10.7910/DVN/DCDKZQ.

There are 40 months where either or both sources report download counts.

In the output of DataCite API's "report" endpoint, there's just one month that "usage data has not been processed for the entire reporting period" / "partial data returned" exception.

For the 31 months where the sources have different counts for this dataset, the counts from Harvard Dataverse's database are always greater than the counts from the DataCite API.

So the big correction in my mind is that there are no months the DataCite API has counts but the Harvard Dataverse database has no counts. So there's nothing here that suggests that DataCite is getting counts about datasets published in Harvard Dataverse from sources other than Harvard Dataverse.

Julian Gautier

unread,
Jan 15, 2026, 11:44:42 AMJan 15
to Dataverse Users Community
To the Google Sheet I added pageview counts for the dataset at https://doi.org/10.7910/DVN/DCDKZQ, to compare pageview counts from Harvard Dataverse's database to pageview counts from the DataCite API.

I think these pageview count differences are significant, too. And nothing I could learn from DataCite's reports API explains these differences.

Like the download counts, the total pageview counts from the repository's database are greater than the total pageview counts from the DataCite API. But unlike the download counts, there are months, about half of the months, where pageview counts from the DataCite API are greater than pageview counts from the repository's database.

It kind of makes sense that DataCite can record pageview counts that aren't in the repository's database, right? Maybe DataCite is recording pageview counts when people view the dataset and its metadata from DataCite Commons? And as far as I understand, Dataverse repositories aren't pulling the MDC counts from DataCite/EventData. They're only pushing counts to DataCite.

Whereas for download counts, people can't download this dataset's files from DataCite. So is it fair to say that DataCite (or EventData) should be getting the download counts only from the Dataverse repository's reports? What if a second DOI is published with metadata that indicates that it's the same data as the first dataset? Is EventData accounting for that? It's been a while since I tried to understand what DataCite's PID graph is doing.

When I saw the differences between the download counts from the two sources, I wondered if I should consider the database counts as more accurate, and I thought about using those instead when gathering these counts to show to users to learn what they think of them. But thinking more about this, I'm less sure which counts are accurate or most accurate, and less inclined to learn what users think of them until it's more clear which counts are accurate.

Julian Gautier

unread,
Jan 16, 2026, 4:03:10 PMJan 16
to Dataverse Users Community
To see if this is affecting only Harvard Dataverse, I compared the counts from 11 Dataverse installations whose users can see MDC counts on dataset pages. There are at least 23 installations showing MDC counts on their dataset pages, but the DataCite API reports download or pageview counts or both for 11 of the 23.

The comparisons are in a second tab in the Google Sheet, "Total comparisons - other Dataverse repositories".

For each of the 11 installations, I got the counts for the 10 datasets that DataCite's API said were downloaded most often, then used Dataverse's API to get the MDC download and pageview counts of those datasets according to each installation.

For almost all of these datasets, which represent the most downloaded and viewed datasets in each of these 11 installations, the counts are different. That is, the MDC counts on the installations' dataset pages (and from the Dataverse API and in their databases) are different from the counts from the DataCite API (and on each dataset's page on DataCite Commons).

I could only compare total counts from the installations, and not monthly counts, since I'm not able to get monthly counts from the installations. Monthly counts are in each installation's datasetmetrics database table, and of course I have access only to Harvard Dataverse's database (or rather a copy of it). And it doesn't look like it's possible to use the Dataverse API to get these counts by month.

So I can't tell for which months there are differences or consider how each installation's monthly reports affect these differences.

But from what I can tell, almost all of the 110 datasets were published over a year ago, and most were published three or more years ago. I'd imagine that any expected timing issue would have been resolved by now, but I'm not sure.

I expect I'd find similar count differences for other or even most datasets in Harvard Dataverse.

I think it's a reasonable assumption that standardized metrics that users can get from repositories and from DataCite would be the same, and that the metrics are much less valuable when they're not the same.

I'm very curious to hear what folks running these repositories think about this, and what they're users think. I'll catch up with my colleagues at IQSS to see what they think.

James Myers

unread,
Jan 16, 2026, 5:10:55 PMJan 16
to dataverse...@googlegroups.com

FWIW: One part of the difference appears to be that DataCite appears to be reporting unique views whereas we’re reporting total views. Those are both derived quantities summing lower-level entries for (as named in our database) viewsuniqueregular and viewsuniquemachine for unique views, or viewstotalregular and viewstotalmachine for viewstotal, across all country codes (and then summed for all months). (I think both types of counting already remove ‘double-clicks’ – requests for the same URL within 30 seconds).

 

Since unique means “Multiple activities qualifying for the metric type in question representing the same dataset and occurring in the same user-sessions MUST be counted as only one “unique” activity for that dataset.”, I think that would mean downloading of 1000 datafiles in a dataset by one user would count as 1 unique download and 1000 regular downloads (haven’t fully verified that) which could account for the big differences on big datasets. It’s been too long for me to recall what was discussed when we chose what to display, but I wouldn’t be surprised if there was concern that unique views would mean big datasets would be underrepresented (e.g. fewer counts than if you simply spread the files across multiple datasets).

 

-- Jim

 


Sent: Friday, January 16, 2026 4:03 PM
To: Dataverse Users Community <dataverse...@googlegroups.com>

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Julian Gautier

unread,
Jan 20, 2026, 10:04:45 AMJan 20
to Dataverse Users Community
Thanks Jim! I think that's worth a lot! I updated the Google Sheet to show each datasets' unique counts in Dataverse databases, too, and to show the differences between those unique counts and DataCite's counts. The differences are much smaller, which I hope supports the idea that  the folks at DataCite decided to report unique counts in their API and on DataCite Commons pages.

I've also thought what you wrote about unique counts, that any amount of activity of a dataset within a user-session,  like viewing or downloading 1000 files in a dataset, counts as 1 activity. And I also don't remember the rationale for having Dataverse display total counts as opposed to unique counts, but this leads to much more specific questions:

- Why did we decide to show only total views and downloads on dataset pages, as opposed to unique views and downloads, or both?
- When users are interested in datasets with lots of files, would they be concerned that unique views underrepresent the activity of those datasets?
- Are the counts shown on DataCite Commons pages and in their API unique counts? Why have folks at DataCite decided to report only unique counts?
- What types of MDC counts have folks from other repositories and repository platforms decided show and why?

I'm glad that users can see that total views and downloads are shown on dataset pages. And of course the Dataverse API lets users get different types of counts.

James Myers

unread,
Jan 20, 2026, 11:59:02 AMJan 20
to dataverse...@googlegroups.com

I may have been mistaken that DataCite is using total unique counts. Instead, it may be showing regular unique counts (total = regular + machine):

 

Looking at the first two datasets listed for Syracuse/QDR, the db has

6107 R 5006 M = 11113 total

6578 R 1433 M = 8011 total

 

Whereas DataCite has 5895 and 6417 respectively for those two, which a) looks close to the Regular counts (slightly below – I haven’t checked if we had reporting problems some months since 2019-10-01 when we start MDC, but wouldn’t be too surprised), and b) has Machine counts that mirror the sizes of the difference (i.e. while the regular counts differ by <10%, the machine counts are ~3x different for these two datasets).

 

FWIW: There has been another discussion w.r.t. machine counts with DataCite not picking them up with their new browser script (and possibly not reporting them in the api) and several of us pointing out that we have tools/scripts (PyDataverse, etc.)  in use that represent real scientific use but don’t get counted as regular/via a browser counts.

 

-- Jim

 

From: dataverse...@googlegroups.com <dataverse...@googlegroups.com> On Behalf Of Julian Gautier
Sent: Tuesday, January 20, 2026 10:05 AM
To: Dataverse Users Community <dataverse...@googlegroups.com>
Subject: Re: [Dataverse-Users] Why are the MDC download counts from Dataverse API/database different than MDC download counts from DataCite API/Event Data?

 

Thanks Jim! I think that's worth a lot! I updated the Google Sheet to show each datasets' unique counts in Dataverse databases, too, and to show the differences between those unique counts and DataCite's counts. The differences are much smaller, which I hope supports the idea that  the folks at DataCite decided to report unique counts in their API and on DataCite Commons pages.

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Julian Gautier

unread,
Jan 20, 2026, 3:26:31 PMJan 20
to Dataverse Users Community
I was about to ping folks from DataCite today to ask, but took another look through their docs and found a section on their "Consuming Views and Downloads" page that I think confirms what your hunch Jim. On the page they say that the usage counts (views and downloads) described on this page "are the same ones shared through DataCite's APIs and visible in DataCite Commons", and in the middle of the page is this:

Screenshot 2026-01-20 at 2.52.26 PM.png

So I think this confirms that the usage counts on DataCite Commons pages and that their API returns are excluding machine views and machine downloads. And if a user asked why the count of downloads they see on a Dataverse repository's dataset page are different than the count of downloads for the same dataset on a DataCite Commons page, the reasons would be that:
  • Dataverse is showing "regular" plus "machine" downloads that are not "unique" (or where metric-type is total-dataset-requests and where access-method is "regular" plus "unique") 
  • DataCite is showing only "regular" downloads that are "unique" (or where metric-type is unique-dataset-requests and where access-method is "regular")
Like you wrote, we decided to include machine usage because it represents real usage. And the fact that the Dataverse API always includes machine usage tells me that there was a fair amount of confidence that users would always want machine views and downloads to be included in the usage counts they see.

So I wonder if another specific question is:
- Why did the folks at DataCite decide to exclude machine views and machine downloads from their unique counts?

Maybe the reasons are similar to why their newer browser script doesn't pick up machine counts. I recall that we haven't adopted that browser script because it doesn't pick up machine counts, or that was at least one reason why. But I don't remember hearing DataCite folks' reasons why their browser script doesn't pick up machine counts.

Jim, do you know off hand if we ever learned why? Or maybe we've postponed that discussion with them?
Reply all
Reply to author
Forward
0 new messages