JHOVE - how does it work?

Julie

unread,

Aug 28, 2024, 2:35:21 PM8/28/24

to Dataverse Users Community

Hello Dataverse community,

I’m starting a project with Borealis to better understand how format information is handled in Dataverse and what we could do with that information. Specifically, I’m curious to know:

Why JHOVE was chosen for format identification, etc.
What operations JHOVE performs in the Dataverse context
Where results from JHOVE are recorded/stored: so far, we’ve found the “contenttype” field in the “datafile” table for MIME types
If/how other folks are making use of the MIME types or other format information that’s recorded in DV

I’d also be interested to know if anyone is using external tools to gather format information about their files in Dataverse. Any advice is much appreciated!

Thank you,

Julie

Julie Shi

Digital Preservation Librarian

Scholars Portal

Philip Durbin

unread,

Aug 28, 2024, 5:06:33 PM8/28/24

to dataverse...@googlegroups.com

Hi Julie,

We've been using JHOVE forever but I think we'd be fine with switching to something else if it's better.

I believe we only use it for file detection. And yes, like you said, we stuff the value into that "contenttype" field.

File-level external tools use the MIME/content type. You can find some examples at https://guides.dataverse.org/en/6.3/admin/external-tools.html but think of previewers for images or text or CSV, for example.

I'm sure others have thoughts on this stuff. I hope this helps!

Thanks,

Phil

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/f76c66ba-8e92-4149-9676-c60adec68759n%40googlegroups.com.

--

Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Julie

unread,

Sep 12, 2024, 10:33:26 PM9/12/24

to Dataverse Users Community

Hi Phil,

Apologies for the delay! This is very helpful, particularly that the external tools rely on MIME/content types, which makes a lot of sense!

Would you happen to know if only the file detection features are used because the external tools are the main use case that JHOVE is supporting in Dataverse? Just curious about how/why JHOVE arrived on the Dataverse scene.

Best,

Julie

Leonid Andreev

unread,

Sep 24, 2024, 9:24:55 AM9/24/24

to Dataverse Users Community

Hi Julie,

Sorry for the delay on our part.

The short answer is just "legacy", really. The reason JHOVE was originally chosen (and, as Phil said, we've been using it "forever") was that we needed something to help identify the mime types of uploaded files, and that was a) something readily available then and b) another Harvard-developed and maintained package. The latter part hasn't been the case for a longest time (but at least it appears to be actively maintained now, after a long hiatus, by the Open Preservation Foundation). It is entirely possible that there are better software packages available by now. We just haven't gotten around to evaluate any alternatives, because, "legacy".

I should mention that Dataverse does not solely rely on JHOVE to determine the mime type of an uploaded file. There are several ways of detecting, or guessing the content type, and then Dataverse will make an educated guess as to which type to use. For example, we have our own code for identifying the formats that we recognize and support for producing data- and variable-level metadata - such as Stata and SPSS. If one of these tests produces a positive result, the type produced (for example, "application/x-stata-14") always wins over whatever JHOVE says. Dataverse will also check the type that the browser or the API client supply with the upload. There are situation where that will be the "richest"/most specific mime type available, and therefore chosen over the JHOVE-produced result. There is also the "direct upload" workflow, where the file is uploaded directly to S3/Globus volume, bypassing Dataverse. This is much more efficient that streaming the data through Dataverse and much better suited for large uploads. However, the drawback is that Dataverse cannot perform any content type detection during the initial upload workflow, via JHOVE or otherwise, since it never sees the actual data bytes. So it has to rely solely on the type supplied by the browser, or on guessing the type from the filename extension (Dataverse maintains the list of extensions for this purpose).

Hope this helps.

Best,

-Leo

Leonid Andreev

unread,

Sep 24, 2024, 11:11:24 AM9/24/24

to Dataverse Users Community

P.S. Knowing the correct mime types of files is important in the context of configuring external tools. But I wouldn't say that this is the main or most important use case. There is also the workflow where we extract additional metadata from certain types of files that I mentioned earlier. Generating thumbnails relies on being able to identify image file types. And just for the general purpose of data preservation, being able to identify and describe the data objects in a dataset as precisely as possible has a value of its own. In other words, we assume that the mime type is an important metadata element worth of making an extra effort to determine correctly.

Reply all

Reply to author

Forward