remediation of misidentified files

77 views
Skip to first unread message

Tracy P.

unread,
Mar 28, 2023, 3:53:41 PM3/28/23
to Digital Curation
Hello,

I've used several tools to identify files such as DROID and FITS. These tools report when there is a file extension or file type mismatch; however, they do not correct file extension mismatches.

Does anyone have tool suggestions or workflow examples where file extensions or other misidentified characteristics are reported, modified and documented on batch/large scale?

Thanks
Tracy Popp

Pejsa, Stanislav

unread,
Mar 29, 2023, 11:07:47 AM3/29/23
to digital-...@googlegroups.com
Hello Tracy,

I am not sure it is desirable that the tools automatically correct the
reported mismatch. I inspect these files on occasion, but in most cases
there is nothing wrong with the mismatch. This happens when droid
recognises the file signature, but the file has a different extension
than a "conventional" one.

droid documentation says on this:
> 3.10 File extension mismatch warning
Sometimes file extensions are incorrect or missing. If DROID detects
that the file extension for a file
name does not match the format of the file it has identified by
signature or container signature, it
will issue a file extension mismatch warning.

We did a project in which droid identified almost 300000 files and only
some 10000 of them were reported as extension mismatch. Most of them
(85 %) were identified as xml, but had extensions vtk and vtp (but also
htm).

At least over here, it is the individual files rather than batches that
need some intervention, but ymmd.

Best, Standa

--
Stanislav Pejša
data curator @ PURR

Purdue University Libraries
WALC 2032M
340 Centennial Mall Drive
West Lafayette, IN 47907
spe...@purdue.edu
+001 765-496-3736




On Tue, 2023-03-28 at 12:53 -0700, Tracy P. wrote:
> ---- External Email: Use caution with attachments, links, or sharing
> data ----

Popp, Tracy Marie

unread,
Mar 30, 2023, 10:32:52 AM3/30/23
to digital-...@googlegroups.com
Thanks for the reply, Standa! Do you mind if I contact you off list for future discussion?

Thanks
Tracy
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To view this discussion on the web visit https://urldefense.com/v3/__https://groups.google.com/d/msgid/digital-curation/a522be457bd7433265d5713c7153f2438e54076c.camel*40purdue.edu__;JQ!!DZ3fjg!8VpGzYny2S_KdF8vgsdzdxyDlbkuPcueJVW3St4ZxTDXN4NYv7DRj07xkCsJge98jdx9Boa668MLheO3nUEr$ .

Tyler Thorsted

unread,
Apr 3, 2023, 12:55:30 PM4/3/23
to Digital Curation
Tracy, I am curious if this is happening often or for a particular format. 

There are some older formats such as WordPerfect where users often used arbitrary extensions such as their initials. In these situations we either ignore the mismatch or append an extension to the end. Also in the cases of early Macintosh related formats, they would not have an extension at all. We will append to a known extension in some circumstances. 

In the case where a format uses an en extension not within PRONOM but we find it to be an acceptable extension, we will submit that new extension to PRONOM for consideration. 

I also don't believe it is good practice to programmatically change extensions based on results from DROID. 

Tyler Thorsted

Popp, Tracy Marie

unread,
Apr 3, 2023, 5:16:29 PM4/3/23
to digital-...@googlegroups.com

Hi Tyler,

 

Thanks for the input. We have a fairly large backlog of files without or with incorrect extensions so I was hoping for some batch review magic solution to make the project more manageable in relation to available resources – but may need to rethink that strategy 😊

 

The most recent collection contained WordStar files where, per the LoC, “There is no official file extension for WordStar files as authors could create their own. The ones listed here are examples of common extension used.” https://www.loc.gov/preservation/digital/formats/fdd/fdd000552.shtml?loclr=blogsig

I also encounter the issue with legacy Mac files where files frequently do not have extensions, which is a known issue due to how legacy Apple/Mac OSs handled extensions - or rather,  didn’t require them, working instead with the type and creator code as I understand it.

 

Tracy

Reply all
Reply to author
Forward
0 new messages