Preserving donors' file names

elizabetha...@ucalgary.ca

unread,

Jan 5, 2021, 8:01:08 PM1/5/21

to Digital Curation

Hi all,

I'm curious what other people/institutions are doing when it comes to preserving the file names that donors have given their files. Obviously this is important provenance information, but I'm also concerned about the length of some file paths, and thinking about the way we use barcodes to uniquely identify physical media like tapes. Having unique identifiers for the items we migrate from physical digital media seems useful, but I'm also concerned about preserving the donors' names for the files.

Thank you for any insight or ideas you can provide!

Elizabeth-Anne Johnson

University of Calgary

Charles Blair

unread,

Jan 5, 2021, 9:00:05 PM1/5/21

to digital-...@googlegroups.com

In a linked-data setting one might do the following.

@prefix premis3: <http://www.loc.gov/premis/rdf/v3/> .

premis3:originalName "G4104-C6-1933-U5-m.tif" ;

"How much of the file path to preserve would be up to the repository." https://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf, p. 86.

The above example is for a file which a department in our institution has named according to established principles (here using the Library of Congress classification for the item) and then deposited, but the manner of recording the information would still apply with a donor-supplied file. I believe that preserving original file names from donors is important because people will often use filenames as metadata, so the filename might preserve important information. In our use case, a depositor might refer to a file by the original name ("Do you have a copy of [filename]?"), not the name the repository has given it; in our case this would involve a unique identifier. If you do decide to record original filenames, please be prepared to handle non-Roman characters and encode them in a standards-compliant, not software-specific, way.

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/318c14c8-6b11-4376-8bfb-1c933f8f89a1n%40googlegroups.com.

--

Charles Blair | Director, Digital Library Development Center, University of Chicago Library | https://www.lib.uchicago.edu/~chas/

Underdown, David

unread,

Jan 7, 2021, 8:45:45 PM1/7/21

to digital-...@googlegroups.com

We always preserve the original filepath (it may well be the only descriptive information we have for the file). Early on we did run into some issues with long paths, but this is becoming less of an issue now as you can enable long paths in Windows 10. When we did hit We came up with a way of recording initial parts of the path that were common to many files and replacing with a placeholder (I forget the exact details now, but could look them up). The system (based around Preservica) does automatically generate a uuid for every file ingested, and we also have a catalogue reference for each file.

In our online catalogue the file name is currently used as the Title element, and we display the rest of the path in the “physical arrangement” field (we’re currently limited to fields that are in ISAAD(G)). We’d like to make that a clickable link so that you could easily see everything that was originally in the same folder together, but that hasn’t been possible yet.

--

The National Archives logo

David Underdown | Senior Digital Archivist
T: 020 3908 9228 | W: nationalarchives.gov.uk
Twitter: @DavidUnderdown9
The National Archives, Kew, Richmond, Surrey TW9 4DU

If you wish to submit a request for information, please use the form provided at: https://www.nationalarchives.gov.uk/contact/

From: digital-...@googlegroups.com <digital-...@googlegroups.com> On Behalf Of Charles Blair
Sent: 06 January 2021 01:32
To: digital-...@googlegroups.com
Subject: Re: [digital-curation] Preserving donors' file names

EXTERNAL: This email originates from outside of The National Archives.

To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/CAN2iP%2BMX0pmf4z-NC6OGp%3DMeFH8pDT66%3DfSPX1JZV3uyCNoErQ%40mail.gmail.com.

Please don't print this e-mail unless you really need to. ----------------------------------------------------------------------------------- National Archives Disclaimer This email and any files transmitted with it are intended solely for the use of the individual(s) to whom they are addressed. If you are not the intended recipient and have received this email in error, please notify the sender and delete the email. Opinions, conclusions and other information in this message and attachments that do not relate to the official business of The National Archives are neither given nor endorsed by it. ------------------------------------------------------------------------------------

Stephen McConnachie

unread,

Feb 16, 2021, 5:13:46 PM2/16/21

to Digital Curation

At the BFI National Archive we rename all files for ingest to our preservation repository (with a strict filename policy that is validated by our ingest scripts) - but we capture the filepath and original filename in our Collections Management System for file or folder (and we specify which it is).

Stephen McConnachie

Head of Data and Digital Preservation,

BFI National Archive

Peter Bubestinger

unread,

Feb 17, 2021, 1:48:53 PM2/17/21

to digital-...@googlegroups.com

Hi everyone :)

I completely agree that preserving the *information* of the original
structure/naming of data coming from external sources. But for many
reasons, I would advise normalizing the structure/naming to a common syntax.

At the Austrian Mediathek (www.mediathek.at), we preserve the original
folder-structure and filenames (plus filesystem metadata like
timestamps) in a simple plain-text file during the ingest of external
data carriers (HDDs, USB-sticks, etc).

The ingest is done on Linux machines (*), so the listing is done like this:

`ls -laR --time-style=long-iso`

Then everything is renamed during ingest, to the in-house naming rules.
This is also very (very!) helpful and necessary to avoid issues with
non-ascii characters in filenames, long-paths, etc throughout the
processing workflow.

The original folder/filename listing is stored next to the final
archival media files.

Kind regards,
Peter Bubestinger-Steindl

(*) Linux was chosen as OS for digital file ingest, because it is able
to read almost all filesystems (including HFS+), we can define a
write-barrier and don't have to worry about incoming viruses ;)
That's the main reason the listing text is in this format.

>> *David Underdown *|* Senior Digital Archivist*

>> T: 020 3908 9228 | W: nationalarchives.gov.uk

>> <http://www.nationalarchives.gov.uk/>

>> Twitter: @DavidUnderdown9
>> The National Archives, Kew, Richmond, Surrey TW9 4DU
>>
>>
>> If you wish to submit a request for information, please use the form
>> provided at: https://www.nationalarchives.gov.uk/contact/
>>
>>
>>

>> *From:* digital-...@googlegroups.com <digital-...@googlegroups.com> *On
>> Behalf Of *Charles Blair
>> *Sent:* 06 January 2021 01:32
>> *To:* digital-...@googlegroups.com
>> *Subject:* Re: [digital-curation] Preserving donors' file names
>>
>>
>>
>> *EXTERNAL*: This email originates from outside of The National Archives.

>> <https://groups.google.com/d/msgid/digital-curation/318c14c8-6b11-4376-8bfb-1c933f8f89a1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .

>>
>>
>>
>>
>> --
>>
>> Charles Blair | Director, Digital Library Development Center, University
>> of Chicago Library | https://www.lib.uchicago.edu/~chas/
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Digital Curation" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to digital-curati...@googlegroups.com.
>>
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/digital-curation/CAN2iP%2BMX0pmf4z-NC6OGp%3DMeFH8pDT66%3DfSPX1JZV3uyCNoErQ%40mail.gmail.com

>> <https://groups.google.com/d/msgid/digital-curation/CAN2iP%2BMX0pmf4z-NC6OGp%3DMeFH8pDT66%3DfSPX1JZV3uyCNoErQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .

>> Please don't print this e-mail unless you really need to.
>> -----------------------------------------------------------------------------------
>> National Archives Disclaimer This email and any files transmitted with it
>> are intended solely for the use of the individual(s) to whom they are
>> addressed. If you are not the intended recipient and have received this
>> email in error, please notify the sender and delete the email. Opinions,
>> conclusions and other information in this message and attachments that do
>> not relate to the official business of The National Archives are neither
>> given nor endorsed by it.
>> ------------------------------------------------------------------------------------
>>
>>

--
AV-RD e.U.
AudioVisual Research & Development
www.av-rd.com

Tel.: +43 660 200 5734
Stein 115
A-8282 Bad Loipersdorf
UID ATU70313939

nickkrab...@nypl.org

unread,

Feb 17, 2021, 1:48:58 PM2/17/21

to Digital Curation

At NYPL, we retain original filenames and also create a record of what files were named what with BagIt manifests while we are doing the manual portions of ingest.

Previous repository systems have then altered filenames during their part of ingest, by doing things like 'sanitizing characters' or renaming files to a generic UUID. We would prefer to avoid any 'sanitizing' processes, converting non-American English characters to either similar characters or underscores, for both management/location issues and the ethical issues of delegitimizing the languages of the people that created the filenames. There's a great paper about this by Elvia Arroyo-Ramirez. I can understand the architectural reasons for renaming to a generic UUID, although I think with growing use of object-store infrastructure, this becomes less necessary.

We've also faced some issues with path-length, especially on our Windows workstations. However, as David said, with Windows 10 we don't see the issues anymore. If you are having path-length issues, a potential strategy is to package up a set of files within a tar or zip file. This might impact the way that you serve the files, but it effectively replaces a complex web of directory structures with a single object for the file system to manage.

-Nick

Andreas Romeyke

unread,

Mar 4, 2021, 1:56:10 PM3/4/21

to digital-...@googlegroups.com

Hi,

At the Saxon state library we use BagIt 1.0 as SIP format for ingest. The submission application is based on the Perl module Archive::BagIt (available on CPAN) and therefore supports UTF-8 based file paths (also works from Windows10). Our archive system uses its own file paths internally for the AIPs and keeps the file path information as metadata.

With best regards

Andreas

To view this discussion on the web visit https://groups.google.com/d/msgid/digital-curation/4edcfcad-5256-2e58-ab1b-9fe7bc0a1ee8%40av-rd.com.

SpectateSwamp Original

unread,

Jul 15, 2021, 2:14:12 PM7/15/21

to Digital Curation

We scanned the family albums 20+ years ago.

Gave them all 8 character names. The name included the sequence # we put on the back of each picture....

later

I took the catalog info and added it in front of the initial 8 characters...

The name stays with the pic or video.

Reply all

Reply to author

Forward