Invalid Encoding of Directories and Files

Jarrett Drake

unread,

Jun 7, 2016, 3:53:38 PM6/7/16

to Digital Curation

Hello all,

At my library, I am assisting a colleague working with over 150 3.5-inch floppy disks of an author from Latin America. My colleague has already created images for each (with the exception of 1) disk and is now ready to create the SIP, for which we use Bagger, for deposit into our preservation environment. However, Bagger is unable to bag those directories and files with invalid encoding errors. We suspect these messages stem from the heavy usage of accent marks in many of the folder and file names. In the cases we've experimented with, the files can actually be opened and read just perfectly, so the invalid encoding is not presenting an access issue. But we need to find a way to rectify the encoding so as to proceed with later preservation activities.

Have you encountered these issues with non English language materials from other countries? And if so, how did you resolve the issues programmatically? I've been trying out my luck with a few different scripts online, but to no avail so far. The total number of files on all disks is close to 5,000, and about 15-20 percent of them have encoding issues, meaning the only viable solution will be a scripted one. Alternatively, if you recommend a particular resource from which one can edify themselves on the matter, that too would be of assistance. Any other advice or ideas are also appreciated. Thanks!

Best,
Jarrett

Kevin Hawkins

unread,

Jun 7, 2016, 7:47:05 PM6/7/16

to digital-...@googlegroups.com

Unfortunately, Software will sometimes choke on or mangle non-ASCII characters in filenames. Aside from finding another tool to create bags or fixing the bug in Bagger, the work-around that comes to mind would be to run a script on the files that replaces each non-ASCII character with an "unaccented" ASCII version. This is, I admit, not at all ideal from a preservation point of view, but you may need to compromise on ideals for expediency. Here are some leads on finding a script to replace characters:

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=remove+accents+filenames

Good luck!

Kevin

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Bertram Lyons

unread,

Jun 7, 2016, 10:47:38 PM6/7/16

to digital-...@googlegroups.com

Hi Jarrett --

If the disks were imaged, and are stored as disk images, are you also
extracting files from the disk images and storing the outside of the
serialized image?

In this scenario, is it possible for you to store the disk image
itself as a single object (and to Bag it), instead of extracting all
files separately for bagging? This would remove the current obstacle
with file names from the within the original floppy filesystems. You
could use a schema such as dfxml to write out a list of filenames and
internal directory structures for discovery purposes, but store the
files themselves as part of a single raw disk image.

The complication in this area, however, would be the inability to
easily perform format migration tasks if such tasks were ever called
for, without first extracting the target file from within the stored
disk image file.

Perhaps I am missing something in your explanation, though. Is this at
all useful as a possible way forward?

Best --

Bert

______________________________________

Bertram Lyons, CA
AVPreserve
634 W. Main St., Ste 202
Madison, Wisconsin 53703

office: 202-430-4457

http://www.avpreserve.com
Facebook.com/AVPreserve
twitter.com/AVPreserve

Michael Kjörling

unread,

Jun 8, 2016, 9:25:18 AM6/8/16

to digital-...@googlegroups.com

On 7 Jun 2016 12:53 -0700, from jarrett...@gmail.com (Jarrett Drake):

> At my library, I am assisting a colleague working with over 150 3.5-inch
> floppy disks of an author from Latin America. My colleague has already
> created images for each (with the exception of 1) disk and is now ready to
> create the SIP, for which we use Bagger, for deposit into our preservation
> environment. However, Bagger is unable to bag those directories and files
> with invalid encoding errors. We suspect these messages stem from the heavy
> usage of accent marks in many of the folder and file names.

It's a bit of an aside, and this might not help you at all, but are
these disks formatted using FAT? In FAT, file names were simply 11
byte strings, and really only the space character (ASCII 32 decimal)
and perhaps NULL (ASCII 0) were particularly special. Other characters
were simply stored as-is, and at least MS-DOS could be reconfigured to
use a wide variety of "code pages" and keyboard maps to map input to
bytes. (A common trick to make an "inaccessible" directory was to name
it with an ASCII 255 decimal character at the end. That displayed as a
blank space, so there was no visual indication that it was there, but
it was not a space so you had to enter it manually to enter the
directory using the command line. Of course, that trick offered no
protection whatsoever against someone armed with a file manager.)

If the disks are formatted as FAT, then you would need to determine
the code page that was used for the file names (usually, this was one
code page covering one or more language/country combinations,
providing the character support commonly needed), and work from that.
If you are storing the files extracted from the disks, you could use
this information to convert the file names to their equivalents in a
modern encoding such as for example UTF-8.

--
Michael Kjörling • https://michael.kjorling.se • mic...@kjorling.se
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

Ferran Jorba

unread,

Jun 8, 2016, 9:59:28 AM6/8/16

to digital-...@googlegroups.com

Hello Jarrett,

At Tue, 7 Jun 2016 12:53:38 -0700 (PDT), Jarrett Drake wrote:
>
> At my library, I am assisting a colleague working with over 150
> 3.5-inch floppy disks of an author from Latin America. My colleague
> has already created images for each (with the exception of 1) disk
> and is now ready to create the SIP, for which we use Bagger, for
> deposit into our preservation environment. However, Bagger is
> unable to bag those directories and files with invalid encoding
> errors. We suspect these messages stem from the heavy usage of
> accent marks in many of the folder and file names. In the cases
> we've experimented with, the files can actually be opened and read
> just perfectly, so the invalid encoding is not presenting an access
> issue. But we need to find a way to rectify the encoding so as to
> proceed with later preservation activities.

If you are copying those files to a Linux box, then you may use detox
(http://detox.sourceforge.net/), also available for Debian
(http://packages.debian.org/detox), and thus, for any Debian
derivative.

Detox can 'clean' filenames, removing diacritics and any problematic
character, even recursing into directories. As those floppies come
from Latin America, most likely they are coded in CP850 and, being a
close relative to is-8859-1, the default values for detox iso8859_1
rules should do a good work.

Best regards,

Ferran Jorba
Universitat Autònoma de Barcelona

Simon Spero

unread,

Jun 8, 2016, 10:17:35 AM6/8/16

to digital-...@googlegroups.com

It's not an aside- it's a front and center :-)

BagIt allows the use of any character encoding as long as it's UTF-8, and any character in the Unicode repertoire except for byte order markers (BOMs) [asterisk end of line markers].

Bagger uses Java strings, which are UTF-16 internally, and which unless otherwise specified, are decoded using a configurable default decoder. Multiple decoders can be used in the same application, but it can be a little bit trickier to wrangle things in file name translation. Not impossible, but trickier.

Important diagnostic information :

1. Is there a stack trace for where the error is being thrown?
2. What type of file system did the files come from?
3. What type of file system are the files unpacked on to?
4. Do the filenames show up properly when looking at them on the bagging machine (with accents)?
5. Is there are sample disk image with problematic files you can provide?

Simon

Andrew Berger

unread,

Jun 8, 2016, 10:56:28 AM6/8/16

to digital-...@googlegroups.com

Hi Jarrett,

I'm guessing you've tried this, but in case you haven't, have you tried running the python bagit script instead of Bagger, which I think uses the Java library? I seem to remember when I was doing testing that python bagit handled encoding better than Java bagit. (I was running bagit directly from the command line, not using Bagger, however.) We haven't had to deal with non-ASCII encodings in production (yet), so it's been a while since I tried.

Alternatively, and this is possibly excessive for your needs, you could run Archivematica purely as a packaging tool. Archivematica runs a filename sanitization script that generates a log of filename changes. Since an Archivematica AIP is just a bag, you could take those "AIPs" and turn them around for deposit into your preservation environment. It might be possible to pull the filename sanitization micro-service out of Archivematica to run by itself, but I'm not sure how much effort that would involve.

Best,

Andrew

Jarrett Drake

unread,

Jun 12, 2016, 12:09:28 PM6/12/16

to Digital Curation

Hi all,

Thanks for your questions and suggestions. To recap, my colleague and I found and executed the process described at the following link: https://www.leaseweb.com/labs/2013/12/file-name-%EF%BF%BD-invalid-encoding/. It worked perfectly and allowed the bag to be created and validated. I hope this option will aid others.

Best,

Jarrett

John Scancella

unread,

Jul 6, 2016, 9:32:27 AM7/6/16

to Digital Curation

@Simon The bagit spec just specifies UTF-8 for the bagit.txt file, in that file you can then specify the character encoding of the other files. See https://tools.ietf.org/html/draft-kunze-bagit-13#section-2.1.1