Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Glob, encoding & rename

79 views
Skip to first unread message

Ted Nolan <tednolan>

unread,
Jun 8, 2018, 4:38:33 PM6/8/18
to

I'm funning FreeBSD, and after years of hanging on to it,
have finally left the "C" locale and its nice collations
for the LANG=en_US.UTF-8 locale.

To somewhat complicate matters, over the years, I have ripped a
number of CDs (classical ones are the most problematic in this
respect) where the names of the tracks, or the artitsts involve
European letters. It appears that in those cases, the filenames
generated are encoded in iso8859-1 and stored that way in the
Unix directory entries.

This causes me occasional problems as the "ls" command now
(after my locale shift) displays those files with '?' characters
for the characters over 0x7f, and sometimes other operations fail
as well.

I would like to write a Tcl script to look at all the files in a
directory and rename them from the iso8859-1 encoding to the UTF-8
encoding.

I'm a little confused about what "glob {*}" does in the face of
files which do not have valid names in the current system locale.

I wrote a brief test script to see what glob returns, and it does
return an entry for each file in the directory. Furthermore, I can
transcode that entry and then create a new (empty) file with the
same name, but in valid utf-8:

#!/usr/local/bin/tclsh8.6

proc main {} {

foreach f [glob {*}] {

set f [encoding convertfrom iso8859-1 $f]
close [open "/tmp/t/$f" w]
puts $f
}

exit 0
}

main

The generated files in /tmp/t have names which display correctly
with "ls".

So:

1) Should glob be telling me it's giving me names that aren't
valid in the current encoding?

2) If not, is there some way I can test a name to see if it is
valid UTF-8 or not? (Ideally, I would like to run a recursive
script that only renames iso8859-1 encoded filename files).

3) Is "file rename" going to work reliably if I give it the name
from glob on the left and the transcoded name on the right?
--
------
columbiaclosings.com
What's not in Columbia anymore..

Robert Heller

unread,
Jun 8, 2018, 5:27:51 PM6/8/18
to
I would say yes. Filenames are valid if they contain valid characters --
under UNIX (BSD or Linux), that is pretty much anything except a NUL or a '/'.
The *file system* does not actually care about the current encoding. That is
left to whatever software is presenting the strings to the user. Glob is
basically a wrapper for file system API calls, which don't themselves bother
with the locale or encoding system currently in effect.

>
> 2) If not, is there some way I can test a name to see if it is
> valid UTF-8 or not? (Ideally, I would like to run a recursive
> script that only renames iso8859-1 encoded filename files).

Not sure...

>
> 3) Is "file rename" going to work reliably if I give it the name
> from glob on the left and the transcoded name on the right?

It should. But make a backup first, just in case.

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

Rich

unread,
Jun 8, 2018, 5:51:50 PM6/8/18
to
Ted Nolan <tednolan> <t...@loft.tnolan.com> wrote:
>
> I'm funning FreeBSD, and after years of hanging on to it,
> have finally left the "C" locale and its nice collations
> for the LANG=en_US.UTF-8 locale.
> ...
> #!/usr/local/bin/tclsh8.6
>
> proc main {} {

You might want to explicitly set Tcl's 'system' encoding here to
8859-1, otherwise if Tcl's system encoding is utf-8, and your files
contain 8859-1 encoded characters, you might not always get what you
might expect or want.

> foreach f [glob {*}] {
>
> set f [encoding convertfrom iso8859-1 $f]
> close [open "/tmp/t/$f" w]
> puts $f
> }
>
> exit 0
> }
>
> main
>
> The generated files in /tmp/t have names which display correctly
> with "ls".
>
> So:
>
> 2) If not, is there some way I can test a name to see if it is
> valid UTF-8 or not? (Ideally, I would like to run a recursive
> script that only renames iso8859-1 encoded filename files).

Well, you can test the bytes to see if they comply with the encoding
rules of utf-8. There's some details here: https://wiki.tcl.tk/1211.
But you'd have to write your own test for this, as I do not think Tcl
has any built in that will test a string for utf-8 encoding validity.

There is, however, one simple rule you can apply. If *all* the bytes
of the filename have code point values of 7f or smaller, then the bytes
are valid UTF-8 (and valid ASCII, and valid 8859-1). This also means
you do not need to do anything with filenames with bytes that are all
valued less than 7f, because the code point to character mapping for 7f
and below is identical for Unicode, ASCII, and 8859-1. This one rule
would eliminate you needing to do anything to files that have no need
to be translated.

> 3) Is "file rename" going to work reliably if I give it the name
> from glob on the left and the transcoded name on the right?

I'd suggest maybe instead using 'file link' and creating a new hard
link under a slightly different name (maybe an extra extension). This
way you do not attempt to change the origional filename at all (so no
risk of damaging it), you just create a new name pointing at the same
disk blocks. Then after you verify the new names are ok, you can bulk
remove the old names and bulk rename (remove extra extension, if that
was what you used) the new names from the shell.

briang

unread,
Jun 12, 2018, 11:09:43 PM6/12/18
to
On Friday, June 8, 2018 at 2:51:50 PM UTC-7, Rich wrote:
> Ted Nolan <tednolan> <t...@loft.tnolan.com> wrote:
> >
> > So:
> >
> > 2) If not, is there some way I can test a name to see if it is
> > valid UTF-8 or not? (Ideally, I would like to run a recursive
> > script that only renames iso8859-1 encoded filename files).
>
> Well, you can test the bytes to see if they comply with the encoding
> rules of utf-8. There's some details here: https://wiki.tcl.tk/1211.
> But you'd have to write your own test for this, as I do not think Tcl
> has any built in that will test a string for utf-8 encoding validity.
>
> There is, however, one simple rule you can apply. If *all* the bytes
> of the filename have code point values of 7f or smaller, then the bytes
> are valid UTF-8 (and valid ASCII, and valid 8859-1). This also means
> you do not need to do anything with filenames with bytes that are all
> valued less than 7f, because the code point to character mapping for 7f
> and below is identical for Unicode, ASCII, and 8859-1. This one rule
> would eliminate you needing to do anything to files that have no need
> to be translated.

I believe it is also true that all the 8859-1 printable characters do not interfere with utf-8 special characters, which is why Tcl manages to "do-the-right-thing".

I think you can get away with [encoding converto utf-8 [encoding convertfrom iso8859-1 $fname]], where $fname is an element returned by [glob].

-Brian

Rich

unread,
Jun 13, 2018, 7:36:49 AM6/13/18
to
briang <bgriffin...@gmail.com> wrote:
> On Friday, June 8, 2018 at 2:51:50 PM UTC-7, Rich wrote:
>> Ted Nolan <tednolan> <t...@loft.tnolan.com> wrote:
>> >
>> > So:
>> >
>> > 2) If not, is there some way I can test a name to see if it is
>> > valid UTF-8 or not? (Ideally, I would like to run a recursive
>> > script that only renames iso8859-1 encoded filename files).
>>
>> Well, you can test the bytes to see if they comply with the encoding
>> rules of utf-8. There's some details here:
>> https://wiki.tcl.tk/1211. But you'd have to write your own test for
>> this, as I do not think Tcl has any built in that will test a string
>> for utf-8 encoding validity.
>>
>> There is, however, one simple rule you can apply. If *all* the
>> bytes of the filename have code point values of 7f or smaller, then
>> the bytes are valid UTF-8 (and valid ASCII, and valid 8859-1). This
>> also means you do not need to do anything with filenames with bytes
>> that are all valued less than 7f, because the code point to
>> character mapping for 7f and below is identical for Unicode, ASCII,
>> and 8859-1. This one rule would eliminate you needing to do
>> anything to files that have no need to be translated.
>
> I believe it is also true that all the 8859-1 printable characters do
> not interfere with utf-8 special characters, which is why Tcl manages
> to "do-the-right-thing".

There are not "special characters". There are just "characters". The
code point assignments between 8859-1 printables and Unicode below code
point 0x100 may be identical, so the code points represent the same
characters, but the byte level encoding differs between the two for the
same code point.

> I think you can get away with [encoding converto utf-8 [encoding
> convertfrom iso8859-1 $fname]], where $fname is an element returned
> by [glob].

But, likely only if you set Tcl's system encoding to iso8859-1 before
running glob. Otherwise, if Tcl's system encoding is utf-8 it will
have already tried to perform effectively an "encoding convertfrom
utf-8" on the filenames before you get them back from glob. And
attempting to convert 8859-1 bytes by utf-8 decoding them will not work
for any code points above 0x7f as most of the valid 8869-1 bytes in the
range 0x80-0xff are not valid utf-8 encoded bytes.


0 new messages