Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

File name encoding problems (again)

52 views
Skip to first unread message

Gerhard Reithofer

unread,
Apr 7, 2018, 10:09:04 AM4/7/18
to
Hi,
I have large old unix file archives which use char \xb1 in their file
names, example: 33310±±±±±±17.11.05±±±±17UH.file
The terminal display is (UTF8 mode):
$ ls 33310*
33310??????17.11.05????17UH.file

The +/- char is according utf-8 table c2b1, the plus/minus char is only
shown in the tcl shell window.
If I want to access one via tcl file function (like [file exists ...] I
get an error message:

% set fn [glob 33310*]
33310±±±±±±17.11.05±±±±17UH.file
% file exists $fn
could not read "33310±±±±±±17.11.05±±±±17UH.file": no such file or
directory

Also encoding did not help.

% set new [encoding convertfrom utf-8 $fn]
33310±±±±±±17.11.05±±±±17UH.file
% file exists $new
0

% expr {$fn eq $new}
1

Any idea what todo?
How can I read these "unencoded" file name?

TIA,
Gerhard

--
Gerhard Reithofer - Techn. EDV Reithofer - http://www.tech-edv.co.at

Rich

unread,
Apr 7, 2018, 10:45:48 AM4/7/18
to
Gerhard Reithofer <gerhard....@tech-edv.co.at> wrote:
> Hi,
> I have large old unix file archives which use char \xb1 in their file
> names, example: 33310ąąąąąą17.11.05ąąąą17UH.file
> The terminal display is (UTF8 mode):
>
> Any idea what todo?
> How can I read these "unencoded" file name?

You are a couple possible solutions:

1) Use Unix/Linux tools to rename the files, changing the ą characters
to something else.

2) Given that these characters are invalid UTF-8, but valid iso8859-1,
you might try changing Tcl's system encoding to iso8859-1 before you
try to read the names, on my Linux system with iso8859-1 encoding I
can create and read this character just fine:

$ rlwrap tclsh
% encoding system
iso8859-1
% set name "a\xb1c"
aąc
% set fd [open $name {WRONLY CREAT TRUNC}]
file5
% close $fd
% file exists $name
1
% set name2 [glob a*c]
aąc
% file exists $name2
1
%

You can get the current system encoding with [encoding system] (as I
show above). You can tell Tcl to use a different system encoding with
"encoding system ?encoding?" as per the man page. So:

encoding system iso8859-1

before you trying globbing these names might work wonders. Or it might
not. I can only test on my system.

Gerhard Reithofer

unread,
Apr 7, 2018, 11:58:01 AM4/7/18
to
Hi Rich,
On Sat, 7 Apr 2018, Rich wrote:
> Gerhard Reithofer <gerhard....@tech-edv.co.at> wrote:
> > Hi,
> > I have large old unix file archives which use char \xb1 in their file
> > names, example: 33310±±±±±±17.11.05±±±±17UH.file
> > The terminal display is (UTF8 mode):
> >
> > Any idea what todo?
> > How can I read these "unencoded" file name?
>
> You are a couple possible solutions:
>
> 1) Use Unix/Linux tools to rename the files, changing the ± characters
> to something else.

The shell has similar problems as the things are not "displayed"
correctly.

> 2) Given that these characters are invalid UTF-8, but valid iso8859-1,
> you might try changing Tcl's system encoding to iso8859-1 before you
> try to read the names, on my Linux system with iso8859-1 encoding I
> can create and read this character just fine:
>
> $ rlwrap tclsh
> % encoding system
> iso8859-1
> % set name "a\xb1c"
> a±c

This worked also on my system with the identical output without changing
the encoding, BUT as I have a system encoding utf-8 a file name with the
following byte sequence was created.

$ ls a*c | od -h
0000000 c261 63b1 000a
0000005

You see it is: "a" 0xc2b1 "c" "\n" - 0xc2b2 is the utf-8 code for the
plus/minus char.

...

> You can get the current system encoding with [encoding system] (as I
> show above). You can tell Tcl to use a different system encoding with
> "encoding system ?encoding?" as per the man page. So:
>
> encoding system iso8859-1

Yeeeaaah! That did :-) :-)

> before you trying globbing these names might work wonders.

Globbing is the only way to read the file name into a tcl variable II
use it for my tests) but the "wrong" byte sequence was created, using
the 0xc2b1 characters and with this name it can't be addressed.

Setting the system encoding to iso8859-1 leaves the name as it was
(assumption and at least it works for this character).

My repair code is:

encoding system iso8859-1
set pat 3331*
set org [glob $pat]
set ect [encoding convertto utf-8 $org]
file rename $org $ect

Now it looks exactly like on the old unix machines and all plus/minus
are converted to system readable utf-8 characters.

Now I have to loop over several 1000 files for renaminng all files and
it "looks like in ancient times" ;-)

BTW: Is there a way to "guess" an encoding system by looking at files?

Thank you very, very much,

Rich

unread,
Apr 7, 2018, 2:01:58 PM4/7/18
to
Gerhard Reithofer <gerhard....@tech-edv.co.at> wrote:
> Hi Rich,
> On Sat, 7 Apr 2018, Rich wrote:
>> Gerhard Reithofer <gerhard....@tech-edv.co.at> wrote:
>> > Hi,
>> > I have large old unix file archives which use char \xb1 in their file
>> > names, example: 33310ąąąąąą17.11.05ąąąą17UH.file
>> > The terminal display is (UTF8 mode):
>> >
>> > Any idea what todo?
>> > How can I read these "unencoded" file name?
>>
>> You are a couple possible solutions:
>>
>> 1) Use Unix/Linux tools to rename the files, changing the ą characters
>> to something else.
>
> The shell has similar problems as the things are not "displayed"
> correctly.

Display and the shell's ability to modify the bytes are orthogonal.
Look into the $'' expansion in Bash, it allows you to specify the exact
byte value present in the names. A PIA, yes, but possible.

>> 2) Given that these characters are invalid UTF-8, but valid
>> iso8859-1, you might try changing Tcl's system encoding to
>> iso8859-1 before you try to read the names, on my Linux system
>> with iso8859-1 encoding I can create and read this character just
>> fine:
>>
>> $ rlwrap tclsh
>> % encoding system
>> iso8859-1
>> % set name "a\xb1c"
>> aąc
>
> This worked also on my system with the identical output without
> changing the encoding, BUT as I have a system encoding utf-8 a file
> name with the following byte sequence was created.
>
> $ ls a*c | od -h
> 0000000 c261 63b1 000a
> 0000005
>
> You see it is: "a" 0xc2b1 "c" "\n" - 0xc2b2 is the utf-8 code for the
> plus/minus char.

Yes, the plus/minus in iso8859-1 is not the same byte sequence as the
plus/minus character in UTF-8. As well, a standalone 0xb1 is not a
valid UTF-8 byte sequence, which is why it gets messed up when Tcl's in
UTF-8 mode for the system encoding.

>> You can get the current system encoding with [encoding system] (as I
>> show above). You can tell Tcl to use a different system encoding
>> with "encoding system ?encoding?" as per the man page. So:
>>
>> encoding system iso8859-1
>
> Yeeeaaah! That did :-) :-)
>
>> before you trying globbing these names might work wonders.
>
> Globbing is the only way to read the file name into a tcl variable

Well, no, not the 'only' way. This also works:

set name [exec find a*c]

and this:

set fd [open |find a*c]
set name [gets $fd]
close $fd

But glob is the best integrated into Tcl... :)

> use it for my tests) but the "wrong" byte sequence was created, using
> the 0xc2b1 characters and with this name it can't be addressed.

Tcl was using UTF-8 for the system encoding, that is correct for
filenames encoded as UTF-8, but not correct for files encoded using
iso8859-1.

> Setting the system encoding to iso8859-1 leaves the name as it was
> (assumption and at least it works for this character).

It tells Tcl that when it reads the filenames, it should interpret
their bytes as iso8859-1 characters, so you get the correct bytes in
your variable to match the filenames on disk.

> My repair code is:
>
> encoding system iso8859-1
> set pat 3331*
> set org [glob $pat]
> set ect [encoding convertto utf-8 $org]
> file rename $org $ect

I'd recommend wrapping a:

set old_encoding [encoding system]
...
encoding system $old_encoding

Around that to put the system encoding back to what it started out set
for, to prevent trouble elsewhere.


> Now it looks exactly like on the old unix machines and all plus/minus
> are converted to system readable utf-8 characters.
>
> Now I have to loop over several 1000 files for renaminng all files
> and it "looks like in ancient times" ;-)

A pain, but not hard to do.

> BTW: Is there a way to "guess" an encoding system by looking at
> files?

Not with 100% certainty. And generally it is easier to exclude than
include (i.e., decide that something is clearly not UTF-8).


Ralf Fassel

unread,
Apr 9, 2018, 4:12:00 AM4/9/18
to
* Gerhard Reithofer <gerhard....@tech-edv.co.at>
| % set fn [glob 33310*]
| 33310±±±±±±17.11.05±±±±17UH.file
| % file exists $fn
| could not read "33310±±±±±±17.11.05±±±±17UH.file": no such file or
| directory

I would have expected a 0/1 response from 'file exists', but not an
ENOENT error?!? Are you sure that 'file exists' err'd out?

R'

Gerhard Reithofer

unread,
Apr 9, 2018, 9:14:08 AM4/9/18
to
Hallo Ralf,
you are right, it's a "manually edited typing error" ;-)

I used [file exists ...] at first and then the 2nd test was [file size
...] which results in an runtime error.
The quoting is a partially mixing of both, sorry for that :-(

I could provide the "bad" file but creating one with the wrong system
encoding as Rich showed it his example also behaves as explained.

My fault!

Bye,
0 new messages