>> > names, example: 33310ąąąąąą17.11.05ąąąą17UH.file
>> > The terminal display is (UTF8 mode):
>> >
>> > Any idea what todo?
>> > How can I read these "unencoded" file name?
>>
>> You are a couple possible solutions:
>>
>> 1) Use Unix/Linux tools to rename the files, changing the ą characters
>> to something else.
>
> The shell has similar problems as the things are not "displayed"
> correctly.
Display and the shell's ability to modify the bytes are orthogonal.
Look into the $'' expansion in Bash, it allows you to specify the exact
byte value present in the names. A PIA, yes, but possible.
>> 2) Given that these characters are invalid UTF-8, but valid
>> iso8859-1, you might try changing Tcl's system encoding to
>> iso8859-1 before you try to read the names, on my Linux system
>> with iso8859-1 encoding I can create and read this character just
>> fine:
>>
>> $ rlwrap tclsh
>> % encoding system
>> iso8859-1
>> % set name "a\xb1c"
>> aąc
>
> This worked also on my system with the identical output without
> changing the encoding, BUT as I have a system encoding utf-8 a file
> name with the following byte sequence was created.
>
> $ ls a*c | od -h
> 0000000 c261 63b1 000a
> 0000005
>
> You see it is: "a" 0xc2b1 "c" "\n" - 0xc2b2 is the utf-8 code for the
> plus/minus char.
Yes, the plus/minus in iso8859-1 is not the same byte sequence as the
plus/minus character in UTF-8. As well, a standalone 0xb1 is not a
valid UTF-8 byte sequence, which is why it gets messed up when Tcl's in
UTF-8 mode for the system encoding.
>> You can get the current system encoding with [encoding system] (as I
>> show above). You can tell Tcl to use a different system encoding
>> with "encoding system ?encoding?" as per the man page. So:
>>
>> encoding system iso8859-1
>
> Yeeeaaah! That did :-) :-)
>
>> before you trying globbing these names might work wonders.
>
> Globbing is the only way to read the file name into a tcl variable
Well, no, not the 'only' way. This also works:
set name [exec find a*c]
and this:
set fd [open |find a*c]
set name [gets $fd]
close $fd
But glob is the best integrated into Tcl... :)
> use it for my tests) but the "wrong" byte sequence was created, using
> the 0xc2b1 characters and with this name it can't be addressed.
Tcl was using UTF-8 for the system encoding, that is correct for
filenames encoded as UTF-8, but not correct for files encoded using
iso8859-1.
> Setting the system encoding to iso8859-1 leaves the name as it was
> (assumption and at least it works for this character).
It tells Tcl that when it reads the filenames, it should interpret
their bytes as iso8859-1 characters, so you get the correct bytes in
your variable to match the filenames on disk.
> My repair code is:
>
> encoding system iso8859-1
> set pat 3331*
> set org [glob $pat]
> set ect [encoding convertto utf-8 $org]
> file rename $org $ect
I'd recommend wrapping a:
set old_encoding [encoding system]
...
encoding system $old_encoding
Around that to put the system encoding back to what it started out set
for, to prevent trouble elsewhere.
> Now it looks exactly like on the old unix machines and all plus/minus
> are converted to system readable utf-8 characters.
>
> Now I have to loop over several 1000 files for renaminng all files
> and it "looks like in ancient times" ;-)
A pain, but not hard to do.
> BTW: Is there a way to "guess" an encoding system by looking at
> files?
Not with 100% certainty. And generally it is easier to exclude than
include (i.e., decide that something is clearly not UTF-8).