Chinese characters in file name

1169 views
Skip to first unread message

tm.work

unread,
Jan 26, 2012, 7:37:41 AM1/26/12
to
Hi,

I need to generate a list of the file names of all files in a
particular folder. The file names have Chinese characters and 'glob'
is not displaying them right.

I'm sure this relates to the internationalization settings, but this
is all too new to me. Can anyone please help?

I'm running TCL 8.4.19 on a Windows XP machine.

Example:

File name as copied from Explorer and pasted over here:
GOI151102IIA01 灯丝变压器初级线圈绕线架.pdf

From within TCL shell, 'glob *' produces:
GOI151102IIA01 ????????????.pdf

Thanks
TM

Gerald W. Lester

unread,
Jan 26, 2012, 8:02:53 AM1/26/12
to
Ignoring how the return from glob is displayed, does [open] (or other file
commands] work on the value glob is returning? In other words, what does
the following produce:
foreach file [glob *] {
puts stdout "The size of $file is [file size $file]"
}

Is this in the console that wish displays (i.e. the Tk console) or in a DOS
command window?


--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: Gerald...@kng-consulting.net |
+------------------------------------------------------------------------+

tm.work

unread,
Jan 26, 2012, 8:27:06 AM1/26/12
to
Hi Gerald,

Thanks for the feedback.

Running:
foreach f [glob GOI151102IIA01*] {set file $file}
puts stdout "The size of $file is [file size $file]"

From the tclsh console, the command returns:
The size of GOI151102IIA01 ????????????.pdf is 86113

From the wish console, the command returns:
The size of GOI151102IIA01 灯丝变压器初级线圈绕线架.pdf is 86113


I'm getting near what I want to do, but I still have a couple of
questions.

1. What makes the 2 shells display the Chinese characters differently?

2. Writing the name of the file to a text file still writes the
Chinese characters as ?????

From within wish:

catch {open out.txt w} fhdl
puts -nonewline $fhdl $out
catch {close $fhdl} e

The contents of the file read: 'GOI151102IIA01 ????????????.pdf'

Thanks
TM

Bruce

unread,
Jan 26, 2012, 9:30:40 AM1/26/12
to
sounds like the characters are fine, and Tcl does not have a problem -
just the shell
you are running can't display them.

When you say the contents of tghe file read ?????? how are you looking
at the contents
dumping it in the shell, or opening it up in an application? is so what
application?

Bruce

Harald Oehlmann

unread,
Jan 26, 2012, 9:46:40 AM1/26/12
to
Hi TM,
On 26 Jan., 14:27, "tm.work" <t.mene...@c-x-r.com> wrote:
> Hi Gerald,
>
> Thanks for the feedback.
>
> Running:
> foreach f [glob GOI151102IIA01*] {set file $file}
> puts stdout "The size of $file is [file size $file]"
>
> From the tclsh console, the command returns:
> The size of GOI151102IIA01 ????????????.pdf is 86113
>
> From the wish console, the command returns:
> The size of GOI151102IIA01 灯丝变压器初级线圈绕线架.pdf is 86113
>
> I'm getting near what I want to do, but I still have a couple of
> questions.
>
> 1. What makes the 2 shells display the Chinese characters differently?
The tcl shell uses utf8 as coding which is able to display chinise
characters
The DOS Console propably uses cp427 or something else as codepage.

In generell, when TCL translates character sets, the "?" is used for
characters which could not be displayed in the destination character
set. You may look to the docs of "encoding convertto".

> 2. Writing the name of the file to a text file still writes the
> Chinese characters as ?????
>
> From within wish:
>
> catch {open out.txt w} fhdl
> puts -nonewline $fhdl $out
> catch {close $fhdl} e
>
> The contents of the file read: 'GOI151102IIA01 ????????????.pdf'

Here you have the same issue as with the console. Your system encoding
(type "encoding system") does not contain chinese characters?
To solve this, specify the encoding manually:
catch {
set fhdl [open out.txt w]
fconfigure $fhdl -encoding utf8
puts -nonewline $fhdl $out
close $fhdl
} e

This saves the file in utf8 and you need an editor using utf8 to open
it like notepad++. Choose "UTF8" there to display the contents.

You may use you popular encoding like "big5" (cp950) instead utf8.
Type "encoding names" to get a list of available encodings.

TM

unread,
Jan 26, 2012, 10:21:43 AM1/26/12
to
Hi Bruce,

I meant opening the contents of the output file with the list of file
names (with Chinese characters).

I'm viewing it with Geany (text editor) which cannot display Chinese
characters.

I know I'm not outputting the correct characters because when I copy
the list of file names from within Geany and paste it to an
application that can render Chinese characters (eg. Google Translate)
I do not get the intended file names.

Does that make sense?

Cheers

TM

unread,
Jan 26, 2012, 10:38:01 AM1/26/12
to
Hi Harald,

I got the results that I wanted using your code with UTF-8 - thanks.

The script below generates the correct output file from within both
wish & tclsh.

However, wish can also print the correct list to the screen (stdout),
whereas tclsh cannot. Is this related to the codepage? How do I set
the codepage for tclsh/DOS?

foreach f [glob *] {append out [file tail $f]\n}
catch {
set fhdl [open ${fname} w]
fconfigure $fhdl -encoding utf-8
puts -nonewline $fhdl $out
close $fhdl
} err

puts stdout $out


wish shell (stdout):
GOI151102IIA01 灯丝变压器初级线圈绕线架.pdf

tclsh shell (stdout):
GOI151102IIA01 ????????????.pdf

Thanks for your help
TM

Christian Gollwitzer

unread,
Jan 26, 2012, 11:53:27 AM1/26/12
to
Am 26.01.12 16:38, schrieb TM:
> However, wish can also print the correct list to the screen (stdout),
> whereas tclsh cannot. Is this related to the codepage? How do I set
> the codepage for tclsh/DOS?

I believe this is a limitation of the "DOS-box"(this is an erroneous
term, btw.) of windows. When you type "dir", do you see the file name
with chinese characters? If that's not the case, there is probably now
way to do this. Otherwise, the system encoding depends on your language
setting. Is this a Chinese windows system, or just an english system
where you have activated a chinese keyboard?

Christian

Harald Oehlmann

unread,
Jan 27, 2012, 2:41:45 AM1/27/12
to
Could you execute within the dos box before starting tclsh:
chcp
This shows your current codepage
Eventually change codepage to big5:
chcp 950
start tclsh
show the tcl codepage:
encoding system
configure the codepage of tcl stdout (??? is this possible ???)
fconfigure stdout -encoding cp950
(I have tried this on my computer, it did not work at all, even not
with cp437, but you may try.
I personally do not use tclsh within a dos box, the wish console
window is so useful for me)
-Harald

TM

unread,
Jan 27, 2012, 5:34:27 AM1/27/12
to
Hi Christian,

> When you type "dir", do you see the file name with chinese characters?

Typing dir from the DOS window doesn't display the Chinese characters
correctly:

27/01/2012 10:21 <DIR> .
27/01/2012 10:21 <DIR> ..
14/09/2011 11:35 69,353 GOL10001700A00 ???.pdf

So I guess that the output of 'puts stdout $filename' is as much a
product of the OS set up as it if of the TCL set up.

Cheers
TM

TM

unread,
Jan 27, 2012, 5:48:46 AM1/27/12
to
Harald,

I haven't installed a Chinese keyboard. My system has standard UK
keyboard & language settings.

TKcon is happy to display Chinese characters:
ls --> GOL10001700A00 夹线条.pdf

Wish outputs gibberish:
ls --> GOL10001700A00 å¤¹çº¿æ ¡.pdf

And so does DOS:
dir --> GOL10001700A00 ???.pdf

So what is it that allows Tkcon to do a good job where wish and DOS
fail miserably?

Also, why do wish and DOS produce different outputs?


Tkcon: encoding system --> cp1252
Wish: encoding system --> cp1252
tclsh: encoding system --> cp1252

From within the DOS window:
chcp --> Active code page: 437
chcp 950 --> Invalid code page

These language-related issues are way beyond me...

Thanks
TM

Donal K. Fellows

unread,
Jan 27, 2012, 6:52:35 AM1/27/12
to
Before we go any further, be aware that font/character issues are deeply
complex. They bamboozle many programmers, especially if they think that
one byte is one character or that one byte is always that character.
This is not the case. Take care!

On 27/01/2012 10:48, TM wrote:
> I haven't installed a Chinese keyboard. My system has standard UK
> keyboard& language settings.
>
> TKcon is happy to display Chinese characters:
> ls --> GOL10001700A00 夹线条.pdf

Tkcon works at the level of Tcl result strings and has access to Tk's
main font handling system. That means it knows directly what the
characters are — there's no misinterpretation step involved — and knows
exactly how to display those characters.

> Wish outputs gibberish:
> ls --> GOL10001700A00 å¤¹çº¿æ ¡.pdf

The issue there is that although it's to a Tk window, it's been mangled
through a (fake) channel that's set to the system encoding and that
causes problems. (Alas, it seems to be the "wrong sort" of mangling too,
with the bytes on the channel being UTF-8 but being interpreted as a
single-byte encoding; looks like there's a bug here.)

You can get the same displayed output in Tkcon (or something very
similar in Tclsh) by using:

encoding convertto utf-8 $filename

> And so does DOS:
> dir --> GOL10001700A00 ???.pdf

Again, mangling through an encoding though this time in a "correct" way
(there's no chinese characters in the encoding, so they *can't* be
represented at all and are instead converted to "?" symbols). This is
indeed information-lossy.

You can get the same displayed output in Tkcon using:

encoding convertto cp1252 $filename

> Tkcon: encoding system --> cp1252
> Wish: encoding system --> cp1252
> tclsh: encoding system --> cp1252

Yes, cp1252 can't contain any symbols from any east Asian alphabet.
Tkcon doesn't care and doesn't need to care. Wish cares (but shouldn't
as it is directing to a channel where we can ensure that both ends are
correct). Tclsh cares and is directing output to a channel where that
care is properly justified.

Donal.

Harald Oehlmann

unread,
Jan 30, 2012, 4:42:08 AM1/30/12
to
On 27 Jan., 12:52, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
Thank you Donal for the light and TM for the good questions.

I would conclude:
- TCL internally works well
- It behaves correctly on a DOS console which is not able to show
chinese characters.
It prints "?" instead.
- A wish console tries to "immitate" the console which gives non-
satisfactory results.
- We are sorry about the experience - it is as it is.
- I have put a pointer on the wiki to this thread: wiki.tcl.tk/
encoding

Harald

TM

unread,
Feb 10, 2012, 8:40:47 AM2/10/12
to
Donal, thanks for clarifying things. I'm not sure if you are saying
this is a bug or a feature...

Harald, should a bug or a feature request be filed somewhere? If so,
where exactly?

Thanks everyone on this thread
TM

Donal K. Fellows

unread,
Feb 10, 2012, 8:55:32 AM2/10/12
to
On 10/02/2012 13:40, TM wrote:
> Donal, thanks for clarifying things. I'm not sure if you are saying
> this is a bug or a feature...

I'm saying that there's some unfortunate confusion in the console
handling that is leading the garbling of the characters, which is a
_minor_ bug (the relevant channel appears to be using the wrong encoding
by default). There's also the issue of the conversion to ??? which is
actually a feature/consequence of the fact that the characters were sent
to somewhere which couldn't cope with them (the "?" is a fallback for
that case).

The _good_ news is that things are OK inside Tcl. You can work with the
files just fine. You just can't display the filenames other than through
Tk windows that you create (or tkcon). That's OK, since that's what
you'd want to do anyway.

Donal.

TM

unread,
Feb 10, 2012, 9:00:14 AM2/10/12
to
Hi Donal,

Yes, I've fixed my script and it is running a treat.

I was just wondering if there was a bug I should be filling somewhere.

Cheers
TM

Harald Oehlmann

unread,
Feb 13, 2012, 4:59:11 AM2/13/12
to
Hi TM, Hi Donal,
a minor bug is also a bug. In this case, it is in the wish console
which is a part of tk.
Please file a low-priority bug at (after checking if it does not exist
jet) at:
http://sourceforge.net/tracker/?group_id=12997&atid=112997

Thank you all,
Harald
Reply all
Reply to author
Forward
0 new messages