Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Encoding of file names

0 views
Skip to first unread message

utabintarbo

unread,
Dec 8, 2005, 8:02:35 AM12/8/05
to
Here is my situation:

I am trying to programatically access files created on an IBM AIX
system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
on a Win32 system. Not confused? OK, let's move on... ;-)

When I ask for an os.listdir() of a relevant directory, I get filenames
with embedded escaped characters (ex.
'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
which will read as "False" when applying an os.path.isfile() to it. I
wish to apply some operations to these files, but am unable, since
python (on Win32, at least) does not recognize this as a valid
filename.

Help me, before my thin veneer of genius is torn from my boss's eyes!
;-)

Peter Hansen

unread,
Dec 8, 2005, 9:19:50 AM12/8/05
to pytho...@python.org
utabintarbo wrote:
> I am trying to programatically access files created on an IBM AIX
> system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
> on a Win32 system. Not confused? OK, let's move on... ;-)
>
> When I ask for an os.listdir() of a relevant directory, I get filenames
> with embedded escaped characters (ex.
> 'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
> which will read as "False" when applying an os.path.isfile() to it. I
> wish to apply some operations to these files, but am unable, since
> python (on Win32, at least) does not recognize this as a valid
> filename.

I'm not sure of the answer, but note that .isfile() is not just checking
whether the filename is valid, it's checking that something *exists*
with that name, and that it is a file. Big difference... at least in
telling you where to look for the solution. In this case, checking
which of the two tests in ntpath.isfile() is actually failing might be a
first step if you don't have some other lead. (ntpath is what os.path
translates into on Windows, so look for ntpath.py in the Python lib folder.)

If you're really seeing what you're seeing, I suspect a bug since if
os.listdir() can find it (and it's really a file), os.isfile() should
report it as a file, I would think.

-Peter

Peter Otten

unread,
Dec 8, 2005, 10:17:59 AM12/8/05
to
utabintarbo wrote:

> I am trying to programatically access files created on an IBM AIX
> system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
> on a Win32 system. Not confused? OK, let's move on... ;-)
>
> When I ask for an os.listdir() of a relevant directory, I get filenames
> with embedded escaped characters (ex.
> 'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
> which will read as "False" when applying an os.path.isfile() to it. I
> wish to apply some operations to these files, but am unable, since
> python (on Win32, at least) does not recognize this as a valid
> filename.

Does the problem persist if you feed os.listdir() a unicode path?
This will cause listdir() to return unicode filenames which are less prone
to encoding confusion.

Peter

Kent Johnson

unread,
Dec 8, 2005, 10:39:42 AM12/8/05
to
utabintarbo wrote:
> Here is my situation:
>
> I am trying to programatically access files created on an IBM AIX
> system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
> on a Win32 system. Not confused? OK, let's move on... ;-)
>
> When I ask for an os.listdir() of a relevant directory, I get filenames
> with embedded escaped characters (ex.
> 'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')
> which will read as "False" when applying an os.path.isfile() to it. I
> wish to apply some operations to these files, but am unable, since
> python (on Win32, at least) does not recognize this as a valid
> filename.

Just to eliminate the obvious, you are calling os.path.join() with the
parent name before calling isfile(), yes? Something like

for f in os.listdir(someDir):
fp = os.path.join(someDir, f)
if os.path.isfile(fp):
...

Kent

Fredrik Lundh

unread,
Dec 8, 2005, 11:48:04 AM12/8/05
to pytho...@python.org
"utabintarbo" wrote:

> I am trying to programatically access files created on an IBM AIX
> system, stored on a Sun OS 5.8 fileserver, through a samba-mapped drive
> on a Win32 system. Not confused? OK, let's move on... ;-)
>
> When I ask for an os.listdir() of a relevant directory, I get filenames
> with embedded escaped characters (ex.
> 'F07JS41C.04389525AA.UPR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model')

how did you print that name? "\xa6" is a "broken vertical bar", which, as
far as I know, is a valid filename character under both Unix and Windows.

if DIR is a variable that points to the remote directory, what does this
print:

import os
files = os.listdir(DIR)
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

(if necessary, replace [0] with an index that corresponds to one of
the problematic filenames)

when you've tried that, try this variation (only the listdir line has
changed):

import os
files = os.listdir(unicode(DIR)) # <-- this line has changed
file = files[0]
print file
print repr(file)
fullname = os.path.join(DIR, file)
print os.path.isfile(fullname)
print os.path.isdir(fullname)

</F>

utabintarbo

unread,
Dec 8, 2005, 12:48:23 PM12/8/05
to
Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>

I believe that may do the trick. Here is the results of running your
code:

>>> DIR = os.getcwd()
>>> files = os.listdir(DIR)
>>> file = files[-1]
>>> file
'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model'
>>> print file
L07JS41C.04389525AA.QTRªINR.EªC-P.D11.081305.P2.KPF.model
>>> print repr(file)
'L07JS41C.04389525AA.QTR\xa6INR.E\xa6C-P.D11.081305.P2.KPF.model'


>>> fullname = os.path.join(DIR, file)
>>> print os.path.isfile(fullname)

False
>>> print os.path.isdir(fullname)
False
>>> files = os.listdir(unicode(DIR))
>>> file = files[-1]
>>> print file
L07JS41C.04389525AA.QTR¦INR.E¦C-P.D11.081305.P2.KPF.model
>>> print repr(file)
u'L07JS41C.04389525AA.QTR\u2592INR.E\u2524C-P.D11.081305.P2.KPF.model'


>>> fullname = os.path.join(DIR, file)
>>> print os.path.isfile(fullname)

True <--- Success!
>>> print os.path.isdir(fullname)
False

Thanks to all who posted. :-)

"Martin v. Löwis"

unread,
Dec 8, 2005, 5:45:06 PM12/8/05
to utabintarbo
utabintarbo wrote:
> Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>
>
> I believe that may do the trick. Here is the results of running your
> code:

For all those who followed this thread, here is some more explanation:

Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT,
a vertical line in the middle, plus a line from that going left) into
a file name. How he managed to do that, I can only guess: most likely,
the Samba installation assumes that the file system encoding on
the Solaris box is some IBM code page (say, CP 437 or CP 850). If so,
the byte on disk would be \xb4. Where this came from, I have to guess
further: perhaps it is ACUTE ACCENT from ISO-8859-*.

Anyway, when he used listdir() to get the contents of the directory,
Windows applies the CP_ACP encoding (known as "mbcs" in Python).
For reasons unknown to me, the US and several European versions
of XP map this to \xa6, VERTICAL BAR (I can somewhat see that
as meaningful for U+2524, but not for U+2592).

So when he then applies isfile to that file name, \xa6 is mapped
to U+00A6, which then isn't found on the Samba side.

So while Unicode here is the solution, the problem is elsewhere;
most likely in a misconfiguration of the Samba server (which assumes
some encoding for the files on disk, yet the AIX application
uses a different encoding).

Regards,
Martin

Tom Anderson

unread,
Dec 9, 2005, 6:05:28 AM12/9/05
to
On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:

> utabintarbo wrote:
>
>> Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>
>

> For all those who followed this thread, here is some more explanation:
>
> Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
> 50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a
> vertical line in the middle, plus a line from that going left) into a
> file name. How he managed to do that, I can only guess: most likely, the
> Samba installation assumes that the file system encoding on the Solaris
> box is some IBM code page (say, CP 437 or CP 850). If so, the byte on
> disk would be \xb4. Where this came from, I have to guess further:
> perhaps it is ACUTE ACCENT from ISO-8859-*.
>
> Anyway, when he used listdir() to get the contents of the directory,
> Windows applies the CP_ACP encoding (known as "mbcs" in Python). For
> reasons unknown to me, the US and several European versions of XP map
> this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for
> U+2524, but not for U+2592).
>
> So when he then applies isfile to that file name, \xa6 is mapped to
> U+00A6, which then isn't found on the Samba side.
>
> So while Unicode here is the solution, the problem is elsewhere; most
> likely in a misconfiguration of the Samba server (which assumes some
> encoding for the files on disk, yet the AIX application uses a different
> encoding).

Isn't the key thing that Windows is applying a non-roundtrippable
character encoding? If i've understood this right, Samba and Windows are
talking in unicode, with these (probably quite spurious, but never mind)
U+25xx characters, and Samba is presenting a quite consistent view of the
world: there's a file called "double bucky backlash grey box" in the
directory listing, and if you ask for a file called "double bucky backlash
grey box", you get it. Windows, however, maps that name to the 8-bit
string "double bucky blackslash vertical bar", but when you pass *that*
back to it, it gets encoded as the unicode string "double bucky backslash
vertical bar", which Sambda then doesn't recognise.

I don't know what Windows *should* do here. I know it shouldn't do this -
this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered. The
solution is either to apply an information-preserving encoding (UTF-8,
say), or to refuse to do it at all (ie, raise an error if there are
unencodable characters), neither of which are particularly beautiful
solutions. I think Windows is in a bit of a rock/hard place situation
here, poor thing.

Incidentally, for those who haven't come across CP_ACP before, it's not
yet another character encoding, it's a pseudovalue which means 'the
system's current default character set'.

tom

--
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies

utabintarbo

unread,
Dec 9, 2005, 9:49:14 AM12/9/05
to
Part of the reason (I think) is that our CAD/Data Management system
(which produces the aforementioned .MODEL files) substitutes (stupidly,
IMNSHO) non-printable characters for embedded spaces in file names.
This is part of what leads to my consternation here.

And yeah, Windows isn't helping matters much. No surprise there. :-P

Just for s&g's, I ran this on python 2.3 on knoppix:

>>> DIR = os.getcwd()
>>> files = os.listdir(DIR)
>>> file = files[-1]

>>> print file
L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model
>>> print repr(file)

'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model'


>>> fullname = os.path.join(DIR, file)
>>> print os.path.isfile(fullname)

True <--- It works fine here


>>> print os.path.isdir(fullname)
False
>>> files = os.listdir(unicode(DIR))
>>> file = files[-1]
>>> print file
L07JS41C.04389525AA.QTR±INR.E´C-P.D11.081305.P2.KPF.model
>>> print repr(file)

'L07JS41C.04389525AA.QTR\xb1INR.E\xb4C-P.D11.081305.P2.KPF.model'


>>> fullname = os.path.join(DIR, file)
>>> print os.path.isfile(fullname)

True <--- It works fine here
too!
>>> print os.path.isdir(fullname)
False
>>>

This is when mounting the same samba share in Linux. This tends to
support Tom's point re:the "non-roundtrippability" thing.

Thanks again to all.

"Martin v. Löwis"

unread,
Dec 9, 2005, 5:13:30 PM12/9/05
to Tom Anderson
Tom Anderson wrote:
> Isn't the key thing that Windows is applying a non-roundtrippable
> character encoding?

This is a fact, but it is not a key thing. Of course Windows is
applying a non-roundtrippable character encoding. What else could it
do?

> Windows, however, maps that name to the
> 8-bit string "double bucky blackslash vertical bar"

Only if you ask it to. There are two sets of APIs: one to apply
if you ask for byte strings (FindFirstFileA), and one to apply when you
ask for Unicode strings (FindFirstFileW).

In one case it has to convert; in the other, it doesn't.

> I don't know what Windows *should* do here. I know it shouldn't do this
> - this leads to breaking of some very basic invariants about files and
> directories, and so the kind of confusion utabintarbo suffered.

It always did this, and always will. Applications should stop using the
*A versions of the API. If they continue to do so, they will continue
to get bogus results in border cases.

The real issue here really is that there was a border case, when there
shouldn't be one.

Regards,
Martin

Tom Anderson

unread,
Dec 9, 2005, 7:08:18 PM12/9/05
to
On Fri, 9 Dec 2005, "Martin v. Löwis" wrote:

> Tom Anderson wrote:
>
>> Isn't the key thing that Windows is applying a non-roundtrippable
>> character encoding?
>
> This is a fact, but it is not a key thing. Of course Windows is applying
> a non-roundtrippable character encoding. What else could it do?

Well, i'm no great thinker, but i'd say that errors should never pass
silently, and that in the face of ambiguity, one should refuse the
temptation to guess. So, as i said in my post, if the name couldn't be
translated losslessly, an error should be raised.

>> I don't know what Windows *should* do here. I know it shouldn't do this
>> - this leads to breaking of some very basic invariants about files and
>> directories, and so the kind of confusion utabintarbo suffered.
>
> It always did this, and always will. Applications should stop using the
> *A versions of the API.

Absolutely true.

> If they continue to do so, they will continue to get bogus results in
> border cases.

No. The availability of a better alternative is not an excuse for
gratuitous breakage of the worse alternative.

tom

--
Whose house? Run's house!

"Martin v. Löwis"

unread,
Dec 10, 2005, 5:00:45 AM12/10/05
to
Tom Anderson wrote:
>> This is a fact, but it is not a key thing. Of course Windows is
>> applying a non-roundtrippable character encoding. What else could it do?
>
>
> Well, i'm no great thinker, but i'd say that errors should never pass
> silently, and that in the face of ambiguity, one should refuse the
> temptation to guess. So, as i said in my post, if the name couldn't be
> translated losslessly, an error should be raised.

I believe this would not work, the way the API is structured. You do
first FindFirstFile, getting a file name and a ahandle. Then you do
FindNextFile repeatedly, passing the handle. An error of FindFirstFile
is indicated by returning an invalid handle.

So if you wanted FindFirstFile to return an error for unencodable file
names, it would not be possible to get a listing of the other files
in the directory.

FindFirstFile also gives the 8.3 file name (if present), and that is
valid without problems.

Regards,
Martin

0 new messages