os.walk the apostrophe and unicode

Rod Person

unread,

Jun 24, 2017, 3:07:05 PM6/24/17

to

Hi,

I'm working on a program that will walk a file system and clean the id3
tags of mp3 and flac files, everything is working great until the
follow file is found

'06 - Todd's Song (Post-Spiderland Song in Progress).flac'

for some reason that I can't understand os.walk() returns this file
name as

'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'

which then causes more hell than a little bit for me. I'm not
understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
about it.

The script is Python 3, the file system it is running on is a hammer
filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
runs some kind of Linux so it probably ext3/4. The files came from
various system (Mac, Windows, FreeBSD).

--
Rod

http://www.rodperson.com

alister

unread,

Jun 24, 2017, 3:10:47 PM6/24/17

to

On Sat, 24 Jun 2017 14:57:21 -0400, Rod Person wrote:

> \xe2\x80\x99,

because the file name has been created using "Right single quote" instead
of apostrophe, the glyphs look identical in many fonts.

--
"If you understand what you're doing, you're not learning anything."
-- A. L.

John Ladasky

unread,

Jun 24, 2017, 3:21:11 PM6/24/17

to

On Saturday, June 24, 2017 at 12:07:05 PM UTC-7, Rod Person wrote:
> Hi,
>
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
>
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>
> for some reason that I can't understand os.walk() returns this file
> name as
>
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>
> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.

That's a "right single quotation mark" character in Unicode.

http://unicode.scarfboy.com/?s=E28099

Something in your code is choosing to interpret the text variable as an old-fashioned byte array of characters, where every character is represented by a single byte. That works as long as the file name only uses characters from the old ASCII set, but there are only 128 of those.

> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).

Since you are working in Python3, you have the ability to call the .encode() and .decode() methods to translate between Unicode and byte character arrays (which you still need on occasion).

>
> --
> Rod
>
> http://www.rodperson.com

MRAB

unread,

Jun 24, 2017, 3:26:28 PM6/24/17

to

On 2017-06-24 19:57, Rod Person wrote:
> Hi,
>
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
>
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>
> for some reason that I can't understand os.walk() returns this file
> name as
>
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>
> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.
>

> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).
>

If you treat it as a bytestring b'\xe2\x80\x99' and decode it:

>>> c = b'\xe2\x80\x99'.decode('utf-8')
>>> ascii(c)
"'\\u2019'"
>>> import unicodedata
>>> unicodedata.name(c)
'RIGHT SINGLE QUOTATION MARK'

It's not an apostrophe, it's '\u2019' ('\N{RIGHT SINGLE QUOTATION MARK}').

It looks like the filename is encoded as UTF-8, but Python thinks that
the filesystem encoding is something like Latin-1.

Peter Otten

unread,

Jun 24, 2017, 3:29:06 PM6/24/17

to

Rod Person wrote:

> Hi,
>
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
>
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>
> for some reason that I can't understand os.walk() returns this file
> name as
>
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>
> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.

>>> b"\xe2\x80\x99".decode("utf-8")

'’'
>>> unicodedata.name(_)
'RIGHT SINGLE QUOTATION MARK'

So it's '’' rather than "'".

> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).

There seems to be a mismatch between the assumed and the actual file system
encoding somewhere in this mix. Is this the only glitch or are there similar
problems with other non-ascii characters?

Michael Torrie

unread,

Jun 24, 2017, 3:29:15 PM6/24/17

to

On 06/24/2017 12:57 PM, Rod Person wrote:
> Hi,
>
> I'm working on a program that will walk a file system and clean the id3
> tags of mp3 and flac files, everything is working great until the
> follow file is found
>
> '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>
> for some reason that I can't understand os.walk() returns this file
> name as
>
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'

That's basically a UTF-8 string there:

$ python3
>>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
Progress).flac'
>>> print (a.decode('utf-8'))

06 - Todd’s Song (Post-Spiderland Song in Progress).flac
>>>

The NAS is just happily reading the UTF-8 bytes and passing them on the
wire.

> which then causes more hell than a little bit for me. I'm not
> understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> about it.

It's clearly not an apostrophe in the original filename, but probably
U+2019 (’)

> The script is Python 3, the file system it is running on is a hammer
> filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which
> runs some kind of Linux so it probably ext3/4. The files came from
> various system (Mac, Windows, FreeBSD).

It's the file serving protocol that dictates how filenames are
transmitted. In your case it's probably smb. smb (samba) is just passing
the native bytes along from the file system. Since you know the native
file system is just UTF-8, you can just decode every filename from utf-8
bytes into unicode.

Rod Person

unread,

Jun 24, 2017, 3:37:51 PM6/24/17

to

On Sat, 24 Jun 2017 21:28:45 +0200
Peter Otten <__pet...@web.de> wrote:

> Rod Person wrote:
>
> > Hi,
> >
> > I'm working on a program that will walk a file system and clean the
> > id3 tags of mp3 and flac files, everything is working great until
> > the follow file is found
> >
> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> >
> > for some reason that I can't understand os.walk() returns this file
> > name as
> >
> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> > Progress).flac'
> >

> > which then causes more hell than a little bit for me. I'm not
> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> > about it.
>

> >>> b"\xe2\x80\x99".decode("utf-8")
> '’'
> >>> unicodedata.name(_)
> 'RIGHT SINGLE QUOTATION MARK'
>
> So it's '’' rather than "'".
>

> > The script is Python 3, the file system it is running on is a hammer
> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
> > which runs some kind of Linux so it probably ext3/4. The files came
> > from various system (Mac, Windows, FreeBSD).
>

> There seems to be a mismatch between the assumed and the actual file
> system encoding somewhere in this mix. Is this the only glitch or are
> there similar problems with other non-ascii characters?
>

This is the only glitch as in file names so far.

--
Rod

http://www.rodperson.com

Who at Clitorius fountain thirst remove
Loath Wine and, abstinent, meer Water love.

- Ovid

Rod Person

unread,

Jun 24, 2017, 3:47:25 PM6/24/17

to

On Sat, 24 Jun 2017 13:28:55 -0600
Michael Torrie <tor...@gmail.com> wrote:

> On 06/24/2017 12:57 PM, Rod Person wrote:
> > Hi,
> >
> > I'm working on a program that will walk a file system and clean the
> > id3 tags of mp3 and flac files, everything is working great until
> > the follow file is found
> >
> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> >
> > for some reason that I can't understand os.walk() returns this file
> > name as
> >
> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> > Progress).flac'
>

> That's basically a UTF-8 string there:
>
> $ python3

> >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> Progress).flac'
> >>> print (a.decode('utf-8'))

> 06 - Todd’s Song (Post-Spiderland Song in Progress).flac
> >>>
>

> The NAS is just happily reading the UTF-8 bytes and passing them on
> the wire.
>

> > which then causes more hell than a little bit for me. I'm not
> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> > about it.
>

> It's clearly not an apostrophe in the original filename, but probably
> U+2019 (’)
>

> > The script is Python 3, the file system it is running on is a hammer
> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
> > which runs some kind of Linux so it probably ext3/4. The files came
> > from various system (Mac, Windows, FreeBSD).
>

> It's the file serving protocol that dictates how filenames are
> transmitted. In your case it's probably smb. smb (samba) is just
> passing the native bytes along from the file system. Since you know
> the native file system is just UTF-8, you can just decode every
> filename from utf-8 bytes into unicode.

This is the impression that I was under, my unicode is that strong, so
maybe my understand is off...but I tried.

file_name = file_name.decode('utf-8', 'ignore')

but when I get to my logging code:

logfile.write(file_name)

that throws the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 39-41: ordinal not in range(128)

Andre Müller

unread,

Jun 24, 2017, 4:31:06 PM6/24/17

to

Can os.fsencode and os.fsdecode help? I've seen it somewhere.
I've never used it.

To fix encodings, sometimes I use the module ftfy

Greetings
Andre

MRAB

unread,

Jun 24, 2017, 4:35:25 PM6/24/17

to

Your logfile was opened with the 'ascii' encoding, so you can't write
anything outside the ASCII range.

Open it with the 'utf-8' encoding instead.

Peter Otten

unread,

Jun 24, 2017, 5:17:31 PM6/24/17

to

Rod Person wrote:

> On Sat, 24 Jun 2017 21:28:45 +0200
> Peter Otten <__pet...@web.de> wrote:
>

>> Rod Person wrote:
>>
>> > Hi,
>> >
>> > I'm working on a program that will walk a file system and clean the
>> > id3 tags of mp3 and flac files, everything is working great until
>> > the follow file is found
>> >
>> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
>> >
>> > for some reason that I can't understand os.walk() returns this file
>> > name as
>> >
>> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
>> > Progress).flac'
>> >

>> > which then causes more hell than a little bit for me. I'm not
>> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
>> > about it.
>>

>> >>> b"\xe2\x80\x99".decode("utf-8")
>> '’'
>> >>> unicodedata.name(_)
>> 'RIGHT SINGLE QUOTATION MARK'
>>
>> So it's '’' rather than "'".
>>

>> > The script is Python 3, the file system it is running on is a hammer
>> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
>> > which runs some kind of Linux so it probably ext3/4. The files came
>> > from various system (Mac, Windows, FreeBSD).
>>

>> There seems to be a mismatch between the assumed and the actual file
>> system encoding somewhere in this mix. Is this the only glitch or are
>> there similar problems with other non-ascii characters?
>>
>
> This is the only glitch as in file names so far.
>

Then I'd fix the name manually...

Steve D'Aprano

unread,

Jun 24, 2017, 9:01:15 PM6/24/17

to

On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote:

> Then I'd fix the name manually...

The file name isn't broken.

What's broken is parts of the OP's code which assumes that non-ASCII file names
are broken...

--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Peter Otten

unread,

Jun 25, 2017, 2:58:19 AM6/25/17

to

Steve D'Aprano wrote:

> On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote:
>
>> Then I'd fix the name manually...
>
> The file name isn't broken.
>
>
> What's broken is parts of the OP's code which assumes that non-ASCII file
> names are broken...

Hm, the OP says

'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'

Shouldn't it be

'06 - Todd’s Song (Post-Spiderland Song in Progress).flac'

if everything worked correctly? Though I don't understand why the OP doesn't
see

'06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'

which is the repr() that I get.

Steve D'Aprano

unread,

Jun 25, 2017, 3:53:43 AM6/25/17

to

On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:

> Steve D'Aprano wrote:
>
>> On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote:
>>
>>> Then I'd fix the name manually...
>>
>> The file name isn't broken.
>>
>>
>> What's broken is parts of the OP's code which assumes that non-ASCII file
>> names are broken...
>
> Hm, the OP says
>
> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>
> Shouldn't it be
>
> '06 - Todd’s Song (Post-Spiderland Song in Progress).flac'

It should, if the OP did everything right.

He has a file name containing the word "Todd’s":

# Python 3.5

py> fname = 'Todd’s'
py> repr(fname)
"'Todd’s'"

On disk, that is represented in UTF-8:

py> repr(fname.encode('utf-8'))
"b'Todd\\xe2\\x80\\x99s'"

The OP appears to be using Python 2, so when he calls os.listdir() he gets the
file names as bytes, not Unicode. That means he'll see:

- the file name will be Python 2 str, which is *byte string* not text string;
- so not Unicode
- rather the individual bytes in the UTF-8 encoding of the file name.

So in Python 2.7 instead of 3.5 above:

py> fname = u'Todd’s'
py> repr(fname)
"u'Todd\\u2019s'"
py> repr(fname.encode('utf-8'))
"'Todd\\xe2\\x80\\x99s'"

> if everything worked correctly? Though I don't understand why the OP doesn't
> see
>
> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>
> which is the repr() that I get.

That's mojibake and is always wrong :-) I'm not sure how you got that. Something
to do with an accidental decode to Latin-1?

# Python 2.7
py> repr(fname.encode('utf-8').decode('latin-1'))
"u'Todd\\xe2\\x80\\x99s'"

# Python 3.5
py> repr(fname.encode('utf-8').decode('latin-1'))
"'Toddâ\\x80\\x99s'"

Peter Otten

unread,

Jun 25, 2017, 4:47:48 AM6/25/17

to

Steve D'Aprano wrote:

> On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:

>> if everything worked correctly? Though I don't understand why the OP
>> doesn't see
>>
>> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>>
>> which is the repr() that I get.
>
> That's mojibake and is always wrong :-)

Yes, that's my very point.

> I'm not sure how you got that.

I took the OP's string at face value and pasted it into the interpreter:

# python 3.4

>>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac'

'06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'

> Something to do with an accidental decode to Latin-1?

If the above filename is the only one or one of a few that seem broken, and
other non-ascii filenames look OK the OP's toolchain/filesystem may work
correctly and the odd name might have been produced elsewhere, e. g. by
copying an already messed-up freedb.org entry.

[Heureka]

However, the most likely explanation is that the filename is correct and
that the OP is not using Python 3 as he claims but Python 2.

Yes, it took that long for me to realise ;) Python 2 is slowly sinking into
oblivion...

wxjm...@gmail.com

unread,

Jun 25, 2017, 5:23:43 AM6/25/17

to

Le samedi 24 juin 2017 21:10:47 UTC+2, alister a écrit :
> On Sat, 24 Jun 2017 14:57:21 -0400, Rod Person wrote:
>
> > \xe2\x80\x99,
>
> because the file name has been created using "Right single quote" instead
> of apostrophe, the glyphs look identical in many fonts.
>
>

Trust me. Fonts are clearly making distinction between
\u0027 and \u2019.

alister

unread,

Jun 25, 2017, 7:39:07 AM6/25/17

to

Not all, and even when they do it has absolutely nothing to do with the
point of the post
the character in the file name is \u2019 right quotation mark & not an
apostrophe which the op was assuming.
he needs to decode the file name correctly

--
You will be held hostage by a radical group.

Rod Person

unread,

Jun 25, 2017, 8:19:03 AM6/25/17

to

Ok...so after reading all the replies in the thread, I thought I would
be easier to send a general reply and include some links to screenshots.

As Peter mention, the logic thing to do would be to fix the file name
to what I actually thought it was and if this was for work that
probably what I would have done, but since I want to understand what's
going on I decided to waste time on that.

I have to admit, I didn't think the file system was utf-8 as seeing what
looked to be an apostrophe sent me down the road of why is this
apostrophe screwed up instead of "ah this must be unicode".

But doing a simple ls of that directory show it is unicode but the
replacement of the offending character.

http://rodperson.com/graphics/uc/ls.png

I am in fact using Python 3.5. I may be lacking in unicode skills but I
do have the sense enough to know the version of Python I am invoking.
So I included this screenshot of that so the version of Python and the
files list returned by os.walk

http://rodperson.com/graphics/uc/files.png

So the fact that it shows as a string and not bytes in the debugger was
throwing me for a loop, in my log section I was trying to determine if
it was unicode decode it...if not don't do anything which wasn't working

http://rodperson.com/graphics/uc/log_section.png

On Sun, 25 Jun 2017 10:47:18 +0200
Peter Otten <__pet...@web.de> wrote:

> Steve D'Aprano wrote:
>
> > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:
>
> >> if everything worked correctly? Though I don't understand why the
> >> OP doesn't see
> >>
> >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
> >>
> >> which is the repr() that I get.
> >
> > That's mojibake and is always wrong :-)
>

> Yes, that's my very point.
>

> > I'm not sure how you got that.
>

> I took the OP's string at face value and pasted it into the
> interpreter:
>
> # python 3.4

> >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> >>> Progress).flac'

> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>

> > Something to do with an accidental decode to Latin-1?
>

> If the above filename is the only one or one of a few that seem
> broken, and other non-ascii filenames look OK the OP's
> toolchain/filesystem may work correctly and the odd name might have
> been produced elsewhere, e. g. by copying an already messed-up
> freedb.org entry.
>
> [Heureka]
>
> However, the most likely explanation is that the filename is correct
> and that the OP is not using Python 3 as he claims but Python 2.
>
> Yes, it took that long for me to realise ;) Python 2 is slowly
> sinking into oblivion...
>

--
Rod

http://www.rodperson.com

wxjm...@gmail.com

unread,

Jun 25, 2017, 9:21:27 AM6/25/17

to

I recommend to ask directly the Python devs. They are
brillant to solve this kind of problem.

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit
(Intel)] on win32
>>> eta runs etazero.py...
...etazero has been executed
>>> import time
>>> print(time.tzname)
('Europe de l\x92Ouest', 'Europe de l\x92Ouest (heure d\x92été)')
>>>

Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
>>> eta runs etazero.py...
...etazero has been executed
>>> import time
>>> print(time.tzname)
('Europe de l’Ouest', 'Europe de l’Ouest (heure d’été)')
>>> ascii(time.tzname[0])
"'Europe de l\\u2019Ouest'"
>>>

Maybe not...

Michael Torrie

unread,

Jun 25, 2017, 10:19:01 AM6/25/17

to

On 06/25/2017 06:19 AM, Rod Person wrote:
> But doing a simple ls of that directory show it is unicode but the
> replacement of the offending character.
>
> http://rodperson.com/graphics/uc/ls.png

Now that is really strange. Your OS seems to not recognize that the
filename is in UTF-8. I suspect this has something to do with the NAS
file sharing protocol (smb). Though I'm pretty sure that Samba can
handle UTF-8 filenames correctly.

> I am in fact using Python 3.5. I may be lacking in unicode skills but I
> do have the sense enough to know the version of Python I am invoking.
> So I included this screenshot of that so the version of Python and the
> files list returned by os.walk
>
> http://rodperson.com/graphics/uc/files.png

If I create a file that has the U+2019 character in it on my Linux
machine (BtrFS), and do os.walk on it, I see the character in then
string properly. So it looks like Python does the right thing,
automatically decoding from UTF-8.

In your situation I think the problem is the file sharing protocol that
your NAS is using. Somehow some information is being lost and your OS
does not know that the filenames are in UTF-8, and just thinks they are
bytes. And therefore Python doesn't know to decode the string, so you
just end up with each byte being converted to a unicode code point and
being shoved into the unicode string.

How to get around this issue I don't know. Maybe there's a way to
convert the unicode string to bytes using the value of each character,
and then decode that back to unicode.

Peter Otten

unread,

Jun 25, 2017, 10:28:39 AM6/25/17

to

Rod Person wrote:

> Ok...so after reading all the replies in the thread, I thought I would
> be easier to send a general reply and include some links to screenshots.
>
> As Peter mention, the logic thing to do would be to fix the file name
> to what I actually thought it was and if this was for work that
> probably what I would have done, but since I want to understand what's
> going on I decided to waste time on that.
>
> I have to admit, I didn't think the file system was utf-8 as seeing what
> looked to be an apostrophe sent me down the road of why is this
> apostrophe screwed up instead of "ah this must be unicode".
>

> But doing a simple ls of that directory show it is unicode but the
> replacement of the offending character.
>
> http://rodperson.com/graphics/uc/ls.png

Have you set LANG to something that implies ASCII?

$ touch Todd’s ähnlich üblich löblich
$ ls
ähnlich löblich Todd’s üblich
$ LANG=C ls
Todd???s l??blich ??hnlich ??blich
$ python3 -c 'import os; print(os.listdir())'
['Todd’s', 'üblich', 'ähnlich', 'löblich']
$ LANG=C python3 -c 'import os; print(os.listdir())'
['Todd\udce2\udc80\udc99s', '\udcc3\udcbcblich', '\udcc3\udca4hnlich',
'l\udcc3\udcb6blich']
$ LANG=en_US.utf-8 python3 -c 'import os; print(os.listdir())'
['Todd’s', 'üblich', 'ähnlich', 'löblich']

For file names Python resorts to surrogates whenever a byte does not
translate into a character in the advertised encoding.

> I am in fact using Python 3.5. I may be lacking in unicode skills but I
> do have the sense enough to know the version of Python I am invoking.

I've made so many "stupid errors" myself that I always consider them first
;)

> So I included this screenshot of that so the version of Python and the
> files list returned by os.walk
>
> http://rodperson.com/graphics/uc/files.png
>

> So the fact that it shows as a string and not bytes in the debugger was
> throwing me for a loop, in my log section I was trying to determine if
> it was unicode decode it...if not don't do anything which wasn't working
>
> http://rodperson.com/graphics/uc/log_section.png
>
>
>
>
> On Sun, 25 Jun 2017 10:47:18 +0200

> Peter Otten <__pet...@web.de> wrote:
>
>> Steve D'Aprano wrote:
>>
>> > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote:
>>
>> >> if everything worked correctly? Though I don't understand why the
>> >> OP doesn't see
>> >>
>> >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>> >>
>> >> which is the repr() that I get.
>> >
>> > That's mojibake and is always wrong :-)
>>

>> Yes, that's my very point.
>>

>> > I'm not sure how you got that.
>>

>> I took the OP's string at face value and pasted it into the
>> interpreter:
>>
>> # python 3.4

>> >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
>> >>> Progress).flac'

>> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac'
>>

>> > Something to do with an accidental decode to Latin-1?
>>

Rod Person

unread,

Jun 25, 2017, 11:14:53 AM6/25/17

to

On Sun, 25 Jun 2017 08:18:45 -0600
Michael Torrie <tor...@gmail.com> wrote:

> On 06/25/2017 06:19 AM, Rod Person wrote:

> > But doing a simple ls of that directory show it is unicode but the
> > replacement of the offending character.
> >
> > http://rodperson.com/graphics/uc/ls.png
>

> Now that is really strange. Your OS seems to not recognize that the
> filename is in UTF-8. I suspect this has something to do with the NAS
> file sharing protocol (smb). Though I'm pretty sure that Samba can
> handle UTF-8 filenames correctly.
>

> > I am in fact using Python 3.5. I may be lacking in unicode skills
> > but I do have the sense enough to know the version of Python I am

> > invoking. So I included this screenshot of that so the version of

> > Python and the files list returned by os.walk
> >
> > http://rodperson.com/graphics/uc/files.png
>

> If I create a file that has the U+2019 character in it on my Linux
> machine (BtrFS), and do os.walk on it, I see the character in then
> string properly. So it looks like Python does the right thing,
> automatically decoding from UTF-8.
>
> In your situation I think the problem is the file sharing protocol
> that your NAS is using. Somehow some information is being lost and
> your OS does not know that the filenames are in UTF-8, and just
> thinks they are bytes. And therefore Python doesn't know to decode
> the string, so you just end up with each byte being converted to a
> unicode code point and being shoved into the unicode string.
>
> How to get around this issue I don't know. Maybe there's a way to
> convert the unicode string to bytes using the value of each character,
> and then decode that back to unicode.

I think you theory is on the correct path. I'm actually attached to the
NAS via NFS not samba. And just quickly looking into that it seems the
NFS server needs and option set to pass unicode correctly...but my NAS
software doesn't allow my access to settings only to turn it on or off.

Looks like my option is the original correct the file name.