Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and...

7 views
Skip to first unread message

Alf P. Steinbach

unread,
Nov 23, 2009, 4:06:29 PM11/23/09
to
This is the tragic story of this evening:
1. Aspirins to lessen the pain somewhat.
2. Over in [comp.programming] someone mentions paper on Quicksort.
3. I recall that X once sent me link to paper about how to foil
Quicksort, written by was it Doug McIlroy, anyway some Bell Labs guy.
Want to post that link in response to [comp.programming] article.
4. Checking in Thunderbird, no mails from X or about QS there.
5. But his mail address in address list so something funny going on!
6. Googling, yes, it seems Thunderbird has a habit of "forgetting" mails. But
they're really there after all. It's just the index that's screwed up.
7. OK, opening Thunderbird mailbox file (it's just text) in nearest editor.
8. Machine hangs, Windows says it must increase virtual memory, blah blah.
9. Making little Python script to extract individual mails from file.
10. It says UnicodeDecodeError on mail nr. something something.
11. I switch mode to binary. Didn't know if that would work with std input.
12. It's now apparently ten times faster but *still* UnicodeDecodeError!
13. I ask here!

Of course could have googled that paper, but at each step above it seemed just a
half minute more to find the link in mails, and now I decided it must be found.

And I'm hesitant to just delete index file, hoping that it'll rebuild.

Thunderbird does funny things, so best would be if Python script worked.


<code>
import os
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):
if line.startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
else:
f.write( line )
f.close()
</code>


<last few lines of output>
955
956
957
958
Traceback (most recent call last):
File "C:\test\tbfix\splitmails.py", line 11, in <module>
for line in fileinput.input( mode = "rb" ):
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line 23, in
decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2188:
character maps to <undefined
</last few lines of output>


Cheers,

- Alf

Alf P. Steinbach

unread,
Nov 23, 2009, 5:37:58 PM11/23/09
to
* Alf P. Steinbach:

The following worked:


<code>
import sys
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )

input = sys.stdin.detach() # binary
while True:
line = input.readline()
if len( line ) == 0:
break
elif line.decode( "ascii", "ignore" ).startswith( "From - " ):


msg_id += 1;
f.close()
print( msg_id )

f = open( "msg_{0:0>6}.txt".format( msg_id ), "wb+" )


else:
f.write( line )
f.close()
</code>


Cheers,

- Alf

Terry Reedy

unread,
Nov 23, 2009, 8:48:40 PM11/23/09
to pytho...@python.org
Alf P. Steinbach wrote:

> import os
> import fileinput
>
> def write( s ): print( s, end = "" )

I believe this is the same as
write = sys.stdout.write
though you never use it that I see.


>
> msg_id = 0
> f = open( "nul", "w" )
> for line in fileinput.input( mode = "rb" ):

I presume you are expecting the line to be undecoded bytes, as with
open(f,'rb'). To be sure, add write(type(line)).

> if line.startswith( "From - " ):
> msg_id += 1;
> f.close()
> print( msg_id )
> f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )

I do not understand why you are writing since you just wanted to look.
In any case, you open in text mode.


> else:
> f.write( line )
> f.close()
> </code>
>
>
> <last few lines of output>
> 955
> 956
> 957
> 958
> Traceback (most recent call last):
> File "C:\test\tbfix\splitmails.py", line 11, in <module>
> for line in fileinput.input( mode = "rb" ):
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254,
> in __next__
> line = self.readline()
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349,
> in readline
> self._buffer = self._file.readlines(self._bufsize)
> File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line
> 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
> 2188: character maps to <undefined

It goes ahead and tries to decode to str anyway. Maybe there is a bug,
though maybe the text-mode open in the loop somehow changes fileinput,
especially if you write to something it has open. So I would not report
a bug until I tried reading without writing.

tjr

Lie Ryan

unread,
Nov 23, 2009, 11:02:23 PM11/23/09
to
Alf P. Steinbach wrote:
>
> And I'm hesitant to just delete index file, hoping that it'll rebuild.

it'll be rebuild the next time you start Thunderbird:
(MozillaZine: http://kb.mozillazine.org/Disappearing_mail)
* It's possible that the ".msf" files (index files) are corrupted. To
rebuild the index of a folder, right-click it, select Properties, and
choose "Rebuild Index" from the General Information tab. You can also
close Thunderbird and manually delete them from your profile folder;
they will be rebuilt when Thunderbird starts.

Nobody

unread,
Nov 24, 2009, 1:15:18 AM11/24/09
to
On Mon, 23 Nov 2009 22:06:29 +0100, Alf P. Steinbach wrote:

> 10. It says UnicodeDecodeError on mail nr. something something.

That's what you get for using Python 3.x ;)

If you must use 3.x, don't use the standard descriptors. If you must use
the standard descriptors in 3.x, call detach() on them to get the
underlying binary stream, i.e.

stdin = sys.stdin.detach()
stdout = sys.stdout.detach()

and use those instead.

Or set LC_ALL or LC_CTYPE to an ISO-8859-* locale (any stream of bytes can
be decoded, and any string resulting from decoding can be encoded).

Steven D'Aprano

unread,
Nov 24, 2009, 8:02:09 AM11/24/09
to
On Mon, 23 Nov 2009 22:06:29 +0100, Alf P. Steinbach wrote:


> 6. Googling, yes, it seems Thunderbird has a habit of "forgetting"
> mails. But they're really there after all. It's just the index that's
> screwed up.

[...]


> And I'm hesitant to just delete index file, hoping that it'll rebuild.

Right-click on the mailbox and choose "Rebuild Index".

If you're particularly paranoid, and you probably should be, make a
backup copy of the entire mail folder first.

http://kb.mozillazine.org/Compacting_folders
http://kb.mozillazine.org/Recover_messages_from_a_corrupt_folder
http://kb.mozillazine.org/Disappearing_mail


Good grief, it's about six weeks away from 2010 and Thunderbird still
uses mbox as it's default mail box format. Hello, the nineties called,
they want their mail formats back! Are the tbird developers on crack or
something? I can't believe that they're still using that crappy format.

No, I tell a lie. I can believe it far too well.

--
Steven

Chris Jones

unread,
Nov 24, 2009, 1:19:10 PM11/24/09
to pytho...@python.org
On Tue, Nov 24, 2009 at 08:02:09AM EST, Steven D'Aprano wrote:

> Good grief, it's about six weeks away from 2010 and Thunderbird still
> uses mbox as it's default mail box format. Hello, the nineties called,
> they want their mail formats back! Are the tbird developers on crack or
> something? I can't believe that they're still using that crappy format.
>
> No, I tell a lie. I can believe it far too well.

:-)

I realize that's somewhat OT, but what mail box format do you recommend,
and why?

Thanks,

CJ

Steven D'Aprano

unread,
Nov 24, 2009, 5:43:32 PM11/24/09
to

maildir++

http://en.wikipedia.org/wiki/Maildir

Corruption is less likely, if there is corruption you'll only lose a
single message rather than potentially everything in the mail folder[*],
at a pinch you can read the emails using a text editor or easily grep
through them, and compacting the mail folder is lightning fast, there's
no wasted space in the mail folder, and there's no need to mangle lines
starting with "From " in the body of the email.

The only major downside is that because you're dealing with potentially
thousands of smallish files, it *may* have reduced performance on some
older file systems that don't deal well with lots of files. These days,
that's not a real issue.

Oh yes, and people using Windows can't use maildir because (1) it doesn't
allow colons in names, and (2) it doesn't have atomic renames. Neither of
these are insurmountable problems: an implementation could substitute
another character for the colon, and while that would be a technical
violation of the standard, it would still work. And the lack of atomic
renames would simply mean that implementations have to be more careful
about not having two threads writing to the one mailbox at the same time.


[*] I'm assuming normal "oops there's a bug in the mail client code"
corruption rather than "I got drunk and started deleting random files and
directories" corruption.

--
Steven

samwyse

unread,
Nov 24, 2009, 6:09:00 PM11/24/09
to
On Nov 24, 4:43 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> Oh yes, and people using Windows can't use maildir because (1) it doesn't
> allow colons in names, and (2) it doesn't have atomic renames. Neither of
> these are insurmountable problems: an implementation could substitute
> another character for the colon, and while that would be a technical
> violation of the standard, it would still work. And the lack of atomic
> renames would simply mean that implementations have to be more careful
> about not having two threads writing to the one mailbox at the same time.

A common work around for the former is to URL encode the names, which
let's you stick all sorts of odd characters.

I'm afraid I can't help with the latter, though.

Chris Jones

unread,
Nov 25, 2009, 12:11:08 AM11/25/09
to pytho...@python.org
On Tue, Nov 24, 2009 at 05:43:32PM EST, Steven D'Aprano wrote:
> On Tue, 24 Nov 2009 13:19:10 -0500, Chris Jones wrote:
>
> > On Tue, Nov 24, 2009 at 08:02:09AM EST, Steven D'Aprano wrote:
> >
> >> Good grief, it's about six weeks away from 2010 and Thunderbird still
> >> uses mbox as it's default mail box format. Hello, the nineties called,
> >> they want their mail formats back! Are the tbird developers on crack or
> >> something? I can't believe that they're still using that crappy format.
> >>
> >> No, I tell a lie. I can believe it far too well.
> >
> > :-)
> >
> > I realize that's somewhat OT, but what mail box format do you recommend,
> > and why?
>
> maildir++
>
> http://en.wikipedia.org/wiki/Maildir

Outside the two pluses, maildir also goes back to the 90s - 1995, Daniel
Berstein's orginal specification.

> Corruption is less likely, if there is corruption you'll only lose a
> single message rather than potentially everything in the mail folder[*],
> at a pinch you can read the emails using a text editor or easily grep
> through them, and compacting the mail folder is lightning fast, there's
> no wasted space in the mail folder, and there's no need to mangle lines
> starting with "From " in the body of the email.

This last aspect very welcome.

> The only major downside is that because you're dealing with potentially
> thousands of smallish files, it *may* have reduced performance on some
> older file systems that don't deal well with lots of files. These days,
> that's not a real issue.
>
> Oh yes, and people using Windows can't use maildir because (1) it doesn't
> allow colons in names, and (2) it doesn't have atomic renames. Neither of
> these are insurmountable problems: an implementation could substitute
> another character for the colon, and while that would be a technical
> violation of the standard, it would still work. And the lack of atomic
> renames would simply mean that implementations have to be more careful
> about not having two threads writing to the one mailbox at the same time.
>
>
> [*] I'm assuming normal "oops there's a bug in the mail client code"
> corruption rather than "I got drunk and started deleting random files and
> directories" corruption.

I'm not concerned with the other aspects, but I'm reaching a point where
mutt is becoming rather sluggish with the mbox format, especially those
mail boxes that have more than about 3000 messages and it looks like
maildir, especially with some form of header caching might help.

Looks like running a local IMAP server would probably be more effective,
though.

Thank you for your comments.

CJ

Terry Reedy

unread,
Nov 26, 2009, 1:30:57 PM11/26/09
to pytho...@python.org
Ken Seehart wrote:
> I need to create a pipe where I have one thread (or maybe a generator)
> writing data to the tail while another python object is reading from the
> head. This will run in real time, so the data must be deallocated after
> it is consumed.

CPython does that when last reference disappears.

> Reading should block until data is written, and writing
> should block when the buffer is full (i.e. until some of the data is
> consumed). I assume there must be a trivial way to do this, but I don't
> see it. Any ideas or examples?
>
> I'm using python 2.6.
>
queue module

Dave Angel

unread,
Nov 26, 2009, 2:50:45 PM11/26/09
to Ken Seehart, pytho...@python.org

Ken Seehart wrote:
> I need to create a pipe where I have one thread (or maybe a generator)
> writing data to the tail while another python object is reading from
> the head. This will run in real time, so the data must be deallocated

> after it is consumed. Reading should block until data is written, and

> writing should block when the buffer is full (i.e. until some of the
> data is consumed). I assume there must be a trivial way to do this,
> but I don't see it. Any ideas or examples?
>
> I'm using python 2.6.
>
>

Seems to me collections.deque is a good data structure for the purpose,
at least if both operations are in the same thread.

For multithreading, consider Queue module (or queue in Python 3.x).

DaveA

Aahz

unread,
Dec 1, 2009, 11:45:13 AM12/1/09
to
In article <031bc732$0$1336$c3e...@news.astraweb.com>,

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
>
>Good grief, it's about six weeks away from 2010 and Thunderbird still
>uses mbox as it's default mail box format. Hello, the nineties called,
>they want their mail formats back! Are the tbird developers on crack or
>something? I can't believe that they're still using that crappy format.

Just to be contrary, I *like* mbox.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

The best way to get information on Usenet is not to ask a question, but
to post the wrong information.

Michael Ströder

unread,
Dec 3, 2009, 6:52:35 PM12/3/09
to
Aahz wrote:
> In article <031bc732$0$1336$c3e...@news.astraweb.com>,
> Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
>> Good grief, it's about six weeks away from 2010 and Thunderbird still
>> uses mbox as it's default mail box format. Hello, the nineties called,
>> they want their mail formats back! Are the tbird developers on crack or
>> something? I can't believe that they're still using that crappy format.
>
> Just to be contrary, I *like* mbox.

Me too. :-)

Ciao, Michael.

Steven D'Aprano

unread,
Dec 3, 2009, 7:33:57 PM12/3/09
to


Why? What features or benefits of mbox do you see that make up for it's
disadvantages?

--
Steven

David Robinow

unread,
Dec 3, 2009, 7:59:30 PM12/3/09
to pytho...@python.org

I've never heard of mbox. Is it written in Python?

Steven D'Aprano

unread,
Dec 3, 2009, 8:27:11 PM12/3/09
to
On Thu, 03 Dec 2009 19:59:30 -0500, David Robinow wrote:


> I've never heard of mbox. Is it written in Python?

It is a file format used for storing email. Wikipedia is your friend:

http://en.wikipedia.org/wiki/Mbox


--
Steven

Chris Jones

unread,
Dec 3, 2009, 8:47:55 PM12/3/09
to pytho...@python.org
On Thu, Dec 03, 2009 at 07:59:30PM EST, David Robinow wrote:
> On Thu, Dec 3, 2009 at 7:33 PM, Steven D'Aprano
> <st...@remove-this-cybersource.com.au> wrote:
> > On Fri, 04 Dec 2009 00:52:35 +0100, Michael Str�der wrote:

> >
> >> Aahz wrote:
> >>> Just to be contrary, I *like* mbox.
> >>
> >> Me too. :-)
> >
> >
> > Why? What features or benefits of mbox do you see that make up for it's
> > disadvantages?
>
> I've never heard of mbox. Is it written in Python?

English, actually.. short for mail box, I gather.

CJ

Nobody

unread,
Dec 5, 2009, 11:20:08 AM12/5/09
to
On Fri, 04 Dec 2009 00:33:57 +0000, Steven D'Aprano wrote:

>>> Just to be contrary, I *like* mbox.
>>
>> Me too. :-)

Me too.

> Why? What features or benefits of mbox do you see that make up for it's
> disadvantages?

Simplicity and performance.

Maildir isn't simple when you add in the filesystem or archive format
(leaving aside the fact that maildir cannot be processed using nothing but
ANSI C).

Nor is it particularly quick if you want to grep for a message in a
decade's worth of archives (even on Linux; and NTFS is *much* worse for
dealing with many small files).

0 new messages