Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Perl script to identify corrupt mbox messages?

35 views
Skip to first unread message

Tuxedo

unread,
Jun 14, 2007, 10:06:00 AM6/14/07
to
I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total 3087
messages contained in the actual file.

The mail application used is Mozilla on Windows. The exact same error
occurs in all Mozilla applications I tested the file with, i.e. the full
Mozilla suite, the most recent Seamonkey and the stand-alone Thunderbird.

The identical error also happens when testing the mbox in Mozilla
Thunderbird on a Linux system.

I've tried to fix the error by manually removing the message just before
the first message shown in the index, but this does not seem to be the
(only) position of the culprit. The same mbox works entirely when viewed in
for example MUTT or the standard KDE mail client, Kmail. In other words,
the error may partly be attributed to how Mozilla parses it's own mbox and
partly to any incorrectly formatted messages. Perhaps there are several
incorrectly formatted messages, in the form of junk mail, which have been
intentionally or not crafted to currupt standard mbox files within Mozilla.

Does anyone have or know of a perl script that traverses through an mbox
file and that can identify incorrectly formatted mail messages within?

Tuxedo

--
"Imagine if every Thursday your shoes exploded if you tied them the
usual way. This happens to us all the time with computers, and nobody
thinks of complaining."
-- Jeff Raskin, interviewed in Doctor Dobb's Journal

Glenn Jackman

unread,
Jun 14, 2007, 11:38:33 AM6/14/07
to
At 2007-06-14 10:06AM, "Tuxedo" wrote:
> I'm trying to repair a gigantic mbox file that appears to have been
> corrupted in that it displays only 17 of the most recent and total 3087
> messages contained in the actual file.
>
> The mail application used is Mozilla on Windows. The exact same error
> occurs in all Mozilla applications I tested the file with, i.e. the full
> Mozilla suite, the most recent Seamonkey and the stand-alone Thunderbird.
>
> The identical error also happens when testing the mbox in Mozilla
> Thunderbird on a Linux system.
>
> I've tried to fix the error by manually removing the message just before
> the first message shown in the index, but this does not seem to be the
> (only) position of the culprit. The same mbox works entirely when viewed in
> for example MUTT or the standard KDE mail client, Kmail. In other words,
> the error may partly be attributed to how Mozilla parses it's own mbox and
> partly to any incorrectly formatted messages. Perhaps there are several
> incorrectly formatted messages, in the form of junk mail, which have been
> intentionally or not crafted to currupt standard mbox files within Mozilla.
>
> Does anyone have or know of a perl script that traverses through an mbox
> file and that can identify incorrectly formatted mail messages within?

Make sure that supposedly empty lines are in fact empty (no stray spaces
or carriage returns). Check for the proper existence of /^From / lines
following a blank line (except for the first line of the mbox) which are
the indicator for a message in an mbox file.

--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry

Tuxedo

unread,
Jun 14, 2007, 1:17:20 PM6/14/07
to
Glenn Jackman wrote:

[...]

> Make sure that supposedly empty lines are in fact empty (no stray spaces
> or carriage returns). Check for the proper existence of /^From / lines
> following a blank line (except for the first line of the mbox) which are
> the indicator for a message in an mbox file.
>

Many thanks for those pointers!

Petr Vileta

unread,
Jun 14, 2007, 8:01:43 PM6/14/07
to
Tuxedo wrote:
> I'm trying to repair a gigantic mbox file that appears to have been
> corrupted in that it displays only 17 of the most recent and total
> 3087 messages contained in the actual file.
>
> The mail application used is Mozilla on Windows. The exact same error
> occurs in all Mozilla applications I tested the file with, i.e. the
> full Mozilla suite, the most recent Seamonkey and the stand-alone
> Thunderbird.
>
You can try my freeware Tbird2OE (use google). This program is not primary
mean for recovery mbox file but if this will work and will create more then
17 files then I can send you part of Perl code for read and parse mbox file.
--

Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your mail
from another non-spammer site please.)

Tuxedo

unread,
Jun 15, 2007, 1:01:50 AM6/15/07
to
Petr Vileta wrote:

[...]

> You can try my freeware Tbird2OE (use google). This program is not primary
> mean for recovery mbox file but if this will work and will create more
> then 17 files then I can send you part of Perl code for read and parse
> mbox file.

Ok, thanks. Your program converts folders from mbox format to separate .eml
messages. I will try that to see if it recognizes all 3000+ messages in the
mbox. With a bit of luck, it might be possible to use Mozilla's built in
function to re-import the messages and perhaps the new folder will be fixed.

Dr.Ruud

unread,
Jun 15, 2007, 4:22:27 PM6/15/07
to
Tuxedo schreef:

> I'm trying to repair a gigantic mbox file that appears to have been
> corrupted in that it displays only 17 of the most recent and total
> 3087 messages contained in the actual file.

Consider formail, it is on about every *ix system.

--
Affijn, Ruud

"Gewoon is een tijger."

Tuxedo

unread,
Jun 15, 2007, 5:34:24 PM6/15/07
to
> Consider formail, it is on about every *ix system.

Sounds good but unfortunately the mbox belongs to a company which would not
consider anything better than their crap Windows system!

Dr.Ruud

unread,
Jun 15, 2007, 6:00:20 PM6/15/07
to
Tuxedo schreef:
> [attribution repaired] Ruud:

>> Consider formail, it is on about every *ix system.
>
> Sounds good but unfortunately the mbox belongs to a company which
> would not consider anything better than their crap Windows system!

Well, you could still move the file to a different system, or add
cygwin.

Tuxedo

unread,
Jun 15, 2007, 6:40:05 PM6/15/07
to
Dr.Ruud wrote:

> Tuxedo schreef:
> > [attribution repaired] Ruud:
>
> >> Consider formail, it is on about every *ix system.
> >
> > Sounds good but unfortunately the mbox belongs to a company which
> > would not consider anything better than their crap Windows system!
>
> Well, you could still move the file to a different system, or add
> cygwin.
>

True, but the file still needs to be parsed by the Mozilla application
running on a Windows system. In fact, the exact same error occurs on a
Linux system when placing the mbox in the Mozilla mail directory.

Dr.Ruud

unread,
Jun 15, 2007, 8:40:56 PM6/15/07
to
Tuxedo schreef:
> Dr.Ruud:
>> Tuxedo:
>>> [attribution repaired] Ruud:

Use formail to repair the file, is why I suggested it.

Tuxedo

unread,
Jun 16, 2007, 3:24:05 AM6/16/07
to
Dr.Ruud wrote:

[...]

> Use formail to repair the file, is why I suggested it.
>

My mistake, and thanks for the tip! I was not familiar with 'formail'
before. This interesting utility is indeed is on my Linux system. But
having neither a remote idea what error(s) the original mailbox may
contain, nor being familiar with formail, it is a bit complicated to guess
how to best process it. Nevertheless, I tried the following examples:

formail -ds <my_crappy_mbox >>reinvigorated_mbox

... this certainly made some changes, in fact, 10 or so additional messages
appear in the Mozilla index which did not show up earlier, including a
couple without a valid sender which are now listed by Mozilla as from
foo@bar, but which appear to be file fragments, i.e. not real mail.

Most of the 3000+ messages, however, still do not show up in Mozilla.

So I tried: ...
formail -zds <my_crappy_mbox >>reinvigorated_mbox
.. but this made the file no more readable in Mozilla than the previous try.

and ...
formail -rds <my_crappy_mbox >>reinvigorated_mbox
.. but with the same result as the former try.

Naturally I removed the generated (.msf) index files as well as terminated
the Mozilla application between the tries, in case something would get
cached otherwise.

The Mozilla application simply appears to be choking on the mbox while
building the index. The progress bar is helplessly trying to move forward,
but then falls back, then forward a bit, and then back again, until it
finally gives up. In other words, the graphical indicator at the bottom
right of the application, which is meant to indicate the progress of
building the index, never reaches its maximum.

Perhaps the mbox contains some very odd characters, maybe part of some
attachment, which causes Mozilla but not other mail clients to choke.
Perhaps it is the result of some malformatted mail circulating via zoombie
machines, Outlook and whatever, that affects Mozilla on multiple platforms.

Peter J. Holzer

unread,
Jun 16, 2007, 5:22:20 AM6/16/07
to
On 2007-06-14 14:06, Tuxedo <tux...@mailinator.net> wrote:
> I'm trying to repair a gigantic mbox file that appears to have been
> corrupted in that it displays only 17 of the most recent and total 3087
> messages contained in the actual file.
>
> The mail application used is Mozilla on Windows.
[...]

> The same mbox works entirely when viewed in for example MUTT or the
> standard KDE mail client, Kmail.

Have you tried copying all messages to a new mbox file with mutt or
kmail and then reading the new mbox file with Mozilla?

hp


--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | h...@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Mumia W.

unread,
Jun 16, 2007, 5:40:54 AM6/16/07
to
On 06/16/2007 02:24 AM, Tuxedo wrote:
> [...]
> formail -ds <my_crappy_mbox >>reinvigorated_mbox
>
> .... this certainly made some changes, in fact, 10 or so additional messages
> appear in the Mozilla index which did not show up earlier, including a
> couple without a valid sender which are now listed by Mozilla as from
> foo@bar, but which appear to be file fragments, i.e. not real mail.
>
> Most of the 3000+ messages, however, still do not show up in Mozilla.
>
> So I tried: ...
> formail -zds <my_crappy_mbox >>reinvigorated_mbox
> ... but this made the file no more readable in Mozilla than the previous try.

>
> and ...
> formail -rds <my_crappy_mbox >>reinvigorated_mbox
> ... but with the same result as the former try.

>
> Naturally I removed the generated (.msf) index files as well as terminated
> the Mozilla application between the tries, in case something would get
> cached otherwise.
>
> The Mozilla application simply appears to be choking on the mbox while
> building the index. The progress bar is helplessly trying to move forward,
> but then falls back, then forward a bit, and then back again, until it
> finally gives up. In other words, the graphical indicator at the bottom
> right of the application, which is meant to indicate the progress of
> building the index, never reaches its maximum.
>
> Perhaps the mbox contains some very odd characters, maybe part of some
> attachment, which causes Mozilla but not other mail clients to choke.
> Perhaps it is the result of some malformatted mail circulating via zoombie
> machines, Outlook and whatever, that affects Mozilla on multiple platforms.
>

Research the problem with the help of this website:
http://kb.mozillazine.org/

In particular, this article may (or may not) be of help:
http://kb.mozillazine.org/Inbox_stays_blank

Here is a script that, might improve things a little bit:

use strict;
use warnings;
require FileHandle;
require Email::Folder;
require Date::Parse;
require POSIX;
Date::Parse->import('str2time');
POSIX->import('ctime');

my $file = glob('~/tmp/mozmail/OldTests');
my $outfile = 'output.mbox';

my $fh = FileHandle->new($outfile, '>') or die("Stop: $!");
my $folder = Email::Folder->new($file);

my $count = 0;
while (my $msg = $folder->next_message) {
my $date = $msg->header('Date');
$date = ctime(str2time($date)); chomp $date;
$fh->print("From - $date\n");
$fh->print($msg->as_string() . "\n");
$count++;
}
print "There are $count messages in the folder.\n";

$fh->close;

Email::Folder and Date::Parse are modules you can download from CPAN.
The other modules are standard parts of Perl. You should change $file
and $outfile as appropriate. You shouldn't modify the original mailbox file.

Probably, you'll not need the script. Things should improve after you've
deleted the .msf (index) file and closed an reopened Mozilla.

(Followups set to alt.fan.mozilla)

Tuxedo

unread,
Jun 16, 2007, 5:57:56 AM6/16/07
to
Peter J. Holzer wrote:

> On 2007-06-14 14:06, Tuxedo <tux...@mailinator.net> wrote:
> > I'm trying to repair a gigantic mbox file that appears to have been
> > corrupted in that it displays only 17 of the most recent and total 3087
> > messages contained in the actual file.
> >
> > The mail application used is Mozilla on Windows.
> [...]
> > The same mbox works entirely when viewed in for example MUTT or the
> > standard KDE mail client, Kmail.
>
> Have you tried copying all messages to a new mbox file with mutt or
> kmail and then reading the new mbox file with Mozilla?

Yes that was the first thing I tried, but it didn't work :-(

I assume therefore that that some crummy characters are contained within
one or more messages, or/and in headers, which somehow cause the Mozilla to
choke, and so whatever conversion is done is simply carried forward.

The file, coming from Windows appeared to have been in DOS format. I've
un-DOS'ed it, awk-splitted each individual message, re-combed with the
proper empty line, followed by a ^From occurance, but without luck.

The file is around 150 MB and I wish I could post the entire mbox here, but
its not mine to distribute, and it surely contains much private
communication. Maybe there is a way to encode the entire content into
Mozilla safe characters as this is obviously a Mozilla bug.

Dr.Ruud

unread,
Jun 16, 2007, 5:57:00 AM6/16/07
to
Tuxedo schreef:
> Dr.Ruud:

>> Use formail to repair the file, is why I suggested it.
>

> This interesting utility is indeed is on my Linux system. But
> having neither a remote idea what error(s) the original mailbox may
> contain, nor being familiar with formail, it is a bit complicated to
> guess how to best process it. Nevertheless, I tried the following
> examples:
>
> formail -ds <my_crappy_mbox >>reinvigorated_mbox
>
> ... this certainly made some changes, in fact, 10 or so additional
> messages appear in the Mozilla index which did not show up earlier,
> including a couple without a valid sender which are now listed by
> Mozilla as from foo@bar, but which appear to be file fragments, i.e.
> not real mail.

I assume that you want to find out at which message the problem starts
and at which line in the mbox file that is, and start fixing from there.

Be careful not to introduce extra problemes with the move of the mbox
file from the Windows to the Linux system (maybe you should do a
dos2unix on the file, and then maybe you shouldn't, formail will DWIM).

You can use formail together with procmail to convert from mbox to
maildir format (the one file per message in new/ cur/ tmp/ structure)
like this:

formail -defYz \
-s procmail -m VERBOSE=yes DEFAULT="test_maildir/" /dev/null <
crappy.mbx

(the "test_maildir/" will be created in the user's $HOME, include
MAILDIR="/some/path" to redirect)

The "-defYz" just lists all interesting options for this case, change at
will, see man formail.

From the maildir structure you should be able to find out at which
message the split up breaks.

Tuxedo

unread,
Jun 16, 2007, 7:01:01 AM6/16/07
to
Mumia W. wrote:

Excellent! However, the problem does not want to be so easily solved. It
was no problem getting the above script running with the 2 up-to-date and
non-standard modules, and after having saved the script, as fixbox.pl, I
sucessfully tested it on a small mbox file containing only 3 messages.

However, with the real file, and when using a 2GH notebook with 512MB
memory and Perl 5.8.7, munching through the approximately 150MB mbox, the
above script (or the shell) returned: "Out of Memory!". The resulting
'output.mbox' file remained empty.

Personally, I'm not a fan of Mozilla mail, and looking a bit closer, I
could not find a solution to this particular issue on kb.mozillazine.org,
either. The problematic mailbox is someone's else. I'm seriously
contemplating telling them to: 1) abandon Windows, 2) Mozilla mail.

Tuxedo

unread,
Jun 16, 2007, 7:38:05 AM6/16/07
to
Dr.Ruud wrote:

> Tuxedo schreef:
> > Dr.Ruud:
>
> >> Use formail to repair the file, is why I suggested it.
> >
> > This interesting utility is indeed is on my Linux system. But
> > having neither a remote idea what error(s) the original mailbox may
> > contain, nor being familiar with formail, it is a bit complicated to
> > guess how to best process it. Nevertheless, I tried the following
> > examples:
> >
> > formail -ds <my_crappy_mbox >>reinvigorated_mbox
> >
> > ... this certainly made some changes, in fact, 10 or so additional
> > messages appear in the Mozilla index which did not show up earlier,
> > including a couple without a valid sender which are now listed by
> > Mozilla as from foo@bar, but which appear to be file fragments, i.e.
> > not real mail.
>
> I assume that you want to find out at which message the problem starts
> and at which line in the mbox file that is, and start fixing from there.

Yes. If I only knew. I tried to split it up, delete sections and so on,
only to find the problem appearing in many sections, but not knowing
exactly where. The file is just too big to locate the error manually.


>
> Be careful not to introduce extra problemes with the move of the mbox
> file from the Windows to the Linux system (maybe you should do a
> dos2unix on the file, and then maybe you shouldn't, formail will DWIM).
>
> You can use formail together with procmail to convert from mbox to
> maildir format (the one file per message in new/ cur/ tmp/ structure)
> like this:
>
> formail -defYz \
> -s procmail -m VERBOSE=yes DEFAULT="test_maildir/" /dev/null <
> crappy.mbx
>
> (the "test_maildir/" will be created in the user's $HOME, include
> MAILDIR="/some/path" to redirect)
>
> The "-defYz" just lists all interesting options for this case, change at
> will, see man formail.
>
> From the maildir structure you should be able to find out at which
> message the split up breaks.
>

Many thanks for all the tips, especially about formail. However, in this
particular case, I will leave the problematic mbox to rest, because it is
my ambition not to deal with anything from the Windows user space. In
realising it is probably a 99% Mozilla bug, my solution became to simply
transfer the file to a good old BSD mail server and let the file owner
access it's content via Neomail - an exceptional perl program which handles
mbox pop-mail via a no-frills web interface. I just tested that and the
full mbox was read, as fine as it was in MUTT or another native *nix mailer.

Sooner or later Mozilla developers will likely fix the bug, whatever it is.


Peter J. Holzer

unread,
Jun 16, 2007, 7:53:30 AM6/16/07
to
On 2007-06-16 09:57, Tuxedo <tux...@mailinator.net> wrote:
> Peter J. Holzer wrote:
>
>> On 2007-06-14 14:06, Tuxedo <tux...@mailinator.net> wrote:
>> > I'm trying to repair a gigantic mbox file that appears to have been
>> > corrupted in that it displays only 17 of the most recent and total 3087
>> > messages contained in the actual file.
>> >
>> > The mail application used is Mozilla on Windows.
>> [...]
>> > The same mbox works entirely when viewed in for example MUTT or the
>> > standard KDE mail client, Kmail.
>>
>> Have you tried copying all messages to a new mbox file with mutt or
>> kmail and then reading the new mbox file with Mozilla?
>
> Yes that was the first thing I tried, but it didn't work :-(
>
> I assume therefore that that some crummy characters are contained within
> one or more messages, or/and in headers, which somehow cause the Mozilla to
> choke, and so whatever conversion is done is simply carried forward.

That sounds likely. I think the best way to identify the message(s) with
the crummy characters is to split your big mail boy into a small number
(at most 10 or so) of smaller mboxes. You can do that with formail or a
perl (or awk) script or any text editor you find convenient. Then try to open
each mbox. For each mbox which Mozilla cannot open, split it again,
until you little mboxes with one bad message in each. Then you can check
what they have in common.


> The file, coming from Windows appeared to have been in DOS format. I've
> un-DOS'ed it, awk-splitted each individual message, re-combed with the
> proper empty line, followed by a ^From occurance, but without luck.
>
> The file is around 150 MB and I wish I could post the entire mbox here, but
> its not mine to distribute, and it surely contains much private
> communication. Maybe there is a way to encode the entire content into
> Mozilla safe characters as this is obviously a Mozilla bug.

Parsing each message with MIME::Parser and then printing it out again
may help. But I would hate to do that for a whole folder of 3000
messages. It may or may not work and you still don't know the problem
afterwards. Try to find the bad message(s) and convert only these. Then
you also have a test case for bugreport for mozilla.

Tuxedo

unread,
Jun 16, 2007, 8:32:06 AM6/16/07
to
Peter J. Holzer wrote:

That is certainly the best solution, but unfortunately it would set back to
much time. I would however consider submitting the entire file, given
permission by the file owner and without making it public, to someone who
could identify, bugreport and potentially fix the problem.


Petr Vileta

unread,
Jun 16, 2007, 10:06:02 PM6/16/07
to
Tuxedo wrote:
Waht my Tbird2OE? Helped you? I want to know the result ;-)

Tuxedo

unread,
Jun 17, 2007, 9:25:41 AM6/17/07
to
Petr Vileta wrote:

> Tuxedo wrote:
> Waht my Tbird2OE? Helped you? I want to know the result ;-)
>

Sorry, but unfortunately I did not get a chance to test your program,
mainly due to time constraints. Also, I have a feeling that converting to
mdir or eml and then back to mbox again will not work in that the cause
will simply remain throughout the process and the problem will appear again
in Mozilla. Instead, my solution has been to move the mbox to a unix system
where the user can access it via a webmail interface. The mbox is
completely readably on any mbox reader except Mozilla. Nevertheless, thanks
for your kind advise! I have bookmarked your program and for another time.


Tuxedo

unread,
Jun 17, 2007, 4:59:57 PM6/17/07
to
Tuxedo wrote:

> Petr Vileta wrote:
>
> > Tuxedo wrote:
> > Waht my Tbird2OE? Helped you? I want to know the result ;-)
> >
>
> Sorry, but unfortunately I did not get a chance to test your program,

Correction to above, I've now tested Tbird2OE with the following result:

All messages which existed in the mbox format were succesfully imported
into Outlook Express. In thereafter importing and converting the Outlook
(eml) files back into the combined Mozilla mbox format, all messages were
restored and show up in the newly created Mozilla index. In other words,
Tbird2OE effectively repaired the mbox and circumvented whatever problem
Mozilla, Seamonkey and Thunderbird has to parse an mbox containing certain
buggy messages, or perhaps inconsistencies in how they were separated.

Thank's a ton Petr for fixing the problem! I wish you all the best with
your application, even if my particular purpose is not exactly what
Tbird2OE was designed for.

0 new messages