I customer just gave me a massive mail file in mbox format which has
accrued over several years. The file was rescued from an old drive of a
previous but now broken system, and so I would like to restore the mailbox
in a mail application on a new system.
The mail file was readable on the previous system in Mozilla Thunderbird,
as there it had a corresponding .msf index. However, the .msf file no
longer exists and the mbox itself is nearly 3GB. When placing this in a new
T-Bird mail folder, the mail application tries but soon fails to generate
the index which is necessary to display the messages.
At first I thought the file may be corrupt so I tried running:
formail -zds < big_mbox >> fixed_mbox
But soon after formail began munching its way into the big_mbox there was
an "Out of memory" error returned by the shell, which I guess was also what
the mail client silently did.
I guess I need more ram to process such big file and that any mail
application, formail included, simply needs more than the filesize, which
unfortunately I do not have. In any case, I think the file is probably Ok
since it worked fine on the previous system.
What methods exists to process and restore this huge file? How about for
example splitting it into parts, such as 5 or 10 different files, obviously
cut at the right points between messages. I guess the individual mbox files
can then easily be readable in more or less any mail application. Can this
be done via the shell and if so how?
Are there any particular Unix tools to split such huge message files or
create an .msf index without running out of memory in the process?
Many thanks for any ideas and advise.
Tuxedo
>customer just gave me a massive mail file in mbox format
>Are there any particular Unix tools to split such huge message files
http://en.wikipedia.org/wiki/Mbox says:
mbox is a generic term for a family of related file formats used for
holding collections of electronic mail messages. All messages in an
mbox mailbox are concatenated and stored as plain text in a single
file. The beginning of each message is indicated by a line whose first
five characters consist of "From" followed by a space (the so-called
"From_ line" or "'From ' line") and the return path e-mail address. A
blank line is appended to the end of each message.
IOW, it's not hard identify message boundaries. You can use common text
processing tools to split the big file into smaller ones.
--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php
Use formail:
formail -s savemail < "$mbox"
Where savemail is a script containing:
cat > $(date +%Y-%m-%d_%H:%M:%S)-$(uuidgen)
This will put each message in a separate file. Adjust to taste if
you want to put more than one message into each file or to use
different filenames.
--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
You can even use perl and use something like
@mail = split(/\nFrom /,$mboxfile);
That assume your mail system uses the "put a '>' before 'From' in all
email" option.
>John Kelly <j...@isp2dial.com> writes:
>
>> On Tue, 15 Jun 2010 01:18:19 +0200, Tuxedo <tux...@mailinator.com>
>> wrote:
>>>customer just gave me a massive mail file in mbox format
>>>Are there any particular Unix tools to split such huge message files
>> IOW, it's not hard identify message boundaries. You can use common text
>> processing tools to split the big file into smaller ones.
>
>You can even use perl and use something like
>
> @mail = split(/\nFrom /,$mboxfile);
That will read it into memory all at once, which may cause thrashing
with his 3GB file. In his scenario, better to read and write one line
at a time, and open a new output file every so many messages.
It's easy to shoot yourself in the foot with Perl.
It's easy to shoot yourself in the foot with any language. The above
code assumes that the file is already stored in the variable $mboxfile.
Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
spread FUD about something that's possible with any tool. Beyond that,
the above regex will run into problems parsing mailboxes.
Back on topic: please don't parse email with regex. You don't know what
you're doing and you will get it wrong (Google "Email Hates the Living"*
and watch the Google Video to see why). Use a module, like Mail::Box or
Email::Folder::Mbox, something that's been tested and in production use
at large ESPs for decades.
* Disclaimer: I used to work for that man and have seen first-hand the
true nature of email.
--
Thanks and best regards,
Chris Nehren
[...]
> This is the problem with not enforcing email quotas. I have a friend
> who does just what your user did--kept everything in a single mbox file
Problem is that it is on a local system application, which is a Windows PC.
The users do what they want there.
[..-]
> messages. You could use vi or emacs to read the mbox file and chop it
> into monthly parts, then see if formail will work on that. I think
> that's the best you're going to do for this user.
I'm not sure how to do that with vi and emacs. I don't think any editor
will actually open the file. Or is there a command line sequence for emacs
or the likes. Yearly batches would probably work Ok, but as soon as a
procedure tries to read the full file into memory the process just runs out
of memory and terminates.
> Users who do this sort of thing give me gas.
The annoying thing it's not the first time this same customer is doing this
same thing. The mail applications are not idiot proof or designed for
non-technical people. Perhaps there should be a warning in Mozilla GUI
applications, such as:
"Your is soon larger than your system can handle, please divide message box
into separate segments".
Tuxedo
[...]
> IOW, it's not hard identify message boundaries. You can use common text
> processing tools to split the big file into smaller ones.
Thanks for the tip but I'm not sure what processing tools can be used to
split the file into smaller ones? At least no editor that I know will open
the file. It's simply too big.
Tuxedo
> Back on topic: please don't parse email with regex. You don't know what
> you're doing and you will get it wrong (Google "Email Hates the Living"*
> and watch the Google Video to see why).
Sound like a good film - will watch!
> Use a module, like Mail::Box or
> Email::Folder::Mbox, something that's been tested and in production use
> at large ESPs for decades.
How can I use these Perl modules to split the mbox? Will they not also
attempt to read the entire file in one go and run out of memory...
Thanks for any further tips.
Tuxedo
[...]
> Use formail:
>
> formail -s savemail < "$mbox"
>
> Where savemail is a script containing:
>
> cat > $(date +%Y-%m-%d_%H:%M:%S)-$(uuidgen)
>
> This will put each message in a separate file. Adjust to taste if
> you want to put more than one message into each file or to use
> different filenames.
Thanks for this proceure, it works fine on a not-too-large mbox. However,
it fails with the huge file that that the system runs out of memory, as I
guess cat or formail tries to read in the full file to process. But it's a
good example how to split an mbox into individual files. I will probably
use this idea for something else.
Many thanks,
Tuxedo.
maybe try this variant, just hoping it woulb be less greedy and won't eat all the process:
$ export FILENO=000000 ; formail -n32 +1ds procmail -p 'DEFAULT=/tmp/_mb_$FILENO' /dev/null<yourMbox
Borrowing from the Email::Folder docs:
#!/usr/bin/perl
use strict;
use warnings;
use Email::Folder;
my $folder = Email::Folder->new("some_file");
while(my $message = $folder->next_message) {
print $message->header('Subject'), "\n";
}
Or thereabouts. No, it will not read the entire file all at once, unless
you call ->messages on the Email::Folder object. For more information on
what you can do with the $message object, see Email::Simple's docs.
Mail::Box not covered here because, while it is the swiss-army chainsaw
of mail modules, it's also more complex with a higher learning curve.
Another plan might be to use the "reformail" tool. I've used it in
similar situations though nothing on quite the same scale. In
particular the -s option runs a program for each mail in the mbox file;
the message is provided on stdin and an environment variable provides
access to a counter so you can simply number the messages.
It is often part of the "maildrop" package though I think it was
originally part of the courier mail system.
--
Ben.
> Tuxedo <tux...@mailinator.com> writes:
> <snip>
>> Thanks for any further tips.
>
> Another plan might be to use the "reformail" tool.
I see that formail has been suggested already. I am not sure of
reformail is another implementation (in which it case it may be worth
trying) or just a renaming of formail (in which case it might also have
trouble with the mbox size). Maybe someone who knows both can comment.
<snip>
--
Ben.
>John Kelly wrote:
I was not talking about text editors, where you read the whole file into
memory all at once. Tools like grep, sed, and awk read one line at at
time. Or you could write a simple while loop in bash to read a file one
line at a time.
while read; do
# each line is in $REPLY
# do something with it
done < mybigfile
If you don't have enough knowledge of these tools to devise a solution,
Chris idea of Email::Folder may work for you.
>> It's easy to shoot yourself in the foot with Perl.
>It's easy to shoot yourself in the foot with any language. The above
>code assumes that the file is already stored in the variable $mboxfile.
>Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
>spread FUD about something that's possible with any tool.
Sorry, I couldn't resist. But it is "easy" with Perl ;-)
>Back on topic: please don't parse email with regex. You don't know what
>you're doing and you will get it wrong
I filter mail with procmail regexes. Works for me.
Thanks for this tip! In interesting and useful tool. While this worked on a
smaller mbox with or without the print $message->body('Body') bit it also
failed on the large mailbox unfortunately:
#!/usr/bin/perl
use strict;
use warnings;
use lib "/tmp/perl/lib/perl5/site_perl/5.8.8";
use Email::Folder;
my $folder = Email::Folder->new("bigmbox");
while(my $message = $folder->next_message) {
print $message->header('Subject'), "\n";
print $message->body('Body'), "\n";
}
Is there any other possible way to split such a huge file in in for example
three or four parts? Then it would probably be "small enough" to process.
Actually, I incurred a different error with the above perl procedure (not
"Out of memory"). While the system was working really hard and after a bit
less than a minute of file crunching, the error was:
"bigmbox is not an mbox file at
/tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
(The /tmp path just referers to my temporary external module installation).
Maybe there is something wrong with my 'bigmbox'. The original file (on a
Windows drive) is reported to be 2.8 GB while the version I transferred to
my Linux box is only 2.0G for some reason. I downloaded it via lan using
Mozilla (as I think FTP has a maximum file size transfer limit).
Also, the previous "Out of memory" error that occured while processing by
formail, what could it relate to specifically? Ram, CPU, swap? I tested
processing the file on a Windows box which has 4GB RAM, which I guess
should be large enough for the 2.8GB file and any concurrent processes.
In theory, given a fast enough system, I guess any regular Mozilla
application should be able to process and generate the index needed to view
and modify the troubled mbox.
It may be fair to think that to allow such large files to accrue is down to
user error or even stupidity, but at the same time, non-technical users do
not necessarily think about what size their mailboxes may be or that one
single file in mbox format could grow so large. After all, to them it would
only be logical to think whatever fits on a drive should be Ok to store in
a 'mail folder'. In this sense, GUI mail app. designers have failed to make
this very clear in user interfaces. Maybe the advise is given in some
readme document, which nobody reads. I guess no one anticipated that users
mailboxes would grow to the size they actually do after years of usage,
until one day they simply day stop working and cannot easily be restored.
Tuxedo
Yes, basically you'd read n messages, write them to a file, read n more,
write them to another file. I'm out of fish for the day, though: please
see e.g. http://p3rl.org/bp for a proper tutorial on Perl if you don't
know where to start.
> Actually, I incurred a different error with the above perl procedure (not
> "Out of memory"). While the system was working really hard and after a bit
> less than a minute of file crunching, the error was:
> "bigmbox is not an mbox file at
> /tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
> (The /tmp path just referers to my temporary external module installation).
>
> Maybe there is something wrong with my 'bigmbox'. The original file (on a
> Windows drive) is reported to be 2.8 GB while the version I transferred to
> my Linux box is only 2.0G for some reason. I downloaded it via lan using
> Mozilla (as I think FTP has a maximum file size transfer limit).
The file wasn't transferred completely, yes, you hit the file size
limit. What does "downloaded it via lan" mean? You'll need to download
the file in toto before being able to process it in toto. Please keep
in mind that this is not the fault of the Perl code but rather that you
have an incomplete file on your hands.
--
Thanks and best regards,
Chris Nehren
Unless noted, all content I post is CC-BY-SA.
> On Tue, 15 Jun 2010 09:39:16 +0200, Tuxedo <tux...@mailinator.com>
> wrote:
>
> >John Kelly wrote:
>
> >> IOW, it's not hard identify message boundaries. You can use common
> >> text processing tools to split the big file into smaller ones.
> >
> >Thanks for the tip but I'm not sure what processing tools can be used to
> >split the file into smaller ones? At least no editor that I know will
> >open the file. It's simply too big.
>
> I was not talking about text editors, where you read the whole file into
> memory all at once. Tools like grep, sed, and awk read one line at at
> time. Or you could write a simple while loop in bash to read a file one
> line at a time.
Aah. I'm familar with grep procedures and have used sed and awk but I'm
not really any good at it. But it sounds like the right solution to my
problem!
>
> while read; do
>
> # each line is in $REPLY
> # do something with it
>
> done < mybigfile
>
> If you don't have enough knowledge of these tools to devise a solution,
> Chris idea of Email::Folder may work for you.
I tried Chris idea but had some type of error "bigmbox is not an mbox
file.." described in my previous post. I'm not quite sure what the cause of
the error may be.
Thanks,
Tuxedo
>Is there any other possible way to split such a huge file in in for example
>three or four parts?
Yes with text tools as I suggested. But not many people want to devise
your solution for free. You might try the awk group, they sometimes go
beyond the call of duty to help.
>I guess no one anticipated that users mailboxes would grow to the size
>they actually do after years of usage, until one day they simply day stop
>working and cannot easily be restored.
Some people like to archive mail, but it's dangerous. Eliot Spitzer
advised against email. Same goes for Usenet. Never say anything you
don't want repeated in a court of law.
>John Kelly wrote:
>> I was not talking about text editors, where you read the whole file into
>> memory all at once. Tools like grep, sed, and awk read one line at at
>> time. Or you could write a simple while loop in bash to read a file one
>> line at a time.
>
>Aah. I'm familar with grep procedures and have used sed and awk but I'm
>not really any good at it. But it sounds like the right solution to my
>problem!
>I tried Chris idea but had some type of error "bigmbox is not an mbox
>file.." described in my previous post. I'm not quite sure what the cause of
>the error may be.
The problem with canned solutions is, they will choke on improper data.
You may have some inconsistency in the data, where you need to get down
into the dirty details to figure it out and fix it.
Read the wikipedia article on mbox, it explains how to recognize mbox
messages boundaries. Understanding that, you may be able to make some
progess.
>Read the wikipedia article on mbox, it explains how to recognize mbox
>messages boundaries. Understanding that, you may be able to make some
>progess.
And for a wild idea, try this:
Use dd to copy the file into 10 equally sized pieces. That will break
the messages on the tail and nose of each piece, but then you can use a
text editor to extract the broken pieces and put them back together
where they belong.
And if you do have some inconsistency in the data, you can learn which
of the 10 pieces has the problem, and proceed to fix it.
Divide and conquer. It rarely fails.
[...]
> And for a wild idea, try this:
>
> Use dd to copy the file into 10 equally sized pieces. That will break
> the messages on the tail and nose of each piece, but then you can use a
> text editor to extract the broken pieces and put them back together
> where they belong.
>
> And if you do have some inconsistency in the data, you can learn which
> of the 10 pieces has the problem, and proceed to fix it.
>
> Divide and conquer. It rarely fails.
It sound like the perfect poor mans solution :-).
I did 'man dd' but remain somewhat puzzled... I'm not familiar with the
command line options. I know I'm asking free advise here, but how would you
divide a file named for example myBigCrapBox into ten pieces using dd ?
Many thanks,
Tuxedo
> On 2010-06-15, Tuxedo scribbled these curious markings:
> > Chris Nehren wrote:
> >
> >> On 2010-06-15, Tuxedo scribbled these curious markings:
> > Thanks for this tip! In interesting and useful tool. While this worked
> > on a
> > smaller mbox with or without the print $message->body('Body') bit it
> > also failed on the large mailbox unfortunately:
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > use lib "/tmp/perl/lib/perl5/site_perl/5.8.8";
> >
> > use Email::Folder;
> >
> > my $folder = Email::Folder->new("bigmbox");
> > while(my $message = $folder->next_message) {
> > print $message->header('Subject'), "\n";
> > print $message->body('Body'), "\n";
> > }
> >
> > Is there any other possible way to split such a huge file in in for
> > example three or four parts? Then it would probably be "small enough" to
> > process.
>
> Yes, basically you'd read n messages, write them to a file, read n more,
> write them to another file. I'm out of fish for the day, though: please
> see e.g. http://p3rl.org/bp for a proper tutorial on Perl if you don't
> know where to start.
I like reading Perl tutorials, especially since there are so many :-)
> > Actually, I incurred a different error with the above perl procedure
> > (not "Out of memory"). While the system was working really hard and
> > after a bit less than a minute of file crunching, the error was:
> > "bigmbox is not an mbox file at
> > /tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
> > (The /tmp path just referers to my temporary external module
> > installation).
> >
> > Maybe there is something wrong with my 'bigmbox'. The original file (on
> > a Windows drive) is reported to be 2.8 GB while the version I
> > transferred to my Linux box is only 2.0G for some reason. I downloaded
> > it via lan using Mozilla (as I think FTP has a maximum file size
> > transfer limit).
>
> The file wasn't transferred completely, yes, you hit the file size
> limit. What does "downloaded it via lan" mean? You'll need to download
> the file in toto before being able to process it in toto. Please keep
> in mind that this is not the fault of the Perl code but rather that you
> have an incomplete file on your hands.
Thanks for advising me on this too, one more mystery solved...
Tuxedo
>I did 'man dd' but remain somewhat puzzled... I'm not familiar with the
>command line options. I know I'm asking free advise here, but how would you
>divide a file named for example myBigCrapBox into ten pieces using dd ?
"count" limits the number of blocks copied
"seek" determines the starting block
"bs" sets the block size, for example bs=4k
"if" is your input file
"of" is your output file (creates if not existing)
Experiment with some junk file till you get the hang of it. But be
careful, you can really hurt yourself with dd while logged in as root.
>"seek" determines the starting block
Whoops! I meant skip, not seek
>Experiment with some junk file till you get the hang of it. But be
>careful, you can really hurt yourself with dd while logged in as root.
See what I mean
> On Tue, 15 Jun 2010 16:20:20 +0000, John Kelly <j...@isp2dial.com> wrote:
>
>
> >"seek" determines the starting block
>
> Whoops! I meant skip, not seek
>
>
> >Experiment with some junk file till you get the hang of it. But be
> >careful, you can really hurt yourself with dd while logged in as root.
>
> See what I mean
I think this is all too complex for my basic understanding of dd :-)
Tuxedo
>> >"seek" determines the starting block
>>
>> Whoops! I meant skip, not seek
>>
>>
>> >Experiment with some junk file till you get the hang of it. But be
>> >careful, you can really hurt yourself with dd while logged in as root.
>>
>> See what I mean
>
>I think this is all too complex for my basic understanding of dd :-)
The five parameters I listed are all you need to break up the big file.
If that seems too complex, you may be unable to solve your puzzle.
> On Tue, 15 Jun 2010 18:46:54 +0200, Tuxedo <tux...@mailinator.com>
> wrote:
>
> >> >"seek" determines the starting block
> >>
> >> Whoops! I meant skip, not seek
> >>
> >>
> >> >Experiment with some junk file till you get the hang of it. But be
> >> >careful, you can really hurt yourself with dd while logged in as root.
> >>
> >> See what I mean
> >
> >I think this is all too complex for my basic understanding of dd :-)
>
> The five parameters I listed are all you need to break up the big file.
> If that seems too complex, you may be unable to solve your puzzle.
Yes, it's beyond me. The syntax described in the man page is not clear to
me as now is the first time I've come across the command and I'm not a
programmer. There is not even an example in the in the man page on my
system. I will search for some.
Thanks for putting me on track.
Tuxedo
>John Kelly wrote:
>> >> >Experiment with some junk file till you get the hang of it. But be
>> >> >careful, you can really hurt yourself with dd while logged in as root.
>> >>
>> >> See what I mean
>> >
>> >I think this is all too complex for my basic understanding of dd :-)
>>
>> The five parameters I listed are all you need to break up the big file.
>> If that seems too complex, you may be unable to solve your puzzle.
>
>Yes, it's beyond me. The syntax described in the man page is not clear to
>me as now is the first time I've come across the command and I'm not a
>programmer. There is not even an example in the in the man page on my
>system. I will search for some.
>
>Thanks for putting me on track.
Quick tutorial:
The dd command is simple -- all it does is, copy one file to another, as
a raw byte stream.
To duplicate a file:
dd if=myfile of=mydupfile
To create a file filled with nulls"
dd count=1 if=/dev/zero of=xxx
1+0 records in
1+0 records out
512 bytes (512 B) copied, 8.6045e-05 s, 6.0 MB/s
As you can see, the default blocksize is 512 bytes. To change the
blocksize, use the bs parameter
dd bs=4k count=1 if=/dev/zero of=xxx
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 6.4394e-05 s, 63.6 MB/s
Now all you need to put in the mix is, the "skip" parameter. For the
first piece, you start at the beginning and don't need to skip any
bytes.
But for the second piece, you skip the same number of blocks you copied
to the first piece.
So if your count on the first piece was 40000, your "skip" on the second
piece also will be 40000. The third piece 80000, and so on.
It's not hard. Try it on some junk data to get the hang of it. Once
you learn how to use dd, you can do ANYTHING. Including wiping out your
hard drive, heh.
> dd bs=4k count=1 if=/dev/zero of=xxx
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB) copied, 6.4394e-05 s, 63.6 MB/s
>
>
> Now all you need to put in the mix is, the "skip" parameter. For the
> first piece, you start at the beginning and don't need to skip any
> bytes.
>
> But for the second piece, you skip the same number of blocks you copied
> to the first piece.
>
> So if your count on the first piece was 40000, your "skip" on the second
> piece also will be 40000. The third piece 80000, and so on.
>
> It's not hard. Try it on some junk data to get the hang of it. Once
> you learn how to use dd, you can do ANYTHING. Including wiping out your
> hard drive, heh.
in this case, perhaps "split" could also have been used.
>> It's not hard. Try it on some junk data to get the hang of it. Once
>> you learn how to use dd, you can do ANYTHING. Including wiping out your
>> hard drive, heh.
>
>in this case, perhaps "split" could also have been used.
Never heard of it till now, but I see it with "man split"
Doesn't sound like as much fun as dd though, you can't wipe out your
hard drive with split. :-D
Did you try my suggestion with multithread formail/procmail?
Did it fail?
If so, maybe try another simple man idea with awk (not tested on Solaris ;-)
$ awk '/^From /{a++}{print >"/tmp/_mb_"a}' yourBigBox
(adapt "/tmp/_mb_" prefix part for your fav path)
or maybe try it with 'csplit' (read the info page for extended parms on paths)
$ csplit -z yourBigBox '/^From /' '{*}'
Note that these are really not foolproof in case some mails in the box self content
other mails you may have then to check that point and re-glue some stuff back
but if your box was a simple man mbox you should be safe with that by now :-)
>try it with 'csplit' (read the info page for extended parms on paths)
>$ csplit -z yourBigBox '/^From /' '{*}'
That's a good one to remember.
It works for me on an mbox file larger than my total RAM.
> as I guess cat or formail tries to read in the full file to process.
> But it's a good example how to split an mbox into individual files.
> I will probably use this idea for something else.
--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
[...]
> It's not hard. Try it on some junk data to get the hang of it. Once
> you learn how to use dd, you can do ANYTHING. Including wiping out your
> hard drive, heh.
Thanks for the quick tutorial. The wonders of dd has finally come to light!
I first tested it on a smaller already functioning mbox.
I thereafter tested to copy 100 bytes of the beginning of the huge file:
dd count=1 bs=100 if=myBigCrapBox of=myBigCrapBox.1
But thereby I realise there must be something wrong with the huge mbox
file. The resulting file, myBigCrapBox.1, should be the first 100 bytes
ASCII but it all appears to be binary data, or nothing at all; one editor
(Nedit) just shows the file with a long line of <nul><nul><nul>, while the
filesize is exactly 100 bytes.
I'm not sure how this happened because all other smaller mboxes created by
Mozilla Thunderbird are indeed in plain text format.
Does anyone know if the Mozilla mail applications use some kind of exotic
compression format for files above a certain size?
I even tested placing the resulting file 100 bytes file in a Mozilla mail
directory. The mailfolder (or file) shows up but is empty, not even a start
of a single message. I also tested with version longer than the 100 bytes.
I guess I have been doomed with a corrupt mbox file! But how can such large
2.8GB file contain nothing readable? It should be a direct copy of the mbox
and a full version of the file, not a truncated 2GB limit file via ftp or
other file transfer. I copied the file from the original Windows drive via
USB Flash media directly onto a Linux system where I ran the dd command.
Thanks for any advise or theories on how this possibly corrupt mbox may be
reinvigorated and viewed.
Tuxedo
>I even tested placing the resulting file 100 bytes file in a Mozilla mail
>directory. The mailfolder (or file) shows up but is empty, not even a start
>of a single message. I also tested with version longer than the 100 bytes.
>
>I guess I have been doomed with a corrupt mbox file! But how can such large
>2.8GB file contain nothing readable? It should be a direct copy of the mbox
>and a full version of the file, not a truncated 2GB limit file via ftp or
>other file transfer. I copied the file from the original Windows drive via
>USB Flash media directly onto a Linux system where I ran the dd command.
>
>Thanks for any advise or theories on how this possibly corrupt mbox may be
>reinvigorated and viewed.
100 bytes is not enough to see the big picture. Try more, 1,000 or
10,000, or whatever it takes until you see some data that looks like
mail messages. Then use the skip feature of dd to read past that when
copying.
>I guess I have been doomed with a corrupt mbox file! But how can such large
>2.8GB file contain nothing readable? It should be a direct copy of the mbox
>and a full version of the file, not a truncated 2GB limit file via ftp or
>other file transfer. I copied the file from the original Windows drive via
>USB Flash media directly onto a Linux system where I ran the dd command.
Are you sure the original Windows file is mbox format? Even if it is,
there are opportunites for extra garbage to be added when copying from
one system to another.
If you can find mbox messages somewhere in the file, you can use dd to
strip off the leading garbage.
But maybe it's not really mbox format, and there is extra garbage
between each message. Or worse, some kind of compressed format where
you can't really see what you have just by looking at the data.
Tinkering with the data, using dd, can help you answer those questions.
I haven't read the whole bandworm thread, so that may already have been
suggested; say you want the mails sorted by month and year, as defined
in the From field (e.g. "From - Sun Dec 27 21:08:44 2009", and all mails
from Dec 2009 in file mbox shall be stored in a file mbox_2009-Dec)...
awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
(If the number of created files will exceed some number of allowed open
file descriptors, please tell us, then the code needs some adjustments.)
Janis
>
> Tuxedo
Janis Papanagnou wrote:
>
> I haven't read the whole bandworm thread, so that may already have been
> suggested; say you want the mails sorted by month and year, as defined
> in the From field (e.g. "From - Sun Dec 27 21:08:44 2009", and all mails
> from Dec 2009 in file mbox shall be stored in a file mbox_2009-Dec)...
>
> awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
To prevent a message body line starting with "From [...]" you can defined
the pattern more accurate, instead of /^From / specify (for example)...
/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {...}
or perhaps just
NF==7 && /^From / {...}
[...]
> > awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
>
> To prevent a message body line starting with "From [...]" you can defined
> the pattern more accurate, instead of /^From / specify (for example)...
>
> /^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {...}
>
> or perhaps just
>
> NF==7 && /^From / {...}
>
> >
> > (If the number of created files will exceed some number of allowed open
> > file descriptors, please tell us, then the code needs some adjustments.)
> >
> > Janis
>
Thanks for this awk tip!
But you are right, the first one catches message body text that simply
begin a line with "From":
awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
The other versions, however, I get some errors with. I presume I am
replicating it in some wrong way:
awk '/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ { f
= "mbox_"$NF"-"$4 } { print > f }' mbox
The error for the above is "redirection has null string value".
awk 'NF==7 && /^From { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail
The error here is "unterminated regexp".
Perhaps you can correct the above or type your two last examples in full?
Thanks,
Tuxedo
(ITYM, a line starting with "From ".)
> awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
>
> The other versions, however, I get some errors with. I presume I am
> replicating it in some wrong way:
>
> awk '/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ { f
> = "mbox_"$NF"-"$4 } { print > f }' mbox
Have you put the whole statement in a single line? Or in more lines as
follows...
awk '
/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {
f = "mbox_"$NF"-"$4
}
{ print > f }
' mbox
>
> The error for the above is "redirection has null string value".
Please compare the defined awk pattern with your "From " line in the mbox
file.
grep '^From ' mbox | head -1 | od -c
might help to spot unvisible white space characters.
>
> awk 'NF==7 && /^From { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail
awk 'NF==7 && /^From / { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail
>
> The error here is "unterminated regexp".
>
> Perhaps you can correct the above or type your two last examples in full?
Please retry.
Janis
>
> Thanks,
> Tuxedo
awk '/^From / {n++} {print >"msg" n ".mbx" }' big_mbox
but never tried on such big files
"Tuxedo" <tux...@mailinator.com> a écrit dans le message de news:
hv6dbr$hn0$00$1...@news.t-online.com...
> Tuxedo
[..]
> 100 bytes is not enough to see the big picture. Try more, 1,000 or
> 10,000, or whatever it takes until you see some data that looks like
> mail messages. Then use the skip feature of dd to read past that when
> copying.
I thought the mbox format was meant to begin with "From" on the first line
of the file. At least that's how mboxes look on my Linux box. But who knows
what could have been inserted by some Windows application.
So I tried the larger values, bs=10000 etc, but same result.
The likely broken mbox file appears to be all binary, yet it doesn't look
like a typical binary file in my editor. I'm not sure what it is...
I tested the awk trick posted by Janis Papanagnou today:
awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' myBigCrapBox
While this works on another good mailbox, the output I get with
myBigCrapBox is:
awk: cmd. line:1: fatal: grow_iop_buffer: iop->buf: can't allocate
-2147483646 bytes of memory (Cannot allocate memory)
The computer worked away at it's maximum power for 30 seconds or so until
the process died out without managing to find a single occurence of a ^From
string.
Tuxedo
[...]
> But maybe it's not really mbox format, and there is extra garbage
> between each message. Or worse, some kind of compressed format where
> you can't really see what you have just by looking at the data.
Yes I think the file must be in some compressed format. There are other
smaller mail folders from the same Windows drive and hat were used with the
same mail application but which are plain text. It's only the huge 2.8GB
file that is unreadable. Perhaps Mozilla Thunderbird is compressing any and
only those mboxes that exceed a certain file size, but if so, they do not
get a suffix, like .zip, .gz etc. If this theory is true, I'm still not
sure what compression format is used, in case it's even a standard format.
Tuxedo
Thanks, it works, not with the actual huge mail box becuase I've just found
that one is something else wrong with (it's probably compressed). But the
above examples work great with any 'normal' mbox I have.
Very useful to know.
Tuxedo
Silly mortal, assuming software adheres to standards. Have you watched
that video yet? :)
--
Thanks and best regards,
Chris Nehren
Unless noted, all content I post is CC-BY-SA.
>To prevent a message body line starting with "From [...]" you can defined
>the pattern more accurate, instead of /^From / specify (for example)...
>
> /^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {...}
>
>or perhaps just
>
> NF==7 && /^From / {...}
I wonder how mail programs cope with that. The extra test is good, but
not foolproof. No test can be foolproof, unless "^From " in the body is
escaped (mangled) when stored.
>Yes I think the file must be in some compressed format.
Low level tools like dd help focus on the real problem.
>I customer just gave me a massive mail file in mbox format which has
>accrued over several years. The file was rescued from an old drive of a
>previous but now broken system, and so I would like to restore the mailbox
>in a mail application on a new system.
>
>The mail file was readable on the previous system in Mozilla Thunderbird,
>as there it had a corresponding .msf index. However, the .msf file no
>longer exists and the mbox itself is nearly 3GB.
What? You said you rescued an old drive. So if the .msf file no longer
exists, how can you know it had a .msf file?
I'm beginning to wonder if this thread is a practical joke.
[...]
> Silly mortal, assuming software adheres to standards. Have you watched
> that video yet? :)
I watched about half of "Email hatest the Living" on Google, entertaining
stuff! I will watch the rest.
In case of Thunderbird mbox format, the mbox files normally begin with
'From', at least so it does in other working T-Bird mail files from the
same system where the 2.8GB mail file comes from. It appears that T-Bird is
using some compression format when an mbox file hits a certain size:
https://wiki.mozilla.org/Talk:Thunderbird:2.0_Product_Planning#Auto_compress_folders_after_relative_changes_in_size
If I only knew which, I could try and uncompress it.
Tuxedo
> On Mon, 14 Jun 2010 21:17:26 -0400, Maxwell Lol <nos...@com.invalid>
> wrote:
>>You can even use perl and use something like
>>
>> @mail = split(/\nFrom /,$mboxfile);
>
> That will read it into memory all at once, which may cause thrashing
> with his 3GB file. In his scenario, better to read and write one line
> at a time, and open a new output file every so many messages.
Sure. I just wanted to mention this technique, because it's useful at times.
> It's easy to shoot yourself in the foot with Perl.
Of course. Dealing with 3GB files can be a concern.
However, if you only have to do it once, sometimes it's better to let
the computer do the work, even if it's not the most elegant solution.
There are times when I know it will take (say) 30 seconds longer for a
command to complete, but it's easier to do that, than to write a
better script (which will take longer than 20 seconds).
Mental triage, so to speak.
>>> @mail = split(/\nFrom /,$mboxfile);
> It's easy to shoot yourself in the foot with any language. The above
> code assumes that the file is already stored in the variable $mboxfile.
> Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
> spread FUD about something that's possible with any tool. Beyond that,
> the above regex will run into problems parsing mailboxes.
As I said, this will work if your mailboxes purposely prevent lines
starting with "From" to come in umodified.
Sendmail has an option to do this, so mailboxes do exist with this
characteristic. I use it to convert messages into digest format.
However, Chris makes a good point.
>As I said, this will work if your mailboxes purposely prevent lines
>starting with "From" to come in umodified.
>Sendmail has an option to do this
I would like to know the sendmail option.
> Tuxedo
>
>
--
[It is] best to confuse only one issue at a time.
-- K&R
> On Tue, 15 Jun 2010 06:26:28 +0000 (UTC), Chris Nehren
> <ape...@invalid.isuckatdomains.localhost.net> wrote:
>
>>> It's easy to shoot yourself in the foot with Perl.
>
>>It's easy to shoot yourself in the foot with any language. The above
>>code assumes that the file is already stored in the variable $mboxfile.
>>Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
>>spread FUD about something that's possible with any tool.
>
> Sorry, I couldn't resist. But it is "easy" with Perl ;-)
It's even easier with "rm" - Better not use that either, :-)
> I'm not sure how to do that with vi and emacs. I don't think any editor
> will actually open the file.
Emacs will open and edit large binary files.
> Low level tools like dd help focus on the real problem.
dd is good :-)
> What? You said you rescued an old drive. So if the .msf file no longer
> exists, how can you know it had a .msf file?
I'm not entirely sure, I only presume the mbox had a corresponding .msf
index when it was working, because the huge mailbox was working on the
original system.
As far as I understand and have also tested, if an mbox without .msf file
is placed in a relevant Mozilla T-Bird folder, the .msf index will be
created once firing up T-Bird. If the index does not exist, the parsing of
the mbox takes longer, which is mainly noticeable when initally created
against a fairly large mbox. By that I also understand that the mail
application reads the index rather than the mbox itself when used and only
when clicking on a particular message listing, the mail application returns
the relevant message part of the mbox to display to the user by the GUI.
Transferring other less than a gigabyte sized mail folders worked fine and
the .msf indexes were created as expected. Whatever happened to the
original .msf index for the huge mbox, I do not know for sure. I did not
personally rescue the data, so maybe the file was lost or perhaps it did in
fact not even exist.
The problem I think I have is while the mail application attempts to create
the index on a new set up the file has grown so huge that the process of
creating this index terminates because the file is too large for the system
or application to read in at once. Anyway, I guess this is the problem, but
there could also be something else wrong with the file. It surely is not a
plain text format mbox.
> I'm beginning to wonder if this thread is a practical joke.
Not at all, though I can think of more entertaining topics than trying to
recover the contents of a 2.8GB Windows Mozilla T-Bird mbox :-)
Tuxedo
> It surely is not a plain text format mbox.
Must be in Mozilla compacted format. But if you can't get any Mozilla
to read it, I don't what more you can do.
[...]
> Must be in Mozilla compacted format. But if you can't get any Mozilla
> to read it, I don't what more you can do.
What actually happens after moving the file into A relevant Mozilla folder
and after firing up T-Bird, the progress bar in the bottom right of the
mail application begins to move and the status bar indicates the following:
Building summary file for MyBigCrapBox...
Depermining which messages to index...
While the progress bar continues in hopeful anticipation until it all ends
in a small tragedy maybe about a minute later, choking on the big file. No
related .msf created. No message shown in the folder.
I'm trying this with on a fast windows box with 4GB Ram. So it may be
similar to as if trying to read in the full file with formail on Linux
until running out of memory.
Mozilla Compact sounds familiar! I guess nobody here happes to know an
uncompress command for that exotic compression format :-). After all, it
should be open source...
Tuxedo
>Mozilla Compact sounds familiar! I guess nobody here happes to know an
>uncompress command for that exotic compression format :-). After all, it
>should be open source...
It is. You could read the Mozilla source and try to hack out your own
solution. But as for shell related solutions, I think we have reached
the point of diminishing returns.
Certainly not foolproof. You've read the "perhaps just", I suppose?
Just to be sure that you understand what I suggested; in the first
pattern the /.*/ was meant as an abbreviation (nontheless functional
effective) for a complete pattern, as defined in the /^From / line.
I abbreviated it to fit on a line, for posting purpose.
Generally; how would a regexp be able to distinguish in a primitive
mbox data format between the From line and a data line starting with
/From/ and being preceeded with an empty line and followed by /Tags:/
lines? We both know that.
Janis
> On Wed, 16 Jun 2010 08:19:25 -0400, Maxwell Lol <nos...@com.invalid>
> wrote:
>
>>As I said, this will work if your mailboxes purposely prevent lines
>>starting with "From" to come in umodified.
>
>>Sendmail has an option to do this
>
> I would like to know the sendmail option.
It's only an option when it processes incoming mail messages (port
25). So it won't help your problem. I have a mail server, and it uses
this option when it saves mail into a file.
[...]
> It is. You could read the Mozilla source and try to hack out your own
> solution. But as for shell related solutions, I think we have reached
> the point of diminishing returns.
Yes, I agree, it's now off-topic here. I will search for relevant Mozilla
groups/forums/docs.
Thanks for the advise, especially on dd procedures.
Tuxedo
> John Kelly wrote:
>
> [...]
>
>> It is. You could read the Mozilla source and try to hack out your own
>> solution. But as for shell related solutions, I think we have reached
>> the point of diminishing returns.
>
> Yes, I agree, it's now off-topic here. I will search for relevant Mozilla
> groups/forums/docs.
Just in case you missed the suggestion a while back... have you tried
the file command? It may be able to tell you the format if it is some
standard compressed file. It is worth while simply because it takes no
more than a second to try.
--
Ben.
grep '^From ' <your_mbox_file>
If so, perhaps grep's -n option will tell you the first line beginning
a valid mbox format message. You might then use head(1) to retrieve the
leading "binary garbage" and deal with that separately. The messages
following the leading garbage could be retrieved with tail(1).
[...]
> Just in case you missed the suggestion a while back... have you tried
> the file command? It may be able to tell you the format if it is some
> standard compressed file. It is worth while simply because it takes no
> more than a second to try.
I didn't miss it, there always seems to exist a new and interesting command
to try. Unfortunately the output is a bit vague on this particular file:
$ file MyBigCrapBox
MyBigCrapBox: data
So it just returns 'data'. I guess it is 'compacted' Mozilla mailbox data.
I'm searching various Mozilla related sites for possible solutions. At the
same time I have a feeling even if there is some way to 'uncompact' the
data my system will likely run out of memory while trying to do so. I guess
I could split the file first with dd, but then it might not work to
uncompact properly.
Thanks for the tip.
Tuxedo
It's very likely not, as jwz is too pleased to rant^Wtell you:
http://www.jwz.org/doc/mailsum.html (or at least it wasn't a standard
format in the iterations he discusses there)
Though perhaps searching the Mozilla community sites will yield an
extension or addon that can massage the data into something resembling a
sane (inasmuch as email can be sane) format.
[...]
> It's very likely not, as jwz is too pleased to rant^Wtell you:
> http://www.jwz.org/doc/mailsum.html (or at least it wasn't a standard
> format in the iterations he discusses there)
>
> Though perhaps searching the Mozilla community sites will yield an
> extension or addon that can massage the data into something resembling a
> sane (inasmuch as email can be sane) format.
Thanks for the jwz link. I searched a couple of Mozilla related forums but
for some reason links from for example:
http://kb.mozillazine.org/Thunderbird_:_Tips_:_Compacting_Folders to:
http://kb.mozillazine.org/Compacting_folders#Compacting_does_not_seem_to_work
do not load.
I think I'll put this whole affair down to bad software design by Mozilla
developers for not clearly communicating the growing mailbox risk to
end-users as part of the actual user interface, combined with user-error of
the person having lost years of data that is now probably irretrievable on
anything less than a small super-computer and/or by custom programming
solutions. That said, I've learned some interesting stuff in this thread :-)
Tuxedo
OK, that just means file does not know this format. It was worth a try.
<snip>
--
Ben.
Could it be, by any chance, that the file was opened for writing by
mozilla and the system crashed. So that all opened files that have not
been flushed to disk have been nullified?
In any case, it look at the other end of the file and check whether
there is somewhere a valid mail file. You could also try mutt -f
big_brocken_file as well to see if mutt can read it. In my experience,
mutt's mbox parser works quite well and I have been able to restore
broken mbox files using mutt (that contained junk in between, because it
was a original deleted file).
Any way, to come back to your original question, to split a huge mbox
file, I'd try archivemail.
regards,
Christian
[...]
> Could it be, by any chance, that the file was opened for writing by
> mozilla and the system crashed. So that all opened files that have not
> been flushed to disk have been nullified?
That may very well be the case as the system probably crashed with the mail
application open. But I don't know for sure what happened and the user
probably doesn't either.
> In any case, it look at the other end of the file and check whether
> there is somewhere a valid mail file. You could also try mutt -f
> big_brocken_file as well to see if mutt can read it. In my experience,
> mutt's mbox parser works quite well and I have been able to restore
> broken mbox files using mutt (that contained junk in between, because it
> was a original deleted file).
$ mutt -f MyBigCrapBox
MyBigCrapBox: Value too large for defined data type (errno = 75)
> Any way, to come back to your original question, to split a huge mbox
> file, I'd try archivemail.
I don't have archivemail on my system but maybe I'll give it a try,
although dd appars to be the ideal solution for dealing with extra large
files.
Thanks,
Tuxedo
As someone else mentioned downthread, "split" is probably better
than "dd" for this purpose. "csplit" may be even better (I've
never used it myself).
As for using it while logged in as root, you shouldn't even *consider*
doing something like this as root. Non-root accounts exist for a reason.
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
No application of this type should ever try to load all it's data in memory
at once. That's just bad design.
A 3GB mail collection may seem big to us, but it's just the tip of the
iceberg, really. Since MS Outlook went beyond the 32 bit boundary, I've seen
users with 12GB and even 17GB mail files. (in MS Outlook's PST format).
We might not like it. We might think them idiots, but it's getting worse
every day. Users keep all mails, no matter how irrelevant. They don't
organise them or archive them. And since they have no clue about file
transfers, they send/receive most files as email attachments. And they store
those mails, too.
The problem is not even the capacity of the mail program to handle those
volumes. What those people don't realise, is that they can't even archive
those data. DVD's are allready too small for the volumes I encounter. And
17GB is not that far from a single-sided Blue-Ray (25GB)....
> What methods exists to process and restore this huge file? How about for
> example splitting it into parts, such as 5 or 10 different files, obviously
> cut at the right points between messages. I guess the individual mbox files
> can then easily be readable in more or less any mail application. Can this
> be done via the shell and if so how?
I see you got plenty of answers there. My choice would have been to write a
python script, reading the big file one line at a time, writing output files
of - at most - xxx MB (volume to be determined by testing with your
Thunderbird)
Good luck with it.
--
Any time things appear to be going better, you have overlooked
something.
[...]
> So it just returns 'data'. I guess it is 'compacted' Mozilla mailbox data.
> I'm searching various Mozilla related sites for possible solutions. At the
> same time I have a feeling even if there is some way to 'uncompact' the
> data my system will likely run out of memory while trying to do so. I
> guess I could split the file first with dd, but then it might not work to
> uncompact properly.
Although this information is largely irrelevant to comp.unix.shell, I found
that the Mozilla mbox 'compact' is not a compressed format after all, but
simply plain text:
http://forums.mozillazine.org/viewtopic.php?p=9510377#p9510377
http://kb.mozillazine.org/Compacting_folders#What_is_compacting.3F
It appears I have a seriously broken mbox which is surely beyond any means
of repair. I presume the mail application must have been trying to write to
file by compacting/deleting or perhaps appending new messages while the
system or simply the application must have crashed (as suggested by
Christian Brabandt in a post) and that it resulted in 2.8GB of gibberish
being written to the file. It's the only logical conclusion I can imagine.
Thanks to everyone for their response in trying to solve the unsolvable....
Tuxedo
ALl right allready. Just put the file up for download somewhere, we'll have
a look at it. :-)
PS: don't forget to compress it.