Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Making thunderbird attachments searchable

2 views
Skip to first unread message

qoo eiji

unread,
Jan 3, 2010, 8:18:08 AM1/3/10
to scr...@perl.org
Okay, I wrote this script because I've been trying to find a way to
search my attachments in Thunderbird.
I figured I'm not the only one but it's a quick hack and not a module
so I don't know where to post it, so here it is:

You can do anything you want with it. It's not very polished because I
thought I'd give myself more time before polishing it up and put
options into it. Right now it's just a pipe - cos I can't guarantee
that it won't kill your mbox!

After running your mbox through this code you should be able to search
your attachment names under the tage "X-META-01".

#!/bin/perl
# process mbox and add X-META-01 tag with filenames of attachments
# code by zlel
# put into the public domain 2010.01.03
# filename: mboxtagattachments
# usage : cat Inbox | mboxtagattachments > newInbox

$OUTPUTENCODING = "UTF-8";

use MIME::Base64 qw( encode_base64 decode_base64 );
use Text::Iconv;
use Data::Dumper;

$STATE = "INIT";
$LASTLINE = 1;
while (($_=<STDIN>) || $LASTLINE) {
if (!$_) {
$LASTLINE = 0;
$_ = "From -";
}
s/[\n\r]*$//g;
$_ .= "\r\n";
if (/^From -/) {
if ($RAW_PROLOG) {
## -------------------------------------------
## HANDLE MAIL THAT ENDED
## -------------------------------------------

%subvalues = getSubValues($MAIL_HEADER
{"Content-Type"});
$CHARSET = $subvalues{"charset"};

# print STDERR "$CHARSET\n";
# if ($CHARSET) {
# $MAIL_BODY = Text::Iconv->new
($CHARSET, $OUTPUTENCODING)->convert($MAIL_BODY);
#}
#print STDERR Dumper(\%MAIL_HEADER);

print $RAW_PROLOG;
for (keys %MAIL_ATTACHMENTS) {
$fn = $MAIL_ATTACHMENTS{$_};
if ($fn =~ /[^[:alnum:][:punct:]
[:space:]]/) {
if ($CHARSET) {
$fn = Text::Iconv->new
($OUTPUTENCODING, $CHARSET)->convert($fn);
}
}
print "X-META-01: $fn\r\n";
}
print $RAW_HEADER;
print "\r\n";
print $RAW_BODY;
}
$RAW_PROLOG = $_;
$RAW_HEADER = "";
$RAW_BODY = "";
%MAIL_HEADER = ();
$MAIL_BODY = "";
%MAIL_ATTACHMENTS = ();
$STATE = "HEADER";
$KEY = "";
next;
}
if ($STATE eq "HEADER") {
if (/^\s*$/ && ($MAIL_HEADER{"From"} ne "")) {
$STATE = "BODY";
if ($MAIL_HEADER{"Content-Type"} =~ /
multipart/) {

%subvalues = getSubValues($MAIL_HEADER
{"Content-Type"});
$partbody_boundary = $subvalues
{"boundary"};

%partbody_headers = ();
$partbody_count = 0;
$partbody_filename = "";
$KEY = "";
$STATE = "PART-HEADER";
}
} else {
if (/^X-META-01:/) {
next;
} else {
$RAW_HEADER .= $_;
}
}
if (/^\s/) {
chomp;
$KEY = $PREVKEY;
$VALUE = $_;
} else {
($KEY, $VALUE) = /^(\S[^:]*):(.*)/;
$PREVKEY = $KEY;
}
if ($KEY eq "Subject") {
$VALUE =~ s/^\s*//;
$VALUE = decode($VALUE);
}
if ($KEY) {
$MAIL_HEADER{$KEY} .= $VALUE;
}
next;
}
if ($STATE eq "PART-BODY") {
$RAW_BODY .= $_;
if (/$partbody_boundary/) {
{
## -----------------------------------
## HANDLE COMPLETED PART
## -----------------------------------

%subvalues = getSubValues
($partbody_headers{"Content-Type"});
$partbody_filename = decode($subvalues
{"name"});

if ($partbody_filename eq "") {
%subvalues = getSubValues
($partbody_headers{"Content-Disposition"});
$partbody_filename = $subvalues
{"filename"};
}

# if ($partbody_filename eq "") {
# for (my $c=0 ; $subvalues
{"filename*$c*"}; $c++) {
#
$partbody_filename .= $subvalues{"filename*$c*"};
# }
# $partbody_filename = decode
($partbody_filename, "url");
# }

if ($partbody_filename) {
$MAIL_ATTACHMENTS
{$partbody_count} = $partbody_filename;
}

# print "PART $partbody_count\n";
# print Dumper(\%partbody_headers);
# print "FILENAME $partbody_filename
\n";
# print $partbody;
}
$STATE = "PART-HEADER";
$partbody = "";
$partbody_count++;
$partbody_filename = "";
} else {
$partbody .= $_;
}
next;
}
if ($STATE eq "PART-HEADER") {
$RAW_BODY .= $_;
if (/^\s*$/) {
$STATE = "PART-BODY";
}
if (/^\s/) {
chomp;
if ($_) {
$partbody_headers{$PREVKEY} .= "\n".
$_;
}
next;
} else {
($KEY, $VALUE) = /^(\S[^:]*):(.*)/;
if ($KEY && $VALUE) {
$partbody_headers{$KEY} = $VALUE;
}
$PREVKEY = $KEY;
}
next;
}
if ($STATE eq "BODY") {
$RAW_BODY .= $_;
$MAIL_BODY .= $_;
next;
}
}

sub getSubValues {
my $value = shift @_;
my %subvalues = ();
my $initial;
my $oldvalue = "";

($initial, $value) = $value =~ m/^([^;\n\r]*)[;[:space:]\n\r]*
(.*)/sg;
$subvalues{""} = $initial;
while ($value) {
($key, $value) = $value =~ /([^=]*)=(.*)/s;
if ($value =~ /^"/) {
($keyvalue, $value) = $value =~ /"([^"]
*)"[[:space:]\n\r]*(.*)/s;
} else {
($keyvalue, $value) = $value =~ /\s*([^;
[:space:]]*)[;[:space:]\n\r]*(.*)/s;
}
$subvalues{$key} = $keyvalue;
if ($oldvalue eq $value) {
break;
}
$oldvalue = $value;
}
return %subvalues;
}

sub encode {
my $fn = shift @_;
my $encoding = shift @_;
my $charset = shift @_;
if ($charset eq "") {
$charset = $OUTPUTENCODING;
}
if ($encoding eq "url") {
$fn = "$charset''".URLEncode(Text::Iconv->new
($OUTPUTENCODING, $charset)->convert($fn));
} else {
$fn = encode_base64(Text::Iconv->new($OUTPUTENCODING,
$charset)->convert($fn));
chomp($fn);
$fn = "=?$charset?$fn?=";
}
return $fn;
}
sub decode {
my $fn = shift @_;
my $encoding = shift;
if ($encoding eq "url") {
$fn =~ s/([^']*?)''([^;]*?);/Text::Iconv->new($1,
$OUTPUTENCODING)->convert(URLDecode($2))/eg;
} else {
$fn =~ s/=\?([^\?]*?)\?([^\?]*?)\?([^\?]*?)\?=/
Text::Iconv->new($1, $OUTPUTENCODING)->convert(decode_base64($3))/eg;
}
return $fn;
}
sub URLDecode {
my $theURL = $_[0];
$theURL =~ tr/+/ /;
$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
$theURL =~ s/<!?(.|\n)*?>//g;
return $theURL;
}
sub URLEncode {
my $theURL = $_[0];
$theURL =~ s/([\W])/%".uc(sprintf("%2.2x",ord($1)))/eg;
return $theURL;
}
exit;

John W. Krahn

unread,
Jan 4, 2010, 6:55:48 PM1/4/10
to scr...@perl.org
qoo eiji wrote:
> Okay, I wrote this script because I've been trying to find a way to
> search my attachments in Thunderbird.
> I figured I'm not the only one but it's a quick hack and not a module
> so I don't know where to post it, so here it is:
>
> You can do anything you want with it. It's not very polished because I
> thought I'd give myself more time before polishing it up and put
> options into it. Right now it's just a pipe - cos I can't guarantee
> that it won't kill your mbox!
>
> After running your mbox through this code you should be able to search
> your attachment names under the tage "X-META-01".
>
> [ SNIP ]

>
> sub getSubValues {
> my $value = shift @_;
> my %subvalues = ();
> my $initial;
> my $oldvalue = "";
>
> ($initial, $value) = $value =~ m/^([^;\n\r]*)[;[:space:]\n\r]*
> (.*)/sg;

The [:space:] character class already includes the characters \n and \r.

> $subvalues{""} = $initial;
> while ($value) {
> ($key, $value) = $value =~ /([^=]*)=(.*)/s;
> if ($value =~ /^"/) {
> ($keyvalue, $value) = $value =~ /"([^"]
> *)"[[:space:]\n\r]*(.*)/s;
> } else {
> ($keyvalue, $value) = $value =~ /\s*([^;
> [:space:]]*)[;[:space:]\n\r]*(.*)/s;
> }
> $subvalues{$key} = $keyvalue;
> if ($oldvalue eq $value) {
> break;

'break' is not valid in this context, perhaps you meant 'last' instead.

> }
> $oldvalue = $value;
> }
> return %subvalues;
> }

John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway

Johan Vromans

unread,
Jan 6, 2010, 4:04:17 AM1/6/10
to qoo eiji, scr...@perl.org
qoo eiji <eijik...@gmail.com> writes:

> Okay, I wrote this script because I've been trying to find a way to
> search my attachments in Thunderbird.

Thanks for your contribution! You are a brave person.

> #!/bin/perl
> ...
> $OUTPUTENCODING = "UTF-8";

For any script that you write, always use

use strict;
use warnings;

This will help you finding typing mistakes and other sources of
hard-to-find errors.

It is always worth the trouble to investigate the miscellaneous CPAN
modules that may help you. There are several modules that can parse
mail messages, saving you a lot of energy.

> $STATE = "INIT";

In general, uppercase names are 'reserved' for Perl. You're free to
use them but one day Perl may change its mind.

Also, using 'use strict' will force you to declare all variables.

> s/[\n\r]*$//g;

This can better be written as

s/[\n\r]+$//;

The 'g' modifier is useless since the pattern is anchored to the end
of the string.

> for (keys %MAIL_ATTACHMENTS) {
> $fn = $MAIL_ATTACHMENTS{$_};

You may consider using the form:

while ( my($key,$value) = each(%MAIL_ATTACHMENTS) ) {

> if (/$partbody_boundary/) {

The variable may contain characters that have a special meaning when
used inside a pattern.

You can use either:

$partbody_boundary = quotemeta($partbody_boundary);
if ( /$partbody_boundary/ )

or

if ( /\Q$partbody_boundary\E/ )

This will 'quote' the potential special characters so they only match
themselves.

-- Johan

Vlado Keselj

unread,
Jan 6, 2010, 7:50:13 AM1/6/10
to Johan Vromans, qoo eiji, scr...@perl.org

On Wed, 6 Jan 2010, Johan Vromans wrote:

...


> > if (/$partbody_boundary/) {
>
> The variable may contain characters that have a special meaning when
> used inside a pattern.
>
> You can use either:
>
> $partbody_boundary = quotemeta($partbody_boundary);
> if ( /$partbody_boundary/ )
>
> or
>
> if ( /\Q$partbody_boundary\E/ )
>
> This will 'quote' the potential special characters so they only match
> themselves.

Or, probably better, just use:
if ( index($_,$partbody_boundary) >= $[ )

One would expect index to be faster, eventhough regex are likely
sufficiently optimized, but we also save on transforming the string with
\Q, and compiling it as an regex.

--Vlado

Terrence Brannon

unread,
Jan 6, 2010, 9:10:46 AM1/6/10
to Johan Vromans, qoo eiji, scr...@perl.org

Johan Vromans wrote:
>
> The variable may contain characters that have a special meaning when
> used inside a pattern.
>
> You can use either:
>
> $partbody_boundary = quotemeta($partbody_boundary);
> if ( /$partbody_boundary/ )
>
> or
>
> if ( /\Q$partbody_boundary\E/ )
>

How about
if ( qr/$regexp/ )

Per http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

you can precompile the regexp this way if necessary.

qoo eiji

unread,
Jan 6, 2010, 7:21:21 AM1/6/10
to scr...@perl.org
Hi Johan,

Thank you on the note on quotemeta. Overlooked that one totally!

I did search CPAN, but I was concerned about preserving my mbox as
much as possible while injecting tags into them, so that a simple diff
can help me verify if my processed mbox was correct - maybe I wasn't
thorough, but a quick search on CPAN didn't return promising
results... so while my script may not be as reliable, at least I know
that my output is verifiable...

Besides, there are other hacks I need to do to work around
Thunderbird, so I was hoping the code could serve as a skeleton for
people like me... until Thunderbird 3 finally manages to fix all my
problems!

exit;

Johan Vromans

unread,
Jan 6, 2010, 2:18:57 PM1/6/10
to scr...@perl.org
qoo eiji <eijik...@gmail.com> writes:

> I did search CPAN, but I was concerned about preserving my mbox as
> much as possible while injecting tags into them, so that a simple diff
> can help me verify if my processed mbox was correct - maybe I wasn't
> thorough, but a quick search on CPAN didn't return promising
> results... so while my script may not be as reliable, at least I know
> that my output is verifiable...

You can use the modules to analyse the mailbox, and still write the
mailbox yourself.

-- Johan

Johan Vromans

unread,
Jan 6, 2010, 2:16:22 PM1/6/10
to scr...@perl.org
>>> Please followup-to the list only!

Terrence Brannon <meta...@gmail.com> writes:

> How about
> if ( qr/$regexp/ )

qr// returns a pre-compiled regexp, it doesn't perform a match.

my $pat = qr/.../;
if ( $string =~ $pat ) { ... }

-- Johan

0 new messages