Parsing an email message

11 views
Skip to first unread message

Bernie Cosell

unread,
Jan 10, 2022, 6:37:33 PM1/10/22
to
I need to parse an email message and pull its various parts apart. Is
there some not-so-difficult way to do it? Corriel looks like it would be
just the thing, unfortunately it won't run on Windows. The Mail:: and
Email:: modules seem very complicated when all I want to do is feed it a
complete message and get at the various pieces [body, attachments, etc] and
the headers [from, date, etc]. Is there a _simple_ package that'll do
that? If not, are there tutorials or the like for Mail:: and/or Email::?
They seem to be much more focused on managing actual mailboxes {Mail::} and
*composing* emails [Email::] and give pretty short shrift [to my struggling
with the man pages] to just *parsing* an email. Thanks!

/Bernie\
--
Bernie Cosell Fantasy Farm Fibers
ber...@fantasyfarm.com Pearisburg, VA
--> Too many people, too few sheep <--

Rainer Weikusat

unread,
Jan 11, 2022, 12:31:51 PM1/11/22
to
Bernie Cosell <ber...@fantasyfarm.com> writes:
> I need to parse an email message and pull its various parts apart. Is
> there some not-so-difficult way to do it? Corriel looks like it would be
> just the thing, unfortunately it won't run on Windows. The Mail:: and
> Email:: modules seem very complicated when all I want to do is feed it a
> complete message and get at the various pieces [body, attachments, etc] and
> the headers [from, date, etc]. Is there a _simple_ package that'll do
> that? If not, are there tutorials or the like for Mail:: and/or Email::?
> They seem to be much more focused on managing actual mailboxes {Mail::} and
> *composing* emails [Email::] and give pretty short shrift [to my struggling
> with the man pages] to just *parsing* an email. Thanks!

There is no simple way to parse an e-mail message: That's literally the
most complicated grammar I ever wrote a parser for.

Henry Law

unread,
Jan 11, 2022, 5:58:38 PM1/11/22
to
On Mon, 10 Jan 2022 18:37:26 -0500, Bernie Cosell wrote:

> Is there a _simple_ package that'll do that? If not, are there
> tutorials or the like for Mail:: and/or Email::?

I use Email::MIME. How "simple" it is depends on your point of view but,
as someone else has already observed, MIME email has a complicated
structure (e.g. separate parts within one message are themselves
Email::MIME structures), and you're not going to get a /simple/ piece of
code that understands that.

However, if you pass the text of a single message to Email::MIME, the
object will then give you a "header_pairs" method, which will give you a
great deal of what you need. And there's a "body" method which will give
you the body, surprisingly.

If you want to send me a mail (address is valid) I can let you have great
wodges of code that does this stuff; maybe reading through it and taking
out the bits you don't need might help you. It's object-oriented so you
might even be able to use the packages.

--
Henry Law n e w s @ l a w s h o u s e . o r g
Manchester, England

Andreas Karrer

unread,
Jan 11, 2022, 7:18:40 PM1/11/22
to
* Bernie Cosell <ber...@fantasyfarm.com>:
> I need to parse an email message and pull its various parts apart. Is
> there some not-so-difficult way to do it? Corriel looks like it would be

There is no really simple way because mail headers and MIME are not
simple. A MIME message may be an arbitrarily complex tree of parts,
parts may be items of a whole lot of media types such as text, html,
images, videos, pdf etc. Then there is the further complexity of
"multipart/alternative", where you will have to decide by some
heuristic which of the alternatives you want to extract or display.

I'd recommend Email::MIME, maybe that qualifies as "not-so-difficult".

"arbitrarily complex tree" is a hint that a recursive approach should
be used.

This skeleton passes the mail message in $message to Email::MIME for
parsing. The "showparts" method then displays a summary of each direct
subpart and calls itself recursively for that subpart. It uses
Email::MIME::ContentType to parse the "Content-Type" headers, which may
be quite complex, too.

use Email::MIME;
use Email::MIME::ContentType;

my $email = Email::MIME->new($message);
sub showparts;
sub showparts {
my $item = shift;
my $indent = shift;
my $i = 1;
for my $part ($item->subparts) {
my $ct = parse_content_type($part->content_type);
my $len = length $part->body;
print "part$indent $i: $ct->{type}/$ct->{subtype}, $len bytes\n";
showparts $part, "$indent $i";
$i++;
}
}
showparts $email, "";

If you are, for example, just interested in all pdf attachments,
might be enough to filter out the parts with a Content-Type of
application/pdf or application/x-pdf.



- Andi

Bernie Cosell

unread,
Jan 19, 2022, 1:40:00 PM1/19/22
to
Bernie Cosell <ber...@fantasyfarm.com> wrote:

} I need to parse an email message and pull its various parts apart. Is
} there some not-so-difficult way to do it?

Wow -- thanks for all the info. I knew MIME messages were messy but I
didn't really realize just *how* messy. I think I'll need to more
fine-tune exactly what I want from the message and then focus on
finding/extracting just that.

Bernie Cosell

unread,
Jan 26, 2022, 12:18:34 PM1/26/22
to
Bernie Cosell <ber...@fantasyfarm.com> wrote:

} I need to parse an email message and pull its various parts apart. Is
} there some not-so-difficult way to do it?

I'm still struggling with this and I can't figure what I'm doing wrong I've
been trying to start simple and ease my way into the morass [and thanks for
all the sample code and advice... alas, I'm still kinda lost]. I tried a
very very simple program:
-------------------------------------------------------
!/usr/bin/perl
use v5.10 ;
use strict;
use warnings ;
use Email::Simple ;
use Email::MIME ;
use Email::MIME::ContentType ;
use Email::Simple::Header ;

foreach my $msg (@ARGV)
{ checkmsg($msg) ; }
exit ;

sub checkmsg
{ my $email = Email::Simple->new($_[0]) ;
my @header_names = $email->header_names ;
say scalar(@header_names) ;
foreach my $header (@header_names)
{ say "$header" ; }
exit ;
}
---------------------------------------------------------

I tried it with a simple message [headers in part]
---------------------
[...]
Content-Type: multipart/alternative;
boundary="Apple-Mail=_AB70B143-E35C-42EB-86E0-84730EB5E4A7"
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
Date: Sun, 9 Jan 2022 13:27:13 -0500
Subject: Getting involved on state level
Message-Id: <840F9FC5-346F-4D62...@swva.net>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
X-PMFLAGS: 570966400 0 65537 PT49NPRZ.CNM
[...]

--Apple-Mail=_AB70B143-E35C-42EB-86E0-84730EB5E4A7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
--------------------------------------

I don't care about sorting out the MIME section, I just want to see if I
can get the headers parsed.. but when I try it:

D:\Desktop\>showparts Mailbox\multipart
0

What am I doing wrong? THANKS!! /bernie\

Bernie Cosell

unread,
Jan 26, 2022, 12:45:22 PM1/26/22
to
Bernie Cosell <ber...@fantasyfarm.com> wrote:

} Bernie Cosell <ber...@fantasyfarm.com> wrote:
}
} } I need to parse an email message and pull its various parts apart. Is
} } there some not-so-difficult way to do it?
}
} I'm still struggling with this ...

Please ignore. When I looked again I realized the idiot mistake I had
made. DUH. It wants the *text* of the message, not a stupid file-name.
When I did the open()... $msg=<..> it all magically worked. What a dolt I
am... Sorry to bother y'all

/Bernie\\
Reply all
Reply to author
Forward
0 new messages