Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#984581: pst-utils: Fails to extract email addresses for emails having ARC headers from PST file

220 views
Skip to first unread message

sai kalyan

unread,
Mar 5, 2021, 7:10:03 AM3/5/21
to
Package: pst-utils
Version: 0.6.71-0.1
Severity: important
Tags: patch

Hi,

We have been using the tool to extract emails from the PST files. However with
the recent observations, for some mails where the transport headers contain ARC
headers, the email addresses are not extracted from the PST and only usernames
are available in the MIME content of emails that are extracted.
After enabling debug logs we got to know that all the internet headers are
being ignored as bogus headers which also contains the headers To:, From: ...
where we can see the email addresses available.

As the tool is open-source we tried to debug the tool, post debug we identified
that the the headers are ignored (as bogus headers) and the tool is using the
metadata extracted to construct MIME content for the email where the email
addresses are missing.

We would like to point at two parts where the issue could be possibly happened.
1) Parsing the mail from PST - As the structure variable does not contain the
addresses for these emails.
2) Ignoring the headers as bogus headers using the incorrect comparison.


We are not able to look into the parsing part, but we did some changes to
verify the behavior at identification part of bogus headers, probably not
appropriate changes.

Sample Data:
Below is the sample MIME Content that is extracted for an email from
PST by readpst utility

From: user_1
To: user_2
CC: user_3

where user_1, user_2 and user_3 are just usernames without email addresses

We would like to hear back as soon as possible.

Thank you
Sai Kalyan



-- System Information:
Debian Release: 10.5
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-8-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages pst-utils depends on:
ii libc6 2.28-10
ii libgcc1 1:8.3.0-6
ii libgd3 2.2.5-5.2
ii libglib2.0-0 2.58.3-2+deb10u2
ii libgsf-1-114 1.14.45-1
ii libpst4 0.6.71-0.1
ii libstdc++6 8.3.0-6

pst-utils recommends no packages.

pst-utils suggests no packages.

Paul Wise

unread,
Mar 5, 2021, 9:50:02 PM3/5/21
to
Control: tags -1 + moreinfo

On Fri, 2021-03-05 at 23:06 +0530, sai kalyan wrote:

> Version: 0.6.71-0.1

Could you test version 0.6.75-1 from Debian bullseye?

> Tags: patch

Could you attach your patch to the bug report?

> for some mails where the transport headers contain ARC headers

Could you provide some information about what ARC headers are?

> the email addresses are not extracted from the PST and only usernames
> are available in the MIME content of emails that are extracted.

Please supply an example PST file that this problem occurs with.

--
bye,
pabs

https://wiki.debian.org/PaulWise
signature.asc

Paul Wise

unread,
Mar 7, 2021, 8:40:03 PM3/7/21
to
Control: found -1 0.6.75-1

On Sun, 2021-03-07 at 17:42 +0000, Surla, Sai Kalyan wrote:

> Already tried with version 0.6.75-1.

Thanks, marking the bug as found in that version.

> Also compiled the latest code available and tried with it, still the
> same results.

Thanks for testing this too.

> Please find the changes in the attached file. (readpst.c line no. : 1238)

It is traditional to provide changes in the patch format by using the
`diff -u` command or the corresponding commands from the version
control system that the upstream project is using.

Below is the output from the Mercurial diff for your change.

$ hg diff
diff -r 7200790e46ac src/readpst.c
--- a/src/readpst.c Tue Jun 16 17:18:28 2020 -0700
+++ b/src/readpst.c Mon Mar 08 09:20:50 2021 +0800
@@ -1235,7 +1235,7 @@

int header_match(char *header, char*field) {
int n = strlen(field);
- if (strncasecmp(header, field, n) == 0) return 1; // tag:{space}
+ if (strstr(header,field) != NULL || strncasecmp(header, field, n) == 0) return 1; // tag:{space}
if ((field[n-1] == ' ') && (strncasecmp(header, field, n-1) == 0)) {
char *crlftab = "\r\n\t";
DEBUG_INFO(("Possible wrapped header = %s\n", header));


I am fairly certain that this is not the correct fix for this issue.

> ARC headers are kind of email authentication headers.

Thanks for the info.

> For some security reasons we cannot share the original

Understood.

> if possible we will try to share the inhouse sample pst.

That would be necessary to be able to fix the issue.

> Meanwhile our observation is if the headers start with the following
> headers (...) it is treated as bogus, this email is starting with
> some header which is not one of the listed.

That does look like what the code does indeed, probably the right fix
is to scan through all of the headers instead of just the first one.
signature.asc

Paul Wise

unread,
Mar 14, 2021, 11:30:04 PM3/14/21
to
On Wed, 2021-03-10 at 09:28 +0000, Surla, Sai Kalyan wrote:

> Hope you got a chance to at the issue that we reported.

I am looking at the issue today.

I managed to reproduce the issue that you have reported using the
sample PST file that you have provided.

I acknowledge that I am seeing both the issues you reported:

 * only a limited set of headers are being extracted
 * email address is missing from the To header
    - but the From header is correct

The readpst -d option to output debug information was instrumental in
reproducing this, it causes all the info in the PST file and the entire
sequence of decoding steps to be output to a debug file.

I modified the valid_headers function to also accept the ARC-Seal
header but that does not fix the problem. Looking at the debug output I
noticed that the X-GM-THRID header is the first header. I then added a
X-GM-THRID to the valid_headers function and that fixed the problem. I
think that messages with a different first header will not work though,
you would have to add all of the first headers that could exist to the
valid_headers function, which seems like an incorrect thing to do.

If you have any sample PST files that *do* work with the current code,
that would allow me to compare the working PST with the broken PST,
which would be very helpful in tracking down where the problem is.

Until I can figure out the correct fix, I suggest you workaround this
bug by adding "return 1;" without quotes as the first line in the
valid_headers function. This way you can keep readpst working for your
customers while the correct fix is found. I believe that the modern PST
files that you have available are all valid files, while the
valid_headers function aims to detect broken files, so there should be
no risk to the conversion process for your case.
signature.asc

Paul Wise

unread,
Mar 15, 2021, 12:30:03 AM3/15/21
to
I did some further investigation of the PST file you sent.

I conclude that there are two problems you are experiencing:

The first one is that readpst doesn't consider the headers as valid
even though they clearly are valid. Since the header validity detection
was added to detect invalid PST files I am going to have to discuss
this with the upstream author. Perhaps the header validity detection
will have to become more generic or perhaps it will be discarded or
perhaps the invalid PST files will be detected in a different way.
Fixing this will bring back all the headers, including ARC & To.

The second one is that for your particular PST file, the To field does
not contain an email address. Looking at the debug output I see that
the "Display Sent-To Address" contains only the name, not the email.
This appears to be a problem with the PST file itself, as the 0x0E04
type, which is PR_DISPLAY_TO, aka the "Address Sent-To", does not
contain the email address. The email address does appear in the
"Contact Address" and "Search Key" though. I am not sure if it is
correct to merge the contact address into the to address though.

If you have any more samples of working or broken PST files, I would be
happy to have a copy of them to debug further.
signature.asc

Paul Wise

unread,
Mar 15, 2021, 12:50:03 AM3/15/21
to
Hi Carl,

A Debian user reported a bug in libpst's readpst tool:

https://bugs.debian.org/984581

The bug report contains two issues experienced with a sample PST file:

The first is that normal MIME headers were not extracted, because the
headers were not considered valid, because the first header was the
X-GM-THRID header rather than one of the limited list of headers
considered valid by readpst. Clearly the header validity function is
not going to keep up with changes in the list of headers that are
commonly in PST files. My suggestion is to replace the current header
validity function with one that just checks if either the first header,
or the entire header block complies with the email header RFC.
Alternatively the header validity function could be removed, or made
optional but disabled/enabled by default.

The second is that when the headers are considered invalid, the To
field doesn't get an email address, only a name. This is because in the
PST file the 0x0E04 aka PR_DISPLAY_TO aka "Address Sent-To" field does
not contain the email address, only a name. The email address does
appear in some the other PST fields (contact and search key) though,
but I am not sure if all PST files have this problem, and I am not sure
if other PST files have the email address in other fields and I am not
sure if it is right to copy those fields to the To field.

I welcome any help you can give on these two topics.
signature.asc

Paul Wise

unread,
Mar 18, 2021, 8:50:03 PM3/18/21
to
On Thu, 2021-03-18 at 17:14 +0000, Surla, Sai Kalyan wrote:

> Please find a PST file

As far as I can tell from the `readpst -d debug.log` output, this new
PST file does not have any MIME headers in it, so it is expected that
fixing the valid_headers function will do nothing. I expect if you look
at the PST file in Outlook you will see there are no MIME headers.

I noticed something in common between the original PST file and the new
PST file you have sent, they both have an unknown MAPI type 0x39fe that
contains the email addresses of the recipients. So I will try to find
out in the PST file specifications what this MAPI type is for and then
add some code to libpst and readpst to decode it.

$ rm -f * ; /usr/bin/readpst -d debug.log ~/stash/samples/pst/bugs.debian.org/984581/forpst.pst ; echo ; grep -A5 'mapi-id: 0x39fe' debug.log
Opening PST file and indexes...
Processing Folder "Deleted Items"
Processing Folder "for pst"
"Outlook Data File" - 2 items done, 0 items skipped.
"for pst" - 1 items done, 0 items skipped.

2356166 pst_process libpst.c(2194) #10 - mapi-id: 0x39fe type: 0x1f length: 0x13
2356166 pst_process libpst.c(3172) Unknown type 0x39fe Unicode String Data [size = 0x13]
2356166 pst_process libpst.c(3174)
2356166 000000 :64 65 65 70 74 69 73 6b 40 67 6d 61 69 6c 2e 63 :deeptisk@gmail.c
2356166 000010 :6f 6d 00 :om.

$ rm -f * ; /usr/bin/readpst -d debug.log ~/stash/samples/pst/bugs.debian.org/984581/u3si.pst ; echo ; grep -A5 'mapi-id: 0x39fe' debug.log
Opening PST file and indexes...
Processing Folder "Deleted Items"
Processing Folder "Sent Items"
"Outlook Data File" - 2 items done, 0 items skipped.
"Sent Items" - 1 items done, 0 items skipped.

2356205 pst_process libpst.c(2194) #13 - mapi-id: 0x39fe type: 0x1f length: 0x16
2356205 pst_process libpst.c(3172) Unknown type 0x39fe Unicode String Data [size = 0x16]
2356205 pst_process libpst.c(3174)
2356205 000000 :4d 79 55 73 65 72 31 40 65 78 63 68 31 33 66 61 :MyUser1@exch13fa
2356205 000010 :73 2e 6c 6f 63 00 :s.loc.

--
2356205 pst_process libpst.c(2194) #13 - mapi-id: 0x39fe type: 0x1f length: 0x1c
2356205 pst_process libpst.c(3172) Unknown type 0x39fe Unicode String Data [size = 0x1c]
2356205 pst_process libpst.c(3174)
2356205 000000 :41 64 6d 69 6e 69 73 74 72 61 74 6f 72 40 65 78 :Administrator@ex
2356205 000010 :63 68 31 33 66 61 73 2e 6c 6f 63 00 :ch13fas.loc.
signature.asc

Paul Wise

unread,
Mar 18, 2021, 9:20:03 PM3/18/21
to
On Fri, 2021-03-19 at 08:30 +0800, Paul Wise wrote:

> I noticed something in common between the original PST file and the
> new PST file you have sent, they both have an unknown MAPI type
> 0x39fe that contains the email addresses of the recipients. So I will
> try to find out in the PST file specifications what this MAPI type is
> for and then add some code to libpst and readpst to decode it.

The specs indicate that 0x39fe is indeed the recipient address:

https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-pst/141923d5-15ab-4ef1-a524-6dce75aae546
https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-pst/5ee9a00a-858b-47db-95b3-f91518640ea7
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtagsmtpaddress-canonical-property
signature.asc

Paul Wise

unread,
Mar 18, 2021, 11:10:03 PM3/18/21
to
On Fri, 2021-03-19 at 09:03 +0800, Paul Wise wrote:

> The specs indicate that 0x39fe is indeed the recipient address:

The issue in libpst when there are no MIME headers in the PST file is:

There are some MAPI properties for To/CC/BCC:

https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtagdisplayto-canonical-property
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtagdisplaycc-canonical-property
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtagdisplaybcc-canonical-property

These contain *only* the names and not the addresses.

Outlook fills them automatically from the list of recipients.

Outlook stores the recipients in a separate table to email properties.

libpst stores them in the sentto/cc/bcc fields of the email structure.

libpst has no storage of the recipients table of the PST file.

libpst processes the MAPI types one-by-one rather than in separate
tables and only has one action per MAPI type.

So this is not going to be easy to fix.

I will discuss this with upstream.
signature.asc

Paul Wise

unread,
Mar 18, 2021, 11:50:03 PM3/18/21
to
On Mon, 2021-03-15 at 12:38 +0800, Paul Wise wrote:

> The second is that when the headers are considered invalid, the To
> field doesn't get an email address, only a name.

I figured out the problem here, the PR_DISPLAY_BCC, PR_DISPLAY_CC and
PR_DISPLAY_TO fields are basically bogus and contain *only* the names
and not the addresses and are filled out automatically by Outlook based
on the recipients of the message, which are stored in a separate MAPI
table to the email properties. libpst extracts the email properties,
but doesn't store the recipients table anywhere as far as I can tell.
I propose the following set of fixes for this issue:

Add a pst_item_recipient struct with the set of properties described by
Microsoft at the URL below and a pointer to the next recipient struct.

https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/recipient-tables

Add a recipients linked list element to the pst_item struct with a
pointer to the first pst_item_recipient struct.

Add code to the pst_process function to populate the recipients linked
list in a similar way to how the attachments are done.

Does that seem correct to you?

Are you willing to work on this or should I?
signature.asc

Paul Wise

unread,
Mar 22, 2021, 3:40:04 AM3/22/21
to
On Mon, 2021-03-22 at 05:41 +0000, Surla, Sai Kalyan wrote:

> In this case can we still go with the temporary change that you
> suggested as the issue is little different with this PST?

The temporary change will not work for the second PST, since it only
works around the header detection issue, but the second PST doesn't
have the full MIME headers, only the predefined PST To/CC/BCC fields.

There isn't any easy workaround for the issue with the second PST.
signature.asc

Paul Wise

unread,
Apr 5, 2021, 2:50:03 AM4/5/21
to
On Mon, 2021-04-05 at 06:04 +0000, Surla, Sai Kalyan wrote:

> Is there any update on the issues.

I discussed the issues with upstream.

Upstream doesn't have time to work on the issues.

Upstream confirmed my suggested solutions sound OK.

I haven't yet had time to work on the solutions.
signature.asc

Paul Wise

unread,
May 29, 2021, 10:40:04 PM5/29/21
to
On Mon, 5 Apr 2021 06:04:49 +0000 "Surla, Sai Kalyan" wrote:

> Is there any update on the issues.

I finally found time to work on the first issue (header detection)
where we had a workaround already and created proper patches (attached)
for the issue and sent them to the upstream maintainer.
0001-Add-debugging-for-header-detection.patch
0002-Also-detect-email-headers-wrapped-with-space-instead.patch
0003-Detect-reasonable-email-headers-too.patch
signature.asc

Paul Wise

unread,
Aug 17, 2021, 12:10:04 AM8/17/21
to
Control: forwarded -1 https://bugzilla.redhat.com/show_bug.cgi?id=1994178

On Sun, 30 May 2021 10:26:21 +0800 Paul Wise wrote:
> On Mon, 5 Apr 2021 06:04:49 +0000 "Surla, Sai Kalyan" wrote:
>
> > Is there any update on the issues.
>
> I finally found time to work on the first issue (header detection)
> where we had a workaround already and created proper patches (attached)
> for the issue and sent them to the upstream maintainer.

I have forwarded the patches to the Fedora bug tracker, hopefully that
will mean that the upstream maintainer will accept them now.

I had to fix a bug with the first patch causing a segfault.

I will include the patches in the next upload to Debian unstable.
signature.asc
0 new messages