Possible reasons for "lost connection after DATA"

Sean Durkin

unread,

Sep 10, 2014, 3:56:48 AM9/10/14

to

Hello,

some of my users were complaining about losing incoming mail, namely Amazon shipping notifications, newsletters and such things that they were absolutely sure were sent out, but never reached their inbox. After doing some digging, increasing log verbosity and such, I found a lot of this:

[... snip ...]
Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:06:48 mail postfix/smtpd[23111]: lost connection after DATA (22788 bytes) from mail18-92.srv2.de[193.169.181.92]
Sep 10 00:13:35 mail postfix/smtpd[23348]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:27:49 mail postfix/smtpd[23454]: lost connection after DATA (22788 bytes) from mail18-97.srv2.de[193.169.181.97]
Sep 10 00:31:03 mail postfix/smtpd[23103]: lost connection after DATA (49116 bytes) from quinqueunus.psi.cust-cluster.com[195.140.187.51]
Sep 10 00:48:46 mail postfix/smtpd[23890]: lost connection after DATA (22788 bytes) from mail18-98.srv2.de[193.169.181.98]
Sep 10 00:51:53 mail postfix/smtpd[23564]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 01:09:16 mail postfix/smtpd[24565]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 01:09:44 mail postfix/smtpd[24290]: lost connection after DATA (22788 bytes) from mail18-99.srv2.de[193.169.181.99]
Sep 10 01:16:14 mail postfix/smtpd[24674]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 01:30:44 mail postfix/smtpd[24782]: lost connection after DATA (22788 bytes) from mail18-100.srv2.de[193.169.181.100]
Sep 10 01:51:42 mail postfix/smtpd[25198]: lost connection after DATA (22788 bytes) from mail18-105.srv2.de[193.169.181.105]
Sep 10 01:54:29 mail postfix/smtpd[24966]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 02:12:21 mail postfix/smtpd[25784]: lost connection after DATA (17511 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 02:12:42 mail postfix/smtpd[25656]: lost connection after DATA (22788 bytes) from mail18-106.srv2.de[193.169.181.106]
Sep 10 02:20:11 mail postfix/smtpd[25892]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 02:33:41 mail postfix/smtpd[26077]: lost connection after DATA (22788 bytes) from mail18-107.srv2.de[193.169.181.107]
Sep 10 02:54:22 mail postfix/smtpd[26178]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 02:54:40 mail postfix/smtpd[26490]: lost connection after DATA (22788 bytes) from mail18-108.srv2.de[193.169.181.108]
Sep 10 03:14:16 mail postfix/smtpd[26585]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 03:15:39 mail postfix/smtpd[26905]: lost connection after DATA (22788 bytes) from mail18-113.srv2.de[193.169.181.113]
Sep 10 03:15:52 mail postfix/smtpd[27091]: lost connection after DATA (17511 bytes) from smtp-out-127-106.amazon.com[176.32.127.106]
Sep 10 03:23:15 mail postfix/smtpd[27214]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
[... snip ...]

So to me that looks as if either the external SMTP server closes its connection before it is done with the entire message (the transferred size does not match the size passed through "MAIL FROM: SIZE=XYZ"), or the connection times out. I can see in the logs and looking at the queue directories that these messages are put in the incoming queue by cleanup, are then found but skipped by qmgr (probably since they are not finished); they lurk in the incoming queue for awhile and disappear about the time the "lost connection" message is put in the logs. So this points to a timeout.

The weird thing is that the data sizes are always the same for the same message-id being delivered, even if it is delivered via different servers from a cluster. If it were a timeout, network problem or such, I'd expect a more or less random value for the received data size, not always exactly the same.

This seems to be a "new" problem (meaning I just recently got aware of it; I don't know when it started, but I do know everything was working fine for years before). This does not seem to be a problem of the sepcific hosts above; in between, I've been getting messages from the same hosts successfully. It seems that some messages go through on the first try, other messages from the same hosts are always losing connection, they never are delivered completely, ultimately resulting in the external host giving up; the message is then lost, and the user never knows about it.

The first question is:
Can I rule out it's my fault? I don't have traffic shaping or ICMP blocking running on that host, which maybe could cause something like that. Doing a traceroute to the hosts above does not show anything out of the ordinary, so it doesn't look like a routing problem, either. Connection times are OK as well, connections should not be so slow they time out.

The second question is:
Is there anything I can do on my end? Since it looks as if I simply do not get a complete delivery, the sender just stops sending after a certain number of bytes, I don't see what I could do...

Regards,
Sean

Robert Schetterer

unread,

Sep 10, 2014, 4:11:41 AM9/10/14

to

Am 10.09.2014 um 09:56 schrieb Sean Durkin:
> The first question is:
> Can I rule out it's my fault?

have you changed anything last days/month upgrades/updates software
hardware ?
please send you postfix config , search list archive "lost connection
after DATA"

Best Regards
MfG Robert Schetterer

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

Viktor Dukhovni

unread,

Sep 10, 2014, 10:19:12 AM9/10/14

to

On Wed, Sep 10, 2014 at 09:56:48AM +0200, Sean Durkin wrote:

> Some of my users were complaining about losing incoming mail,

> namely Amazon shipping notifications, newsletters and such things
> that they were absolutely sure were sent out, but never reached
> their inbox. After doing some digging, increasing log verbosity
> and such, I found a lot of this:

Have you tried disabling TCP window scaling? It might be confusing
some middle-box (firewall, NAT device, ...) on path between the
remote systems and your MTA.

> [... snip ...]
> Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]

Post the hostname/IP address of the receving system.

Capture and examine a tcpdump recording of mail from one of the
problem senders. Any sign of retransmission by the sender?

For at least one such session, post all related messages from the
"postfix/smtpd[pid]" that occur between "connect from" and
"disconnect from".

--
Viktor.

Message has been deleted

Sean Durkin

unread,

Sep 10, 2014, 3:19:58 PM9/10/14

to

Hi Viktor,

Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni:
> Have you tried disabling TCP window scaling? It might be confusing
> some middle-box (firewall, NAT device, ...) on path between the
> remote systems and your MTA.

I would not have thought of that... I've tried that now, but it does not seem to help.

> Post the hostname/IP address of the receving system.

mail.tuxroot.de

> Capture and examine a tcpdump recording of mail from one of the
> problem senders. Any sign of retransmission by the sender?

I'm trying to get a good dump and will post results once I get one.
Not that easy since the external hosts keep changing all the time. All mail affected comes from mass mailers that use server clusters, so I keep getting those messages from lots of different remote hosts. I'm waiting for it to happen from one of the hosts I've seen before.

Retransmission is tried numerous times, but for every retransmission the lost connection message is the same (identical number of bytes), as far as I can tell. That's one thing that puzzles me... So e.g. a message is delivered twice and each time the connection is lost after exactly 17441 bytes, even if it's different remote hosts trying, that's kind of odd.

> For at least one such session, post all related messages from the
> "postfix/smtpd[pid]" that occur between "connect from" and
> "disconnect from".

Here's one: http://pastebin.com/twb3Z8Eg
And this seems to be the same message being redelivered later, from a different host, with the same result (connection lost after exactly 17441 bytes):
http://pastebin.com/Qihbjz3w

What I do notice there is that in fact the connection seems to be *very* slow. In the above example, the whole process takes several minutes. I don't have any throughput or network speed issues with other hosts, though. I've tried sending mail from Gmail, Yahoo, my workplace, my former university, GMX, whatever; everything goes through on the first attempt each and every time, and quickly. But it seems it is always slow for a few hosts.

Regards,
Sean

Wietse Venema

unread,

Sep 10, 2014, 3:52:14 PM9/10/14

to

Sean Durkin:
[ Charset windows-1252 converted... ]

> Hi Viktor,
>
> Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni:
> > Have you tried disabling TCP window scaling? It might be confusing
> > some middle-box (firewall, NAT device, ...) on path between the
> > remote systems and your MTA.
> I would not have thought of that... I've tried that now, but it does not seem to help.
>
> > Post the hostname/IP address of the receving system.
> mail.tuxroot.de
>
> > Capture and examine a tcpdump recording of mail from one of the
> > problem senders. Any sign of retransmission by the sender?
> I'm trying to get a good dump and will post results once I get one.
> Not that easy since the external hosts keep changing all the time. All mail affected comes from mass mailers that use server clusters, so I keep getting those messages from lots of different remote hosts. I'm waiting for it to happen from one of the hosts I've seen before.
>
> Retransmission is tried numerous times, but for every retransmission
> the lost connection message is the same (identical number of bytes),
> as far as I can tell. That's one thing that puzzles me... So e.g.
> a message is delivered twice and each time the connection is lost
> after exactly 17441 bytes, even if it's different remote hosts
> trying, that's kind of odd.

No, it means the same problem is happening. Same error,
same symptom.

> What I do notice there is that in fact the connection seems to be
> *very* slow. In the above example, the whole process takes several
> minutes. I don't have any throughput or network speed issues with
> other hosts, though. I've tried sending mail from Gmail, Yahoo,

Slow performance is typical for TCP window scaling problems. Have
you tried to turn it off in your kernel?

# sysctl -w net.ipv4.tcp_window_scaling=0

To make it permanent:

# echo 'net.ipv4.tcp_window_scaling = 0' >> /etc/sysctl.conf

Wietse

Viktor Dukhovni

unread,

Sep 10, 2014, 5:03:57 PM9/10/14

to

On Wed, Sep 10, 2014 at 09:19:58PM +0200, Sean Durkin wrote:

> > For at least one such session, post all related messages from the
> > "postfix/smtpd[pid]" that occur between "connect from" and
> > "disconnect from".
> Here's one: http://pastebin.com/twb3Z8Eg

This trace has an insane level of debugging turned on, to the point
that syslogd is overwhelmed and is losing messages. PLEASE DISABLE
ALL VERBOSE logging. NO "-v" options in master.cf, NO debug_peer_list,
...

Please make sure that the /dev/log syslog socket is a "dgram" not
a "stream" socket, and that mail logging is not synchronous.

Then if the problem persists, report just normal Postfix logging,
not the flood of noise from verbose logging.

--
Viktor.

Sean Durkin

unread,

Sep 11, 2014, 5:29:10 AM9/11/14

to

Hello Wietse,

Am 10.09.2014 um 21:52 schrieb Wietse Venema:

> Slow performance is typical for TCP window scaling problems. Have
> you tried to turn it off in your kernel?

Yes, Viktor suggested that also and I tried it. It does not make a difference, the problem persists.

Regards,
Sean

Wietse Venema

unread,

Sep 11, 2014, 7:49:25 AM9/11/14

to

Sean Durkin:

What is the distribution of DATA sizes before failure? In your
example I see numbers around 3kB, 9kB, 12kB.

Some failures are triggered by packet content, and may be replaced
only by replacing hardware that operates marginally. Does the problem
go away when you

- Replace the server (either the network card or the whole box)

- Replace the cable that connects the server to the network switch

- Replace the network switch that the server is plugged into.

- Replace the cable that connects the switch to the router

- Replace the router

- And so on...

If you think this is a stupid idea, then you haven't been around
long enough.

Wietse

Sean Durkin

unread,

Sep 11, 2014, 8:36:51 AM9/11/14

to

Hi Viktor,

Am 10.09.2014 um 23:03 schrieb Viktor Dukhovni:
> This trace has an insane level of debugging turned on, to the point
> that syslogd is overwhelmed and is losing messages. PLEASE DISABLE
> ALL VERBOSE logging. NO "-v" options in master.cf, NO debug_peer_list,
> ...

Yes, sorry, I cranked up the debug level, since normal logging looks like this:

Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from mail18-21.srv2.de[193.169.181.21]
Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: client=mail18-21.srv2.de[193.169.181.21]
Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 bytes) from mail18-21.srv2.de[193.169.181.21]
Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from mail18-21.srv2.de[193.169.181.21]
...
Sep 11 10:10:59 mail postfix/smtpd[25537]: connect from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:10:59 mail postfix/smtpd[25537]: 8736FC40A7D: client=quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:36:44 mail postfix/smtpd[25537]: lost connection after DATA (36809 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:36:44 mail postfix/smtpd[25537]: disconnect from quattuorocto.psi.cust-cluster.com[195.140.187.48]
..
Sep 11 10:38:48 mail postfix/smtpd[25913]: connect from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:38:49 mail postfix/smtpd[25913]: 2558DC40458: client=smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:41:01 mail postfix/smtpd[25913]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:41:01 mail postfix/smtpd[25913]: disconnect from smtp-out-127-108.amazon.com[176.32.127.108]

I didn't think that info alone was particularly useful...

> Please make sure that the /dev/log syslog socket is a "dgram" not
> a "stream" socket, and that mail logging is not synchronous.

Logging is not synchronous, the socket is a datagram socket (it has all been set up that way all along).

No change, still the same problem, see above.

Meanwhile, I've managed to record a tcpdump of such a failed session. What exactly am I looking for there?
I don't see anything out of the ordinary, except increasing delays between received packets from the external host, until the host sends a "RST".
It seems I simply do not receive any packets. The ones I get are immediately ACK'd, but then there's seconds and later minutes until the next one even arrives, until finally the remote host gives up and terminates the connection.

I'll try to get more dumps for comparison, including some from hosts that have no problems delivering.

There's no packet filtering or rate limiting on port 25, at least not on my machine. The hosting provider might have something there, I'd have to ask them...

Regards,
Sean

Viktor Dukhovni

unread,

Sep 11, 2014, 9:01:37 AM9/11/14

to

On Thu, Sep 11, 2014 at 02:36:51PM +0200, Sean Durkin wrote:

> > PLEASE DISABLE
> > ALL VERBOSE logging. NO "-v" options in master.cf, NO debug_peer_list,
>

> Yes, sorry, I cranked up the debug level, since normal logging looks like this:
>
> Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from mail18-21.srv2.de[193.169.181.21]
> Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: client=mail18-21.srv2.de[193.169.181.21]
> Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 bytes) from mail18-21.srv2.de[193.169.181.21]
> Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from mail18-21.srv2.de[193.169.181.21]

That's sufficient. It shows you're likely not using TLS here, and
the time beetween message start and connection loss. The number
of samples is rather small now. I would expect the session duration
for each sending host to be essentially constant over multiple
deliveries (equal to the remote machine's TCP timeout).

Possibilities include a broken network interface somewhere or a
bad cable that corrupts IP or TCP packet headers given specific
input patterns.

If the problem is with the message payload, you could try enabling
inbound TLS, perhaps these sending servers support it. Don't recall
whether you already have TLS. If the problem is not with the
payload, then TLS won't make any difference (some hosts will still
fail even after TLS).

> I didn't think that info alone was particularly useful...

It is sufficient, and the verbose logs just add noise.

> Meanwhile, I've managed to record a tcpdump of such a failed
> session. What exactly am I looking for there?

Retransmissions from the sender of data with the same sequence
number...

Post "tcpdump" output (without packet content is fine), containing
packets from just a single failed session.

> There's no packet filtering or rate limiting on port 25, at least
> not on my machine. The hosting provider might have something there,
> I'd have to ask them...

They probably have middle boxes, which might be the cause of the
problem.

--
Viktor.

Sean Durkin

unread,

Sep 11, 2014, 9:25:57 AM9/11/14

to

Hi Wietse,

Am 11.09.2014 um 13:49 schrieb Wietse Venema:
> What is the distribution of DATA sizes before failure? In your
> example I see numbers around 3kB, 9kB, 12kB.

At the moment, I see these sizes:

- always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to be only 3 different hosts trying)
- always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 different hosts from there today)
- always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different hosts)

It seems those are always the same 3 messages being re-tried constantly (when I look at them in the incoming queue folder, it's the same recipient and sender and the same message-ID, as far as I can tell). I have problems only with messages from these clusters, everything else seems unaffected (at least I haven't seen any "lost connection" messages from any other hosts as far as my logfiles go back).

Yesterday I had an additional message with exactly 17441 bytes on every try before failure from the Amazon-cluster. That one was finally delivered completely early this morning, and has since disappeared from the cycle.

FWIW, I have received a handful of messages from the Amazon-cluster that did not have any delays/problems yesterday and today, one of them even from one of the "problematic" hosts that can't deliver the other message.

> Some failures are triggered by packet content, and may be replaced
> only by replacing hardware that operates marginally. Does the problem
> go away when you
>
> - Replace the server (either the network card or the whole box)
>
> - Replace the cable that connects the server to the network switch
>
> - Replace the network switch that the server is plugged into.
>
> - Replace the cable that connects the switch to the router
>
> - Replace the router
>
> - And so on...
>
> If you think this is a stupid idea, then you haven't been around
> long enough.

By no means do I think that's stupid. :)
I'm only doing this server stuff "for fun" in my spare time, but my real job is in microelectronics and hardware, so I've had my share of mysterious and seemingly unexplainable stuff (ISI, crosstalk, low-frequency jitter, ground bounce, ESD-induced phenomena, you know the drill...).

Problem is that this box is a rented root server in a data center somwhere, so I don't have access to the hardware to try any of that. I can contact support, but they of course charge you for everything they do, and as long as I haven't ruled out that the reason is just some stupid configuration mistake on my part (or a routing/filtering issue at my hosting provider, or Amazon, or...), I don't want to start replacing hardware, obviously...

Regards,
Sean

Viktor Dukhovni

unread,

Sep 11, 2014, 10:04:00 AM9/11/14

to

On Thu, Sep 11, 2014 at 03:25:57PM +0200, Sean Durkin wrote:

> I can contact support, but they of course charge you for
> everything they do, and as long as I haven't ruled out that the
> reason is just some stupid configuration mistake on my part (or a
> routing/filtering issue at my hosting provider, or Amazon, or...),
> I don't want to start replacing hardware, obviously...

The Postfix configuration has no impact on the TCP layer, beyond
optionally specifying the TCP window size. Since it is the TCP
layer that fails, the problem is not related to the Postfix
configuration.

Your PCAP files should demonstrate repeated retransmission of data,
are the ACKs you're sending confirming receipt of packets that are
sent repeatedly? In that case your ACKs are getting lost? Is
there a sequence number gap in the data received from the server?
In that case the remote server's data is getting lost. Does the
capture confirm that window scaling is not in use? ...

--
Viktor.

Wietse Venema

unread,

Sep 11, 2014, 11:10:54 AM9/11/14

to

Sean Durkin:

> Hi Wietse,
>
> Am 11.09.2014 um 13:49 schrieb Wietse Venema:
> > What is the distribution of DATA sizes before failure? In your
> > example I see numbers around 3kB, 9kB, 12kB.
>
> At the moment, I see these sizes:
>
> - always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to be only 3 different hosts trying)
> - always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 different hosts from there today)
> - always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different hosts)
>
> It seems those are always the same 3 messages being re-tried
> constantly (when I look at them in the incoming queue folder, it's
> the same recipient and sender and the same message-ID, as far as
> I can tell). I have problems only with messages from these clusters,
> everything else seems unaffected (at least I haven't seen any "lost
> connection" messages from any other hosts as far as my logfiles
> go back).
>
> Yesterday I had an additional message with exactly 17441 bytes on
> every try before failure from the Amazon-cluster. That one was
> finally delivered completely early this morning, and has since
> disappeared from the cycle.

That increases my suspicion of a data-dependent error - some marginal
cable/switch/router, perhaps some middle box with a memory bit error
that requires a power cycle to clear the problem. If the problem is
caused by crosstalk defect, then only physical replacement will
solve it.

> Problem is that this box is a rented root server in a data center
> somwhere, so I don't have access to the hardware to try any of

> that. I can contact support, but they of course charge you for

> everything they do, and as long as I haven't ruled out that the
> reason is just some stupid configuration mistake on my part (or a
> routing/filtering issue at my hosting provider, or Amazon, or...),
> I don't want to start replacing hardware, obviously...

Try power cycling.

Wietse

Hannes Erven

unread,

Sep 11, 2014, 2:48:31 PM9/11/14

to

Hi Sean,

> Meanwhile, I've managed to record a tcpdump of such a failed session.
> What exactly am I looking for there?

I remember a possibly similar situation back in 2008... the culprit was
a not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields
and hence had the remote site send RSET.
Anyways on our end the connection seemed to starve, just as you describe
it.

We detected that by comparing tcpdumps from both affected ends. Of
course we had been lucky enough to have that happen with a business
partner with competent IT people who we got a hold of, spotted the
problem and also temporarily switched the feature off on their side to
prove that this actually is the problem.
A firmware upgrade on my client's firewall then fixed the issue.

With a server hosted somewhere and incoming connections from big
clusters, you might not be as lucky as that...

best regards,

-hannes

Wietse Venema

unread,

Sep 11, 2014, 3:03:19 PM9/11/14

to

Sean Durkin:

> Meanwhile, I've managed to record a tcpdump of such a failed
> session. What exactly am I looking for there?

- The receiving host's window announcement in the tcp handshake
and in subsequent ACKs.

- Whether there is a "gap" in the sender packet sequence numbers
as seen by the receiving host.

Such a gap means that a particular packet is being dropped.

Just to bore you with a few examples of bad middleboxes:

- Shortly after the first Postfix release there was a problem with
traffic corruption due to a buggy middlebox (a Packeteer traffic
shaper). The error had a very distinct signature.

- For many years, there were problems with CISCO PIX "firewalls"
that inspected SMTP traffic but failed to properly handle the
case that <CR><LF>.<CR><LF> happened to fall on a packet boundary.

- http://www.arschkrebs.de/postfix/postfix_cisco_pix_bugs.shtml
has other examples where CISCO PIX/ASA "firewalls" will mis-handle
SMTP traffic in various ways.

In your case, you may have to collaborate with someone who is willing
to send large amounts of random email; hopefully some messages will
trigger the bug, and then the sender and receiver can compare tcpdump
recordings.

Wietse

L. Mark Stone

unread,

Sep 11, 2014, 4:59:55 PM9/11/14

to

Any chance there is a UTM device in the email stream?

We see lots of these errors when our SonicWalls do an RBL lookup, don't like the data in the email stream etc. The SonicWalls then just drop the connection and Postfix logs the drop.

Hope that helps,

Mark

Sean Durkin

unread,

Sep 12, 2014, 4:27:03 AM9/12/14

to

Hi Viktor,

Am 11.09.2014 um 16:04 schrieb Viktor Dukhovni:
> Your PCAP files should demonstrate repeated retransmission of data,
> are the ACKs you're sending confirming receipt of packets that are
> sent repeatedly? In that case your ACKs are getting lost? Is
> there a sequence number gap in the data received from the server?
> In that case the remote server's data is getting lost. Does the
> capture confirm that window scaling is not in use? ...

I've now collected a big enough data sample (I wanted to get more data from different hosts and such).

What I am seeing is this:
- window scaling is turned off
- there's two cases:
1. I am simply not receiving any data anymore after a few packets, also no retransmissions (up to that point, everything seems normal, no sequence number gaps, no missing ACKs); after 20mins or so of complete silence, I receive a RST and the connection is killed. That is the less common case. That looks like the connection is suddenly completely dead and the remote hosts just keeps sending RST until one of them happens to go through after all.
2. The connection starts out OK (first few packets all OK), then I start receiving packets out of order or not at all. I then get a lot of retransmissions of packets that I didn't ACK because I never received them in the first place (that's the majority of cases). This goes on until the remote host is fed up and kills the connection.

So that looks like this is an "external" problem I can't fix by changing my config...

Regards,
Sean

Sean Durkin

unread,

Sep 12, 2014, 4:36:51 AM9/12/14

to

Hi Wietse,

Am 11.09.2014 um 17:10 schrieb Wietse Venema:
> That increases my suspicion of a data-dependent error - some marginal
> cable/switch/router, perhaps some middle box with a memory bit error
> that requires a power cycle to clear the problem. If the problem is
> caused by crosstalk defect, then only physical replacement will
> solve it.

If it is a hardware error on my server (NIC, cable, anything on the physical layer...), wouldn't I be seeing CRC errors or something?
Shouldn't I be seeing something i.e. in the ifconfig error counters or the like?

If it's somewhere else in the data center (router/switch), then there's really no way for me to know.
I could contact support, but that request would be much to vague to be any useful ("You've probably got some broken hardware somewhere...") and they probably wouldn't care anyway unless more customers were affected.

If it's a middle box somewhere along the way, that's even worse. Even more different people potentially involved...

> Try power cycling.

Did that, no change.

So, the plan now is to just sit it out, keeping an eye on things...

Regards,
Sean

Sean Durkin

unread,

Sep 12, 2014, 4:43:46 AM9/12/14

to

Hi Hannes,

Am 11.09.2014 um 20:48 schrieb Hannes Erven:
> I remember a possibly similar situation back in 2008... the culprit was a not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields and hence had the remote site send RSET.
> Anyways on our end the connection seemed to starve, just as you describe it.
>
> We detected that by comparing tcpdumps from both affected ends. Of course we had been lucky enough to have that happen with a business partner with competent IT people who we got a hold of, spotted the problem and also temporarily switched the feature off on their side to prove that this actually is the problem.
> A firmware upgrade on my client's firewall then fixed the issue.
>
> With a server hosted somewhere and incoming connections from big clusters, you might not be as lucky as that...

Yup. Looks like I'll just have to sit it out. This is just a small, private, low-traffic server, it's not like anyone at Amazon cares that I have problems. ;)

And even if they did, I have neither the know-how nor the time and resources to do anything useful to fix it.

I'll just keep my eyes open to see if it gets any worse and recommend my users have their Amazon and newsletter stuff sent to other Email addresses. The advantage of being small is that that really is a feasible option. :)

Regards,
Sean

Sean Durkin

unread,

Sep 12, 2014, 4:46:08 AM9/12/14

to

Hi Mark,

Am 11.09.2014 um 22:59 schrieb L. Mark Stone:
> Any chance there is a UTM device in the email stream?

Possible, but I wouldn't know. This is a rented rootserver in some data center. I don't know their topology, and they probably wouldn't tell me even if I asked.

> We see lots of these errors when our SonicWalls do an RBL lookup, don't like the data in the email stream etc. The SonicWalls then just drop the connection and Postfix logs the drop.

I'll contact support and ask, that won't hurt.

Regards,
Sean

Viktor Dukhovni

unread,

Sep 12, 2014, 10:30:10 AM9/12/14

to

On Fri, Sep 12, 2014 at 10:36:51AM +0200, Sean Durkin wrote:

> If it's a middle box somewhere along the way, that's even worse.
> Even more different people potentially involved...

I would rent a backup MX server (deploy identical anti-spam policies,
and lists of valid recipients, ...) at a different site, which
relays mail to your primary. If no problems are not observed at
that server, make it the primary (cut over) and stop paying for
the original server.

Let the vendor know that there is a network problem at the present
location. You can refer them to this thread.

http://archives.neohapsis.com/archives/postfix/2014-09/thread.html#118

It should be in their interest to fix the problem. They can deploy
network sniffers upstream from your machine and work with you to
find out what's really happening. Unless the problem is an NSL
mandated network sniffer upstream of your machine. :-)

--
Viktor.

Ruud Harmsen

unread,

Jul 2, 2022, 8:02:34 AM7/2/22

to

I know this discussion is almost 8 years old, but this morning
I had the problem too, and I found a solution that goes in a
totally different direction. Perhaps still useful. I found your
discussion in:
https://mailing.postfix.users.narkive.com/3vrsw06p/possible-reasons-for-lost-connection-after-data

My situation:
I wanted to send a reply using IPA characters, so way outside
the Latin-1 range, which stupid old Eudora is still restricted
to. I normally run Eudora in Wine under Linux Mint, used to be
Windows. I know there are better programs nowadays, but I want
to keep by decades old mail archive. Converting it isn't trivial.

I tried Evolution for Linux just to send this odd message. But
for some strange reason (unrelated to this discussion) it
wouldn't. After a while I gave up. Then I started playing e-mail
client myself, typing SMTP or ESMTP commands to telnet directly.
And after a while, I had a working solution, using this syntax:
telnet localhost 25 < prepared-file

Localhost is really my remote e-mailserver, because I redirected
ports 25, 587 and 110 to it, using "ssh -L 25:localhost:25 [...]"
etc.

My prepared-file contains simple ESMTP, minimal mail headers, and
the message that I want to send, in UTF-8. It works. Postfix
understandably complains "improper command pipelining after EHLO
from localhost[::1]", because indeed I ignore Postfix's responses
and just rattle on. But it works!

That is, it did until I replaced my test message with a serious
real one. Then I ran into a problem exactly like what you guys
were discussing 8 years ago. Fewer bytes accepted than prepared,
then Postfix logged:
"lost connection after DATA (584 bytes) from localhost[::1]"
although in reality my DATA was over 1600 bytes long.

After a while I found a cause: in a quoted part of the e-mail
I was responding to, there was the word “sector”, like that,
in rounded quotes. Now when I replaced those by simple ASCII
quotes, "sector", the problem was gone!! Message sent
flawlessly!

Even stranger: when I tried this: "sector”, so one straight
quote, then a round one, I got this in the command line:
==
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

telnet> some text after quoted word
?Invalid command
telnet> Connection closed.
==

So I seems it has nothing to do, in my case at least, with
Postfix, TCP packets, hardware, firewalls, whatever, but it's
just telnet interpreting data that in my opinion it should
just pass unaltered. But its behaviour may also be completely
legit and by design. I don't know.

Now of course, in the situation you guys were talking about,
there was probably no telnet at play, but still, the software
that _was_ sending those e-mails, may exhibit a similar
unexpected interpretation of round or straight quotes. Hence
my message. HTH.

Ruud Harmsen

unread,

Jul 2, 2022, 8:08:25 AM7/2/22

to

I forgot to mention that I am using:
Content-transfer-encoding: 8bit
but using base64 or quoted-printable instead, makes no difference.

redouane red

unread,

Sep 3, 2022, 7:21:20 AM9/3/22

to

hello sir please i need ask you how can send in account amzon to gmail please help me