SMTP IsAlive Timeout

4 views

Skip to first unread message

Message has been deleted

Brandi Baylon

unread,

Jul 10, 2024, 6:20:53 AM7/10/24

to milnwallprofneu

When looking at the TCP 25 traffic, the trace is mostly just showing 'TCP Retransmission'. There are 8.3 million entries in the trace! The trace was only running for the time it took me to start the telnet and for it to timeout, which was less than 30 seconds. That is way high... Given that 99% of the entries are 'TCP Retransmission', it makes me think that the NetScaler is not acknowledging/accepting the traffic from the clients correctly...?

Were you also using DSR (direct server return mode) for your SMTP load balancing with USIP mode? Which is possibly why your backend smtp servers still keep their default gateway instead of the ADC snip? (Since you seem to also have MB on the servicegroup enabled.)

SMTP isAlive timeout

Download https://gohhs.com/2yY1f1

When the keyword argument timeout is specified as a number,(default: 30), then TIMEOUT will be raised after the valuespecified has elapsed, in seconds, for any of the expect()family of method calls. When None, TIMEOUT will not be raised, andexpect() may block indefinitely until match.

When the keyword argument timeout is -1 (default), then TIMEOUT willraise after the default value specified by the class timeoutattribute. When None, TIMEOUT will not be raised and may blockindefinitely until match.

This reads at most size characters from the child application. Itincludes a timeout. If the read does not complete within the timeoutperiod then a TIMEOUT exception is raised. If the end of file is readthen an EOF exception will be raised. If a logfile is specified, acopy is written to that log.

If timeout is None then the read may block indefinitely.If timeout is -1 then the self.timeout value is used. If timeout is 0then the child is polled and if there is no data immediately readythen this will raise a TIMEOUT exception.

On the other hand, if there are bytes available to read immediately,all those bytes will be read (up to the buffer size). So, if thebuffer size is 1 megabyte and there is 1 megabyte of data availableto read, the buffer will be filled, regardless of timeout.

If SMTP server is not reachable, the ping request will timeout. However, some SMTP servers are configured to explicitly block ping requests. Thus, even if the ping packet timeouts, you should still try to telnet the SMTP server unless absolutely sure that ping command should not be blocked.

Most probably though, if ping command timeouts, the server address is not reachable. If SMTP server does not work, although its IP address is reachable by ping command, the SMTP server software might not be running on the specified machine Ping server regularly

When telnet terminal is open, we can now continue with SMTP connectivity testing. To connect to the SMTP server, type: o smtp.example.com 25 in telnet command prompt (where smtp.example.com should be replaced by actual SMTP server and 25 by the actual SMTP server port).

Second, by default the NVMe drivers included in most operating systems implement an I/O timeout. If an I/O does not complete in an implementation specific amount of time, usually tens of seconds, the driver will attempt to cancel the I/O, retry it, or return an error to the component that issued the I/O. The Xen PV block device interface does not time out I/O, which can result in processes that cannot be terminated if it is waiting for I/O. The Linux NVMe driver behavior can be modified by specifying a higher value for the nvme.io timeout kernel module parameter.

Third, the NVMe interface can transfer much larger amounts of data per I/O, and in some cases may be able to support more outstanding I/O requests, compared to the Xen PV block interface. This can cause higher I/O latency if very large I/Os or a large number of I/O requests are issued to volumes designed to support throughput workloads like EBS Throughput Optimized HDD (st1) and Cold HDD (sc1) volumes. This I/O latency is normal for throughput optimized volumes in these scenarios, but may cause I/O timeouts in NVMe drivers. The I/O timeout can be adjusted in the Linux driver by specifying a larger value for the nvme_core.io_timeout kernel module parameter.

Monitors determine the availability and performance of devices, links, and services on a network. Health monitors check the availability. Performance monitors check the performance and load. If a monitored device, link, or service does not respond within a specified timeout period, or the status indicates that performance is degraded or that the load is excessive, the BIG-IP system can redirect the traffic to another resource.

Active monitoring checks the status of a pool member or node on an ongoing basis as specified. If a pool member or node does not respond within a specified timeout period, or the status of a node indicates that performance is degraded, the BIG-IP system can redirect the traffic to another pool member or node. There are many active monitors. Each active monitor checks the status of a particular protocol, service, or application. For example, one active monitor is HTTP. An HTTP monitor allows you to monitor the availability of the HTTP service on a pool, pool member, or node. A WMI monitor allows you to monitor the performance of a node that is running the Windows Management Instrumentation (WMI) software. Active monitors fall into two categories: Extended Content Verification (ECV) monitors for content checks, and Extended Application Verification (EAV) monitors for service checks, path checks, and application checks.

When a virtual server that is being monitored by a health monitor does not respond to a probe from the BIG-IP system within a specified timeout period, the system marks the virtual server down and no longer load balances traffic to that virtual server. When the health monitor determines that the virtual server is once again responsive, the system again begins to load balance traffic to that virtual server. To illustrate, a Gateway Internet Control Message Protocol (ICMP) monitor pings a virtual server. If the monitor does not receive a response from the virtual server, the BIG-IP system marks that virtual server down. When the ping is successful, the system marks the virtual server up.

Pooled connections that have been idle in the pool for longer than this timeout will be tested before they are used again, to ensure they are still alive.If this option is set too low, an additional network call will be incurred when acquiring a connection, which causes a performance hit.If this is set high, no longer live connections might be used which might lead to errors.Hence, this parameter tunes a balance between the likelihood of experiencing connection problems and performanceNormally, this parameter should not need tuning.Value 0 means connections will always be tested for validity.

The transaction settings helps you manage the transactions in your database, for example, the transaction timeout, the lock acquisition timeout, the maximum number of concurrently running transactions, etc.For more information, see Manage transactions and Locks and deadlocks.

Just now one machine had the issue again. I checked and saw that we
where down to just two smtpd processes and even though master was still
bound to port 25 no new connections where accepted. I did telnet to it,
but the connection was not accepted and ran into timeout.How does the timer issue relate to the master process not accepting
anymore TCP/IP connections on port 25?

> The setup and configuration works like a charm for hours at a time and
> all of a sudden it stops working leading to two issues (not at the same
> time):
>
> 1) First issue was that suddenly smtp stopped delivering email to that
> mutli-A record. We noticed a few thousand emails in the active queue (I
> guess all emails where in the active queue by that time). We rule out
> problems with the destination servers since the remaining postfix
> instances still delivered mail during that time. Even the last
> submission of email from the now locked up postfix finished without
> issue. There are just no more tries to reach the destination. Postfix
> also stops to deliver to both destination IPs at the same time. There
> was no logging anymore, but from anvil giving us some statistics about
> connection rates.
>
> 2) The second issue, occurring more often is that smtpd stopped
> responding or doing anything acutally. Sometimes the number of processes
> went to max (500) at other occurences it just stayed at like 30-50 but
> symptoms where still the same.
>

> Looking at netstat all tcp connections to the smtpd processes went away
> after some time.We believe that clients trying to deliver email to us
> did disconnect due to reaching their timeout.
>

This means that the smtpd processes are hanging, the master is
hanging, or both. At this point I will not speculate further until you report the
result of following the instructions in
_README.html#loggingIf I don't see a credible report about warnings etc. in Postfix
logfiles, then that means that you are flying blind, and that needs
to be addressed first.The following is for background information only.The master daemon watches the SMTP port only when all existing
smtpd processes have reported that they are busy (i.e. talking to
an SMTP client). Otherwise, some idle smtpd process will watch
the port.When all smtpd processes have reported that they are busy, the
master starts a new smtpd processes in response to a new connection,
provided that the per-service process limit is not reached (otherwise
the master logs a warning that all ports are busy).In your case, the two smtpd processes got stuck before sending the
"I am busy" message to the master daemon, so the master daemon
still believes that the two processes are idle. I don't know if
this has anything to do with broken virtual timers.Regardless of why a process hangs, if it hangs then you should see
watchdog errors in the Postfix logs. If you don't see those then
either your virtual timer is busted, or your logging is busted.Wietse