Prevent hang when the ACK of an EOS package is missing

30 views
Skip to first unread message

Hongxu Ma

unread,
Dec 21, 2022, 11:07:16 AM12/21/22
to gpdb...@greenplum.org
Hi All
Under the IC-UDP mode, there is a certain probability to cause MotionSender to hang when the ACK of an EOS package is missing, an example of the hang stack:
```
// MotionSender
#0  0x00007f4e1bc8dccd in poll () 
#1  0x00000000009e8737 in pollAcks () at ic_udpifc.c:5104
#2  SendEosUDPIFC () at ic_udpifc.c:5463
#3  0x00000000009d8bf6 in SendEndOfStream()
  ...
#5  execMotionSender () at nodeMotion.c:330
#6  ExecMotion () at nodeMotion.c:274
  ...
```
The issue exists in all version: 5x, 6x and 7x. We encountered it in many scenarios (internal and community).
I gave a brief description in a community issue, please refer: 

And paste the fix thoughts here:
  1. Always response ACK for an EOS, don't consider if it's in the ic_control_info.connHtab or not.
  2. New timeout mechanisms, e.g. a short transmit_timeout for EOS (but give the hang risk to receiver).
  3. No change, just provide workarounds: like set ic ype to TCP.
After offline discussion, I prefer the option1: always give ACK for an EOS. But a little concern: is there any hidden danger if we decide to do so?

Hope to get suggestions from here, welcome to discuss.

Thanks,
Hongxu

Ashwin Agrawal

unread,
Dec 21, 2022, 8:48:28 PM12/21/22
to Hongxu Ma, gpdb...@greenplum.org
On Wed, Dec 21, 2022 at 8:07 AM Hongxu Ma <int...@outlook.com> wrote:
>
> Hi All
> Under the IC-UDP mode, there is a certain probability to cause MotionSender to hang when the ACK of an EOS package is missing, an example of the hang stack:
> ```
> // MotionSender
> #0 0x00007f4e1bc8dccd in poll ()
> #1 0x00000000009e8737 in pollAcks () at ic_udpifc.c:5104
> #2 SendEosUDPIFC () at ic_udpifc.c:5463
> #3 0x00000000009d8bf6 in SendEndOfStream()
> ...
> #5 execMotionSender () at nodeMotion.c:330
> #6 ExecMotion () at nodeMotion.c:274
> ...
> ```
> The issue exists in all version: 5x, 6x and 7x. We encountered it in many scenarios (internal and community).
> I gave a brief description in a community issue, please refer:
> https://github.com/greenplum-db/gpdb/issues/12961#issuecomment-1347835199

Thank you for the fantastic description of the problem in that issue.

>
> And paste the fix thoughts here:
>
> Always response ACK for an EOS, don't consider if it's in the ic_control_info.connHtab or not.
> New timeout mechanisms, e.g. a short transmit_timeout for EOS (but give the hang risk to receiver).
> No change, just provide workarounds: like set ic ype to TCP.
>
> After offline discussion, I prefer the option1: always give ACK for an EOS. But a little concern: is there any hidden danger if we decide to do so?
>

Let's clearly define what the expected end state is for Motion
Receiver (MR) on EOS. Once that's clearly defined, retries for EOS,
should make sure that end state is accomplished if entry exists, act
on it, if it doesn't then act as noop, in either case respond back to
ACK (option 1). Also, we should look into code and verify what is the
behavior of Motion Sender (MS) in case it receives multiple ACKs for
EOS from MR. If it's fine and act as noop, then option 1 seems safe.

Option 2 - not sure what behavior we have with timeout, failing the query?
Option 3 - is just a no-go. We have to fix the module which has
problem exist :-)

--
Ashwin Agrawal (VMware)

Hongxu Ma

unread,
Jan 3, 2023, 12:00:47 PM1/3/23
to Ashwin Agrawal, gpdb...@greenplum.org
Thanks for your reply, Ashwin.

Option 2 - not sure what behavior we have with timeout, failing the query?
Yes, so user can get error message in time, rather than a long-time hang.

And I am going to go deeper in option1 and do some verifications, then post more details for discussion later.

Thanks,
Hongxu


From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Thursday, December 22, 2022 9:48
To: Hongxu Ma <int...@outlook.com>
Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Re: Prevent hang when the ACK of an EOS package is missing
 
Reply all
Reply to author
Forward
0 new messages