On Wed, Dec 21, 2022 at 8:07 AM Hongxu Ma <
int...@outlook.com> wrote:
>
> Hi All
> Under the IC-UDP mode, there is a certain probability to cause MotionSender to hang when the ACK of an EOS package is missing, an example of the hang stack:
> ```
> // MotionSender
> #0 0x00007f4e1bc8dccd in poll ()
> #1 0x00000000009e8737 in pollAcks () at ic_udpifc.c:5104
> #2 SendEosUDPIFC () at ic_udpifc.c:5463
> #3 0x00000000009d8bf6 in SendEndOfStream()
> ...
> #5 execMotionSender () at nodeMotion.c:330
> #6 ExecMotion () at nodeMotion.c:274
> ...
> ```
> The issue exists in all version: 5x, 6x and 7x. We encountered it in many scenarios (internal and community).
> I gave a brief description in a community issue, please refer:
>
https://github.com/greenplum-db/gpdb/issues/12961#issuecomment-1347835199
Thank you for the fantastic description of the problem in that issue.
>
> And paste the fix thoughts here:
>
> Always response ACK for an EOS, don't consider if it's in the ic_control_info.connHtab or not.
> New timeout mechanisms, e.g. a short transmit_timeout for EOS (but give the hang risk to receiver).
> No change, just provide workarounds: like set ic ype to TCP.
>
> After offline discussion, I prefer the option1: always give ACK for an EOS. But a little concern: is there any hidden danger if we decide to do so?
>
Let's clearly define what the expected end state is for Motion
Receiver (MR) on EOS. Once that's clearly defined, retries for EOS,
should make sure that end state is accomplished if entry exists, act
on it, if it doesn't then act as noop, in either case respond back to
ACK (option 1). Also, we should look into code and verify what is the
behavior of Motion Sender (MS) in case it receives multiple ACKs for
EOS from MR. If it's fine and act as noop, then option 1 seems safe.
Option 2 - not sure what behavior we have with timeout, failing the query?
Option 3 - is just a no-go. We have to fix the module which has
problem exist :-)
--
Ashwin Agrawal (VMware)