Deadlock with v8.2.4699

42 views
Skip to first unread message

leit...@gmail.com

unread,
Apr 23, 2022, 2:52:43 AM4/23/22
to vim_dev
Hi all,

I'm one of the maintainers for govim, a Go plugin for vim. As part of our CI we run against latest vim daily and recently it started to fail. I bisected the change to v8.2.4699, with the commit message "hard to reproduce hang when reading from channel" but it seems like it introduced a new (100% reproducible) hang for us.

I picked one of our tests that failed, where we use "feedkeys()" to trigger autocompletion.
From earlier discussions on this list I know that we should avoid using "feedkeys()" in favour of "normal", but I've seen this hang triggered when using normal as well.

What happens is that vim spins at 100% CPU until the test times out.

This is the steps executed by that test, they probably doesn't make sense as is but I hope that the diff below can provide some hints:
 vim ex 'e main.go'
 vim ex 'call cursor(11,1)'
 vim ex 'call feedkeys(\"A\\<C-X>\\<C-O>\\<C-N>\\<C-N>\\<ESC>\", \"xt\")'
 vim ex 'call feedkeys(\"A(Con\", \"xt\")'
 vim ex 'call feedkeys(\"A\\<C-X>\\<C-O>\\<C-N>\\<ESC>\", \"xt\")'
 vim ex 'call feedkeys(\"A)\", \"xt\")'
 vim ex 'w'

Changing so that we send <ESC> on the second feedkeys, and add mode "!" to the last will make the test pass:

 vim ex 'e main.go'
 vim ex 'call cursor(11,1)'
 vim ex 'call feedkeys(\"A\\<C-X>\\<C-O>\\<C-N>\\<C-N>\\<ESC>\", \"xt\")'
 vim ex 'call feedkeys(\"A(Con\\<C-X>\\<C-O>\\<C-N>\\<ESC>\", \"xt\")'
 vim ex 'call feedkeys(\"A)\\<ESC>\", \"xt!\")'
 vim ex 'w'

I haven't created a smaller reproducible yet but I did capture a stack trace when it happens, https://gist.github.com/leitzler/a75f2312f551332ec0c687397e77dda5.

Any guidance on how to proceed with the debugging here?

Regards, Pontus

Bram Moolenaar

unread,
Apr 23, 2022, 8:51:13 AM4/23/22
to vim...@googlegroups.com, leit...@gmail.com
The problem with using feedkeys() is that it does not go through the
normal execution loop. It often works, but not for everything.
Especially message handling, which happens when waiting for the user to
type something, can be a problem.

Have you tried doing these tests by running Vim in a terminal window?
That is more realistic, using term_sendkeys() to have it execute
commands.

--
Bumper sticker: Honk if you love peace and quiet.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

leit...@gmail.com

unread,
Apr 26, 2022, 3:49:05 AM4/26/22
to vim_dev
Thanks for the response. I haven't tried via terminal window, that sounds like a good idea to investigate.

The reason I think feedback() is innocent here is that there are other tests as well that fails, that aren't using feedkeys() at all. I did find one of them that is fairly small where we basically just open a file, move the cursor and delete two lines:
vim ex 'e main.go'
vim ex 'call cursor(9,1)'
vim ex 'normal 2dd'

The commands are sent via a channel and executed on the vim side inside a callback handler (https://github.com/govim/govim/blob/3dce913d76ea92ddd2e6a43fad8aebb72284f9c4/plugin/govim.vim#L262-L276).

A interesting thing that I did find while debugging is that simply adding a sleep to channel_read_block_json() (where the new coded was added in v8.2.4699) changes the behaviour so that the tests pass:
@@ -3999,8 +3999,10 @@ channel_read_json_block(

            // channel_parse_messages() may fill the queue with new data to
            // process.
-           if (channel_has_readahead(channel, part))
+           if (channel_has_readahead(channel, part)) {
+               do_sleep(200L, FALSE);
                continue;
+           }

Since we do have a bunch different tests that started to fail I would really like to understand what is going on, might very well be that v8.2.4699 uncovered something we are doing wrong at our side.

I also captured two channel logs, one where it fails and the other that pass (with sleep): https://gist.github.com/leitzler/c66e9a1b181e15516ff73cfdff530a14.
The interesting part is at the end of broken.log where I manually kill the test at 115.17, but vim still remains running at 100% CPU afterwards:

0.944675 RECV on 0(sock): '[0,[85,"ex","normal 2dd"]]
'
0.944682 : looking for messages on channels
0.944694 on 0: Getting JSON message 0
0.944699 on 0: Invoking channel callback <SNR>17_define
0.944770 : SafeState: reset: ins_typebuf()
0.944858 SEND on 0(sock): '[97,["callback",85,[""]]]
'
0.944923 SEND on 0(sock): '[98,["function","function:GOVIM_internal_BufChanged",[1,9,11,-2,[{"lnum":9,"col":1,"added":-2,"end":11,"lines":[]}]]]]
'
0.944936 on 0: Blocking read JSON for id 98
0.944941 : looking for messages on channels
0.944945 on 0: Waiting for up to 30000 msec
0.945009 RECV on 0(sock): '[96,["",""]]
'
0.945017 : looking for messages on channels
115.174200 : raw terminal output: "^[[24;1H^[[m^[[0m"
115.174473 : Exiting...
115.174906 RECV on 0(sock): '[98,["",null]]
[0,[86,"call","s:schedule",10]]
[0,[87,"call","s:schedule",11]]
'
115.174993 SEND on 0(sock): '[99,["shutdown"]]
'
115.175030 on 0: Blocking read JSON for id 99
115.175053 : looking for messages on channels
115.175069 on 0: Getting JSON message 0
115.175088 on 0: Invoking channel callback <SNR>17_define
115.175293 SEND on 0(sock): '[100,["callback",86,["",0]]]
'
115.175326 ERR on 0: ch_sendexpr(): write failed
115.175356 : ERROR: Error detected while processing function <SNR>17_triggerDrain[1]..<SNR>17_drainScheduleBacklog[35]..<SNR>17_ch_evalexpr[9]..GOVIM_internal_EnrichDelta[4]..GOVIM_internal_BufChanged[2]..<SNR>17_callbackFunction[2]..<SNR>17_ch_evalexpr[9]..VimLeave Autocommands for "*"..function <SNR>17_doShutdown[5]..<SNR>17_ch_evalexpr[9]..<SNR>17_define:
115.175380 : ERROR: line 66:
115.175398 : ERROR: E631: ch_sendexpr(): write failed

Bram Moolenaar

unread,
Apr 26, 2022, 5:23:07 AM4/26/22
to vim...@googlegroups.com, leit...@gmail.com

> Thanks for the response. I haven't tried via terminal window, that sounds
> like a good idea to investigate.
>
> The reason I think feedback() is innocent here is that there are other
> tests as well that fails, that aren't using feedkeys() at all. I did find
> one of them that is fairly small where we basically just open a file, move
> the cursor and delete two lines:
> vim ex 'e main.go'
> vim ex 'call cursor(9,1)'
> vim ex 'normal 2dd'
>
> The commands are sent via a channel and executed on the vim side inside a
> callback handler
> (https://github.com/govim/govim/blob/3dce913d76ea92ddd2e6a43fad8aebb72284f9c4/plugin/govim.vim#L262-L276).
>
> A interesting thing that I did find while debugging is that simply adding a
> sleep to channel_read_block_json() (where the new coded was added in
> v8.2.4699) changes the behaviour so that the tests pass:
> @@ -3999,8 +3999,10 @@ channel_read_json_block(
>
> // channel_parse_messages() may fill the queue with new data to
> // process.
> - if (channel_has_readahead(channel, part))
> + if (channel_has_readahead(channel, part)) {
> + do_sleep(200L, FALSE);
> continue;
> + }

Perhaps what happens is that there is readahead, but it's not sufficient
to parse more. Then we busy-loop here. What we could do is get the
length of the typeahed (or better: a hash) and if it's the same as last
time, ignore it.

> Since we do have a bunch different tests that started to fail I would
> really like to understand what is going on, might very well be that
> v8.2.4699 uncovered something we are doing wrong at our side.

> I also captured two channel logs, one where it fails and the other that
> pass (with
> sleep): https://gist.github.com/leitzler/c66e9a1b181e15516ff73cfdff530a14.
> The interesting part is at the end of broken.log where I manually kill the
> test at 115.17, but vim still remains running at 100% CPU afterwards:
>
> 0.944675 RECV on 0(sock): '[0,[85,"ex","normal 2dd"]]
> '
> 0.944682 : looking for messages on channels
> 0.944694 on 0: Getting JSON message 0
> 0.944699 on 0: Invoking channel callback <SNR>17_define
> 0.944770 : SafeState: reset: ins_typebuf()
> 0.944858 SEND on 0(sock): '[97,["callback",85,[""]]]
> '
> 0.944923 SEND on 0(sock):
> '[98,["function","function:GOVIM_internal_BufChanged",[1,9,11,-2,[{"lnum":9,"col":1,"added":-2,"end":11,"lines":[]}]]]]
> '
> 0.944936 on 0: Blocking read JSON for id 98
> 0.944941 : looking for messages on channels
> 0.944945 on 0: Waiting for up to 30000 msec
> 0.945009 RECV on 0(sock): '[96,["",""]]
> '
> 0.945017 : looking for messages on channels
> 115.174200 : raw terminal output: "^[[24;1H^[[m^[[0m"

It says waiting for 30 sec, but there appears to be a gap of 115 sec.


--
Over the years, I've developed my sense of deja vu so acutely that now
I can remember things that *have* happened before ...

leit...@gmail.com

unread,
Apr 26, 2022, 8:01:31 AM4/26/22
to vim_dev
> Perhaps what happens is that there is readahead, but it's not sufficient
> to parse more. Then we busy-loop here. What we could do is get the
> length of the typeahed (or better: a hash) and if it's the same as last
> time, ignore it.

Yes, that sounds like a plausible explanation and a good idea. My C is unfortunately way too rusty to submit a patch, but I'd be more than happy to try out any patches if you'd like confirmation before pushing a change.

> It says waiting for 30 sec, but there appears to be a gap of 115 sec.

The job itself is initiated via job_start() with a timeout of 30000, but that timeout never occurs when vim hangs like this. I killed it manually after 115 sec.

Bram Moolenaar

unread,
Apr 26, 2022, 1:47:26 PM4/26/22
to vim...@googlegroups.com, leit...@gmail.com

> > Perhaps what happens is that there is readahead, but it's not sufficient
> > to parse more. Then we busy-loop here. What we could do is get the
> > length of the typeahed (or better: a hash) and if it's the same as last
> > time, ignore it.
>
> Yes, that sounds like a plausible explanation and a good idea. My C is
> unfortunately way too rusty to submit a patch, but I'd be more than happy
> to try out any patches if you'd like confirmation before pushing a change.
>
> > It says waiting for 30 sec, but there appears to be a gap of 115 sec.
>
> The job itself is initiated via job_start() with a timeout of 30000, but
> that timeout never occurs when vim hangs like this. I killed it manually
> after 115 sec.

Yeah, if it loops here it doesn't get to the part where it uses the
timeout.

I can guess what happens, I'll make a patch based on that. But I can't
be 100% sure it will solve the problem. Please check after the relevant
patch. I don't know how to write a test for this too.

--
hundred-and-one symptoms of being an internet addict:
60. As your car crashes through the guardrail on a mountain road, your first
instinct is to search for the "back" button.

leit...@gmail.com

unread,
Apr 27, 2022, 6:51:26 AM4/27/22
to vim_dev
Our CI ran about an hour ago using v8.2.4832 and the tests passed now, seems like you were spot on!
Thank you for the fix and swift response, much appreciated.

Regards, Pontus

Reply all
Reply to author
Forward
0 new messages