Pyspark Jupyter notebook not connected... but it IS

344 views
Skip to first unread message

kjell...@sony.com

unread,
Oct 24, 2018, 2:20:27 AM10/24/18
to Project Jupyter
Hi all!

I believe I am not alone in having trouble with Jupyter Notebooks that get disconnected while the kernel is running and which refuse to reconnect.

But today I saw a new twist; the notebook header shows "Disconnected" (and pressing this button makes the notebook attempt to reconnect but fails). However, a progress bar for a heavy task keeps updating, and the "Spark Jobs" dialog also keep updating, and shows fresh progress every time I open it. 

The Jupyter "overview" of running processes also shows the kernel as active. So, obviously something is running, and the notebook is able to update some of its items. So why do the header then shows "Not connected" (and the "kernel icon" which is supposed to show whether the kernel is running, idle, or disconnected, also shows "disconnected". Is this a known bug?

/kjell

Kevin Bates

unread,
Oct 24, 2018, 8:51:33 PM10/24/18
to Project Jupyter
If this is reproducible, enable debug logging on your Notebook Server (`--log-level=DEBUG`) and reproduce your issue.  Once disconnected, you should see log messages indicating that messages are being buffered.  Upon re-connecting, you'll see a message that N messages are being discarded for a given value (i.e., key).

The issue is that the key used to determine if the buffered messages should be replayed is a connection-specific value and a new value is produced with each new connection.  As a result, the value associated with the now current connection won't be found in the dictionary of buffered messages, so none of the messages relative to the previous (disconnected) connection will ever be replayed.

I think the crux of the issue is locating a value that persists across "tab invocations" (i.e. connections) but isn't kernel-scoped because, apparently, a given kernel instance running from the same notebook server can have multiple connections simultaneously.

Once replay is resolved within the server, there's a good chance that some set of changes in the front end will also be necessary.  I know because I've modified the server to use a kernel-scoped key value so messages are replayed to the front-end, but they still don't appear.  

See Issue https://github.com/jupyter/notebook/issues/4105, and PRs https://github.com/jupyter/notebook/pull/2871 and https://github.com/jupyter/notebook/pull/4110.

Roland Weber

unread,
Oct 25, 2018, 1:58:36 AM10/25/18
to Project Jupyter
On Wednesday, October 24, 2018 at 8:20:27 AM UTC+2, kjell...@sony.com wrote:
a progress bar for a heavy task keeps updating
...
So why do the header then shows "Not connected"

In our environments, we've seen problems when a "heavy task" in a kernel is taking so much CPU that the Jupyter process no longer responds in a timely manner. Or at least that's what we currently suppose is happening.

It is also possible that your kernel is so busy with executing the heavy task that it no longer responds to other messages sent from the browser, which tries to determine the kernel status. Yet the progress bar updates are being sent, because that is part of the heavy task itself.

(Maybe that's what's happening in our environments, too :-)

cheers,
  Roland

kjell...@sony.com

unread,
Nov 1, 2018, 8:26:37 AM11/1/18
to Project Jupyter
Thanks a lot Kevin! I cannot say I fully understands it all, but I have forwarded your comments to our tools team, since I cannot change these settings myself. However, since this only happens occasionally, I am not sure whether they would accept to have the debug setting active. We'll see.

   /kjell

Kevin Bates

unread,
Nov 1, 2018, 9:42:00 AM11/1/18
to Project Jupyter
Yeah, after reading your original post again in combination with Roland's response, I'd actually lean toward Roland's theory here.  Nevertheless, gaining access to the notebook server log and, if possible, the kernel's log, along with knowledge about what a particular cell is doing at the time can go a long way in deciphering the issue.  Sounds like your tools team may be the best resource at the moment.  Good luck.
Reply all
Reply to author
Forward
0 new messages