kdb randomly hanging when run inside docker container

337 views
Skip to first unread message

joshm...@yandex.com

unread,
Apr 2, 2016, 7:48:58 PM4/2/16
to Kdb+ Personal Developers

I'm not sure how to debug this, so looking for suggestions.

I've been running multiple kdb instances inside a docker container.
On multiple occasions, one of the instances will hang:

- Running hopen on its port from a another kdb process will hang indefinitely.
- Trying to run a HTTP query will also hang until the client times out.
- I am, however, able to successfully open a connection with telnet.
- Sending SIGINT/SIGTERM has no effect. SIGKILL is needed to bring down the hung process.
- Other kdb instances in the container are unaffected.
- Appears to happen randomly, sometimes after 1 hour, sometimes after 2 days.

Most likely a docker bug, but so far it's only manifested itself with kdb.

James Little

unread,
Apr 3, 2016, 5:25:56 AM4/3/16
to personal...@googlegroups.com
I'd try attaching strace to the pid concerned, to see if it's stuck waiting on the return of a syscall. Normally you would see regular select calls, corresponding to the file descriptors/ports the process is listening on.

If there's nothing useful in the strace output once it gets into this state, you could try running it with strace from the beginning, with something like

nohup strace -s 99999 -o ~/str_out.log q myprog.q 2>&1 > ~/q_out.log < /dev/null &

and you may be able to glean something useful in the run-up to the issue. 

If it's not obvious already whether it's _really_ stuck, you could run a function within kdb on a timer (e.g. log a timestamp to disk), and see if that continues working. That might help narrow down the issue to the Docker (i.e. Linux conatainers) networking stack, and/or its signal-handling.



--
You received this message because you are subscribed to the Google Groups "Kdb+ Personal Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to personal-kdbpl...@googlegroups.com.
To post to this group, send email to personal...@googlegroups.com.
Visit this group at https://groups.google.com/group/personal-kdbplus.
For more options, visit https://groups.google.com/d/optout.

joshm...@yandex.com

unread,
Apr 3, 2016, 9:59:36 AM4/3/16
to personal...@googlegroups.com

Thanks. I'll give strace a try. And it does apear to be really stuck. The
process was running a function on a timer once a second that sent data
to another process, which stopped receiving updates when this issue occured.

On 3 April 2016 08:27 UTC, James Little <j...@jameslittle.me.uk> wrote:

> I'd try attaching strace to the pid concerned, to see if it's stuck waiting
> on the return of a syscall. Normally you would see regular select
> <http://linux.die.net/man/2/select> calls, corresponding to the file
Reply all
Reply to author
Forward
0 new messages