Plea for help diagnosing strange "signal: killed" issues in previously-working code

1,192 views
Skip to first unread message

Russtopia

unread,
Mar 2, 2024, 12:59:27 PM3/2/24
to golang-nuts
Hi all,

Symptom: mysterious "signal: killed" occurrences with processes spawned from Go via exec.Cmd.Start()/Wait()

Actors:

'Server': Intel i5, running Funtoo 1.4 - 4GB RAM, 4GB swap
'Laptop': Intel Core i7 9thGen, running Devuan Chimaera - 15GB RAm, 15GB swap

A quite small -- ~900 lines of code in two .go files, program I developed which has been running for years just fine on 'Server', which spawns and monitors other programs using exec.Cmd.Start() and then uses exec.Cmd.Wait() to monitor them, started acting very strangely a few days ago, after either: upgrading from go 1.18 to go 1.22 (amd64), and/or non- Go related minor updates (regretfully I don't recall exactly what, but it was not a distro or kernel upgrade, nor a major version of glibc or anything like that).

What used to be smoothly launched tasks from my Go program became erratically (then almost always) aborted jobs, with the status of "signal: killed". I can watch the system with 'top or 'htop' and don't see any obvious spikes in mem or CPU usage, and never have with this setup beforehand either.

I spent a few evenings verifying 'Server' wasn't out of memory (oom_reaper), as many web searches suggest is the biggest cause of the above, and other system-related things with no luck (no oom logs in /var/log/kern.log, dmesg, etc.).

I even tried launching my go program with some of the debug options mentioned here, but didn't see anything unusual:

https://github.com/golang/go/issues/31517

Everything else on 'Server' ran fine, so I began to suspect the unthinkable: could the new Go toolchain 1.22.0 I installed have some subtle issue with the server's runtime environment, or could there be a compiler bug?

Out of habit, after any go update I usually rebuild this project using the new toolchain, which I believe I had done just before this issue began; so first I reverted to go1.18 on 'Server' (there as a fallback), and rebuilt my tool; no help. I tried clearing out all of ~/go/pkg/mod/* and refetched all dependencies, rebuilt again. Still no help.

Then I tried building the project on another machine, 'Laptop', which had go1.21.0; on this machine it ran fine. So I copied *that* build of the tool back to 'Server', and it failed with "sign: killed" as well.

I tried the reverse ('Server' build of project, using go1.22.0 copied to 'Laptop' using go1.18.x) but that failed, as there was a GLIBC incompatibility preventing it running. (I think Funtoo had a slightly newer GLIBC version than Devuan).

Anyhow, in case it *was* a go toolchain or C library issue beneath Go itself, I downloaded the latest go 1.22.0 source to 'Server' and rebuilt it there from scratch (all.bash); using that to rebuild my tool I was happy to see it now worked normally again... for about 5 rounds of spawning programs, then the issue returned!

The 'Server' isn't very powerful, but it's definitely not out of memory and has run my tool reliably for a long time, until the last week:

$ free
               total        used        free      shared  buff/cache   available
Mem:         3928680     1334988      592232       29368     2001460     2277376
Swap:        4194300      136192     4058108

This issue even occurs spawning a simple 'do-nothing' shell script, which just loops sleeping for 5 seconds a few times before exiting. These processes/scripts run perfectly from the shell command line, without using my Go program to launch them.

What the heck is going on? How do I diagnose this?

Ian Lance Taylor

unread,
Mar 2, 2024, 1:05:06 PM3/2/24
to Russtopia, golang-nuts
On Sat, Mar 2, 2024 at 9:59 AM Russtopia <rma...@gmail.com> wrote:
>
> Symptom: mysterious "signal: killed" occurrences with processes spawned from Go via exec.Cmd.Start()/Wait()

The first step is to tell us the exact and complete error that you
see. "signal: killed" can have different causes, and the rest of the
information should help determine what is causing this one.

Ian

Robert Engels

unread,
Mar 2, 2024, 1:23:52 PM3/2/24
to Ian Lance Taylor, Russtopia, golang-nuts
I would be also try reverting the Go version and ensure that it continues to work. Other system libraries may have been updated.

> On Mar 2, 2024, at 12:05 PM, Ian Lance Taylor <ia...@golang.org> wrote:
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAOyqgcXR6hBSnGLjehwng%2BXp4QQ8ZznramEAZTmD%3D6tVwFirTg%40mail.gmail.com.

Russtopia

unread,
Mar 2, 2024, 1:52:21 PM3/2/24
to Ian Lance Taylor, golang-nuts
Hi, I tried outputting the value of werr.(*exec.ExitError).Stderr, but it's empty.

Outputting all of werr.(*exec.ExitError) via

fmt.Printf("[job *ExitError:%+v]\n", werr.(*exec.ExitError))

..gives merely

[job *ExitError:signal: killed]


Russtopia

unread,
Mar 2, 2024, 1:53:12 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
I have tried rebuilding with go1.18.6, go1.15.15 with no difference.

Robert Engels

unread,
Mar 2, 2024, 2:40:15 PM3/2/24
to Russtopia, Ian Lance Taylor, golang-nuts
Please clarify - does it work using the older versions of Go?

On Mar 2, 2024, at 12:53 PM, Russtopia <rma...@gmail.com> wrote:



Russtopia

unread,
Mar 2, 2024, 2:58:33 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
It no longer does.. so it suggests to me there's something external that has changed, but I have no clue as to what that might be -- as the process being started by my go tool will run just fine from a shell. And, it *does* run fine on my laptop (which granted is beefier, but again this server was running the tool just fine for years and I haven't done any big upgrades, but it *is* possible some minor underlying package update has severely changed the environment somehow). Unfortunately I don't have a system-wide 'snapshot' I can revert to.

Perhaps this will end up being a question of Linux diagnostics more than Go but I haven't yet seen any way to tell *why* the process is being killed, whether it be due to some bug tickled by Go's exec or something else. The oom_reaper doesn't say a thing to system logs; I don't see my free RAM or swap suddenly drop.. I've even checked my server for rootkits out of paranoia :). Everything else on the system is just fine, I just cannot seem to run these scripts any more when launched from my go program (again, even a 'do-nothing' script that just sleeps a few times, then completes with exit status 0, no longer works -- it just gets 'killed').

Russtopia

unread,
Mar 2, 2024, 3:00:51 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
.. I should add that I have often completely restarted the go program during testing here, so I don't think it could be a case of some long-term 'leak' in the go tool's own code since it's been relaunched and doesn't have any big 'state' to restore or anything.

Robert Engels

unread,
Mar 2, 2024, 3:01:44 PM3/2/24
to Russtopia, Ian Lance Taylor, golang-nuts
I’m guessing some other library or package you installed has corrupted your Linux installation. Sadly, my suggestion would be a fresh install of Linux. 

On Mar 2, 2024, at 1:58 PM, Russtopia <rma...@gmail.com> wrote:



Russtopia

unread,
Mar 2, 2024, 3:06:18 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
:( I was afraid someone would say that. Yes I worry something very strange must be going on. Sigh. There goes my weekend.

Kurtis Rader

unread,
Mar 2, 2024, 3:08:35 PM3/2/24
to Russtopia, golang-nuts
My recommendation is run your program under the control of `strace`. The simplest way to do that is to rename the current program you're launching by adding something like a ".orig" extension. Then create a script with the name of the program containing this:

#!/bin/sh
exec strace -o /tmp/strace.out -t /path/to/prog.orig



--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Russtopia

unread,
Mar 2, 2024, 4:51:29 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
SOLVED!

Thank you all for the helpful suggestions. Although it has turned out to be something totally different, and a teachable lesson in web app design...

This go tool of mine has a very simple web interface with controls for a set of jobs on the main page.
The jobs on this list can be run, viewed, and most importantly, cancelled via an endpoint of the form "/cancel/?id=nnnnnnn" ...
I have had the site up in a "demo" mode on the public internet at various times, including recently -- it turns out that very recently some crawl bots must have found it, and they are following those /cancel/ links on the dashboard almost as soon as they appear -- they must be scanning at <5sec intervals to find the new unique jobIDs encoded in each 'cancel' link. Oops :)

The app's usually behind an auth page, but this time it wasn't. I'm not a web dev so rookie mistake I suppose!
I guess I really should have a 'robots.txt' file served out by my go app to prevent this, and perhaps consider other client session ids to prevent outside crawlers from accidentally activating my app's link endpoints.

Thank you again, all.

On Sat, Mar 2, 2024 at 6:23 PM Robert Engels <ren...@ix.netcom.com> wrote:

Robert Engels

unread,
Mar 2, 2024, 6:32:15 PM3/2/24
to Russtopia, Ian Lance Taylor, golang-nuts
Glad you figured it out. Not certain how requests could crash a process like… I think you’d be better off configuring a maximum heap size rather than having the OOM killer kick in 

On Mar 2, 2024, at 3:50 PM, Russtopia <rma...@gmail.com> wrote:



Russtopia

unread,
Mar 2, 2024, 8:27:06 PM3/2/24
to Robert Engels, Ian Lance Taylor, golang-nuts
Oh, the OOM killer wasn't kicking in at all, it turns out. My tool simply invokes the cancellation function supplied by

cmdCancelCtx, cmdCancelFunc := context.WithCancel(j.mainCtx)

... if the /cancel/?id=nnnnnnn endpoint of my web dashboard is hit any time after the Wait() call. The result from exec.Wait() appears to be "killed" regardless. My handler wasn't printing out that something/someone had actually visited the cancel endpoint. The process was actually being terminated 'intentionally' in the respect that it was being done via this 'cancel' link on my web dashboard, but by web crawl bots!

With the HTTP auth feature enabled, bots can't visit the tool's dashboard or any other app endpoints now, and even if they could my endpoints now are guarded by a user-agent check to reject anything with "bot" in the UserAgent string. That stopped the accidental activation of any function, especially cancelling the jobs!
Reply all
Reply to author
Forward
0 new messages