Regular Crashes

89 views
Skip to first unread message

Niklas Merz

unread,
Aug 28, 2018, 5:27:17 PM8/28/18
to Perkeep
Hi everyone,

I am running perkeepd on two servers for some time now. I built the binary myself and use one server as a primary and the other one as backup.

Those perkeep instances crash regularly without any reason to me. The last log entries usually look very normal. My primary often crashes when it tries to sync to the secondary after it has crashed. But the primary has also problem with other actions like twitter imports and searches.

I am not able to run perkeep stable for a longer period of time. I would appreciate your help.

Thanks and regards
Niklas

Mathieu Lonjaret

unread,
Aug 28, 2018, 7:57:59 PM8/28/18
to per...@googlegroups.com
Hi,

Can you be a bit more specific about anything please? Like a stack
trace, or a way to reproduce, or what your configuration and workflow
is, etc?
I mean, we can't help at all if we don't have at least something to start with.
Feel free to open separate issues on the tracker for each of the problems.

Cheers,
Mathieu
> --
> You received this message because you are subscribed to the Google Groups "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Diego Medina

unread,
Aug 28, 2018, 9:37:31 PM8/28/18
to per...@googlegroups.com
A one time random guess, is your instance running out of memory or your server restarting?

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Niklas Merz

unread,
Aug 29, 2018, 1:06:02 PM8/29/18
to Perkeep
The server and RAM should be fine. I`m running perkeepd with systemd. How much RAM does you server have? Do you recommend any amount?

The problem is that I cannot find a way to reproduce this. This happens with normal operations like import or sync via the app. The log file also shows no problems. After a crash there are only normal entries like synching or Blob received. How can I get stacktraces or debug this more? Is there a verbose log mode?

Thanks and sorry for this stupid request. I know not reproducible bugs are terrible to support.

Mathieu Lonjaret

unread,
Aug 29, 2018, 2:41:52 PM8/29/18
to per...@googlegroups.com
On Wed, 29 Aug 2018 at 19:06, Niklas Merz <Nikla...@gmx.net> wrote:
>
> The server and RAM should be fine. I`m running perkeepd with systemd. How much RAM does you server have? Do you recommend any amount?

well, there's no good answer for that, as it depends on much data you
have. But I do run a throwaway instance on an f1-micro GCE (600MB),
where I barely have any data, and I feel that's pretty close to being
the minimum.

> The problem is that I cannot find a way to reproduce this. This happens with normal operations like import or sync via the app. The log file also shows no problems. After a crash there are only normal entries like synching or Blob received. How can I get stacktraces or debug this more? Is there a verbose log mode?

Not that I can remember, no. I mean, I think we could enable verbose
HTTP logging, but I doubt that would help.

> Thanks and sorry for this stupid request. I know not reproducible bugs are terrible to support.

Np. let's try to make some progress.
Since you don't see any error or panic stack trace, I guess we can try
to check that you don't see any of the "regular" termination messages
either. So, at the end of the log, do you see any of the messages that
are in the handleSignals func from
https://perkeep.org/server/perkeepd/perkeepd.go ?
Secondly, could you add a logging message at the very end of this same
file, right after
if testHookServerUp.call(); !testHookWaitToShutdown.call() {
select {}
}
?
Normally, we should never get past this select{}. So if by any chance
we do, it would be interesting to know about it.

> Am Dienstag, 28. August 2018 23:27:17 UTC+2 schrieb Niklas Merz:
>>
>> Hi everyone,
>>
>> I am running perkeepd on two servers for some time now. I built the binary myself and use one server as a primary and the other one as backup.
>>
>> Those perkeep instances crash regularly without any reason to me. The last log entries usually look very normal. My primary often crashes when it tries to sync to the secondary after it has crashed. But the primary has also problem with other actions like twitter imports and searches.
>>
>> I am not able to run perkeep stable for a longer period of time. I would appreciate your help.
>>
>> Thanks and regards
>> Niklas
>

Niklas Merz

unread,
Aug 29, 2018, 5:29:52 PM8/29/18
to per...@googlegroups.com

> well, there's no good answer for that, as it depends on much data you
have. But I do run a throwaway instance on an f1-micro GCE (600MB),
where I barely have any data, and I feel that's pretty close to being
the minimum.
 
Ok my VM with 2 GB RAM for about 20GB data in perkeep should be fine. I ran some uploads and watched the usage and it looked good. 

> Np. let's try to make some progress.
Since you don't see any error or panic stack trace, I guess we can try
to check that you don't see any of the "regular" termination messages
either. So, at the end of the log, do you see any of the messages that
are in the handleSignals func from
https://perkeep.org/server/perkeepd/perkeepd.go ?
Secondly, could you add a logging message at the very end of this same
file, right after
  if testHookServerUp.call(); !testHookWaitToShutdown.call() {
    select {}
  }
?
Normally, we should never get past this select{}. So if by any chance
we do, it would be interesting to know about it.

Ok I will try that and watch out for another termination.

Thank you and I will report back with results.

clive boulton

unread,
Aug 29, 2018, 6:49:39 PM8/29/18
to per...@googlegroups.com
If everything else checks out. 2GB ram infers an older server. On that assumption CMOS battery going bad causes random reboots. Also malicious virus who has wound its way into RAM. Recommend new MoBo battery.  

--
You received this message because you are subscribed to the Google Groups "Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perkeep+unsubscribe@googlegroups.com.

Mathieu Lonjaret

unread,
Aug 29, 2018, 7:11:44 PM8/29/18
to per...@googlegroups.com
And in any case, next time it happens please do send the server logs
(in private if you prefer). I don't think there's any sensitive info
in there, but maybe double-check. We never know, I might spot
something that you somehow missed.

Euan Kemp

unread,
Aug 30, 2018, 8:05:08 AM8/30/18
to per...@googlegroups.com
On Wed, Aug 29, 2018 at 11:29:38PM +0200, Niklas Merz wrote:
> Ok my VM with 2 GB RAM for about 20GB data in perkeep should be fine. I ran
> some uploads and watched the usage and it looked good.

A better way to check if it's getting OOM-killed is via checking kmsg
logs.

The output of "dmesg" will have it if it's recent, and if it's not
recent and has fallen out of the kernel's ring buffer, it should be in
whatever's collecting those logs (`journalctl -k` on many modern
machines).

If it's crashing in certain ways (e.g. segfault), it could also core
dump, at which point it's possible to find a record of that in some
cases. You can find more about this in "man 5 core".

On many modern machines, you can use `coredumpctl` to view any core
dumps that might have happened. To see if that output is accurate, you
should check "/proc/sys/kernel/core_pattern" to see if it references
systemd-coredump, or any other similar tool.


One other thing that might help is the exit code. You said you're
running it via systemd, so the following command should show the exit
code:

$ journalctl UNIT=perkeep.service -t systemd

This should show a line with something like "Main process exited, status=#"
Knowing that number could help a little bit.

You might need to configure your machine further to store kmsg output
and coredumps if it's not configured to, but such changes could help
track down the issue.

- Euan

Niklas Merz

unread,
Sep 1, 2018, 4:09:07 AM9/1/18
to Perkeep
So it failed again.

The exit code is:
Aug 28 20:21:16 stuff systemd[1]: Started Perkeep Server.
-- Subject: Unit perkeep.service has finished start-up
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit perkeep.service has finished starting up.
--
-- The start-up result is done.
Aug 30 16:49:23 stuff systemd[1]: perkeep.service: Main process exited, code=killed, status=9/KILL
Aug 30 16:49:23 stuff systemd[1]: perkeep.service: Unit entered failed state.
Aug 30 16:49:23 stuff systemd[1]: perkeep.service: Failed with result 'signal'.

$ dmesg | grep perkeepd

[9339544.204149] perkeepd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
[9339544.204153] perkeepd cpuset=ns mems_allowed=0
[9339544.204156] CPU: 3 PID: 11500 Comm: perkeepd Tainted: P           O    4.13.16-2-pve #1
[9339544.204447] [11488]  1000 11488   459372   282365     638       6        0             0 perkeepd
[9339544.204450] Memory cgroup out of memory: Kill process 11488 (perkeepd) score 539 or sacrifice child
[9339544.205419] Killed process 11488 (perkeepd) total-vm:1837488kB, anon-rss:1129460kB, file-rss:0kB, shmem-rss:0kB
[9339544.258384] oom_reaper: reaped process 11488 (perkeepd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[9340611.483660] [15711]  1000 15711   682515   276764     707       7        0             0 perkeepd
[9340611.483664] Memory cgroup out of memory: Kill process 15711 (perkeepd) score 529 or sacrifice child
[9340611.484713] Killed process 15711 (perkeepd) total-vm:2730060kB, anon-rss:1107056kB, file-rss:0kB, shmem-rss:0kB
[9341096.168476] perkeepd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
[9341096.168478] perkeepd cpuset=ns mems_allowed=0
[9341096.168482] CPU: 3 PID: 18250 Comm: perkeepd Tainted: P           O    4.13.16-2-pve #1
[9341096.168836] [18246]  1000 18246   421195   280089     619       6        0             0 perkeepd
[9341096.168842] Memory cgroup out of memory: Kill process 18246 (perkeepd) score 535 or sacrifice child
[9341096.169788] Killed process 18246 (perkeepd) total-vm:1684780kB, anon-rss:1120356kB, file-rss:0kB, shmem-rss:0kB
[9341096.228024] oom_reaper: reaped process 18246 (perkeepd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[9501543.900541] perkeepd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
[9501543.900543] perkeepd cpuset=ns mems_allowed=0
[9501543.900547] CPU: 3 PID: 22466 Comm: perkeepd Tainted: P           O    4.13.16-2-pve #1
[9501543.900832] [22465]  1000 22465   412239   265550     648       6    11826             0 perkeepd
[9501543.900845] Memory cgroup out of memory: Kill process 22465 (perkeepd) score 530 or sacrifice child
[9501543.901793] Killed process 22465 (perkeepd) total-vm:1648956kB, anon-rss:1062200kB, file-rss:0kB, shmem-rss:0kB
[9501543.956730] oom_reaper: reaped process 22465 (perkeepd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So it runs out of memory right? I will allocate more RAM to this VM and try running it again.

Brad Fitzpatrick

unread,
Sep 1, 2018, 4:49:29 AM9/1/18
to per...@googlegroups.com
Yup, out of memory.

--

Mathieu Lonjaret

unread,
Sep 1, 2018, 1:18:51 PM9/1/18
to per...@googlegroups.com
And you're maintaining that in 'journalctl -u perkeep', at the time of
perkeep's death, there's no such sign of the OOM? I'm asking, because
it usually is pretty obvious just by looking at perkeep's output.
If that's the case, it's something to keep in mind for future
debugging of Perkeep under systemd.

Brad Fitzpatrick

unread,
Sep 1, 2018, 1:44:27 PM9/1/18
to per...@googlegroups.com
The kernel is sending it an uncatchable SIGKILL. It has no chance to log.

Mathieu Lonjaret

unread,
Sep 1, 2018, 1:48:34 PM9/1/18
to per...@googlegroups.com
Hmm, so what is it that makes it behave that way? If I force Perkeep
to OOM (not on systemd), it clearly shows up in the output here. Is my
kernel not behaving the same way because Perkeep is not running
through systemd?

Brad Fitzpatrick

unread,
Sep 1, 2018, 1:50:48 PM9/1/18
to per...@googlegroups.com
What type of "OOM" are you referring to and how did you cause it? What do your logs show when you do so?

Mathieu Lonjaret

unread,
Sep 1, 2018, 1:52:32 PM9/1/18
to per...@googlegroups.com
I make it allocate forever in a loop.
Hold on, I'll reproduce it again in a minute.

Mathieu Lonjaret

unread,
Sep 1, 2018, 1:57:46 PM9/1/18
to per...@googlegroups.com
2018/09/01 19:55:57 server: available at https://home.granivo.re/ui/
2018/09/01 19:55:57 http: TLS handshake error from 192.168.1.9:40429: EOF
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x1390c86, 0x16)
/home/mpl/go1/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc0b0000000, 0x48000000, 0x24fa1d8)
/home/mpl/go1/src/runtime/mem_linux.go:156 +0xc7
runtime.(*mheap).sysAlloc(0x24df200, 0x48000000, 0x0, 0x7f7ff0ff8cf8)
/home/mpl/go1/src/runtime/malloc.go:619 +0x1c7
runtime.(*mheap).grow(0x24df200, 0x22b31, 0x0)
/home/mpl/go1/src/runtime/mheap.go:920 +0x42
runtime.(*mheap).allocSpanLocked(0x24df200, 0x22b31, 0x24fa1e8, 0x0)
/home/mpl/go1/src/runtime/mheap.go:848 +0x337
runtime.(*mheap).alloc_m(0x24df200, 0x22b31, 0xc000050101, 0x1)
etc.

Brad Fitzpatrick

unread,
Sep 1, 2018, 1:58:52 PM9/1/18
to per...@googlegroups.com
If the Go runtime can't get memory from the kernel, Go can crash itself with a log message.

If Go can allocate but the kernel overcommitted itself (on by default) then the kennel picks a process to kill.

Mathieu Lonjaret

unread,
Sep 1, 2018, 2:00:26 PM9/1/18
to per...@googlegroups.com
Ah I see, that's subtle. thanks.

Euan Kemp

unread,
Sep 1, 2018, 2:01:23 PM9/1/18
to per...@googlegroups.com
You can reproduce a kernel oom-kill of a specific  progress using oom_adj and  the magic-sysrq hotkey or file.

For example, to make pid $pid get oomkilled, you can do the following:

$ echo 1000 | sudo tee /proc/$pid/oom_score_adj
$ echo f | sudo tee /proc/sysrq-trigger

This assumes you haven't adjusted the oom score for any other pids significantly up (unlikely) and that you have CONFIG_MAGIC_SYSRQ in your  kernel (very likely).

Another way to do it would be to run the process in a specific cgroup (e.g. with cgexec) and enforce a very low memory limit on that cgroup, and that's probably a bit more robust though also a little more complicated.

Mathieu Lonjaret

unread,
Sep 1, 2018, 2:02:27 PM9/1/18
to per...@googlegroups.com
Following that reasoning, I'm thinking if I choose a small enough
increment to allocate (that one was doing increments of 100M), I might
be able to get in the second case, right?

Mathieu Lonjaret

unread,
Sep 1, 2018, 2:07:30 PM9/1/18
to per...@googlegroups.com
well ok, but it would be more interesting to be able to reproduce
Niklas system behaviour without any such tricks.

Meister Roerich

unread,
Sep 1, 2018, 2:18:58 PM9/1/18
to per...@googlegroups.com
Can I provide you something to test this? I am using a Debian 9 in a LXC container on Proxmox with a very simple systemd script.

You received this message because you are subscribed to a topic in the Google Groups "Perkeep" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/perkeep/DUmc92gUocI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to perkeep+u...@googlegroups.com.

Niklas Merz

unread,
Sep 1, 2018, 2:20:21 PM9/1/18
to Perkeep
Can I provide you something to test this? I am using a Debian 9 in a LXC container on Proxmox with a very simple systemd script.

(Possible previous email was me with stupid mail account please ignore)

Euan Kemp

unread,
Sep 1, 2018, 2:23:09 PM9/1/18
to per...@googlegroups.com, Niklas Merz
> well ok, but it would be more interesting to be able to reproduce Niklas system behaviour without any such tricks.
The oom-killer being invoked is the oom-killer being invoked. It's not really a trick, and the main point was to clarify why perkeep can't/isn't logging in this scenario.

I agree this little venture into the exact details of the oom-mechanic isn't really interesting to this issue.

I think our next steps are the following:

1. Get pprof heap output from before a crash to see if there are any memory leaks in perkeep, fix said memory leak, or
2. Declare that for the work-load in question more memory is needed.

Niklas, if you're familiar with reading pprof output, you could add the environment variable `CAMLI_HTTP_PPROF=true` to your perkeep instance, and then take regular snapshots of the "/debug/pprof/heap" and use "go tool pprof" to see if there are any memory leaks some time before the crash. Be sure that you're not exposing that endpoint to the outside world for security reasons.

I think that's our  best bet for a concrete next step, though if someone else has a better idea, please do suggest it :)

Mathieu Lonjaret

unread,
Sep 1, 2018, 2:23:27 PM9/1/18
to per...@googlegroups.com
Well this new information in itself is already important, and probably
sufficient. :-)

Mathieu Lonjaret

unread,
Sep 1, 2018, 2:35:34 PM9/1/18
to per...@googlegroups.com
On Sat, 1 Sep 2018 at 20:23, Euan Kemp <euan...@gmail.com> wrote:
>
> > well ok, but it would be more interesting to be able to reproduce Niklas system behaviour without any such tricks.
> The oom-killer being invoked is the oom-killer being invoked. It's not really a trick, and the main point was to clarify why perkeep can't/isn't logging in this scenario.

Well, the point failed then.

> I agree this little venture into the exact details of the oom-mechanic isn't really interesting to this issue.

I did not say that. It is interesting to me.

Theodore Y. Ts'o

unread,
Sep 1, 2018, 8:27:31 PM9/1/18
to per...@googlegroups.com
On Sat, Sep 01, 2018 at 11:20:21AM -0700, Niklas Merz wrote:
> Can I provide you something to test this? I am using a Debian 9 in a LXC
> container on Proxmox with a very simple systemd script.

Are you setting up an explicit or implicit memory limit on your
containers?

If you are constraining your container so it has a tiny amount of
memory, that would certainly explain why the kernel is OOM killing it.

Quoting from: https://stgraber.org/2016/03/26/lxd-2-0-resource-control-412/

All limits can also be inherited through profiles in which case
each affected container will be constrained by that limit. That
is, if you set limits.memory=256MB in the default profile, every
container using the default profile (typically all of them) will
have a memory limit of 256MB.

- Ted

Euan Kemp

unread,
Sep 1, 2018, 8:34:07 PM9/1/18
to per...@googlegroups.com
> If you are constraining your container so it has a tiny amount of memory, that would certainly explain why the kernel is OOM killing it.

Note that we can already answer that from the dmesg output:

[9501543.900845] Memory cgroup out of memory: Kill process 22465 (perkeepd) score 530 or sacrifice child
[9501543.901793] Killed process 22465 (perkeepd) total-vm:1648956kB, anon-rss:1062200kB, file-rss:0kB, shmem-rss:0kB

The fact that it says "Memory cgroup out of memory" not "Out of memory" at the beginning indeed means it's a cgroup limit. We can also see that it was using about 1GB of residual memory, so it's reasonable to assume the cgroup has a memory limit of greater than 1GB.

That's not enough to answer whether this is a memory leak or not though. 1GiB seems like a lot of memory for perkeep, but it also might be okay for some workloads, so without information about what that memory's being used for, we can't be sure whether anything's really wrong or not.

Niklas Merz

unread,
Sep 13, 2018, 6:08:50 AM9/13/18
to Perkeep
Witj 4GB of RAM delegated to this container it runs more stable. Seems a bit much to me, but maybe this is only the case with LXC containers in Proxmox. Maybe I will try running perkeep on metal or a Raspberry Pi to check this.

Mathieu Lonjaret

unread,
Sep 13, 2018, 10:10:06 AM9/13/18
to per...@googlegroups.com
Last time I checked (but it was a long time ago), RPis were very low
on memory, and doing anything on them was very slow. For example, I
think even building Perkeep itself was not possible. So I wouldn't get
my hopes up for Perkeep on an RPi.

Niklas Merz

unread,
Sep 13, 2018, 10:28:53 AM9/13/18
to per...@googlegroups.com
I could try cross compiling and running it on Rpi 3 with 1gb RAM just for fun.

I try to keep my ram as low as possible on virtual machines. Maybe a qemu VM behaves differently. I will try that, too. Thanks for the help.
Reply all
Reply to author
Forward
0 new messages