http://code.google.com/p/memcached/wiki/ReleaseNotes1418
Well, that learns me for trying to write software without the 10+ VM
buildbots...
The i386 one, can you include the output of "stats settings", and also
manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
"stats settings" again please? Really weird that it fails there, but not
the lines before it looking for the "OK" while enabling it.
On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
from 3 to 8 and try again? I was trying to be clever but that may not be
working out.
You received this message because you are subscribed to a topic in the Google Groups "memcached" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/memcached/Tw6t_W-a6Xc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to memcached+...@googlegroups.com.
> On Sat, Apr 19, 2014 at 12:43 PM, dormando <dorm...@rydia.net> wrote:Uhh... is your cross compile goofed?
> Well, that learns me for trying to write software without the 10+ VM
> buildbots...
>
> The i386 one, can you include the output of "stats settings", and also
> manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
> "stats settings" again please? Really weird that it fails there, but not
> the lines before it looking for the "OK" while enabling it.
>
>
> As soon as I type "lru_crawler enable", memcached crashes. I see this in dmesg.
>
> [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
> [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000]
> [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e0000+18000]
>
> Starting with "-o lru_crawler" also crashes.
>
> [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
>
> This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit
> processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right...
Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.
Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.
Thanks!
I feel like something's a bit different between your two tests. In the
>
> On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
> from 3 to 8 and try again? I was trying to be clever but that may not be
> working out.
>
>
> Didn't change anything, same two failures with the same output listed.
first set, it's definitely not crashing for the 64bit test, but not
working either. Is something weird going on with the second set of tests?
You noted it seems to be running a 32bit binary still.
On Sat, Apr 19, 2014 at 1:45 PM, dormando <dorm...@rydia.net> wrote:
Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.
Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7dbfb40 (LWP 7)]
0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1 0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
#2 0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#3 0x08061bfe in item_crawler_thread ()
#4 0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
#5 0xf7ead94e in clone () from /usr/lib/libc.so.6
https://github.com/dormando/memcached/tree/crawler_fix
Can you try this? The lock elision might've made my "undefined behavior"
mistake of not holding a lock before initially waiting on the condition
fatal.
A further fix might be required, as it's possible someone could kill the
do_etc flag before the thread fully starts and it'd drop out with the lock
held. That would be an incredible feat though.
>> Thanks!Thanks, I'm just trying to reason why it's failing in two different ways.
>
> >
> > On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
> > from 3 to 8 and try again? I was trying to be clever but that may not be
> > working out.
> >
> >
> > Didn't change anything, same two failures with the same output listed.
>
> I feel like something's a bit different between your two tests. In the
> first set, it's definitely not crashing for the 64bit test, but not
> working either. Is something weird going on with the second set of tests?
> You noted it seems to be running a 32bit binary still.
>
> I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones.
>
> In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots.
The initial failure of finding 90 items when it expected 60 is a timing
glitch, the other ones are this thread crashing the daemon.
> One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to causeCan you give me a summary of what the core2 machine gave you? I've built
> problems, especially if we are dealing with threading-type things?
on a core2duo and nehalem i7 and they all work fine. I've also torture
tested it on a brand new 16 core (2x8) xeon.
> On the i7 machine, I think we're still experiencing segfaults. Running just the LRU test; note the two "undef" values showing up again:
>
Ok. I might still be goofing the lock somewhere. Can you see if memcached
is crashing at all during these tests? Inside the test script you can see
it's just a few raw commands to copy/paste and try yourself.
You can also use an environment variable to start a memcached external to
the tests within a debugger:
if ($ENV{T_MEMD_USE_DAEMON}) {
my ($host, $port) = ($ENV{T_MEMD_USE_DAEMON} =~
m/^([^:]+):(\d+)$/);
T_MEMD_USE_DAEMON="127.0.0.1:11211" or something, I think. haven't used
that in a while.
> On Sat, Apr 19, 2014 at 2:45 PM, dormando <dorm...@rydia.net> wrote:Makes no goddamn sense. Maybe the fix below will.. fix it.
> > One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to
> cause
> > problems, especially if we are dealing with threading-type things?
>
> Can you give me a summary of what the core2 machine gave you? I've built
> on a core2duo and nehalem i7 and they all work fine. I've also torture
> tested it on a brand new 16 core (2x8) xeon.
>
>
> I ran the test suite on the Core2 a number of times (at least 5). Sometimes it completes without failure, other times I still get these two
> failures. This is with `sleep 3` changed to `sleep 8`.
>
> # Failed test 'slab1 now has 60 used chunks'
> # at t/lru-crawler.t line 57.
> # got: '90'
> # expected: '60'
>
> # Failed test 'slab1 has 30 reclaims'
> # at t/lru-crawler.t line 59.
> # got: '0'
> # expected: '30'
> # Looks like you failed 2 tests of 189.
> t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
> Failed 2/189 subtests
Please try again. Wonder if I can somehow fund getting a haswell NUC
bought just for my build VM's. Will TRX work within a VM..?
None of the other places I intend to run build VM's have lock elision...
Thanks for your patience on this. It's been a huge help!
>Ahh okay. Weird that you're able to see that, as the crawl command signals
> Once I wrapped my head around it, figured this one out. This cheap patch "fixes" the test, although I'm not sure it is the best actual solution. Because we don't set the lru_crawler_running flag on the main thread, but in the LRU thread itself, we have a race condition here. pthread_create() is by no means required to actually start the thread right away or
> schedule it, so the test itself asks too quickly if the LRU crawler is running, before the auxiliary thread has had the time to mark it as running. The sleep ensures we at least give that thread time to start.
>
> (Debugged by way of adding a print to STDERR statement in the while(1) loop. The only time I saw the test actually pass was when that loop caught and repeated itself for a while. It failed when it only ran once, which would make sense if the thread hadn't actually set the flag yet.)
the thread. Hmm... no easy way to tell if it *had* fired or if it's not
yet fired.
The parts I thought really hard about seem to be doing okay, but the
scaffolding I apparently goofed fairly bad, heh.
I just pushed another commit to the crawler_fix tree, can you try it and
see if it works with an unmodified test?