1.4.18

215 views
Skip to first unread message

dormando

unread,
Apr 17, 2014, 7:28:24 PM4/17/14
to memc...@googlegroups.com

Dan McGee

unread,
Apr 19, 2014, 12:51:08 PM4/19/14
to memc...@googlegroups.com
On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote:
http://code.google.com/p/memcached/wiki/ReleaseNotes1418

I just tried building the Arch Linux package for this and got failures when running the test suite. This was the output from the 32-bit i686 build; I saw the same results building for x86_64. Let me know what other relevant information might help.

#   Failed test at t/lru-crawler.t line 45.
#          got: undef
#     expected: 'yes'
t/lru-crawler.t ......
Failed 96/189 subtests
t/lru.t .............. ok
t/maxconns.t ......... ok
t/multiversioning.t .. ok
t/noreply.t .......... ok
t/slabs_reassign.t ... ok
t/stats-conns.t ...... ok
t/stats-detail.t ..... ok
t/stats.t ............ ok
t/touch.t ............ ok
t/udp.t .............. ok
t/unixsocket.t ....... ok
t/whitespace.t ....... skipped: Skipping tests probably because you don't have git.

Test Summary Report
-------------------
t/lru-crawler.t    (Wstat: 13 Tests: 94 Failed: 1)
  Failed test:  94
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 94.
Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr  0.05 sys +  2.27 cusr  0.35 csys =  3.43 CPU)
Result: FAIL
Makefile:1376: recipe for target 'test' failed
make: *** [test] Error 1
==> ERROR: A failure occurred in check().
    Aborting...



Running out of a git checkout on x86_64, I get slightly different results:

t/item_size_max.t .... ok
t/line-lengths.t ..... ok
t/lru-crawler.t ...... 93/189
#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'
# Looks like you failed 2 tests of 189.
t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/189 subtests
t/lru.t .............. ok
t/maxconns.t ......... ok
t/multiversioning.t .. ok
t/noreply.t .......... ok
t/slabs_reassign.t ... ok
t/stats-conns.t ...... ok
t/stats-detail.t ..... ok
t/stats.t ............ ok
t/touch.t ............ ok
t/udp.t .............. ok
t/unixsocket.t ....... ok
t/whitespace.t ....... 1/120
#   Failed test '0001-Support-V-version-option.patch (see devtools/clean-whitespace.pl)'
#   at t/whitespace.t line 40.
t/whitespace.t ....... 27/120 # Looks like you failed 1 test of 120.
t/whitespace.t ....... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/120 subtests

Test Summary Report
-------------------
t/lru-crawler.t    (Wstat: 512 Tests: 189 Failed: 2)
  Failed tests:  96-97
  Non-zero exit status: 2
t/whitespace.t     (Wstat: 256 Tests: 120 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=48, Tests=7193, 115 wallclock secs ( 1.39 usr  0.15 sys +  5.39 cusr  1.02 csys =  7.95 CPU)
Result: FAIL
Makefile:1482: recipe for target 'test' failed
make: *** [test] Error 1

 
$ git describe
1.4.18

$ uname -a
Linux galway 3.14.1-1-ARCH #1 SMP PREEMPT Mon Apr 14 20:40:47 CEST 2014 x86_64 GNU/Linux

$ gcc --version
gcc (GCC) 4.8.2 20140206 (prerelease)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

dormando

unread,
Apr 19, 2014, 1:43:25 PM4/19/14
to memc...@googlegroups.com
Well, that learns me for trying to write software without the 10+ VM
buildbots...

The i386 one, can you include the output of "stats settings", and also
manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
"stats settings" again please? Really weird that it fails there, but not
the lines before it looking for the "OK" while enabling it.

On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
from 3 to 8 and try again? I was trying to be clever but that may not be
working out.

Thanks! At least there're still people trying to maintain it for some
distros...
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Dan McGee

unread,
Apr 19, 2014, 2:28:49 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 12:43 PM, dormando <dorm...@rydia.net> wrote:
Well, that learns me for trying to write software without the 10+ VM
buildbots...

The i386 one, can you include the output of "stats settings", and also
manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
"stats settings" again please? Really weird that it fails there, but not
the lines before it looking for the "OK" while enabling it.

As soon as I type "lru_crawler enable", memcached crashes. I see this in dmesg.

[189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
[189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000]
[195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e0000+18000]

Starting with "-o lru_crawler" also crashes.

[195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]

This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right...


On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
from 3 to 8 and try again? I was trying to be clever but that may not be
working out.

Didn't change anything, same two failures with the same output listed.
 
You received this message because you are subscribed to a topic in the Google Groups "memcached" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/memcached/Tw6t_W-a6Xc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to memcached+...@googlegroups.com.

dormando

unread,
Apr 19, 2014, 2:45:32 PM4/19/14
to memc...@googlegroups.com
> On Sat, Apr 19, 2014 at 12:43 PM, dormando <dorm...@rydia.net> wrote:
> Well, that learns me for trying to write software without the 10+ VM
> buildbots...
>
> The i386 one, can you include the output of "stats settings", and also
> manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
> "stats settings" again please? Really weird that it fails there, but not
> the lines before it looking for the "OK" while enabling it.
>
>
> As soon as I type "lru_crawler enable", memcached crashes. I see this in dmesg.
>
> [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
> [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000]
> [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e0000+18000]
>
> Starting with "-o lru_crawler" also crashes.
>
> [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
>
> This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit
> processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right...

Uhh... is your cross compile goofed?

Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.

Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.

Thanks!

>
> On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
> from 3 to 8 and try again? I was trying to be clever but that may not be
> working out.
>
>
> Didn't change anything, same two failures with the same output listed.

I feel like something's a bit different between your two tests. In the
first set, it's definitely not crashing for the 64bit test, but not
working either. Is something weird going on with the second set of tests?
You noted it seems to be running a 32bit binary still.

dormando

unread,
Apr 19, 2014, 2:51:41 PM4/19/14
to memc...@googlegroups.com
Er... reading comprehension fail. I meant "64bit binary still" at the
bottom there.

Dan McGee

unread,
Apr 19, 2014, 3:06:03 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 1:45 PM, dormando <dorm...@rydia.net> wrote:
> On Sat, Apr 19, 2014 at 12:43 PM, dormando <dorm...@rydia.net> wrote:
>       Well, that learns me for trying to write software without the 10+ VM
>       buildbots...
>
>       The i386 one, can you include the output of "stats settings", and also
>       manually run: "lru_crawler enable" (or start with -o lru_crawler) then run
>       "stats settings" again please? Really weird that it fails there, but not
>       the lines before it looking for the "OK" while enabling it.
>
>
> As soon as I type "lru_crawler enable", memcached crashes. I see this in dmesg.
>
> [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000]
> [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000]
> [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e0000+18000]
>
> Starting with "-o lru_crawler" also crashes.
>
> [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000]
>
> This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit
> processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right...

Uhh... is your cross compile goofed?

Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.

Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7dbfb40 (LWP 7)]
0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
#2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#3  0x08061bfe in item_crawler_thread ()
#4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
#5  0xf7ead94e in clone () from /usr/lib/libc.so.6
 

Thanks!

>
>       On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
>       from 3 to 8 and try again? I was trying to be clever but that may not be
>       working out.
>
>
> Didn't change anything, same two failures with the same output listed.

I feel like something's a bit different between your two tests. In the
first set, it's definitely not crashing for the 64bit test, but not
working either. Is something weird going on with the second set of tests?
You noted it seems to be running a 32bit binary still.

I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones.

In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots.

-Dan

Dan McGee

unread,
Apr 19, 2014, 3:13:09 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 2:06 PM, Dan McGee <dpm...@gmail.com> wrote:
On Sat, Apr 19, 2014 at 1:45 PM, dormando <dorm...@rydia.net> wrote:

Any chance you could start the memcached-debug binary under gdb and then
crash it the same way? Get a full stack trace.

Thinking if I even have a 32bit host left somewhere to test with... will
have to spin up the VM's later, but a stacktrace might be enlightening
anyway.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7dbfb40 (LWP 7)]
0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
#2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#3  0x08061bfe in item_crawler_thread ()
#4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
#5  0xf7ead94e in clone () from /usr/lib/libc.so.6
 

And here is a backtrace with some debug symbols, not sure what is relevant to output but I picked the obvious pthread-related variables. `bt full` shows basically nothing extra.


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7dbfb40 (LWP 7)]
0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
#2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#3  0x08061bfe in item_crawler_thread (arg=0x0) at items.c:770

#4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
#5  0xf7ead94e in clone () from /usr/lib/libc.so.6
(gdb) bt full

#0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
No symbol table info available.

#1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
No symbol table info available.

#2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
No symbol table info available.
#3  0x08061bfe in item_crawler_thread (arg=0x0) at items.c:770
        i = <optimized out>

#4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.

#5  0xf7ead94e in clone () from /usr/lib/libc.so.6
No symbol table info available.
(gdb) p lru_crawler_cond
$1 = {__data = {__lock = 1, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x8081590 <lru_crawler_lock>, __nwaiters = 0, __broadcast_seq = 0},
  __size = "\001", '\000' <repeats 31 times>, "\220\025\b\b", '\000' <repeats 11 times>, __align = 1}
(gdb) p lru_crawler_lock
$2 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 256, __nusers = 0, {d = {__espins = 0, __elision = 3}, __list = {__next = 0x30000}}},
  __size = '\000' <repeats 13 times>, "\001\000\000\000\000\000\000\000\000\003", __align = 0}
 

dormando

unread,
Apr 19, 2014, 3:18:08 PM4/19/14
to memc...@googlegroups.com
Holy crap lock elision. I have one machine with a haswell chip here, but
I'll have to USB boot. Is getting an Arch liveimage especially time
consuming?

https://github.com/dormando/memcached/tree/crawler_fix

Can you try this? The lock elision might've made my "undefined behavior"
mistake of not holding a lock before initially waiting on the condition
fatal.

A further fix might be required, as it's possible someone could kill the
do_etc flag before the thread fully starts and it'd drop out with the lock
held. That would be an incredible feat though.
 
>
> Thanks!
>
> >
> >       On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
> >       from 3 to 8 and try again? I was trying to be clever but that may not be
> >       working out.
> >
> >
> > Didn't change anything, same two failures with the same output listed.
>
> I feel like something's a bit different between your two tests. In the
> first set, it's definitely not crashing for the 64bit test, but not
> working either. Is something weird going on with the second set of tests?
> You noted it seems to be running a 32bit binary still.
>
> I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones.
>
> In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots.

Thanks, I'm just trying to reason why it's failing in two different ways.
The initial failure of finding 90 items when it expected 60 is a timing
glitch, the other ones are this thread crashing the daemon.

Dan McGee

unread,
Apr 19, 2014, 3:38:08 PM4/19/14
to memc...@googlegroups.com
Not at all; if you download the latest install ISO (https://www.archlinux.org/download/) it is a live CD and you can boot straight into an Arch environment. You can do an install if you want, or just run live and install any necessary packages (`pacman -S base-devel gdb`) and go from there.
 

https://github.com/dormando/memcached/tree/crawler_fix

Can you try this? The lock elision might've made my "undefined behavior"
mistake of not holding a lock before initially waiting on the condition
fatal.

A further fix might be required, as it's possible someone could kill the
do_etc flag before the thread fully starts and it'd drop out with the lock
held. That would be an incredible feat though.

The good news here is now that we found our way to lock elision, both 64-bit and 32-bit builds (including one straight from git and outside the normal packaging build machinery) blow up in the same place. No segfault after applying this patch, so we've made progress.
 
>
>       Thanks!
>
>       >
>       >       On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
>       >       from 3 to 8 and try again? I was trying to be clever but that may not be
>       >       working out.
>       >
>       >
>       > Didn't change anything, same two failures with the same output listed.
>
> I feel like something's a bit different between your two tests. In the
> first set, it's definitely not crashing for the 64bit test, but not
> working either. Is something weird going on with the second set of tests?
> You noted it seems to be running a 32bit binary still.
>
> I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones.
>
> In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots.

Thanks, I'm just trying to reason why it's failing in two different ways.
The initial failure of finding 90 items when it expected 60 is a timing
glitch, the other ones are this thread crashing the daemon.

One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to cause problems, especially if we are dealing with threading-type things?

On the i7 machine, I think we're still experiencing segfaults. Running just the LRU test; note the two "undef" values showing up again:

$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189

#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'disabled lru crawler'
#   at t/lru-crawler.t line 69.
#          got: undef
#     expected: 'OK
# '

#   Failed test at t/lru-crawler.t line 72.
#          got: undef
#     expected: 'no'
# Looks like you failed 4 tests of 189.
t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400)
Failed 4/189 subtests


Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two runs in a row were different.

$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189

#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'ifoo29 == 'ok''
#   at /home/dan/memcached/t/lib/MemcachedTest.pm line 59.
#          got: undef
#     expected: 'VALUE ifoo29 0 2
# ok
# END
# '
t/lru-crawler.t .. Failed 10/189 subtests

Test Summary Report
-------------------
t/lru-crawler.t (Wstat: 13 Tests: 182 Failed: 3)
  Failed tests:  96-97, 182
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 182.
Files=1, Tests=182,  8 wallclock secs ( 0.03 usr  0.00 sys +  0.04 cusr  0.00 csys =  0.07 CPU)
Result: FAIL


$ prove t/lru-crawler.t
t/lru-crawler.t .. 93/189

#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'

#   Failed test 'sfoo28 == <undef>'
#   at /home/dan/memcached/t/lib/MemcachedTest.pm line 53.
#          got: undef
#     expected: 'END
# '
t/lru-crawler.t .. Failed 11/189 subtests

Test Summary Report
-------------------
t/lru-crawler.t (Wstat: 13 Tests: 181 Failed: 3)
  Failed tests:  96-97, 181
  Non-zero wait status: 13
  Parse errors: Bad plan.  You planned 189 tests but ran 181.
Files=1, Tests=181,  8 wallclock secs ( 0.02 usr  0.00 sys +  0.03 cusr  0.00 csys =  0.05 CPU)
Result: FAIL

-Dan

dormando

unread,
Apr 19, 2014, 3:45:18 PM4/19/14
to memc...@googlegroups.com
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0xf7dbfb40 (LWP 7)]
> > 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
> > (gdb) bt
> > #0  0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
> > #1  0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0
> > #2  0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
> > #3  0x08061bfe in item_crawler_thread ()
> > #4  0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0
> > #5  0xf7ead94e in clone () from /usr/lib/libc.so.6
>
> Holy crap lock elision. I have one machine with a haswell chip here, but
> I'll have to USB boot. Is getting an Arch liveimage especially time
> consuming?
>
>
> Not at all; if you download the latest install ISO (https://www.archlinux.org/download/) it is a live CD and you can boot straight into an Arch
> environment. You can do an install if you want, or just run live and install any necessary packages (`pacman -S base-devel gdb`) and go from there.

Okay, seems like I'll have to give it a shot since this still isn't
working well.
 
>
> https://github.com/dormando/memcached/tree/crawler_fix
>
> Can you try this? The lock elision might've made my "undefined behavior"
> mistake of not holding a lock before initially waiting on the condition
> fatal.
>
> A further fix might be required, as it's possible someone could kill the
> do_etc flag before the thread fully starts and it'd drop out with the lock
> held. That would be an incredible feat though.
>
>
> The good news here is now that we found our way to lock elision, both 64-bit and 32-bit builds (including one straight from git and outside the
> normal packaging build machinery) blow up in the same place. No segfault after applying this patch, so we've made progress.

I love progress.

> >
> >       Thanks!
> >
> >       >
> >       >       On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39
> >       >       from 3 to 8 and try again? I was trying to be clever but that may not be
> >       >       working out.
> >       >
> >       >
> >       > Didn't change anything, same two failures with the same output listed.
> >
> > I feel like something's a bit different between your two tests. In the
> > first set, it's definitely not crashing for the 64bit test, but not
> > working either. Is something weird going on with the second set of tests?
> > You noted it seems to be running a 32bit binary still.
> >
> > I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones.
> >
> > In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture
> systemd-nspawn chroots.
>
> Thanks, I'm just trying to reason why it's failing in two different ways.
> The initial failure of finding 90 items when it expected 60 is a timing
> glitch, the other ones are this thread crashing the daemon.
>
>
> One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to cause
> problems, especially if we are dealing with threading-type things?

Can you give me a summary of what the core2 machine gave you? I've built
on a core2duo and nehalem i7 and they all work fine. I've also torture
tested it on a brand new 16 core (2x8) xeon.
Ok. I might still be goofing the lock somewhere. Can you see if memcached
is crashing at all during these tests? Inside the test script you can see
it's just a few raw commands to copy/paste and try yourself.

You can also use an environment variable to start a memcached external to
the tests within a debugger:
if ($ENV{T_MEMD_USE_DAEMON}) {
my ($host, $port) = ($ENV{T_MEMD_USE_DAEMON} =~
m/^([^:]+):(\d+)$/);

T_MEMD_USE_DAEMON="127.0.0.1:11211" or something, I think. haven't used
that in a while.

Thanks!

Dan McGee

unread,
Apr 19, 2014, 3:57:27 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 2:45 PM, dormando <dorm...@rydia.net> wrote:
> One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to cause
> problems, especially if we are dealing with threading-type things?

Can you give me a summary of what the core2 machine gave you? I've built
on a core2duo and nehalem i7 and they all work fine. I've also torture
tested it on a brand new 16 core (2x8) xeon.

I ran the test suite on the Core2 a number of times (at least 5). Sometimes it completes without failure, other times I still get these two failures. This is with `sleep 3` changed to `sleep 8`.

#   Failed test 'slab1 now has 60 used chunks'
#   at t/lru-crawler.t line 57.
#          got: '90'
#     expected: '60'

#   Failed test 'slab1 has 30 reclaims'
#   at t/lru-crawler.t line 59.
#          got: '0'
#     expected: '30'
# Looks like you failed 2 tests of 189.
t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/189 subtests
 
> On the i7 machine, I think we're still experiencing segfaults. Running just the LRU test; note the two "undef" values showing up again:
>

Ok. I might still be goofing the lock somewhere. Can you see if memcached
is crashing at all during these tests? Inside the test script you can see
it's just a few raw commands to copy/paste and try yourself.

You can also use an environment variable to start a memcached external to
the tests within a debugger:
    if ($ENV{T_MEMD_USE_DAEMON}) {
        my ($host, $port) = ($ENV{T_MEMD_USE_DAEMON} =~
m/^([^:]+):(\d+)$/);

T_MEMD_USE_DAEMON="127.0.0.1:11211" or something, I think. haven't used
that in a while.


Simple repro, running standalone, no other commands have been issued:
$ nc localhost 11211
lru_crawler enable
OK
lru_crawler crawl 1
OK
lru_crawler disable

SIGSEGV happens, here is the backtrace (surprised to see it in start_thread...):

(gdb) bt
#0  0x00007ffff79881c8 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
#1  0x00007ffff7982fc7 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#2  0x0000000000414f61 in item_crawler_thread (arg=<optimized out>) at items.c:771
#3  0x00007ffff797f0a2 in start_thread () from /usr/lib/libpthread.so.0
#4  0x00007ffff76b4d1d in clone () from /usr/lib/libc.so.6

It does NOT segfault if you run enable immediately followed by disable, with no `crawl 1` in between.

dormando

unread,
Apr 19, 2014, 6:23:16 PM4/19/14
to memc...@googlegroups.com
> On Sat, Apr 19, 2014 at 2:45 PM, dormando <dorm...@rydia.net> wrote:
> > One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to
> cause
> > problems, especially if we are dealing with threading-type things?
>
> Can you give me a summary of what the core2 machine gave you? I've built
> on a core2duo and nehalem i7 and they all work fine. I've also torture
> tested it on a brand new 16 core (2x8) xeon.
>
>
> I ran the test suite on the Core2 a number of times (at least 5). Sometimes it completes without failure, other times I still get these two
> failures. This is with `sleep 3` changed to `sleep 8`.
>
> #   Failed test 'slab1 now has 60 used chunks'
> #   at t/lru-crawler.t line 57.
> #          got: '90'
> #     expected: '60'
>
> #   Failed test 'slab1 has 30 reclaims'
> #   at t/lru-crawler.t line 59.
> #          got: '0'
> #     expected: '30'
> # Looks like you failed 2 tests of 189.
> t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
> Failed 2/189 subtests

Makes no goddamn sense. Maybe the fix below will.. fix it.
Good lord I suck at this. I really wish I could make that
pthread_cond_wait "undefined" behavior actually error out so I don't test
this on 3+ platforms and then have it error out elsewhere :/

Just force-pushed this:
https://github.com/dormando/memcached/tree/crawler_fix

At some point I'd refactored it and didn't push the unlock far enough
south. Now it actually unlocks when it's stopping the thread...

Please try again. Wonder if I can somehow fund getting a haswell NUC
bought just for my build VM's. Will TRX work within a VM..?

None of the other places I intend to run build VM's have lock elision...

Thanks for your patience on this. It's been a huge help!

Dan McGee

unread,
Apr 19, 2014, 6:53:12 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 5:23 PM, dormando <dorm...@rydia.net> wrote:
> On Sat, Apr 19, 2014 at 2:45 PM, dormando <dorm...@rydia.net> wrote:
>       > One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to
>       cause
>       > problems, especially if we are dealing with threading-type things?
>
> Can you give me a summary of what the core2 machine gave you? I've built
> on a core2duo and nehalem i7 and they all work fine. I've also torture
> tested it on a brand new 16 core (2x8) xeon.
>
>
> I ran the test suite on the Core2 a number of times (at least 5). Sometimes it completes without failure, other times I still get these two
> failures. This is with `sleep 3` changed to `sleep 8`.
>
> #   Failed test 'slab1 now has 60 used chunks'
> #   at t/lru-crawler.t line 57.
> #          got: '90'
> #     expected: '60'
>
> #   Failed test 'slab1 has 30 reclaims'
> #   at t/lru-crawler.t line 59.
> #          got: '0'
> #     expected: '30'
> # Looks like you failed 2 tests of 189.
> t/lru-crawler.t ...... Dubious, test returned 2 (wstat 512, 0x200)
> Failed 2/189 subtests

Makes no goddamn sense. Maybe the fix below will.. fix it.

Once I wrapped my head around it, figured this one out. This cheap patch "fixes" the test, although I'm not sure it is the best actual solution. Because we don't set the lru_crawler_running flag on the main thread, but in the LRU thread itself, we have a race condition here. pthread_create() is by no means required to actually start the thread right away or schedule it, so the test itself asks too quickly if the LRU crawler is running, before the auxiliary thread has had the time to mark it as running. The sleep ensures we at least give that thread time to start.

(Debugged by way of adding a print to STDERR statement in the while(1) loop. The only time I saw the test actually pass was when that loop caught and repeated itself for a while. It failed when it only ran once, which would make sense if the thread hadn't actually set the flag yet.)

diff --git a/t/lru-crawler.t b/t/lru-crawler.t
index 8c82623..9b1c7e7 100644
--- a/t/lru-crawler.t
+++ b/t/lru-crawler.t
@@ -47,6 +47,9 @@ is(scalar <$sock>, "OK\r\n", "enabled lru crawler");

 print $sock "lru_crawler crawl 1\r\n";
 is(scalar <$sock>, "OK\r\n", "kicked lru crawler");
+
+sleep 1;
+
 while (1) {
     my $stats = mem_stats($sock);
     last unless $stats->{lru_crawler_running};
 
This makes things happy on the TRX-using machine. Awesome!
 

Please try again. Wonder if I can somehow fund getting a haswell NUC
bought just for my build VM's. Will TRX work within a VM..?
I don't know on this one.


None of the other places I intend to run build VM's have lock elision...

Thanks for your patience on this. It's been a huge help!

dormando

unread,
Apr 19, 2014, 7:05:14 PM4/19/14
to memc...@googlegroups.com
>
> Once I wrapped my head around it, figured this one out. This cheap patch "fixes" the test, although I'm not sure it is the best actual solution. Because we don't set the lru_crawler_running flag on the main thread, but in the LRU thread itself, we have a race condition here. pthread_create() is by no means required to actually start the thread right away or
> schedule it, so the test itself asks too quickly if the LRU crawler is running, before the auxiliary thread has had the time to mark it as running. The sleep ensures we at least give that thread time to start.
>
> (Debugged by way of adding a print to STDERR statement in the while(1) loop. The only time I saw the test actually pass was when that loop caught and repeated itself for a while. It failed when it only ran once, which would make sense if the thread hadn't actually set the flag yet.)

Ahh okay. Weird that you're able to see that, as the crawl command signals
the thread. Hmm... no easy way to tell if it *had* fired or if it's not
yet fired.

The parts I thought really hard about seem to be doing okay, but the
scaffolding I apparently goofed fairly bad, heh.

I just pushed another commit to the crawler_fix tree, can you try it and
see if it works with an unmodified test?

> This makes things happy on the TRX-using machine. Awesome!

Huzzah. I'd simply inverted the usage of the condition by accident. had to
catch both ends of it.
 
>
> Please try again. Wonder if I can somehow fund getting a haswell NUC
> bought just for my build VM's. Will TRX work within a VM..?
>
> I don't know on this one.
>

I will.. find out.

Dan McGee

unread,
Apr 19, 2014, 7:12:38 PM4/19/14
to memc...@googlegroups.com
On Sat, Apr 19, 2014 at 6:05 PM, dormando <dorm...@rydia.net> wrote:
>
> Once I wrapped my head around it, figured this one out. This cheap patch "fixes" the test, although I'm not sure it is the best actual solution. Because we don't set the lru_crawler_running flag on the main thread, but in the LRU thread itself, we have a race condition here. pthread_create() is by no means required to actually start the thread right away or
> schedule it, so the test itself asks too quickly if the LRU crawler is running, before the auxiliary thread has had the time to mark it as running. The sleep ensures we at least give that thread time to start.
>
> (Debugged by way of adding a print to STDERR statement in the while(1) loop. The only time I saw the test actually pass was when that loop caught and repeated itself for a while. It failed when it only ran once, which would make sense if the thread hadn't actually set the flag yet.)

Ahh okay. Weird that you're able to see that, as the crawl command signals
the thread. Hmm... no easy way to tell if it *had* fired or if it's not
yet fired.

The parts I thought really hard about seem to be doing okay, but the
scaffolding I apparently goofed fairly bad, heh.

I just pushed another commit to the crawler_fix tree, can you try it and
see if it works with an unmodified test?

We're good to go now, as far as I can tell. Ran the LRU test about 10 times on both machines I've been using today and it works every time now; no problems with the full test suite at this point either.

dormando

unread,
Apr 19, 2014, 7:17:53 PM4/19/14
to memc...@googlegroups.com
Cool, thanks again. I just pushed these changes to master. I kinda want to
find some other stuff to put in before shoveling out a .19 though. Are you
a packager for Arch? Can you ship .18 with the patches?

Dan McGee

unread,
Apr 19, 2014, 7:48:51 PM4/19/14
to memc...@googlegroups.com
Yep, I'm a packager (and the one responsible for the memcached package). Shipping .18 with these patches won't be a problem at all.

Thanks again for your help!

-Dan
Reply all
Reply to author
Forward
0 new messages