The last week I tried to debug a nasty (but reproducable :D) bug in our
code probably caused by some kind of memory bug somewhere. During this I
tried valgrind, but somehow could not get it to connect to any other
peer at all.
So my next try was ElectricFence which is a library which reimplements
malloc/free so that the allocated memory lies at a page boundary with a
free page after it, causing any overwrite to crash.
However I could not get to the problem, it always crashed much earlier,
and I could reproduce this in the pingpong example, full backtrace after
the mail and at http://pastebin.com/EkWc3EuS (with proper syntax
highlighting and longer lines, ignore the second short back trace
there). It crashes somewhere in the Identifier code. Note that we also
had issues with not being able to compare identifiers returned by ariba
properly, this could be another bug, the same bug, or simply a mistake
on our side, I worked around it by converting the node ids to strings,
which fixed the problem. Still have to investigate that one, my initial
belief was that the additional padding in Identifier was not cleared
correctly, but this does not seem to be the case.
This problem is quite annoying as it completely blocks using advanced
memory debuggers with ariba. Valgrind finds 2-3 other invalid
operations, one might want to look at these as well.
Did anybody make similar observations, or completely different ones?
Regards,
Mathias
Stack trace, ariba pingpong example, 0.7.0, completely unmodified,
crashes only after some successfully messages exchanged:
[... pages of debug output here ...]
8042 [0x7ffff656c700] INFO PingPong null - pinging overlay neighbors
with ping id 6
8043 [0x7ffff656c700] INFO PingPong null - ----> I am nodeone and I know
nodetwo
8044 [0x7ffff656c700] INFO PingPong null - sending ping message to
ca2e99e978ad3f1e60c28b4ffc1570a139a8fa0800000000
8112 [0x7ffff656c700] WARN PeriodicBroadcast null - failed sending
message through ipv6 socket
8878 [0x7ffff656c700] INFO PingPong null - received ping message on link
a4429b1ea33d72602c31cfdb6f320642502694df1df175a3 from node
ca2e99e978ad3f1e60c28b4ffc1570a139a8fa0800000000: ping pong message id 141
9043 [0x7ffff656c700] INFO PingPong null - pinging overlay neighbors
with ping id 7
9043 [0x7ffff656c700] INFO PingPong null - sending ping message to
ca2e99e978ad3f1e60c28b4ffc1570a139a8fa0800000000
9112 [0x7ffff656c700] WARN PeriodicBroadcast null - failed sending
message through ipv6 socket
9918 [0x7ffff656c700] INFO BaseOverlay null - --- link state
-------------------------------
9919 [0x7ffff656c700] INFO BaseOverlay null - LINK_STATE: 0: up=1 init=1
id=e782 serv=0 node=ca2e relaying=1 miss=0 auto=0 | Direct: using
id=7e65 (up=1) ip{192.168.0.101 | 127.0.0.1 | 127.0.0.1};tcp{5003};
9920 [0x7ffff656c700] INFO BaseOverlay null - LINK_STATE: 1: up=1 init=0
id=a442 serv=111 node=ca2e relaying=0 miss=0 auto=0 | Direct: using
id=05fc (up=1) ip{192.168.0.101 | 127.0.0.1 | 127.0.0.1};tcp{5003};
9921 [0x7ffff656c700] INFO BaseOverlay null - LINK_STATE: 2: up=1 init=1
id=0c1b serv=111 node=ca2e relaying=0 miss=0 auto=1 | Direct: using
id=933f (up=1) ip{192.168.0.101 | 127.0.0.1 | 127.0.0.1};tcp{5003};
9922 [0x7ffff656c700] INFO BaseOverlay null - LINK_STATE: 3: up=0 init=0
id=4db5 serv=4294967295 node=<uns relaying=0 miss=4 auto=0 | Direct:
using id=4db5 (up=1)
9922 [0x7ffff656c700] INFO BaseOverlay null - LINK_STATE: 4: up=1 init=0
id=a65c serv=0 node=ca2e relaying=1 miss=0 auto=0 | Direct: using
id=a65c (up=1) ip{192.168.0.101 | 127.0.0.1 | 127.0.0.1};tcp{5003};
9923 [0x7ffff656c700] INFO BaseOverlay null -
----------------------------------------------
9988 [0x7ffff656c700] INFO Chord null - Running stabilization: #links=1
#neighbors=1
9990 [0x7ffff656c700] INFO Chord null - Discovered nodes:
9991 [0x7ffff656c700] INFO Chord null - *
141a1183ffbdca4ca87b429b1db333b570d6a7b400000000
9991 [0x7ffff656c700] INFO Chord null - *
ca2e99e978ad3f1e60c28b4ffc1570a139a8fa0800000000
9991 [0x7ffff656c700] INFO Chord null - Running orphan removal
10043 [0x7ffff656c700] INFO PingPong null - pinging overlay neighbors
with ping id 8
10043 [0x7ffff656c700] INFO PingPong null - sending ping message to
ca2e99e978ad3f1e60c28b4ffc1570a139a8fa0800000000
10112 [0x7ffff656c700] WARN PeriodicBroadcast null - failed sending
message through ipv6 socket
10920 [0x7ffff656c700] INFO BaseOverlay null - Link connection request
is stale, closing: up=0 init=0 id=4db5 serv=4294967295 node=<uns
relaying=0 miss=5 auto=0 | Direct: using id=4db5 (up=1)
10921 [0x7ffff656c700] INFO BaseOverlay null - Link timed out. Dropping
up=0 init=0 id=4db5 serv=4294967295 node=<uns relaying=0 miss=5 auto=0 |
Direct: using id=4db5 (up=1)
10921 [0x7ffff656c700] INFO BaseOverlay null - Dropping link (initiated
locally):4db58aa854d6a3b4a2cd97d4731bba1e0c49936a19680d2c
10922 [0x7ffff656c700] INFO BaseOverlay null - onLinkDown descriptor:
up=0 init=0 id=4db5 serv=4294967295 node=<uns relaying=0 miss=5 auto=0 |
Direct: using id=4db5 (up=1)
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff656c700 (LWP 8348)]
ariba::utility::Identifier::compareTo (this=0x7fffedf57d58, compKey=...)
at utility/types/Identifier.cpp:680
680 if( compKey.isUnspec == false && isUnspec == false )
Missing separate debuginfos, use: debuginfo-install
ElectricFence-2.2.2-31.fc15.x86_64 apr-1.4.5-1.fc15.x86_64
apr-util-1.3.12-1.fc15.x86_64 boost-regex-1.46.0-3.fc15.x86_64
boost-system-1.46.0-3.fc15.x86_64 boost-thread-1.46.0-3.fc15.x86_64
cyrus-sasl-lib-2.1.23-16.fc15.x86_64 db4-4.8.30-3.fc15.x86_64
expat-2.0.1-11.fc15.x86_64 gmp-4.3.2-3.fc15.x86_64
keyutils-libs-1.2-7.fc15.x86_64 krb5-libs-1.9.1-5.fc15.x86_64
libcom_err-1.41.14-2.fc15.x86_64 libicu-4.4.2-7.fc15.x86_64
libselinux-2.0.99-4.fc15.x86_64 libuuid-2.19.1-1.3.fc15.x86_64
log4cxx-0.10.0-10.fc15.x86_64 nspr-4.8.8-1.fc15.x86_64
nss-3.12.10-1.fc15.x86_64 nss-softokn-freebl-3.12.10-2.fc15.x86_64
nss-util-3.12.10-1.fc15.x86_64 openldap-2.4.24-2.fc15.x86_64
openssl-1.0.0d-1.fc15.x86_64 zlib-1.2.5-3.fc15.x86_64
(gdb) bt
#0 ariba::utility::Identifier::compareTo (this=0x7fffedf57d58,
compKey=...) at utility/types/Identifier.cpp:680
#1 0x00007ffff7b4fa28 in operator== (compKey=..., this=0x7fffedf57d58)
at utility/types/Identifier.cpp:326
#2 ariba::utility::Identifier::operator== (this=0x7fffedf57d58,
compKey=...) at utility/types/Identifier.cpp:321
#3 0x00007ffff7ac21c7 in ariba::overlay::BaseOverlay::eraseDescriptor
(this=0x7ffff5499cb0, link=..., communication=false) at
overlay/BaseOverlay.cpp:298
#4 0x00007ffff7ac2413 in ariba::overlay::BaseOverlay::dropLink
(this=0x7ffff5499cb0, link=<optimized out>) at overlay/BaseOverlay.cpp:1042
#5 0x00007ffff7ac950e in ariba::overlay::BaseOverlay::stabilizeLinks
(this=0x7ffff5499cb0) at overlay/BaseOverlay.cpp:403
#6 0x00007ffff7ad4201 in ariba::overlay::BaseOverlay::eventFunction
(this=0x7ffff5499cb0) at overlay/BaseOverlay.cpp:1966
#7 0x00007ffff7b486c4 in
ariba::utility::SystemQueue::QueueThread::threadFunc
(obj=0x7ffff7dd2160) at utility/system/SystemQueue.cpp:255
#8 0x00007ffff71e7b89 in thread_proxy () from
/usr/lib64/libboost_thread-mt.so.1.46.0
#9 0x0000003d96007af1 in start_thread (arg=0x7ffff656c700) at
pthread_create.c:305
#10 0x0000003d958dfb7d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) bt full
#0 ariba::utility::Identifier::compareTo (this=0x7fffedf57d58,
compKey=...) at utility/types/Identifier.cpp:680
No locals.
#1 0x00007ffff7b4fa28 in operator== (compKey=..., this=0x7fffedf57d58)
at utility/types/Identifier.cpp:326
No locals.
#2 ariba::utility::Identifier::operator== (this=0x7fffedf57d58,
compKey=...) at utility/types/Identifier.cpp:321
No locals.
#3 0x00007ffff7ac21c7 in ariba::overlay::BaseOverlay::eraseDescriptor
(this=0x7ffff5499cb0, link=..., communication=false) at
overlay/BaseOverlay.cpp:298
ld = 0x7fffedf57d20
i =
#4 0x00007ffff7ac2413 in ariba::overlay::BaseOverlay::dropLink
(this=0x7ffff5499cb0, link=<optimized out>) at overlay/BaseOverlay.cpp:1042
__PRETTY_FUNCTION__ = "void
ariba::overlay::BaseOverlay::dropLink(const ariba::utility::LinkID&)"
ld = 0x7fffee003d20
#5 0x00007ffff7ac950e in ariba::overlay::BaseOverlay::stabilizeLinks
(this=0x7ffff5499cb0) at overlay/BaseOverlay.cpp:403
ld = 0x7fffee003d20
_foreach_continue400 = false
oldlinks = std::vector of length 1, capacity 1 = {0x7fffee003d20}
now = <optimized out>
__PRETTY_FUNCTION__ = "void
ariba::overlay::BaseOverlay::stabilizeLinks()"
#6 0x00007ffff7ad4201 in ariba::overlay::BaseOverlay::eventFunction
(this=0x7ffff5499cb0) at overlay/BaseOverlay.cpp:1966
No locals.
#7 0x00007ffff7b486c4 in
ariba::utility::SystemQueue::QueueThread::threadFunc
(obj=0x7ffff7dd2160) at utility/system/SystemQueue.cpp:255
ev = {listener = 0x7ffff5499cc8, type = {static DEFAULT =
{static DEFAULT = <same as static member of an already seen type>, id = 0,
static ids =
{<std::_Vector_base<ariba::utility::SystemEventType::Descriptor,
std::allocator<ariba::utility::SystemEventType::Descriptor> >> = {
_M_impl =
{<std::allocator<ariba::utility::SystemEventType::Descriptor>> =
{<__gnu_cxx::new_allocator<ariba::utility::SystemEventType::Descriptor>>
= {<No data fields>}, <No data fields>}, _M_start = 0x7ffff65f9fc0,
_M_finish = 0x7ffff65f9ff0, _M_end_of_storage = 0x7ffff65fa000}}, <No
data fields>}}, id = 1,
static ids = <same as static member of an already seen
type>}, data = 0x0,
scheduledTime =
{<boost::date_time::base_time<boost::posix_time::ptime,
boost::date_time::counted_time_system<boost::date_time::counted_time_rep<boost::posix_time::millisec_posix_time_system_config>
> >> = {<boost::less_than_comparable<boost::posix_time::ptime,
boost::equality_comparable<boost::posix_time::ptime,
boost::posix_time::ptime,
boost::detail::empty_base<boost::posix_time::ptime>,
boost::detail::false_t>,
boost::detail::empty_base<boost::posix_time::ptime>,
boost::detail::true_t>> =
{<boost::less_than_comparable1<boost::posix_time::ptime,
boost::equality_comparable<boost::posix_time::ptime,
boost::posix_time::ptime,
boost::detail::empty_base<boost::posix_time::ptime>,
boost::detail::false_t> >> =
{<boost::equality_comparable<boost::posix_time::ptime,
boost::posix_time::ptime,
boost::detail::empty_base<boost::posix_time::ptime>,
boost::detail::false_t>> =
{<boost::equality_comparable1<boost::posix_time::ptime,
boost::detail::empty_base<boost::posix_time::ptime> >> =
{<boost::detail::empty_base<boost::posix_time::ptime>> = {<No data
fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <No
data fields>}, time_ = {time_count_ = {value_ = 212178174514956525}}},
<No data fields>}, delayTime = 0, remainingDelay = 0}
lock = {m = 0x7ffff7dd2188, is_locked = true}
__PRETTY_FUNCTION__ = "static void
ariba::utility::SystemQueue::QueueThread::threadFunc(ariba::utility::SystemQueue::QueueThread*)"
#8 0x00007ffff71e7b89 in thread_proxy () from
/usr/lib64/libboost_thread-mt.so.1.46.0
No symbol table info available.
#9 0x0000003d96007af1 in start_thread (arg=0x7ffff656c700) at
pthread_create.c:305
__res = <optimized out>
pd = 0x7ffff656c700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {1,
336293511465707841, 264509654752, 140737326270912, 0, 3,
-336312536103212735, 347319412242295105}, mask_was_saved = 0}}, priv = {
pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup =
0x0, canceltype = 0}}}
not_first_call = 0
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
#10 0x0000003d958dfb7d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
==8613== Address 0x614db4c is 0 bytes after a block of size 172 alloc'd
within the serializing code and
==8613== Address 0x612b340 is 64 bytes inside a block of size 736 free'd
in the identifier bug which the gdb backtrace belongs to.
Regards,
Mathias Gottschlag
> [*snip*]
... which is fixed by this patch (should apply to svn).
I have never seen such a convincing reason not to pass parameters as
references.
The remaining valgrind errors look like an off-by-one error in the
serialization code. Deeply hidden below two layers of macro magic which
make translating stack traces a nightmare, so I give up for today. -.-
Good night,
Mathias Gottschlag
@mathias: Yeah, sometimes references are a mess.
Anyway, thanks a lot for the investigation and the fix!!!
The serializer was a nice idea that turned out to be
too complex in the end.
@mario: bit-wise serialization means, all subsequent data, after accept,
come next really bitwise! I will investigate if this
is a problem with the bit-encoders...
In the meanwhile you should encode boolean types into a
bit-field, lets say a uint8_t or something different to
reduce the troubles.
Furthermore, the bit-wise encoding might read and write one byte
at over the end of the buffer, but does no changes to it in debug
mode (since left optimization do not remove that uneccessary operation).
So, I suggest not going after that "bug", when it does not
do anything evil!
Best regards,
Sebastian
> X&& accept&& T(key)&& skip;
> } sznEnd();
>
> #endif /* ACCEPT_MESSAGE_H_ */
>
> --------------------------------------------------------------------------
>
> If I create a new AcceptMessage like this:
> AcceptMessage m(true, "hello buggy", 100);
> all is fine:
>>> GOT BUG-Message: 1, hello buggy, 100
>
> But with odd numbers:
> AcceptMessage m(true, "hello buggy", 99);
>>> GOT BUG-Message: 1, hello buggy, 98
>
> changing the serialization code to this:
> sznBeginDefault(AcceptMessage, X)
> {
> X&& accept&& skip&& T(key);
> } sznEnd();
>
> solves the Problem for now..
>
> So, I assume a very strange bug in the serializer, but haven't
> investigated myself yet.
>
> The number also is only changed at the receiver-side. At the sender's
> the 99 stays untouched.
>
> Best, Mario
--
Karlsruher Institut für Technologie (KIT)
Institut für Telematik
Sebastian Mies
Wissenschaftlicher Mitarbeiter
Zirkel 2
Gebäude 20.20
76131 Karlsruhe
Telefon: +49 721 608-46416
Fax: +49 721 608-46789
E-Mail: mi...@kit.edu
Mathias
Am 23.07.2011 12:54, schrieb omn...@mario.n-ten.de:
> X&& accept&& T(key)&& skip;
> } sznEnd();
>
> #endif /* ACCEPT_MESSAGE_H_ */
>
> --------------------------------------------------------------------------
>
> If I create a new AcceptMessage like this:
> AcceptMessage m(true, "hello buggy", 100);
> all is fine:
>>> GOT BUG-Message: 1, hello buggy, 100
>
> But with odd numbers:
> AcceptMessage m(true, "hello buggy", 99);
>>> GOT BUG-Message: 1, hello buggy, 98
>
> changing the serialization code to this:
> sznBeginDefault(AcceptMessage, X)
> {
> X&& accept&& skip&& T(key);