Bug#688797: Dead lock on BDB and partial stop on slapd

Jose Manuel dos Santos Calhariz

unread,

Oct 4, 2012, 11:30:03 AM10/4/12

to

Following a previous bug report about a stopping slapd server, during
normal day work. As now we have a partially working slapd server, that
answers some queries but ignores others. Using db5.1_stat and gdb
was possible to get the following information:

1 - The BDB have an active dead lock, from db5.1_stat:

Locks grouped by lockers:
Locker Mode Count Status ----------------- Object ---------------
8002c7de dd=14 locks held 0 write locks 0 pid/thread 16486/140064000468736 priority 100
8002c7de READ 1 WAIT istPersonServices.bdb page 28
8002c7df dd=13 locks held 0 write locks 0 pid/thread 16486/140063983683328 priority 100
8002c7df READ 1 WAIT istPersonServices.bdb page 28
8002c7e0 dd=12 locks held 1 write locks 1 pid/thread 16486/140063992076032 priority 100
8002c7e0 WRITE 1 HELD istPersonServices.bdb page 28
(...)

2 - The thread 13 that owns the write lock has a very small stack
trace, from gdb:

Thread 13 (Thread 0x7f63307e5700 (LWP 22245)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f635ff2fc85 in ldap_int_thread_pool_wrapper (xpool=<optimized out>)
at ../../../../libraries/libldap_r/tpool.c:675
#2 0x00007f635e14e8ca in start_thread (arg=<optimized out>) at pthread_create.c:300
#3 0x00007f635deb592d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()

3 - This slapd daemon is a slave and no longer receives updates
from the ldap master.

netstat -ntp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
(...)
tcp 241009 0 XXX.XXX.XXX.XXX:49140 YYY.YYY.YYY.YYY:636 ESTABLISHED 16486/slapd
(...)

Attached is the full output from "db5.1_stat -CA" and is available on
request the full stack trace because of possible sensitive
information.

Jose Calhariz

--
--

Não há assunto pequeno. Há pequeno investidor

--Luiz da Câmara Cascudo

db5.1_stat-20121004-1439.txt

signature.asc

Quanah Gibson-Mount

unread,

Oct 4, 2012, 4:10:02 PM10/4/12

to

--On Thursday, October 04, 2012 4:19 PM +0100 Jose Manuel dos Santos

Calhariz <jose.c...@netvisao.pt> wrote:

>
> Following a previous bug report about a stopping slapd server, during
> normal day work. As now we have a partially working slapd server, that
> answers some queries but ignores others. Using db5.1_stat and gdb
> was possible to get the following information:

Hi Jose,

As I previously noted, this is a known bug with BDB 5.x series. Please
read:

<http://www.openldap.org/its/index.cgi/?findid=7378>
<http://www.openldap.org/its/index.cgi/?findid=7401>

Again, if you can show a deadlock in a current OpenLDAP build with a known
good version of BDB (4.7.25 + all patches), then that would be of
interesting.

--Quanah

--

Quanah Gibson-Mount
Sr. Member of Technical Staff
Zimbra, Inc
A Division of VMware, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration

--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Jose Manuel dos Santos Calhariz

unread,

Oct 10, 2012, 10:30:01 AM10/10/12

to

On Thu, Oct 04, 2012 at 01:05:15PM -0700, Quanah Gibson-Mount wrote:
> --On Thursday, October 04, 2012 4:19 PM +0100 Jose Manuel dos Santos
> Calhariz <jose.c...@netvisao.pt> wrote:
>
> >
> >Following a previous bug report about a stopping slapd server, during
> >normal day work. As now we have a partially working slapd server, that
> >answers some queries but ignores others. Using db5.1_stat and gdb
> >was possible to get the following information:
>
> Hi Jose,
>
> As I previously noted, this is a known bug with BDB 5.x series.
> Please read:
>
> <http://www.openldap.org/its/index.cgi/?findid=7378>
> <http://www.openldap.org/its/index.cgi/?findid=7401>
>
> Again, if you can show a deadlock in a current OpenLDAP build with a
> known good version of BDB (4.7.25 + all patches), then that would be
> of interesting.

Is openldap 2.4.31 current enough?

I have recompiled openldap this time with BDB 4.7.25 (+ 4 patches).
The symptoms are the same.

db4.7_stat -CA:

(...)
800002ae dd=139 locks held 1 write locks 0 pid/thread 20718/139646692484864
800002ae READ 1 WAIT istPersonServices.bdb page 28
800002ae READ 1 HELD istPersonServices.bdb page 1
800002af dd=138 locks held 1 write locks 1 pid/thread 20718/139646700877568
800002af WRITE 1 HELD istPersonServices.bdb page 28
(...)

gdb thread apply all bt

(...)
Thread 25 (Thread 0x7f0207fff700 (LWP 23812)):

#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162

#1 0x00007f02225f2c85 in ldap_int_thread_pool_wrapper (xpool=<optimized out>)
at ../../../../libraries/libldap_r/tpool.c:675
#2 0x00007f02208258ca in start_thread (arg=<optimized out>) at pthread_create.c:300
#3 0x00007f022058cb6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

#4 0x0000000000000000 in ?? ()

(...)

I attached the full log of db4.7_stat -CA and is available on request
the gdb full stack trace.

>
> --Quanah
>

Jose Calhariz

--
--
O progresso e a concretização das utopias.
-- Oscar Wilde

db4.7_stat-20121010-1413.txt

signature.asc

Howard Chu

unread,

Oct 11, 2012, 10:40:02 AM10/11/12

to

Can you give some more background on what has happened just prior to this
hang? Do you have slapd logs (running with TRACE debug level, -d1 or -s1)
leading up to the hang? In particular, I'm interested in knowing whether you
see a message like:
<= index_entry_%s( %ld, \"%s\" ) failure
just before. Also, are you using cn=config on these servers, and did you just
reconfigure the index of the relevant attribute prior to the hang? (E.g., in
your trace, the istPersonServices attribute.)

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Jose Manuel dos Santos Calhariz

unread,

Oct 17, 2012, 9:40:03 AM10/17/12

to

On Wed, Oct 10, 2012 at 03:18:50PM +0100, Jose Manuel dos Santos Calhariz wrote:
> On Thu, Oct 04, 2012 at 01:05:15PM -0700, Quanah Gibson-Mount wrote:
> > --On Thursday, October 04, 2012 4:19 PM +0100 Jose Manuel dos Santos
> > Calhariz <jose.c...@netvisao.pt> wrote:
> >
> > >
> > >Following a previous bug report about a stopping slapd server, during
> > >normal day work. As now we have a partially working slapd server, that
> > >answers some queries but ignores others. Using db5.1_stat and gdb
> > >was possible to get the following information:
> >
> > Hi Jose,
> >
> > As I previously noted, this is a known bug with BDB 5.x series.
> > Please read:
> >
> > <http://www.openldap.org/its/index.cgi/?findid=7378>
> > <http://www.openldap.org/its/index.cgi/?findid=7401>
> >
> > Again, if you can show a deadlock in a current OpenLDAP build with a
> > known good version of BDB (4.7.25 + all patches), then that would be
> > of interesting.
>
> Is openldap 2.4.31 current enough?

It even happen with 2.4.33 :-( It was compiled with BDB (4.7.25 + all
patches). Waiting for the next stop to collect debug information.

>
> I have recompiled openldap this time with BDB 4.7.25 (+ 4 patches).
> The symptoms are the same.
>
> db4.7_stat -CA:
>
> (...)
> 800002ae dd=139 locks held 1 write locks 0 pid/thread 20718/139646692484864
> 800002ae READ 1 WAIT istPersonServices.bdb page 28
> 800002ae READ 1 HELD istPersonServices.bdb page 1
> 800002af dd=138 locks held 1 write locks 1 pid/thread 20718/139646700877568
> 800002af WRITE 1 HELD istPersonServices.bdb page 28
> (...)
>
> gdb thread apply all bt
>
> (...)
> Thread 25 (Thread 0x7f0207fff700 (LWP 23812)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> #1 0x00007f02225f2c85 in ldap_int_thread_pool_wrapper (xpool=<optimized out>)
> at ../../../../libraries/libldap_r/tpool.c:675
> #2 0x00007f02208258ca in start_thread (arg=<optimized out>) at pthread_create.c:300
> #3 0x00007f022058cb6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #4 0x0000000000000000 in ?? ()
> (...)
>
> I attached the full log of db4.7_stat -CA and is available on request
> the gdb full stack trace.
>
> >
> > --Quanah
> >
>
> Jose Calhariz
>
>

Jose Calhariz

--
--
Nihil est in intellectu quod non ante fuerit in sensu.
(Nada está no intelecto que não tenha passado antes pelos sentidos)

signature.asc

Quanah Gibson-Mount

unread,

Oct 17, 2012, 9:50:01 AM10/17/12

to

--On October 17, 2012 2:35:30 PM +0100 Jose Manuel dos Santos Calhariz

<jose.c...@netvisao.pt> wrote:

> On Wed, Oct 10, 2012 at 03:18:50PM +0100, Jose Manuel dos Santos Calhariz
> wrote:
>> On Thu, Oct 04, 2012 at 01:05:15PM -0700, Quanah Gibson-Mount wrote:
>> > --On Thursday, October 04, 2012 4:19 PM +0100 Jose Manuel dos Santos
>> > Calhariz <jose.c...@netvisao.pt> wrote:
>> >
>> > >
>> > > Following a previous bug report about a stopping slapd server, during
>> > > normal day work. As now we have a partially working slapd server,
>> > > that answers some queries but ignores others. Using db5.1_stat and
>> > > gdb was possible to get the following information:
>> >
>> > Hi Jose,
>> >
>> > As I previously noted, this is a known bug with BDB 5.x series.
>> > Please read:
>> >
>> > <http://www.openldap.org/its/index.cgi/?findid=7378>
>> > <http://www.openldap.org/its/index.cgi/?findid=7401>
>> >
>> > Again, if you can show a deadlock in a current OpenLDAP build with a
>> > known good version of BDB (4.7.25 + all patches), then that would be
>> > of interesting.
>>
>> Is openldap 2.4.31 current enough?
>
> It even happen with 2.4.33 :-( It was compiled with BDB (4.7.25 + all
> patches). Waiting for the next stop to collect debug information.

Ok. did you see the information Howard requested you gather via his
comment in the bug?

Thanks,
Quanah

--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc

Jose Manuel dos Santos Calhariz

unread,

Oct 17, 2012, 11:00:02 AM10/17/12

to

On Wed, Oct 17, 2012 at 06:46:11AM -0700, Quanah Gibson-Mount wrote:
>
>
> --On October 17, 2012 2:35:30 PM +0100 Jose Manuel dos Santos
> Calhariz <jose.c...@netvisao.pt> wrote:
>
> >On Wed, Oct 10, 2012 at 03:18:50PM +0100, Jose Manuel dos Santos Calhariz
> >wrote:

> >>> (...)

> >>
> >>Is openldap 2.4.31 current enough?
> >
> >It even happen with 2.4.33 :-( It was compiled with BDB (4.7.25 + all
> >patches). Waiting for the next stop to collect debug information.
>
> Ok. did you see the information Howard requested you gather via his
> comment in the bug?

For some reason I did't receive that comment. Let me give the
information that I have now.

>> Can you give some more background on what has happened just prior to this
>> hang? Do you have slapd logs (running with TRACE debug level, -d1 or -s1)
>> leading up to the hang? In particular, I'm interested in knowing whether you
>> see a message like:
>> <= index_entry_%s( %ld, \"%s\" ) failure

I will change the debug level to include trace.

>> just before. Also, are you using cn=config on these servers, and did you just
>> reconfigure the index of the relevant attribute prior to the hang? (E.g., in
>> your trace, the istPersonServices attribute.)

Yes, we use cn=config, but didn't do any changes to any index.

>
> Thanks,
> Quanah
>
>

Thanks,

signature.asc

Jose Manuel dos Santos Calhariz

unread,

Oct 19, 2012, 11:20:01 AM10/19/12

to

On Wed, Oct 17, 2012 at 03:50:32PM +0100, Jose Manuel dos Santos Calhariz wrote:
> On Wed, Oct 17, 2012 at 06:46:11AM -0700, Quanah Gibson-Mount wrote:
> >
> >
> > --On October 17, 2012 2:35:30 PM +0100 Jose Manuel dos Santos
> > Calhariz <jose.c...@netvisao.pt> wrote:
> >
> > >On Wed, Oct 10, 2012 at 03:18:50PM +0100, Jose Manuel dos Santos Calhariz
> > >wrote:
> > >>> (...)
> > >>
> > >>Is openldap 2.4.31 current enough?
> > >
> > >It even happen with 2.4.33 :-( It was compiled with BDB (4.7.25 + all
> > >patches). Waiting for the next stop to collect debug information.
> >
> > Ok. did you see the information Howard requested you gather via his
> > comment in the bug?
>
> For some reason I did't receive that comment. Let me give the
> information that I have now.

Another partial stop with 2.4.33. Again what stopped was the
synchronization with the LDAP master. A watchdog detected the
situation and restarted the slapd daemon.

>
> >> Can you give some more background on what has happened just prior to this
> >> hang?

The slaves are running the normal day workload until the thread that
process the replication stops processing the updates. We have a
watchdog that detect that the slave is no longer up to date with the
master and restart the slave.

I have the exact time the daemon was restarted, 15:10:42, and the
full log with -d "Stats Stats2 Sync Trace"

> >> Do you have slapd logs (running with TRACE debug level, -d1 or -s1)
> >> leading up to the hang? In particular, I'm interested in knowing whether you
> >> see a message like:
> >> <= index_entry_%s( %ld, \"%s\" ) failure
>
> I will change the debug level to include trace.

Looking into the logs I only see:
<= index_entry_%s( %ld, \"%s\" ) success

>
> >> just before. Also, are you using cn=config on these servers, and did you just
> >> reconfigure the index of the relevant attribute prior to the hang? (E.g., in
> >> your trace, the istPersonServices attribute.)
>
> Yes, we use cn=config, but didn't do any changes to any index.
>
> >
> > Thanks,
> > Quanah
> >
> >
>
> Thanks,
> Jose Calhariz
>

> _______________________________________________
> ns-list mailing list
> ns-...@dsi.ist.utl.pt
> https://mlists.ist.utl.pt/mailman/listinfo/groups.ciist.ns-list

signature.asc