gdb help: debugging a segfault in boost::shared

phear...@gmail.com

unread,

Dec 5, 2006, 7:33:33 AM12/5/06

to

I am working on a multithreaded application that contains a database
connection pool which is using shared_ptr to pass connections around.

After recent changes I've been getting random segfaults in the
shared_ptr code handling ref counting. The gdb session below help
explain the context.

I acknowledge that this is most likely my own screwup, but since I am
unable to get much meaningful information from gdb, I'm starting to run
out of ideas.

The following is what I have discovered by research:
- shared_ptr is thread safe (from boost docs)
- sp_counted_base uses lock-free algorithms for refcounting (from
header)

The questions I currently have are:
- How can I get more details about the segfault? Ie. Which instruction
or memory address is involved?
- Why can't I access *pw?
- How can atomic_increment segfault? Is it possible that gdb's stack
trace is wrong?

An answer to any of these questions would be greatly appreciated.

[START GDB session]

(gdb) run

[...]

TEST: GetEntry
======================================================================

Entry: 000000000000576(539)
Get adapter

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1311896656 (LWP 29331)]
0xb7c556db in boost::detail::atomic_increment (pw=0x8006d73) at
sp_counted_base_gcc_x86.hpp:66
66 );
(gdb) bt
#0 0xb7c556db in boost::detail::atomic_increment (pw=0x8006d73) at
sp_counted_base_gcc_x86.hpp:66
#1 0xb7c55700 in boost::detail::sp_counted_base::add_ref_copy
(this=0x8006d6f) at sp_counted_base_gcc_x86.hpp:133
#2 0xb7c55746 in shared_count (this=0xb1cdff9c, r=@0xb455140c) at
shared_count.hpp:170
#3 0xb7c559dc in shared_ptr (this=0xb1cdff98) at shared_ptr.hpp:106
#4 0xb7c55290 in DatabaseAdapterPool::GetAdapter (this=0xb7d00f20,
dataSource=@0xb1ce00f8) at DatabaseAdapterPool.cc:188
#5 0xb7c3b0a3 in IDatabaseAdapter::GetAdapter (dataSource=@0xb1ce00f8)
at DatabaseAdapter.cc:31
#6 0xb7c4dffd in Instance::GetEntry (this=0x8056c90,
formName=@0xb1ce02c8, entryId=@0xb1ce02c4, fields=@0xb1ce02ac,
values=@0xb1ce02a0) at Instance.cc:85
#7 0xb7c3e42e in ARDBCGetEntry (object=0x8056c90, tableName=0xb453e284
"orcl", vendorFieldList=0xb1ce039c, transId=0,
entryIdList=0xb1ce0394, idList=0xb1ce038c, fieldList=0xb1ce0384,
status=0xb1ce037c) at syscomardbc.cc:386
#8 0x0804c7ee in ardbctest::testGetEntry (this=0x80cec80) at
ardbctest.h:379
#9 0x0804cddc in
boost::detail::function::void_function_obj_invoker0<ardbctest,
void>::invoke (function_obj_ptr=
{obj_ptr = 0x80cec80, const_obj_ptr = 0x80cec80, func_ptr =
0x80cec80, data = "\200"}) at ardbctest.h:217
#10 0x0804f7a1 in boost::function0<void,
std::allocator<boost::function_base> >::operator() (this=0xb1ce0434)
at function_template.hpp:576
#11 0x0804e5f6 in thread_proxy (param=0xbf96bd0c) at
../src/thread.cpp:113
#12 0xb7f3a341 in start_thread () from
/lib/tls/i686/cmov/libpthread.so.0
#13 0xb7dcd4ee in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb) l -
56 {
57 //atomic_exchange_and_add( pw, 1 );
58
59 __asm__
60 (
61 "lock\n\t"
62 "incl %0":
63 "=m"( *pw ): // output (%0)
64 "m"( *pw ): // input (%1)
65 "cc" // clobbers
(gdb) l
66 );
67 }
68
69 inline int atomic_conditional_increment( int * pw )
70 {
71 // int rv = *pw;
72 // if( rv != 0 ) ++*pw;
73 // return rv;
74
75 int rv, tmp;
(gdb) print pw
$12 = (int *) 0x8006d73
(gdb) print *pw
Cannot access memory at address 0x8006d73
(gdb) f 4
#4 0xb7c55290 in DatabaseAdapterPool::GetAdapter (this=0xb7d00f20,
dataSource=@0xb1ce00f8) at DatabaseAdapterPool.cc:188
188 shared_ptr<IDatabaseAdapter> a =
pool.GetAdapter();
(gdb) l
183
184 boost::mutex::scoped_lock lock( pool_mutex
);[START GDB session]
185
186 cout << "Finding available adapter. DataSource:
" << dataSource.name << " (" << &dataSource << ")" << endl;
187
188 shared_ptr<IDatabaseAdapter> a =
pool.GetAdapter();
189
190 if ( a.get() == NULL )
191 a = CreateAdapter( dataSource );
192

[END GDB session]

NOTE: pool.GetAdapter() returns either a shared_ptr to an available
adapter, or an empty shared_ptr.

phear

unread,

Dec 5, 2006, 9:54:24 AM12/5/06

to

Forgot to mention that most of my code (stack frame 4 through 7) is in
a shared library, which might cause some extra fun/trouble.

Paul Pluzhnikov

unread,

Dec 5, 2006, 10:13:28 AM12/5/06

to

phear...@gmail.com writes:

> I am working on a multithreaded application that contains a database
> connection pool which is using shared_ptr to pass connections around.

These are fun to debug :-(

> The questions I currently have are:
> - How can I get more details about the segfault? Ie. Which instruction
> or memory address is involved?

It's pretty obvious that the instruction at 0xb7c556db caused
the fault. You can examine it with

(gdb) x/i 0xb7c556db

The faulting instruction is "lock incl *pw", and it's faulting
because pw is corrupt.

> - Why can't I access *pw?

Because it is obviously corrupt -- it isn't properly aligned,
and its last 2 bytes form an ASCII string "cs" (0x63 0x73).

> - How can atomic_increment segfault? Is it possible that gdb's stack
> trace is wrong?

How can it fail if you pass it a bogus pointer?
Easy -- it just does.

How you got bogus pointer is a whole other story.

> NOTE: pool.GetAdapter() returns either a shared_ptr to an available
> adapter, or an empty shared_ptr.

You should examine "pool" to see if the pointer is corrupt there.
If it is, you'll have to "trace back" to find where it gets
corrupted. That task will be much simpler IF you can eliminate
threads from the picture.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.

phear

unread,

Dec 5, 2006, 1:07:19 PM12/5/06

to

Thanks for the quick reply paul.

You are absolutely correct, the last two bytes of pw is in fact a
fragment of data being read from the database. sp_counted_base's this
pointer also contains fragments of this data. I have examined several
sessions, and in more than half of them I have been able to identify
the fragments as most likely originating from the data being read.

Running the app extensively with only one thread does not trigger this
error.

I use std::string throughout, and strncpy or strdup if I need to deal
with char arrays. Although other libraries might cause buffer
overflows, I am far from convinced that this is caused by that.

It also strikes me as odd that this error always happens in shared_ptr,
but only after having executed the exact same code several hundred
times.

I am afraid I am still quite stuck here. I can not see how I can debug
this any further, since I can't detect it before it segfaults.

On Dec 5, 4:13 pm, Paul Pluzhnikov <ppluzhnikov-...@charter.net> wrote:

> phear.d...@gmail.com writes:
> > I am working on a multithreaded application that contains a database

> > connection pool which is using shared_ptr to pass connections around.These are fun to debug :-(

>
> > The questions I currently have are:
> > - How can I get more details about the segfault? Ie. Which instruction

> > or memory address is involved?It's pretty obvious that the instruction at 0xb7c556db caused

> the fault. You can examine it with
>
> (gdb) x/i 0xb7c556db
>
> The faulting instruction is "lock incl *pw", and it's faulting
> because pw is corrupt.
>

> > - Why can't I access *pw?Because it is obviously corrupt -- it isn't properly aligned,

> and its last 2 bytes form an ASCII string "cs" (0x63 0x73).
>
> > - How can atomic_increment segfault? Is it possible that gdb's stack

> > trace is wrong?How can it fail if you pass it a bogus pointer?

> Easy -- it just does.
>
> How you got bogus pointer is a whole other story.
>
> > NOTE: pool.GetAdapter() returns either a shared_ptr to an available

> > adapter, or an empty shared_ptr.You should examine "pool" to see if the pointer is corrupt there.

Arnold Hendriks

unread,

Dec 5, 2006, 6:46:05 PM12/5/06

to

phear wrote:

> I use std::string throughout, and strncpy or strdup if I need to deal
> with char arrays. Although other libraries might cause buffer
> overflows, I am far from convinced that this is caused by that.
>
> It also strikes me as odd that this error always happens in shared_ptr,
> but only after having executed the exact same code several hundred
> times.

I would still look for a bug in your code. I've been running shared_ptr
trouble free for years (with a disabled atomic refcounting as I don't
need it, not passing any pointers between threads without a lock).
shared_ptr (especially in the earlier boost versions) has a pretty
simple implementation which you can verify yourself.

Perhaps valgrind can provide hints ? Boost also has intrusive shared
pointers, which keep the reference count inside the object - it might
simplify analysis of the data corruption (but it does require quite
intrusive code changes)

Joe Seigh

unread,

Dec 5, 2006, 9:24:08 PM12/5/06

to

phear...@gmail.com wrote:
> I am working on a multithreaded application that contains a database
> connection pool which is using shared_ptr to pass connections around.
>
> After recent changes I've been getting random segfaults in the
> shared_ptr code handling ref counting. The gdb session below help
> explain the context.
>
> I acknowledge that this is most likely my own screwup, but since I am
> unable to get much meaningful information from gdb, I'm starting to run
> out of ideas.
>
> The following is what I have discovered by research:
> - shared_ptr is thread safe (from boost docs)
> - sp_counted_base uses lock-free algorithms for refcounting (from
> header)
>
> The questions I currently have are:
> - How can I get more details about the segfault? Ie. Which instruction
> or memory address is involved?
> - Why can't I access *pw?
> - How can atomic_increment segfault? Is it possible that gdb's stack
> trace is wrong?
>
> An answer to any of these questions would be greatly appreciated.
>

In this case lock-free doesn't mean what you think it means. shared_ptr's
are only safe to copy if you own the source ptr (and the target ptr obviously).
There are experimental smart pointers that are atomically thread-safe.
shared_ptr isn't one of them as you found out. You'll need a mutex
to safely pass those shared_ptr's around.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

jasen

unread,

Dec 6, 2006, 12:32:34 AM12/6/06

to

On 2006-12-05, phear <phear...@gmail.com> wrote:
> Thanks for the quick reply paul.
>
> You are absolutely correct, the last two bytes of pw is in fact a
> fragment of data being read from the database. sp_counted_base's this
> pointer also contains fragments of this data. I have examined several
> sessions, and in more than half of them I have been able to identify
> the fragments as most likely originating from the data being read.
>
> Running the app extensively with only one thread does not trigger this
> error.
>
> I use std::string throughout, and strncpy or strdup if I need to deal
> with char arrays. Although other libraries might cause buffer
> overflows, I am far from convinced that this is caused by that.

Make sure all those methods and functions are either are thread safe, or
apropriately mutexed.

Bye.
Jasen

phear

unread,

Dec 6, 2006, 4:54:08 AM12/6/06

to

On Dec 6, 12:46 am, Arnold Hendriks <a.hendr...@b-lex.nl> wrote:
> phear wrote:
> > I use std::string throughout, and strncpy or strdup if I need to deal
> > with char arrays. Although other libraries might cause buffer
> > overflows, I am far from convinced that this is caused by that.
>
> > It also strikes me as odd that this error always happens in shared_ptr,
> > but only after having executed the exact same code several hundred

> > times.I would still look for a bug in your code. I've been running shared_ptr

> trouble free for years (with a disabled atomic refcounting as I don't
> need it, not passing any pointers between threads without a lock).
> shared_ptr (especially in the earlier boost versions) has a pretty
> simple implementation which you can verify yourself.
>
> Perhaps valgrind can provide hints ? Boost also has intrusive shared
> pointers, which keep the reference count inside the object - it might
> simplify analysis of the data corruption (but it does require quite
> intrusive code changes)

Yes, I still believe this is a bug in my code. I just can't figure out
how to narrow it down from "very strange random thing going on".

Valgrind reports quite a few errors/warnings, most of them occuring in
the oracle library on connect, which I think is normal. I will try
using mysql, to see if I can replicate the error there.

Valgrind also reports two errors that is likely related:

=9500== 994 errors in context 184 of 186:
==9500== Invalid read of size 4
==9500== at 0x43FD729:
boost::detail::shared_count::shared_count(boost::detail::shared_count
const&) (shared_count.hpp:165)
==9500== by 0x43FD9DB:
boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter>
const&) (shared_ptr.hpp:106)
==9500== by 0x43FD28F: DatabaseAdapterPool::GetAdapter(DataSource
const&) (DatabaseAdapterPool.cc:188)
==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource
const&) (DatabaseAdapter.cc:31)
==9500== by 0x43F6A99: Instance::GetListEntryWithFields(std::string
const&, std::vector<ARVendorFieldStruct,
std::allocator<ARVendorFieldStruct> > const&, std::vector<std::string,
std::allocator<std::string> > const&, Qualification const&, unsigned,
unsigned, std::vector<AREntryListFieldValueStruct,
std::allocator<AREntryListFieldValueStruct> >*, unsigned*)
(Instance.cc:505)
==9500== by 0x43E6F25: ARDBCGetListEntryWithFields
(syscomardbc.cc:757)
==9500== by 0x804D011: ardbctest::testGetListEntryWithFields()
(ardbctest.h:458)
==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)
==9500== by 0x8049CAA: run() (main.cc:47)
==9500== by 0x8049EBB: main (main.cc:68)
==9500== Address 0x591D60C is 20 bytes inside a block of size 24
free'd
==9500== at 0x401D268: operator delete(void*)
(vg_replace_malloc.c:246)
==9500== by 0x43FDF68:
__gnu_cxx::new_allocator<std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>
>::deallocate(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*, unsigned) (new_allocator.h:94)
==9500== by 0x43FDF9D:
std::_List_base<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::_M_put_node(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*) (stl_list.h:317)
==9500== by 0x43FE1CA:
std::list<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::_M_erase(std::_List_iterator<DatabaseAdapterPool::InternalPool::PoolObject>) (stl_list.h:1163)
==9500== by 0x43FE22C:
std::list<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::erase(std::_List_iterator<DatabaseAdapterPool::InternalPool::PoolObject>) (list.tcc:98)
==9500== by 0x43FCD07:
DatabaseAdapterPool::InternalPool::GetAdapter()
(DatabaseAdapterPool.cc:68)
==9500== by 0x43FD280: DatabaseAdapterPool::GetAdapter(DataSource
const&) (DatabaseAdapterPool.cc:188)
==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource
const&) (DatabaseAdapter.cc:31)
==9500== by 0x43F6A99: Instance::GetListEntryWithFields(std::string
const&, std::vector<ARVendorFieldStruct,
std::allocator<ARVendorFieldStruct> > const&, std::vector<std::string,
std::allocator<std::string> > const&, Qualification const&, unsigned,
unsigned, std::vector<AREntryListFieldValueStruct,
std::allocator<AREntryListFieldValueStruct> >*, unsigned*)
(Instance.cc:505)
==9500== by 0x43E6F25: ARDBCGetListEntryWithFields
(syscomardbc.cc:757)
==9500== by 0x804D011: ardbctest::testGetListEntryWithFields()
(ardbctest.h:458)
==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)
==9500==
==9500== 994 errors in context 185 of 186:
==9500== Invalid read of size 4
==9500== at 0x43FD9BD:
boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter>
const&) (shared_ptr.hpp:106)
==9500== by 0x43FD28F: DatabaseAdapterPool::GetAdapter(DataSource
const&) (DatabaseAdapterPool.cc:188)
==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource
const&) (DatabaseAdapter.cc:31)
==9500== by 0x43F6A99: Instance::GetListEntryWithFields(std::string
const&, std::vector<ARVendorFieldStruct,
std::allocator<ARVendorFieldStruct> > const&, std::vector<std::string,
std::allocator<std::string> > const&, Qualification const&, unsigned,
unsigned, std::vector<AREntryListFieldValueStruct,
std::allocator<AREntryListFieldValueStruct> >*, unsigned*)
(Instance.cc:505)
==9500== by 0x43E6F25: ARDBCGetListEntryWithFields
(syscomardbc.cc:757)
==9500== by 0x804D011: ardbctest::testGetListEntryWithFields()
(ardbctest.h:458)
==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)
==9500== by 0x8049CAA: run() (main.cc:47)
==9500== by 0x8049EBB: main (main.cc:68)
==9500== Address 0x591D608 is 16 bytes inside a block of size 24
free'd
==9500== at 0x401D268: operator delete(void*)
(vg_replace_malloc.c:246)
==9500== by 0x43FDF68:
__gnu_cxx::new_allocator<std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>
>::deallocate(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*, unsigned) (new_allocator.h:94)
==9500== by 0x43FDF9D:
std::_List_base<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::_M_put_node(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*) (stl_list.h:317)
==9500== by 0x43FE1CA:
std::list<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::_M_erase(std::_List_iterator<DatabaseAdapterPool::InternalPool::PoolObject>) (stl_list.h:1163)
==9500== by 0x43FE22C:
std::list<DatabaseAdapterPool::InternalPool::PoolObject,
std::allocator<DatabaseAdapterPool::InternalPool::PoolObject>
>::erase(std::_List_iterator<DatabaseAdapterPool::InternalPool::PoolObject>) (list.tcc:98)
==9500== by 0x43FCD07:
DatabaseAdapterPool::InternalPool::GetAdapter()
(DatabaseAdapterPool.cc:68)
==9500== by 0x43FD280: DatabaseAdapterPool::GetAdapter(DataSource
const&) (DatabaseAdapterPool.cc:188)
==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource
const&) (DatabaseAdapter.cc:31)
==9500== by 0x43F6A99: Instance::GetListEntryWithFields(std::string
const&, std::vector<ARVendorFieldStruct,
std::allocator<ARVendorFieldStruct> > const&, std::vector<std::string,
std::allocator<std::string> > const&, Qualification const&, unsigned,
unsigned, std::vector<AREntryListFieldValueStruct,
std::allocator<AREntryListFieldValueStruct> >*, unsigned*)
(Instance.cc:505)
==9500== by 0x43E6F25: ARDBCGetListEntryWithFields
(syscomardbc.cc:757)
==9500== by 0x804D011: ardbctest::testGetListEntryWithFields()
(ardbctest.h:458)
==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)

These only confirm what I already know though. Although the count shows
that it happens almost every time, but oddly enough never resulting in
a segfault when running it though valgrind.

You mentioned doing some intrusive code changes to analyze the
corruption. What did you have in mind?

phear

unread,

Dec 6, 2006, 5:04:38 AM12/6/06

to

What exactly do you mean? I do own both source and target pointers, and
DatabaseAdapterPool::GetAdapter() is protected by a mutex.
IDatabaseAdapter.GetAdapter() is not though. I will try locking this as
well.

I might also try rewriting the code to use normal pointers instead,
just to see what happens then. I am afraid this will just hide the bug
though.

On Dec 6, 3:24 am, Joe Seigh <jseigh...@xemaps.com> wrote:

> phear.d...@gmail.com wrote:
> > I am working on a multithreaded application that contains a database
> > connection pool which is using shared_ptr to pass connections around.
>
> > After recent changes I've been getting random segfaults in the
> > shared_ptr code handling ref counting. The gdb session below help
> > explain the context.
>
> > I acknowledge that this is most likely my own screwup, but since I am
> > unable to get much meaningful information from gdb, I'm starting to run
> > out of ideas.
>
> > The following is what I have discovered by research:
> > - shared_ptr is thread safe (from boost docs)
> > - sp_counted_base uses lock-free algorithms for refcounting (from
> > header)
>
> > The questions I currently have are:
> > - How can I get more details about the segfault? Ie. Which instruction
> > or memory address is involved?
> > - Why can't I access *pw?
> > - How can atomic_increment segfault? Is it possible that gdb's stack
> > trace is wrong?
>

> > An answer to any of these questions would be greatly appreciated.In this case lock-free doesn't mean what you think it means. shared_ptr's

phear

unread,

Dec 6, 2006, 11:17:10 AM12/6/06

to

Putting a lock on IDatabaseAdapter.GetAdapter() did not help. Locking
Instance::GetEntry fixed it, but this ofcourse defies the purpose of
multi-threading.

I have tried using normal pointers instead, and although a lot more
rare, it still segfaults occasionally.

I am currently following a lead from valgrind, but I'm not going to
hold my breath just yet.

==4101== Invalid read of size 8
==4101== at 0x541EC58: (within
/u01/app/oracle/product/localhost/db_1/lib/libnnz10.so)
==4101== Address 0x4372BC8 is 128 bytes inside a block of size 133
alloc'd
==4101== at 0x401CC6B: operator new[](unsigned)
(vg_replace_malloc.c:197)
==4101== by 0x440672D: SAString::AllocBuffer(int) (SQLAPI.cpp:282)
==4101== by 0x4407106: SAString::SAString(char const*)
(SQLAPI.cpp:603)
==4101== by 0x43E9899: SqlApiDatabaseAdapter::Execute(std::string
const&) (SqlApiDatabaseAdapter.cc:67)
==4101== by 0x4404A4D: OracleDatabaseAdapter::GetFieldInfo()
(OracleDatabaseAdapter.cc:35)
==4101== by 0x43F3BE7: Instance::GetFields(std::string const&,
ARFieldMappingList*, ARFieldLimitList*, ARUnsignedIntList*)
(Instance.cc:211)
==4101== by 0x43E3FE8: ARDBCGetMultipleFields (syscomardbc.cc:622)
==4101== by 0x804D802: ardbctest::testGetFields(ARCompoundSchema&)
(ardbctest.h:303)
==4101== by 0x804DFDE: ardbctest::testGetSchemasAndFields()
(ardbctest.h:280)
==4101== by 0x804E162: ardbctest::run() (ardbctest.h:179)
==4101== by 0x8049CAA: run() (main.cc:47)
==4101== by 0x8049EBB: main (main.cc:68)

NOTE: SQLAPI is a third-party library between my app and oracle (and
other db apis).

Joe Seigh

unread,

Dec 6, 2006, 11:48:26 AM12/6/06

to

phear wrote:
> Putting a lock on IDatabaseAdapter.GetAdapter() did not help. Locking
> Instance::GetEntry fixed it, but this ofcourse defies the purpose of
> multi-threading.
>
> I have tried using normal pointers instead, and although a lot more
> rare, it still segfaults occasionally.
>
> I am currently following a lead from valgrind, but I'm not going to
> hold my breath just yet.
>

[...]

It's not a problem with shared_ptr or with the debugger. The "lock-free"
stuff in shared_ptr is a red herring. It's an implementation issue to make
it thread-safe and could just have easily been lock based and is lock
based on some platforms. Thread-safe means that shared_ptr handle
internally shared data safely, i.e. the reference count. It does not
mean it is atomically thread-safe like Java pointers and that you
don't need some form of external synchronization if the pointers are
shared.

If you want to see what an atomically thread-safe refcount pointer looks
like, take a look at atomic_ptr in
http://atomic-ptr-plus.sourceforge.net/

We can't give you too much specific advice on how to solve your
problem since we know few specifics on what you are trying
to accomplish.

phear

unread,

Dec 6, 2006, 2:46:51 PM12/6/06

to

I am developing a database connectivity plugin for the Remedy Action
Request System (ARS). ARS is a platform for developing form based
applications, usually for request/ticket/case management. In order to
support integrations to multiple database systems through native ARS
forms, I need a plugin that can be configured to access multiple
configured data sources.

Due to restrictions in the ARS plugin interface and performance
requirements, I've developed a database pool for the plugin so the time
used to connect to the data sources will be greatly reduced. The pool
is very simple in its design. It holds a list of database connections
not in use, and will upon request return one of these, or if none is
available create a new one. The requester is required to give the
adapter back to the pool when done with, so the pool can insert it back
to the available list. The pool will also check if the connections are
expired before returning them, and remove them if they are.

The shared_ptrs are responsible for cleaning up the database
connection. The initial idea was that if an exception occured in a
thread holding a database connection, the shared_ptr would
automatically clean it up. This might not be possible though, since the
pool might have to keep track of the connections in order to limit the
number of open connections. Unless this can be handled on another
level.

The pointers are never passed between threads outside of the pool.
Access to the pool is synchronized in the only two methods used to
access the pool, GetAdapter() and ReleaseAdapter(). The thread safety
of the pointers should therefore not be an issue anyway. This is also
indicated by the fact that this bug also occurs when using ordinary
pointers.

Thanks for enlightening me about the details of shared_ptr and
atomic-ptr though. That will probably save my day sometime :)

On Dec 6, 5:48 pm, Joe Seigh <jseigh...@xemaps.com> wrote:
> phear wrote:
> > Putting a lock on IDatabaseAdapter.GetAdapter() did not help. Locking
> > Instance::GetEntry fixed it, but this ofcourse defies the purpose of
> > multi-threading.
>
> > I have tried using normal pointers instead, and although a lot more
> > rare, it still segfaults occasionally.
>
> > I am currently following a lead from valgrind, but I'm not going to
> > hold my breath just yet.[...]
>
> It's not a problem with shared_ptr or with the debugger. The "lock-free"
> stuff in shared_ptr is a red herring. It's an implementation issue to make
> it thread-safe and could just have easily been lock based and is lock
> based on some platforms. Thread-safe means that shared_ptr handle
> internally shared data safely, i.e. the reference count. It does not
> mean it is atomically thread-safe like Java pointers and that you
> don't need some form of external synchronization if the pointers are
> shared.
>
> If you want to see what an atomically thread-safe refcount pointer looks

> like, take a look at atomic_ptr inhttp://atomic-ptr-plus.sourceforge.net/

Arnold Hendriks

unread,

Dec 6, 2006, 3:09:49 PM12/6/06

to

phear wrote:
>
> Valgrind reports quite a few errors/warnings, most of them occuring in
> the oracle library on connect, which I think is normal. I will try
> using mysql, to see if I can replicate the error there.

If you're using OCI, I feel sorry for you. I have plenty of valgrind and
core dumps all segfaulting deep in oracle code (mostly Oracle's internal
resolver libraries, judging by the function names - for me, it appeared
to get better by switching to connection pools and manually setting up
the connection and keeping environment handles open instead of relying
on the easier OCILogon)

>
> Valgrind also reports two errors that is likely related:
>
> =9500== 994 errors in context 184 of 186:
> ==9500== Invalid read of size 4
> ==9500== at 0x43FD729:
> boost::detail::shared_count::shared_count(boost::detail::shared_count
> const&) (shared_count.hpp:165)
> ==9500== by 0x43FD9DB:
> boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter>
> const&) (shared_ptr.hpp:106)

> ==9500== Address 0x591D60C is 20 bytes inside a block of size 24

> free'd
> ==9500== at 0x401D268: operator delete(void*)
> (vg_replace_malloc.c:246)
> ==9500== by 0x43FDF68:
> __gnu_cxx::new_allocator<std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>
>> ::deallocate(std::_List_node<DatabaseAdapterPool::InternalPool::PoolObject>*, unsigned) (new_allocator.h:94)
> ==9500== by 0x43FDF9D:

Well, at first glance, this would appear to be a race condition when
handling the shared_ptr, perhaps the object containing a shared_ptr got
destroyed but somehow the object is still being accessed. Are you making
use of shared_ptr's get() function instead of taking an explicit
reference, or passing the object (or one of its member) as reference to
a function instead of by value? If so, are you sure the lifetime of the
object is guaranteed as long as some code can use the pointer returned
by get() or by a reference ?

> These only confirm what I already know though. Although the count shows
> that it happens almost every time, but oddly enough never resulting in
> a segfault when running it though valgrind.

That is to be expected: valgrind takes the hit (addresses stay valid
longer than normally so valgrind can monitor them) so no segfault
occurs. I believe you can use environment variables to tune how long
valgrind keeps deallocatted addresses.

Without valgrind, the "Invalid read" would have caused a segfault, if
that address happened to be on a part of the heap that was returned back
to the system. Even without valgrind, the allocator doesn't immediately
return freed memory back to the OS.

>
> You mentioned doing some intrusive code changes to analyze the
> corruption. What did you have in mind?

Using the intrusive boost shared pointers. It keeps the reference count
(the thing that caused the actual segfault) inside the objects you are
counting. It also saves quite a few allocations (the reference count
isn't allocated separately) but it does require you to modify the
classes to hold a reference count and supply counting functions

(older boost implementations only required you to derive from an
intrusive pointer class, and shared_ptr would automatically adapt. I
guess they figured that was too easy to use and that it needed to be
more complex)

Paul Pluzhnikov

unread,

Dec 6, 2006, 5:23:02 PM12/6/06

to

"phear" <phear...@gmail.com> writes:

> Valgrind also reports two errors that is likely related:

I think this is *the* cause of your crashes, and I believe you are
not interpreting these errors correctly.

> ==9500== Invalid read of size 4
> ==9500== at 0x43FD729: boost::detail::shared_count::shared_count(boost::detail::shared_count const&) (shared_count.hpp:165)
> ==9500== by 0x43FD9DB: boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter> const&) (shared_ptr.hpp:106)
> ==9500== by 0x43FD28F: DatabaseAdapterPool::GetAdapter(DataSource const&) (DatabaseAdapterPool.cc:188)
> ==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource const&) (DatabaseAdapter.cc:31)

> ==9500== by 0x43F6A99: Instance::GetListEntryWithFields(...) > (Instance.cc:505)

> ==9500== by 0x43E6F25: ARDBCGetListEntryWithFields (syscomardbc.cc:757)
> ==9500== by 0x804D011: ardbctest::testGetListEntryWithFields() (ardbctest.h:458)
> ==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)
> ==9500== by 0x8049CAA: run() (main.cc:47)
> ==9500== by 0x8049EBB: main (main.cc:68)
> ==9500== Address 0x591D60C is 20 bytes inside a block of size 24 free'd
> ==9500== at 0x401D268: operator delete(void*) (vg_replace_malloc.c:246)

> ==9500== by 0x43FDF68: __gnu_cxx::new_allocator<...>::deallocate(...) (new_allocator.h:94)
> ==9500== by 0x43FDF9D: std::_List_base<... >::_M_put_node(...) (stl_list.h:317)
> ==9500== by 0x43FE1CA: std::list<...>::_M_erase(...) (stl_list.h:1163)
> ==9500== by 0x43FE22C: std::list<...>::erase(...) (list.tcc:98)

> ==9500== by 0x43FCD07: DatabaseAdapterPool::InternalPool::GetAdapter() (DatabaseAdapterPool.cc:68)

Let me translate that message for you:

in DatabaseAdapterPool.cc:188 you are accessing shared_ptr which
resides in a list node, and that list node has already been free()d
by list::erase() operation, called from DatabaseAdapterPool.cc:68

> These only confirm what I already know though.

No, they tell you exact problem which you didn't know: that you
are mis-using std::list().

In particular, your bug is likely along the lines of:

list<...>::iterator it = someList.begin(); // [1]
someList.erase(it); // invalidates it and *it !!! // [2]
...
return it->someField; // read dangling! // [3]

The reason you observe crash/corruption only with multiple threads,
is that between points [2] and [3] the memory that you'll be using at
3 is free -- so some other threads can malloc() it, and write its
own data over it; corrupting it before you get to point [3].

Without interference from another thread, the memory will not change,
so at [3] the value of someField will be the same it was before [2]
(unless code between [2] and [3] also does malloc()s).

phear

unread,

Dec 7, 2006, 4:22:36 AM12/7/06

to

I can't believe I missed that.

I store the connections in a list std::list<PoolObject>, where
PoolObject is a wrapper that contains the pointer and some other data.

When checking out a connection, I get a reference to each PoolObject in
order to conveniently check for expiration and such. If this connection
is ok to be checked out, I move it to a "used" list, and use the same
reference as before to return the pointer. This obviously doesn't work
very well :)

A stupid mistake, but atleast I've lea rned alot while trying to figure
this out.

Thank you all for helping me out. You've been great. And a special
thanks to Paul for continously stating "the obvious" :) You've
certainly got a great eye for detail.

On Dec 6, 11:23 pm, Paul Pluzhnikov <ppluzhnikov-...@charter.net>
wrote:

> "phear" <phear.d...@gmail.com> writes:
> > Valgrind also reports two errors that is likely related:I think this is *the* cause of your crashes, and I believe you are
> not interpreting these errors correctly.
>
>
>
> > ==9500== Invalid read of size 4
> > ==9500== at 0x43FD729: boost::detail::shared_count::shared_count(boost::detail::shared_count const&) (shared_count.hpp:165)
> > ==9500== by 0x43FD9DB: boost::shared_ptr<IDatabaseAdapter>::shared_ptr(boost::shared_ptr<IDatabaseAdapter> const&) (shared_ptr.hpp:106)
> > ==9500== by 0x43FD28F: DatabaseAdapterPool::GetAdapter(DataSource const&) (DatabaseAdapterPool.cc:188)
> > ==9500== by 0x43E30A2: IDatabaseAdapter::GetAdapter(DataSource const&) (DatabaseAdapter.cc:31)
> > ==9500== by 0x43F6A99: Instance::GetListEntryWithFields(...) > (Instance.cc:505)
> > ==9500== by 0x43E6F25: ARDBCGetListEntryWithFields (syscomardbc.cc:757)
> > ==9500== by 0x804D011: ardbctest::testGetListEntryWithFields() (ardbctest.h:458)
> > ==9500== by 0x804E11D: ardbctest::run() (ardbctest.h:180)
> > ==9500== by 0x8049CAA: run() (main.cc:47)
> > ==9500== by 0x8049EBB: main (main.cc:68)
> > ==9500== Address 0x591D60C is 20 bytes inside a block of size 24 free'd
> > ==9500== at 0x401D268: operator delete(void*) (vg_replace_malloc.c:246)
> > ==9500== by 0x43FDF68: __gnu_cxx::new_allocator<...>::deallocate(...) (new_allocator.h:94)
> > ==9500== by 0x43FDF9D: std::_List_base<... >::_M_put_node(...) (stl_list.h:317)
> > ==9500== by 0x43FE1CA: std::list<...>::_M_erase(...) (stl_list.h:1163)
> > ==9500== by 0x43FE22C: std::list<...>::erase(...) (list.tcc:98)

> > ==9500== by 0x43FCD07: DatabaseAdapterPool::InternalPool::GetAdapter() (DatabaseAdapterPool.cc:68)Let me translate that message for you:

>
> in DatabaseAdapterPool.cc:188 you are accessing shared_ptr which
> resides in a list node, and that list node has already been free()d
> by list::erase() operation, called from DatabaseAdapterPool.cc:68
>

> > These only confirm what I already know though.No, they tell you exact problem which you didn't know: that you

phear

unread,

Dec 7, 2006, 4:26:33 AM12/7/06

to

I (un)fortunately don't deal with oracle directly. Since this plugin
has to support a wide array of database systems, I use the commercial
SQLAPI++ library to manage that for me.

Thanks for the info though.

On Dec 6, 9:09 pm, Arnold Hendriks <a.hendr...@b-lex.nl> wrote:
> phear wrote:
>
> > Valgrind reports quite a few errors/warnings, most of them occuring in
> > the oracle library on connect, which I think is normal. I will try

> > using mysql, to see if I can replicate the error there.If you're using OCI, I feel sorry for you. I have plenty of valgrind and

> > a segfault when running it though valgrind.That is to be expected: valgrind takes the hit (addresses stay valid

> longer than normally so valgrind can monitor them) so no segfault
> occurs. I believe you can use environment variables to tune how long
> valgrind keeps deallocatted addresses.
>
> Without valgrind, the "Invalid read" would have caused a segfault, if
> that address happened to be on a part of the heap that was returned back
> to the system. Even without valgrind, the allocator doesn't immediately
> return freed memory back to the OS.
>
>
>
> > You mentioned doing some intrusive code changes to analyze the

> > corruption. What did you have in mind?Using the intrusive boost shared pointers. It keeps the reference count

gdb help: debugging a segfault in boost::shared_ptr

phear...@gmail.com

phear

Paul Pluzhnikov

phear

Arnold Hendriks

Joe Seigh

jasen

phear

phear

phear

Joe Seigh

phear

Arnold Hendriks

Paul Pluzhnikov

phear

phear