http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7a67bc70d425ca23
(READ ALL!)
Sergey's single-threaded mem_pool code is here:
http://ders.stml.net/cpp/mtprog/mtprog.html
(website)
http://ders.stml.net/cpp/mtprog/code.zip
(code)
http://ders.stml.net/cpp/mtprog/doc/index.html
(doxygen)
His benchmark code is as follows:
void start_std(void*)
{
list<int> lst;
for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) lst.push_back(j);
for (int j=0; j<M; j++) lst.pop_front();
}
}
simple enough. He challenged me to post results from my multi-threaded
general-purpose allocator using the same benchmark code. Here is my
response:
http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html
The platform i used for the tests is a P4 3.06mhz HyperThread 512mb...
For a simple test with 2 threads (e.g., cmd-line<2 ders>), here is the
output I get with MINGW GCC 3.4.5 with an optimization level of -O3:
2 9656 ders
The second column is the number of milli-seconds the test ran.
Here is the output I get from running my code with 2 threads (e.g.,
THREAD_COUNT macro defined to a value of 2) complied with MSVC 8.0 release
build:
(1) - test running
(1) - test finished
test time: 2875 ms
--------------------
For a test with 10 threads (e.g., cmd-line<10 ders>), Sergey's code gives
me:
10 39213 ders
Mine is (e.g., THREAD_COUNT macro defined to a value of 10):
(1) - test running
(1) - test finished
test time: 14563 ms
--------------------
Just recap:
2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
This does not prove anything! I urge others to run the benchmarks an post
the results here please!
Thank you all.
> The platform i used for the tests is a P4 3.06mhz HyperThread 512mb...
P4 3.06-GHZ Hyper-Thread 512-MB
!
[...]
Have you disabled assertions with -DNDEBUG?
--
With all respect, Sergey. http://ders.stml.net/
mailto : ders at skeptik.net
> 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
>
> 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
>
> This does not prove anything! I urge others to run the benchmarks an post
> the results here please!
I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
And set N=10000, M=10000.
4 threads:
std 254600
ders 2578
vzoom 2671 (std with defined THREAD_MALLOC_OVERLOAD_NEW_DELETE)
16 threads:
std - don't even want to run
ders 9500
vzoom 10515
Dmitriy V'jukov
Yeah. My code needs to have NDEBUG defined, or it performs an atomic
operation on every allocation and deallocation which would tank its
performance. Anyway, I need to see some numbers generated by others before I
can come to any concrete conclusions.
Great! Now that's more like it. I was wondering why my region allocator was
beating his memory allocation algorithm. Perhaps it because region
allocators do not like certain usage patterns at all. I took at look at his
algorithm, and I concluded that region allocation should perform slower.
Also, I was compiling his code with MINGW/CYGWIN 3.4.5/3.4.4, that could
make a big difference.
Anyway, could you run test with 2 threads and 10 threads? Also, try a
single-threaded run. And make sure to define NDEBUG.
I would love to run Sergey's code through the benchmark code provided in the
following link:
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
but I cannot because it requires the ability of allocating in one thread and
freeing in another...
;^(...
> I was wondering why my region allocator was beating his memory allocation
> algorithm.
Need to postfix the following to the sentence above:
`on the platform I was running the tests on.'
[...]
> > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
> >
> > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
> >
> > This does not prove anything! I urge others to run the benchmarks an
> > post
> > the results here please!
> I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
> And set N=10000, M=10000.
[...]
One other request Dimitriy; In the follow section of my code:
/* Global Settings
_____________________________________________________________*/
#define THREAD_MALLOC_OVERLOAD_NEW_DELETE
#if ! defined(THREAD_ALLOCATOR_HEAP_SIZE)
# define THREAD_ALLOCATOR_HEAP_SIZE 262144
#endif
#if ! defined(THREAD_ALLOCATOR_DTOR_MARK)
# define THREAD_ALLOCATOR_DTOR_MARK 0x80000000
#endif
#if ! defined(THREAD_ALLOCATOR_PRIME_COUNT)
# define THREAD_ALLOCATOR_PRIME_COUNT 1
#endif
#if ! defined(THREAD_ALLOCATOR_MAX_COUNT)
# define THREAD_ALLOCATOR_MAX_COUNT 2
#endif
Try increasing the prime count by 2 or 3, and increasing the max count by 3
or 4; just make sure the max count is a little bit larger than the prime
count. See if that effects the testing at all. Also, you could try
increasing the head size a bit. I am interested to see how it changes things
for better or worse.
> > 2 threads: Sergey(9656) -vs- Chris(2875) milliseconds
>
> > 10 threads: Sergey(39213) -vs- Chris(14563) milliseconds
>
> > This does not prove anything! I urge others to run the benchmarks an post
> > the results here please!
>
> I compile both with MSVC8 /Ox. Hardware is Intel Core 2 Quad 2.4GHz.
> And set N=10000, M=10000.
>
> 4 threads:
> std 254600
> ders 2578
> vzoom 2671 (std with defined THREAD_MALLOC_OVERLOAD_NEW_DELETE)
>
> 16 threads:
> std - don't even want to run
> ders 9500
> vzoom 10515
Well, it seems that comparison was not very honest.
First, by concatenation of circumstances ders allocator was using
vzoom allocator as underlying allocator. Now I turn off vzoom
allocator on ders test.
Second, I create stl compatible allocator for vzoom.
Third, I turn off debug checks wrt sturtup in vzoom allocator.
Also I notice 2 errors in ders allocator. First, it doesn't align
memory! It just returns random addresses. Second, it returns NULL,
when I allocate 0 bytes. I don't fix them.
Here is new results:
4 threads:
ders 2531
vzoom 2437
16 threads:
ders 9406
vzoom 9484
They looks roughly equal. I think that alignment support will slow-
down ders a bit. And note that vzoom is MULTI-THREADED allocator,
while ders is single-threaded.
More important is that vzoom approach can turn ders allocator into
multi-threaded allocator PRESERVING it's single-threaded performance.
Dmitriy V'jukov
I set:
# define THREAD_ALLOCATOR_HEAP_SIZE (262144*4)
# define THREAD_ALLOCATOR_PRIME_COUNT (1+3)
# define THREAD_ALLOCATOR_MAX_COUNT (2+4)
Now results:
4 threads:
vzoom 2390
16 threads:
vzoom 9718
Note that I made some changes in test:
http://groups.google.ru/group/comp.programming.threads/msg/2143c3a4344f25a7
Dmitriy V'jukov
> Anyway, could you run test with 2 threads and 10 threads? Also, try a
> single-threaded run. And make sure to define NDEBUG.
2 threads:
ders 2312
vzoom 2359
10 threads:
ders 5843
vzoom 6093
Compiler options:
/Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_MBCS" /GF /
FD /EHsc /MT /GS- /arch:SSE2 /Zi /TP
Dmitriy V'jukov
[...]
> Well, it seems that comparison was not very honest.
> First, by concatenation of circumstances ders allocator was using
> vzoom allocator as underlying allocator. Now I turn off vzoom
> allocator on ders test.
> Second, I create stl compatible allocator for vzoom.
> Third, I turn off debug checks wrt sturtup in vzoom allocator.
> Also I notice 2 errors in ders allocator. First, it doesn't align
> memory! It just returns random addresses. Second, it returns NULL,
> when I allocate 0 bytes. I don't fix them.
the alignment code in the pre-alpha version of my region allocator is
suboptimal at best... See, I am using sizeof on the following union to
determine max alignment:
union aligner {
char m_char;
short m_short;
int m_int;
long m_long;
double m_double;
float m_float;
aligner* m_this;
void* m_ptr;
void* (*m_fptr) (void*);
std::size_t m_size_t;
std::ptrdiff_t m_ptrdiff_t;
};
This clearly makes the max alignment bigger than it should be. I should do
this instead:
struct aligner_calc {
char padder;
aligner calc;
};
#define ALIGN_MAX offsetof(aligner_calc, calc);
Also, I am use a expensive modulo operator in the following macro:
#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t const)((mp_this))) % \
((std::ptrdiff_t const)((mp_align)))) \
)
this macro is used every time something is freed via the region::from_ptr
function! NOT GOOD! I need to change the macro to something like:
#define IS_ALIGNED(mp_this, mp_type, mp_align) ( \
(mp_this) == ALIGN_SIZE(mp_this, mp_type, mp_align) \
)
That said, I notice a bug in my allocator. It won't throw std::bad_alloc if
an allocation fails. Well, I guess that its good that its only version 0.0.0
pre-alpha!
Ouch!
Okay, I will go ahead and fix those issues, and repost the code.
BTW, thank you so much for your time Dmitriy! I really do appreciate all of
your help wrt this issue.
[...]
I fixed the alignment code:
http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_v000_cpp.html
Now its not wasting memory on every allocation!
> That said, I notice a bug in my allocator. It won't throw std::bad_alloc
> if an allocation fails. Well, I guess that its good that its only version
> 0.0.0 pre-alpha!
Stupid me! I forgot that the region::create function already throws
bad_alloc if allocation fails!
ARGHGH!
> I set:
> # define THREAD_ALLOCATOR_HEAP_SIZE (262144*4)
> # define THREAD_ALLOCATOR_PRIME_COUNT (1+3)
> # define THREAD_ALLOCATOR_MAX_COUNT (2+4)
> Now results:
> 4 threads:
> vzoom 2390
> 16 threads:
> vzoom 9718
It make a couple of hundred ms improvement; not that bad... I wonder how
much correct alignment code would slow down ders allocator; humm... When you
mention this alignment issue, I realize that my aligner code was fairly
crappy:
http://groups.google.com/group/comp.programming.threads/msg/24a5de7cb6c08614
therefore I fixed it:
http://groups.google.com/group/comp.programming.threads/msg/efb379890789a0e8
Then I thought, hey, mine at least try to align things correctly, and ders
apparently does not... Well heck, that's not fair! Also, it seems like his
code simply does not like hyper-threading... Why? Well, I don't know.
Perhaps it because of a thread stack alignment issue inherent in early
hyper-threading processors which cause false-sharing. Intel's fix was to
offset the thread stack by a increacing size. This is why AppCore contains
the following code in the file:
http://webpages.charter.net/appcore/appcore/src/ac_thread_c.html
(refer to last function in file...)
void* AC_CDECL
prv_thread_entry
( void *state )
{
int ret;
void *uret;
ac_thread_t *_this = state;
ret = pthread_setspecific
( g_tls_key,
_this );
if ( ret ) { assert( ! ret ); abort(); }
if ( _this->id < 64 )
{
AC_OS_ALLOCA( 2048 * _this->id );
uret = _this->fp_entry( (void*)_this->state );
}
else
{
uret = _this->fp_entry( (void*)_this->state );
}
return uret;
}
I use the Intel recommended hack to get around problem. Now, let me try to
find a link to documentation of the problem:
http://softwarecommunity.intel.com/Wiki/Multi-threadappsforMulti-core/487.htm
Bingo! The 64-k aliasing problem! Nasty bug. False-sharing sucks!
;^)
> Note that I made some changes in test:
> http://groups.google.ru/group/comp.programming.threads/msg/2143c3a4344f25a7
Cool.
The blocks are aligned to sizeof(void*)
http://ders.stml.net/cpp/mtprog/doc/mem__pool_8cpp-source.html
> Second, it returns NULL,
> when I allocate 0 bytes. I don't fix them.
>
The same URL:
void* mem_pool::allocate(size_t size)
{
assert(size>0);
...
Read this entire thread:
http://groups.google.com/group/comp.lang.c/browse_frm/thread/6fc4da438e08028b
aligning to sizeof(void*) is simply insufficient... free function, and
especially member function pointers, might need different alignments. Also,
on my 32-bit x86 platform with MSVC and GCC, doubles need stricter alignment
than sizeof(void*), they need to on 8-byte boundaries.
Also, run this program:
http://groups.google.com/group/comp.lang.c/msg/be7e0d0e97c5e1d9
on one of your systems and report the output. Here is what I happen to get:
(8) == ALIGN_MAX
(001EFF70) == rawbuf
(001EFF80) == l2cachebuf
(001F0000) == pagebuf
(001F0000) == superbuf
and sizeof(void*) is 4. Well, your allocator will cause undefined behavior
on my system.
>> Second, it returns NULL,
>> when I allocate 0 bytes. I don't fix them.
>>
> The same URL:
>
> void* mem_pool::allocate(size_t size)
> {
> assert(size>0);
You can legitimately pass zero to malloc. Here is docs:
http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html
An implementation can return NULL or some pointer to unique block. Anyway,
its undefined behavior if user makes use of this block for anything else but
passing it back to free.
Its explicitly designed to generate freeing across thread boundaries; I did
this to test the scalability of the allocator under load. I think your test
works fine for a pure thread-local test. But I have always been interested
in creating allocator with great single-thread, and scaleable multi-thread
performance characteristics.
General purpose memory allocator must return address with the MOST
STRINGENT alignment, not just sizeof(void*).
See ISO C++03:
3.8/1:
The lifetime of an object is a runtime property of the object. The
lifetime of an object of type T begins
when:
-- storage with the proper alignment and size for type T is obtained,
and...
3.9/5:
Object types have alignment requirements (3.9.1, 3.9.2). The alignment
of a complete object type is an
implementation-defined integer value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements of its object type.
5.3.4/10:
A new-expression passes the amount of space requested to the
allocation function as the first argument of
type std::size_t. That argument shall be no less than the size of the
object being created; it may be
greater than the size of the object being created only if the object
is an array. For arrays of char and
unsigned char, the difference between the result of the new-expression
and the address returned by the
allocation function shall be an integral multiple of the most
stringent alignment requirement (3.9) of any
object type whose size is no greater than the size of the array being
created. [Note: Because allocation
functions are assumed to return pointers to storage that is
appropriately aligned for objects of any type, this
constraint on array allocation overhead permits the common idiom of
allocating character arrays into which
objects of other types will later be placed. ]
5.7.3.1/2:
The pointer returned shall be suitably aligned so that it can be
converted to a
pointer of any complete object type and then used to access the object
or array in the storage allocated...
> > Second, it returns NULL,
> > when I allocate 0 bytes. I don't fix them.
>
> The same URL:
>
> void* mem_pool::allocate(size_t size)
> {
> assert(size>0);
User CAN allocate 0 bytes. And allocator MUST return unique address
anyway.
3.7.3.1/2 Allocation functions:
Even if the size of the space requested is zero, the request can fail.
If the request succeeds, the value returned shall be a nonnull pointer
value (4.10) p0 different from any previously returned value p1.
I am quite surprised to see C++ memory allocator which doesn't satisfy
those requirements...
Dmitriy V'jukov
> General purpose memory allocator must return address with the MOST
> STRINGENT alignment, not just sizeof(void*).
[...]
> User CAN allocate 0 bytes. And allocator MUST return unique address
> anyway.
> 3.7.3.1/2 Allocation functions:
> Even if the size of the space requested is zero, the request can fail.
> If the request succeeds, the value returned shall be a nonnull pointer
> value (4.10) p0 different from any previously returned value p1.
AFAICT, the above says that malloc can respond to a request of zero bytes in
one of two ways...
It can "pretend" it failed by returning NULL, or it can return non-NULL
address which can be passed back to free. This is implementation defined.
Although, I believe that the only time it returns NULL is if the allocator
was truly out of memory, or else it could trick the program into thinking
its under an out-of-memory condition, when its really not. Also, any code
which does something like:
char* ptr = malloc(0);
if (ptr) {
free(ptr);
}
if 100% conforming. Although, a program that does this:
char* ptr = malloc(0);
if (ptr) {
ptr[0] = 'a';
free(ptr);
}
results in undefined behavior because according to the allocation, the
non-null `ptr' variable points to a chunk of "something" in the allocator
that is effectively zero-bytes long which means that any mutations will
cause demons to fly out of the programmers nose or something... So, AFAICT,
the following should be legit:
void* malloc(size_t sz) {
if (! sz) {
static char zero_bytes;
return &zero_bytes;
}
[...];
return [...];
}
> I am quite surprised to see C++ memory allocator which doesn't satisfy
> those requirements...
yeah. Well, I would overlook it if the code was version 0.0.0 pre-alpha...
;^D
> aligning to sizeof(void*) is simply insufficient... free function, and
> especially member function pointers, might need different alignments. Also,
> on my 32-bit x86 platform with MSVC and GCC, doubles need stricter alignment
> than sizeof(void*), they need to on 8-byte boundaries.
>
> Also, run this program:
>
> http://groups.google.com/group/comp.lang.c/msg/be7e0d0e97c5e1d9
>
> on one of your systems and report the output. Here is what I happen to get:
>
> (8) == ALIGN_MAX
> (001EFF70) == rawbuf
> (001EFF80) == l2cachebuf
> (001F0000) == pagebuf
> (001F0000) == superbuf
>
> and sizeof(void*) is 4. Well, your allocator will cause undefined behavior
> on my system.
You forget about platform-specific stuff. For example x86 have SSE2.
And data for most SSE2 instructions MUST be aligned on 16 bytes
(otherwise you will get hardware exception). On such platform user is
free to allocate memory with custom memory allocator, and then pass it
to SSE2 instruction. And he will be surprised if he will get hardware
exception. Standard malloc() on x86 allocates memory exactly with
alignment 16.
So, to be production-ready you have to extend your 'union aligner'
with ifdefs and platform-specific stuff.
> You can legitimately pass zero to malloc. Here is docs:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html
>
> An implementation can return NULL or some pointer to unique block. Anyway,
> its undefined behavior if user makes use of this block for anything else but
> passing it back to free.
User can compare address of such block with another block for
equality.
For example. Server accepts requests. Some requests have associated
data and some don't have. For latter requests you can allocate zero-
sized block of memory, and then use something like this:
if (request_data_1 == request_data_2) ...
when you searching list of requests. Even if there are 2 requests
without associated data, their addresses MUST BE different.
Dmitriy V'jukov
> > User CAN allocate 0 bytes. And allocator MUST return unique address
> > anyway.
> > 3.7.3.1/2 Allocation functions:
> > Even if the size of the space requested is zero, the request can fail.
> > If the request succeeds, the value returned shall be a nonnull pointer
> > value (4.10) p0 different from any previously returned value p1.
>
> AFAICT, the above says that malloc can respond to a request of zero bytes in
> one of two ways...
>
> It can "pretend" it failed by returning NULL, or it can return non-NULL
> address which can be passed back to free. This is implementation defined.
> Although, I believe that the only time it returns NULL is if the allocator
> was truly out of memory, or else it could trick the program into thinking
> its under an out-of-memory condition, when its really not.
Yeah, I will be very surprised if implementation will do so.
But some custom memory allocator is free to do so, provided it clearly
documents this moment. Then user will be just watching out to not
request 0 bytes. But if implementation at same time overloads global
operator new/malloc, then it can affect stdlib implementation, which
is not ready for such behavior.
> Also, any code
> which does something like:
>
> char* ptr = malloc(0);
> if (ptr) {
> free(ptr);
>
> }
>
> if 100% conforming. Although, a program that does this:
>
> char* ptr = malloc(0);
> if (ptr) {
> ptr[0] = 'a';
> free(ptr);
>
> }
>
> results in undefined behavior because according to the allocation, the
> non-null `ptr' variable points to a chunk of "something" in the allocator
> that is effectively zero-bytes long which means that any mutations will
> cause demons to fly out of the programmers nose or something... So, AFAICT,
> the following should be legit:
>
> void* malloc(size_t sz) {
> if (! sz) {
> static char zero_bytes;
> return &zero_bytes;
> }
> [...];
> return [...];
> }
NO! "the value returned shall be a nonnull pointer value (4.10) p0
different from any previously returned value p1"! Even if I allocate 0
bytes, I MUST have some "object identity" (unique address).
Dmitriy V'jukov
>> void* mem_pool::allocate(size_t size)
>> {
>> assert(size>0);
>
> You can legitimately pass zero to malloc. Here is docs:
>
mem_pool isn't malloc. The preconditions differ.
Your right! Thank you. Sure enough, when I include <xmmintrin.h> and add the
data-type `__m128' to the `union aligner', the `REGION_ALIGN_MAX' expands to
16 instead of 8. What do you think is the most elegant way to handle this...
Humm. I REALLY need to think here. Basically, if a user wants to use the
current commercial version of the vzoom slab allocator with SSE she/he would
have to use a special function called:
void* vz_malloc_aligned(void** pbasemem, size_t size, size_t align);
and use it like:
void* sse_base_mem;
__m128* sse_align_mem = vz_malloc_aligned(&sse_base_mem, sizeof(__m128),
16);
if (sse_align_mem) {
[use sse_align_mem];
vz_free(sse_base_mem);
}
So far, I have received no complaints. Humm... What do you think? Should I
just add an analogous function to the vzoom region allocator? Or should I
peform the per-arch #ifdefs. Something like:
#ifdef VZBUILD_COMPILER_MSVC
# ifdef VZBUILD_ARCH_X86_SSE
# include <xmmintrin.h>
# define VZBUILD_ARCH_X86_SSE_TYPE char m_ ## __LINE__
# endif
#endif
#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif
#define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE
That would work. Also, I could do:
# ifdef VZBUILD_ARCH_X86_SSE
# define VZBUILD_ARCH_X86_SSE_TYPE VZBUILD_DECLSPEC_ALIGN(16) char m_ ##
__LINE__
#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif
#define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE
This could be extendable to different types... like:
# ifdef VZBUILD_ARCH_X86_SSE
# define VZBUILD_ARCH_X86_SSE_TYPE VZBUILD_DECLSPEC_ALIGN(16) char m_ ##
__LINE__
# endif
# ifdef VZBUILD_ARCH_WHATEVER_SPECIAL
# define VZBUILD_ARCH_WHATEVER_SPECIAL VZBUILD_DECLSPEC_ALIGN(128) char
m_ ## __LINE__
# endif
#ifndef VZBUILD_ARCH_X86_SSE_TYPE
# define VZBUILD_ARCH_X86_SSE_TYPE
#endif
#ifndef VZBUILD_ARCH_WHATEVER_SPECIAL
# define VZBUILD_ARCH_WHATEVER_SPECIAL
#endif
#define #define VZBUILD_MALLOC_ALIGNER_EXTRA() \
VZBUILD_ARCH_X86_SSE_TYPE ; \
VZBUILD_ARCH_WHATEVER_SPECIAL
Finally, after all of the #ifdef MESS, the aligner union can be as follows:
union aligner {
char m_char;
short m_short;
int m_int;
long m_long;
double m_double;
float m_float;
aligner* m_this;
void* m_ptr;
void* (*m_fptr) (void*);
std::size_t m_size_t;
std::ptrdiff_t m_ptrdiff_t;
VZBUILD_MALLOC_ALIGNER_EXTRA();
};
Which method do you like the best? Theoretically, the
`VZBUILD_MALLOC_ALIGNER_EXTRA' macro could contain special type alignments
for every platform vzoom supports.
One more thing... I need to add a member function type in the aligner... I
forgot to! ARGH.
;^(...
> > You can legitimately pass zero to malloc. Here is docs:
> >
> > http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html
> >
> > An implementation can return NULL or some pointer to unique block.
> > Anyway,
> > its undefined behavior if user makes use of this block for anything else
> > but
> > passing it back to free.
> User can compare address of such block with another block for
> equality.
> For example. Server accepts requests. Some requests have associated
> data and some don't have. For latter requests you can allocate zero-
> sized block of memory, and then use something like this:
> if (request_data_1 == request_data_2) ...
> when you searching list of requests. Even if there are 2 requests
> without associated data, their addresses MUST BE different.
Great point. This is a bug in my region allocator as well. Luckily, the fix
is very trivial indeed! Here is what I need to do. In the following
function:
___________________________________________________________________________
void* allocate(std::size_t sz) throw(std::bad_alloc) {
0: assert(m_startup);
1: startup();
2: if (! m_startup) {
3: std::printf("STATIC STARTUP ORDER ERROR!!!!!!!!!!\n");
}
4: sz = ALIGN_SIZE(sz, std::size_t, REGION_ALIGN_MAX);
if (sz <= T_heap_size) {
region* node = m_head;
while (node) {
void* const ptr = node->allocate_local(sz);
if (ptr) {
if (node != m_head) {
if (! node->is_full()) {
promote(node);
}
}
return ptr;
}
node = node->m_next;
}
return expand()->allocate_local(sz);
}
return NULL;
}
___________________________________________________________________________
I need to change line 4 to something like:
sz = ALIGN_SIZE((! sz) ? 1 : sz, std::size_t, REGION_ALIGN_MAX);
this would allow the following to never fail:
{
tls_malloc.startup();
void* a = tls_malloc.allocate(0);
void* b = tls_malloc.allocate(0);
assert(a != b);
tls_malloc.deallocate(b);
tls_malloc.deallocate(a);
tls_malloc.shutdown();
}
Thanks again Dmitriy!!!! I am applying fixes and will post them here
shortly.
the type `double' on my platform I happen to be using right now requires an
alignment of 8. sizeof(void*) is 4, so its not going to work.
Good point.
YIKES! STUPID ME!!!
:^o
I fix this in region allocator. It rounds zero-byte allocations up to 1.
That way, every zero-byte request will be unique, unless an out-of-memory
condition it hit, then it can return NULL and will trigger bad_alloc
exception.
Ummm. That is suppose to be:
# define VZBUILD_ARCH_X86_SSE_TYPE __m128 m_ ## __LINE__
of course!!!
> # endif
> #endif
>
> #ifndef VZBUILD_ARCH_X86_SSE_TYPE
> # define VZBUILD_ARCH_X86_SSE_TYPE
> #endif
>
> #define VZBUILD_MALLOC_ALIGNER_EXTRA() VZBUILD_ARCH_X86_SSE_TYPE
[...]
> I.e. sizeof(double) will be rounded to 2*sizeof(void*) which is exactly 8.
Where are you getting the 2 from? Is it sizeof(double) / sizeof(void*)? I
still don't think your doing alignment correctly. What am I missing? Why
does the following GCC program not work on my platform as-is:
________________________________________________________________
#include <cstdio>
#include <cstddef>
#include <list>
#include <ders/stl_alloc.hpp>
typedef __attribute__((aligned(16))) unsigned char __m128[16];
#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t)(mp_this)) % (mp_align)) \
)
int main() {
ders::mem_pool mp;
double* d1 = (double*)mp.allocate(sizeof(*d1));
__m128* sse1 = (__m128*)mp.allocate(sizeof(*sse1));
double* d2 = (double*)mp.allocate(sizeof(*d2));
__m128* sse2 = (__m128*)mp.allocate(sizeof(*sse2));
if (! IS_ALIGNED(d1, 8) ||
! IS_ALIGNED(d2, 8)) {
std::puts("double not aligned!");
}
if (! IS_ALIGNED(sse1, 16) ||
! IS_ALIGNED(sse2, 16)) {
std::puts("sse not aligned!");
}
mp.deallocate(sse2, sizeof(*sse2));
mp.deallocate(d2, sizeof(*d2));
mp.deallocate(sse1, sizeof(*sse1));
mp.deallocate(d1, sizeof(*d1));
return 0;
}
________________________________________________________________
I get an output of
sse not aligned!
This will segfault if I try to use your allocator and SSE... AFAICT, using
sse is real-world. I need to fix my allocator to handle this as well. Also,
why am I required to include the size of an allocation when I call your
deallocate function? This can be a nuisance, and not work for some programs.
You not returning pointers with strictest alignment...
Ummm.. Well, on my MSVC 8, the following program outputs OH CRAP! which
means that the standard allocator does not return pointers aligned on a
boundary of 16!:
_________________________________________________________________
#include <xmmintrin.h>
#include <cstddef>
#include <cstdio>
#define IS_ALIGNED(mp_this, mp_align) ( \
! (((std::ptrdiff_t)(mp_this)) % (mp_align)) \
)
int main() {
__m128* sse1 = (__m128*)std::malloc(sizeof(*sse1));
__m128* sse2 = (__m128*)std::malloc(sizeof(*sse2));
__m128* sse3 = (__m128*)std::malloc(sizeof(*sse3));
if (! IS_ALIGNED(sse1, 16) ||
! IS_ALIGNED(sse2, 16) ||
! IS_ALIGNED(sse3, 16)) {
std::puts("OH CRAP!");
}
std::free(sse3);
std::free(sse2);
std::free(sse1);
return 0;
}
_________________________________________________________________
Run the program and tell be what you get?
> So, to be production-ready you have to extend your 'union aligner'
> with ifdefs and platform-specific stuff.
Maybe not. The standard allocator on my copy of MSVC does not work for SSE!
:^O
[...]
> Also, why am I required to include the size of an allocation when
> I call your deallocate function?
>
This is a C++-specific question.
The point is that it helps to create really fast C++ (class)
allocators...
Moreover, http://ders.stml.net/cpp/mtprog/doc/destroy_8hpp-source.html
shows some other non-trivial C++ tricks so you need certain C++
background to grok what's going on...
Ummm... Don't worry about this too much because, well, the standard
allocator (e.g., malloc) did not align for SSE either. I think a user is
going to have to do it manually. One other point, no matter how hard we try
to get alignment code to work, its still not going to be perfect. There are
some conforming platforms that will require weird alignments that our
allocators will not be able to cope with no matter how hard we try... I can
get my region allocator to conform to SSE by adding the `__m128' data-type
to the aligner union, but that's only x86 specific. There will be some other
platform that will have something that the aligner union simply does not
include.
Apparently SSE is out of scope for system provided allocator as well!!
;^)
> > Also, why am I required to include the size of an allocation when
>> I call your deallocate function?
> >
> This is a C++-specific question.
> The point is that it helps to create really fast C++ (class)
> allocators...
Completely agreed. Its a huge help to know the size when your freeing
memory.
> Moreover, http://ders.stml.net/cpp/mtprog/doc/destroy_8hpp-source.html
> shows some other non-trivial C++ tricks so you need certain C++ background
> to grok what's going on...
I get it. I am more of a C guy, and my region allocator was specifically
designed to overload global operator new/delete. I don't think you can do a
global delete operator with an additional size parameter. Am I right?
;^(...
Yes, you are right.
They just write: "The storage space pointed to by the return value is
guaranteed to be suitably aligned for storage of any type of object".
But doesn't specify what is "any type of object".
So I think, it's Okay for vzoom to use 8 byte alignment be default,
and provide vz_malloc_aligned().
Dmitriy V'jukov