Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

alloca-alternative

161 views
Skip to first unread message

Bonita Montero

unread,
Nov 6, 2019, 5:44:09 AM11/6/19
to
I just found that alloca() with MSVC isn't that fast that it could be.
alloca() calls a function called __chkstk which touches the pages down
the stack to tigger Windows' overcommitting of stacks, That's while
Windows is only able to allocate new pages to a stack when the pages
are touched down the stack, i.e. you'll get a exception if you skip
a page.
So I came to the conclusion to write a little class that has a static
internal buffer with two template-parameters: first the type of the
internal static array and second the size of the array. The construc-
tor takes a parameter which will be the final size of the container;
if it is larger than the second template-parameter, an external array
will be allocated via new T[N].
The allocation will have more overhead than an alloca(), but my idea
is that if there are a larger number of entries the processing time
on the entires will outweigh the allocation.
I'm asking myself if there's a class in boost or another well-known
classlib that implements the same pattern.

So here's the code:

#pragma once
#include <cstddef>
#include <utility>
#include <stdexcept>
#include <algorithm>

template<typename T, std::size_t N>
struct overflow_array
{
overflow_array( std::size_t n );
~overflow_array();
T &operator []( std::size_t i );
T &front();
T &back();
T *data();
T *begin();
T *end();
T const *cbegin() const;
T const *cend() const;
void resize( std::size_t n );

private:
T m_array[N];
T *m_external;
T *m_begin,
*m_end;
};

template<typename T, std::size_t N>
inline
overflow_array<T, N>::overflow_array( std::size_t n )
{
if( N <= n )
{
m_external = nullptr;
m_begin = m_array;
return;
}
m_external = new T[n];
m_begin = m_external;
m_end = m_external + n;
}

template<typename T, std::size_t N>
inline
overflow_array<T, N>::~overflow_array()
{
if( m_external )
delete []m_external;
}

template<typename T, std::size_t N>
inline
T &overflow_array<T, N>::operator []( std::size_t i )
{
return m_begin[i];
}

template<typename T, std::size_t N>
inline
T &overflow_array<T, N>::front()
{
return *m_begin;
}

template<typename T, std::size_t N>
inline
T &overflow_array<T, N>::back()
{
return m_end[-1];
}

template<typename T, std::size_t N>
inline
T *overflow_array<T, N>::data()
{
return m_begin;
}

template<typename T, std::size_t N>
inline
T *overflow_array<T, N>::begin()
{
return m_begin;
}

template<typename T, std::size_t N>
inline
T *overflow_array<T, N>::end()
{
return m_end;
}

template<typename T, std::size_t N>
inline
T const *overflow_array<T, N>::cbegin() const
{
return m_begin;
}

template<typename T, std::size_t N>
inline
T const *overflow_array<T, N>::cend() const
{
return m_end;
}

template<typename T, std::size_t N>
inline
void overflow_array<T, N>::resize( std::size_t n )
{
if( n <= N )
return;
T *newExternal = new T[n];
copy( m_begin, m_end, newExternal );
delete []m_external;
m_external = newExternal;
m_begin = newExternal;
m_end = newExternal + n;
}

Bonita Montero

unread,
Nov 6, 2019, 5:46:26 AM11/6/19
to
> template<typename T, std::size_t N>
> inline
> overflow_array<T, N>::overflow_array( std::size_t n )
> {
>     if( N <= n )
>     {
>         m_external = nullptr;
>         m_begin = m_array;
m_end = m_array + n;

Paavo Helde

unread,
Nov 6, 2019, 6:18:36 AM11/6/19
to
On 6.11.2019 12:43, Bonita Montero wrote:
> I just found that alloca() with MSVC isn't that fast that it could be.
> alloca() calls a function called __chkstk which touches the pages down
> the stack to tigger Windows' overcommitting of stacks, That's while
> Windows is only able to allocate new pages to a stack when the pages
> are touched down the stack, i.e. you'll get a exception if you skip
> a page.
> So I came to the conclusion to write a little class that has a static
> internal buffer with two template-parameters: first the type of the
> internal static array and second the size of the array. The construc-
> tor takes a parameter which will be the final size of the container;
> if it is larger than the second template-parameter, an external array
> will be allocated via new T[N].
> The allocation will have more overhead than an alloca(), but my idea
> is that if there are a larger number of entries the processing time
> on the entires will outweigh the allocation.
> I'm asking myself if there's a class in boost or another well-known
> classlib that implements the same pattern.

Looks like boost::container::small_vector:
"https://www.boost.org/doc/libs/1_71_0/doc/html/boost/container/small_vector.html"

Marcel Mueller

unread,
Nov 6, 2019, 3:01:07 PM11/6/19
to
Am 06.11.19 um 11:43 schrieb Bonita Montero:
> I just found that alloca() with MSVC isn't that fast that it could be.
> alloca() calls a function called __chkstk which touches the pages down
> the stack to tigger Windows' overcommitting of stacks,

Guard pages are no Windows-only feature.

> That's while
> Windows is only able to allocate new pages to a stack when the pages
> are touched down the stack, i.e. you'll get a exception if you skip
> a page.

Exactly. This prevents from accidental out of bounds access.

But there should be functions that commit more than one stack page at
once. (I don't remember exactly on WinXX)

On the other side it is bad practice to allocate large data structures
on the stack. The maximum stack size is always limited and cannot be
enlarged once a thread has been started, because the virtual address
space of the stack must be continuous.


> So I came to the conclusion to write a little class that has a static
> internal buffer with two template-parameters: first the type of the
> internal static array and second the size of the array. The construc-
> tor takes a parameter which will be the final size of the container;
> if it is larger than the second template-parameter, an external array
> will be allocated via new T[N].
> The allocation will have more overhead than an alloca(), but my idea
> is that if there are a larger number of entries the processing time
> on the entires will outweigh the allocation.

So basically you wrote a collection with a limited inline storage.

> I'm asking myself if there's a class in boost or another well-known
> classlib that implements the same pattern.

Not that I know. But some string implementations use this pattern.
Keyword "short string optimization".


Marcel

Chris M. Thomasson

unread,
Nov 6, 2019, 7:27:32 PM11/6/19
to
On 11/6/2019 2:43 AM, Bonita Montero wrote:
> I just found that alloca() with MSVC isn't that fast that it could be.
> alloca() calls a function called __chkstk which touches the pages down
> the stack to tigger Windows' overcommitting of stacks, That's while
> Windows is only able to allocate new pages to a stack when the pages
> are touched down the stack, i.e. you'll get a exception if you skip
> a page.
> So I came to the conclusion to write a little class that has a static
> internal buffer with two template-parameters: first the type of the
> internal static array and second the size of the array. The construc-
> tor takes a parameter which will be the final size of the container;
> if it is larger than the second template-parameter, an external array
> will be allocated via new T[N].
> The allocation will have more overhead than an alloca(), but my idea
> is that if there are a larger number of entries the processing time
> on the entires will outweigh the allocation.
> I'm asking myself if there's a class in boost or another well-known
> classlib that implements the same pattern.
[...]

This reminds me of my ralloc experiment. It can be used as a sort of alloca:

https://pastebin.com/raw/f37a23918

https://groups.google.com/forum/#!original/comp.lang.c/7oaJFWKVCTw/sSWYU9BUS_QJ

Bonita Montero

unread,
Nov 7, 2019, 12:21:03 AM11/7/19
to
>> I just found that alloca() with MSVC isn't that fast that it could be.
>> alloca() calls a function called __chkstk which touches the pages down
>> the stack to tigger Windows' overcommitting of stacks,

> Guard pages are no Windows-only feature.

You didn't understand: Windows has a moving guard-page until the#
stack-bottom. This guard-page triggers Windows stack-overcomitting.

>> That's while
>> Windows is only able to allocate new pages to a stack when the pages
>> are touched down the stack, i.e. you'll get a exception if you skip
>> a page.

> Exactly. This prevents from accidental out of bounds access.

That's stupid. Windows doesn't know if it's a out-of-bound-access or
a correct access.

>> I'm asking myself if there's a class in boost or another well-known
>> classlib that implements the same pattern.

> Not that I know. But some string implementations use this pattern.
> Keyword "short string optimization".

short-string-optimizations aren't made for stack-storage; they can
reside on the heap as well. But my class is made for stack-storage
only.

Alf P. Steinbach

unread,
Nov 7, 2019, 11:40:57 AM11/7/19
to
On 06.11.2019 11:43, Bonita Montero wrote:
> I just found that alloca() with MSVC isn't that fast that it could be.

Checking things out, at least in a little toy test program the standard
allocation with g++ is amazingly fast.

I got the following results from appending single chars to a
std::basic_string, respectively using an arena allocator I cooked up (it
works like alloca except it defers all deallocation to the end) and the
standard allocator, and -O3 option to g++


Arena alloc: 987654 chars in 0.0034617 seconds.
Standard alloc: 987654 chars in 0.0035589 seconds.
Arena alloc: 987654 chars in 0.003044 seconds.
Standard alloc: 987654 chars in 0.0038853 seconds.
Arena alloc: 987654 chars in 0.0030895 seconds.
Standard alloc: 987654 chars in 0.003398 seconds.
Arena alloc: 987654 chars in 0.0061186 seconds.
Standard alloc: 987654 chars in 0.003454 seconds.
Arena alloc: 987654 chars in 0.0031488 seconds.
Standard alloc: 987654 chars in 0.0041217 seconds.

Presumably the standard allocator, before memory starts getting
fragmented just ends up doing the same as an arena allocator. No costly
searches for free blocks. Still it boggles my mind that the standard
allocator at times is even /faster/ than the simple stack like
"increment a pointer or size" allocation.

- Alf

Bonita Montero

unread,
Nov 7, 2019, 11:46:33 AM11/7/19
to
>> I just found that alloca() with MSVC isn't that fast that it could be.

> Checking things out, at least in a little toy test program the standard
> allocation with g++ is amazingly fast.

Of course alloca() is very fast if you don't have this stupid stack
-touching. Here's a little benchmark:

#include <iostream>
#include <random>
#include <chrono>
#if defined(_MSC_VER)
#include <malloc.h>
#elif defined(__linux__)
#include <alloca.h>
#endif

using namespace std;
using namespace chrono;

void allocaFn( size_t *sizes, size_t n )
{
char volatile *p;
for( size_t i = 0; i != n; ++i )
p = (char volatile *)alloca( sizes[i] ),
*p = '*';
}

int main()
{
size_t const ALLOCATIONS_PER_SLICE = 16,
SLICES = 1024;
size_t sizes[ALLOCATIONS_PER_SLICE * SLICES];
size_t const ROUNDS = 100'000;
size_t const LOWER_LIMIT = 1,
UPPER_KIMIT = 32768;
random_device rd;
uniform_int_distribution<size_t> sizeGen( LOWER_LIMIT, UPPER_KIMIT );
for( size_t &s : sizes )
s = sizeGen( rd );
time_point<high_resolution_clock> start = high_resolution_clock::now();
for( size_t round = ROUNDS; round; --round )
for( size_t slice = 0; slice != SLICES * ALLOCATIONS_PER_SLICE;
slice += ALLOCATIONS_PER_SLICE )
allocaFn( sizes + slice, ALLOCATIONS_PER_SLICE );
double nsPerAllocation = duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / ((double)ROUNDS *
ALLOCATIONS_PER_SLICE * SLICES);
cout << nsPerAllocation << endl;
}

Alf P. Steinbach

unread,
Nov 7, 2019, 12:11:29 PM11/7/19
to
On 07.11.2019 17:46, Bonita Montero wrote:
>>> I just found that alloca() with MSVC isn't that fast that it could be.
>
>> Checking things out, at least in a little toy test program the
>> standard allocation with g++ is amazingly fast.
>
> Of course alloca() is very fast if you don't have this stupid stack
> -touching.

No no, the numbers shown were about ordinary `std::alloc` versus a
simple stack like arena allocator, used for `std::basic_string`.

Used like this:


#include <my/arena/all.hpp>
#include <chrono>
#include <type_traits> // std::conditional_t

namespace my {
namespace chrono = std::chrono;
using std::conditional_t;

using Timer_clock =
conditional_t<chrono::high_resolution_clock::is_steady,
chrono::high_resolution_clock, chrono::steady_clock
>;

template< class Rep, class Period >
inline auto as_seconds( const chrono::duration<Rep, Period>
duration_value )
-> double
{ return chrono::duration<double>( duration_value ).count(); }
} // namespace my

auto common_storage()
-> my::arena::Storage&
{
static auto the_storage = my::arena::Storage( 2'000'000 );
return the_storage;
}

void test_arena_allocator()
{
using std::basic_string, std::char_traits, std::cout, std::endl;
namespace arena = my::arena;
using String = basic_string<char, char_traits<char>,
my::arena::Allocator<char>>;

const auto start_time = my::Timer_clock::now();
auto& storage = common_storage();
auto allocator = my::arena::Allocator<char>( storage );
auto s = String( allocator ); // Awkward!
for( int i = 1; i <= 987'654; ++i ) {
s += '-';
}
const auto end_time = my::Timer_clock::now();

cout << "Arena alloc: "
<< s.length() << " chars"
<< " in " << my::as_seconds( end_time - start_time ) << "
seconds."
<< endl;
}

void test_std_allocator()
{
using std::string, std::cout, std::endl;

const auto start_time = my::Timer_clock::now();
auto s = string();
for( int i = 1; i <= 987'654; ++i ) {
s += '-';
}
const auto end_time = my::Timer_clock::now();

cout << "Standard alloc: "
<< s.length() << " chars"
<< " in " << my::as_seconds( end_time - start_time ) << "
seconds."
<< endl;
}

auto main() -> int
{
using std::vector;
{ (void) vector<char>( 1'234'567 ); } // Prime the runtime
library stuff.

for( int i = 1; i <= 5; ++i ) {
test_arena_allocator();
test_std_allocator();
}
}


- Alf

Bonita Montero

unread,
Nov 7, 2019, 12:16:45 PM11/7/19
to
The discussion is about the speed of alloca() and not your private
allocator.

Alf P. Steinbach

unread,
Nov 7, 2019, 12:43:23 PM11/7/19
to
On 07.11.2019 18:16, Bonita Montero wrote:
> The discussion is about the speed of alloca() and not your private
> allocator.

The speed of the standard allocator is quite relevant.

And an arena allocator is comparable in speed to alloca, they do just
about the same (a single arithmetic operation) to allocate.

A main difference is that you can't use alloca in an allocator for e.g.
a string. Another main difference is that failure detection for alloca
is not portable. In the old comp.std.c++ discussions about whether to
add variable length array support to C++, the clincher argument was
usually that one can just use an arena allocator.


- Alf

PS: If you want I can post the allocator code. It's not very complex.
However, it's not been tested other than via my toy
compare-arena-and-std-allocation program that concatenated characters to
strings.

Bonita Montero

unread,
Nov 7, 2019, 12:49:48 PM11/7/19
to
>> The discussion is about the speed of alloca() and not your private
>> allocator.

> The speed of the standard allocator is quite relevant.

1. But not in this thread.
2. std::allocator-compliant allocators need to be heap-compliant to
usual heap-allocation. And fast heap-allocations with low fragmen-
tation´is rocket-science.

> A main difference is that you can't use alloca in an allocator for e.g.
> a string.

You're telling things everyone knows.

Alf P. Steinbach

unread,
Nov 7, 2019, 1:00:40 PM11/7/19
to
On 07.11.2019 18:49, Bonita Montero wrote:
>>> The discussion is about the speed of alloca() and not your private
>>> allocator.
>
>> The speed of the standard allocator is quite relevant.
>
> 1. But not in this thread.

Using alloca has costs. In particular for failure checking, but also (as
you've discovered) by exhausting the available stack space, which then
leads to either UB or inefficiency, depending on the implementation.
When alloc is not significantly faster than the standard allocator, you
shouldn't use it.

So of course the relative speed of the standard allocator is relevant.


> 2. std::allocator-compliant allocators need to be heap-compliant to
>    usual heap-allocation. And fast heap-allocations with low fragmen-
>    tation´is rocket-science.

I don't get what you intend to communicate here. I'm going to try the
old coffee trick.


>> A main difference is that you can't use alloca in an allocator for
>> e.g. a string.
>
> You're telling things everyone knows.

But do you understand that you /can/ use an arena allocator? Which has
roughly the same minimal allocation action as alloca?


- Alf

Mike Terry

unread,
Nov 7, 2019, 1:02:10 PM11/7/19
to
On 06/11/2019 10:43, Bonita Montero wrote:
> I just found that alloca() with MSVC isn't that fast that it could be.
> alloca() calls a function called __chkstk which touches the pages down
> the stack to tigger Windows' overcommitting of stacks, That's while
> Windows is only able to allocate new pages to a stack when the pages
> are touched down the stack, i.e. you'll get a exception if you skip
> a page.

You may be interested in the MSVC #pragma:

#pragma check_stack( [{ on | off }] )

Also, there is a MSVC compiler option /Gs that is related to stack probes...

Regards,
Mike.

Bonita Montero

unread,
Nov 7, 2019, 1:18:01 PM11/7/19
to
> Using alloca has costs. In particular for failure checking, but also (as
> you've discovered) by exhausting the available stack space, which then
> leads to either UB or inefficiency, depending on the implementation.

alloca() is done for small allocations without any failure-checking.
That's like using local variables for which you also don't check if
they can be stored on the stack even for weird recursions.

> When alloc is not significantly faster than the standard allocator, you
> shouldn't use it.

alloca() should be unbeatable by a general-purpose-allocator.

> But do you understand that you /can/ use an arena allocator?
> Which has roughly the same minimal allocation action as alloca?

std::allocator-compliant allocators only make sense if they're compliant
to general-purpose-allocators. An arena-allocator doesn't comply with
this properties.

Bonita Montero

unread,
Nov 7, 2019, 1:25:33 PM11/7/19
to
#pragma check_stack and the /Gs-option are only related to local
variables and not to stack-space allocated via alloca(); I tried
both and __chkstk is always called.

Alf P. Steinbach

unread,
Nov 7, 2019, 1:37:58 PM11/7/19
to
On 07.11.2019 19:17, Bonita Montero wrote:
>> Using alloca has costs. In particular for failure checking, but also
>> (as you've discovered) by exhausting the available stack space, which
>> then leads to either UB or inefficiency, depending on the implementation.
>
> alloca() is done for small allocations without any failure-checking.
> That's like using local variables for which you also don't check if
> they can be stored on the stack even for weird recursions.

Very dangerous. Don't do that. Even if Microsoft did do that for string
conversions in MCF and I think also in ATL/WTL.


>> When alloc is not significantly faster than the standard allocator,
>> you shouldn't use it.
>
> alloca() should be unbeatable by a general-purpose-allocator.

That's the thing you need to test. When you turn off the safety
measures, is it then really much faster?


>> But do you understand that you /can/ use an arena allocator?
>> Which has roughly the same minimal allocation action as alloca?
>
> std::allocator-compliant allocators only make sense if they're compliant
> to general-purpose-allocators. An arena-allocator doesn't comply with
> this properties.

So you don't understand it, but you believe that you do.

I have to guess about what it is that you failed to take into
consideration. Maybe the fact that for a standard container all its
allocations are deallocated before it ceases to exist. I.e., allocated
blocks are not referenced anywhere after the lifetime of the container,
which means that when you use the allocator for objects of scoped
lifetimes you're fine. The point of the arena is that it allows
deallocations in arbitrary order, not strictly like a stack. It's only
the allocations of arenas, from common storage, that have to be in LIFO
order.

Anyway, since you named the thread "alloca-alternative", do consider
arena allocators: they have none of the problems with alloca, and as
mentioned, they can be used in C++-ish cases such as providing
theoretically efficient allocations for `std::basic_string`.

- Alf

Bonita Montero

unread,
Nov 7, 2019, 1:43:49 PM11/7/19
to
>> alloca() is done for small allocations without any failure-checking.
>> That's like using local variables for which you also don't check if
>> they can be stored on the stack even for weird recursions.

> Very dangerous. Don't do that. Even if Microsoft did do that for string
> conversions in MCF and I think also in ATL/WTL.

If i have one MB stack which is the default for windows and allocate
only a small amout of that that's not dangerous.

>> alloca() should be unbeatable by a general-purpose-allocator.

> That's the thing you need to test. When you turn off the safety
> measures, is it then really much faster?

Read how fast allocators like TCMalloc, mimalloc, Hoard and whatever
are built; they have more complexity even than alloca() of MSVC with
this stack touching. And alloca() without stack-touching like with
gcc is even faster.

> I have to guess about what it is that you failed to take into
> consideration. Maybe the fact that for a standard container all its
> allocations are deallocated before it ceases to exist. I.e., allocated
> blocks are not referenced anywhere after the lifetime of the container,
> which means that when you use the allocator for objects of scoped
> lifetimes you're fine. The point of the arena is that it allows
> deallocations in arbitrary order, not strictly like a stack. It's only
> the allocations of arenas, from common storage, that have to be in LIFO
> order.

An allocator that works for larger runtime has to cope with fragmen-
tation as optimally as possible. All these simple allcators don't.

Ian Collins

unread,
Nov 7, 2019, 3:54:14 PM11/7/19
to
On 08/11/2019 07:37, Alf P. Steinbach wrote:
>
> So you don't understand it, but you believe that you do.

Maybe you didn't notice that you were being trolled by a rude person?

--
Ian.

0 new messages