Can I avoid the use of arrays in this situation or do I have to use them?

mike3

unread,

Nov 23, 2007, 4:42:17 PM11/23/07

to

Hi.

I've been reworking a bignum package I made a bit. I've got this thing
here that multiplies two strings of digits using a simple "grade
school" algorithm, and it seems I can't help but use an array in there
for a temporary buffer, as reallocating an std::vector<> over and over
again takes lots of time, and storing a global buffer would be a
problem since my program will have multithreading. (Although I suppose
one could allocate a certain amount of global buffers equal to the
maximum amount of threads one could run and then pass an index
parameter to each routine but that seems kind of silly. And how do you
keep track of all these buffers especially with overloaded operators
where you can't pass a parameter?! Eww.) Can I avoid the use of the
array and still get maximum performance? Or is this the case when
arrays are appropriate? The point of the temporary buffer is to allow
for this to be used as an "in-place" multiplication procedure as well
as out-of-place. There are other places in the bignum package where
similar array-vs-vector performance considerations arise involving
temporary buffers (there's a floating point addition/subtraction
routine that requires shifting the bits in one of the operands and to
avoid messing up the original it creates a temporary buffer... Also,
that buffer needs to be able to be passed to routines similar to this
one which accept vectors... Should I just give up and use arrays
everywhere for this?! (sucks a lot)).

Also, are the typecasts in there necessary or could they be removed
and the full-width product of the two digits still be obtained?

This is the code:

---
/* Multiply two digit vectors.
* Parameters:
* vec0: Reference to destination vector.
* vec1: Reference to vector to multiply.
* vec2: Reference to vector to multiply by.
* length1: Length of vec1
* length2: Length of vec2
*
* Returns: Error code
*
* Operation: vec0 = vec1 * vec2.
*
* Note: vec0 must be able to hold length1 + length2 worth of
* elements.
*/
FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)
{
/* Set up temporary digit buffer */
Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
to
// maximize performance. Vector
// takes forever to constantly
de/
// reallocate.

/* Safety */
if(length1 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length1));
if(length2 > MAX_PRECISION)
return(FG3DError(FG3D_INVALID_PRECISION, length2));

/* Clear buffer */
std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

/* Set up iterators */
std::vector<Digit>::iterator v0i(vec0.begin());
std::vector<Digit>::const_iterator v1i(vec1.begin());
std::vector<Digit>::const_iterator v2i(vec2.begin());
std::vector<Digit>::const_iterator v2init(v2i);

/* Do the multiplication */
for(int i(0);i<length1;i++,++v1i)
{
TwoDigit carry;
carry = 0;
v2i = v2init;
for(int j(0);j<length2;j++,++v2i)
{
TwoDigit
tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
+static_cast<TwoDigit>(tmpbuf[i+j])+carry);
carry = tmp>>DIGIT_SIZE;
tmp &= MAX_DIGIT;
tmpbuf[i+j] = tmp;
}
tmpbuf[i+length2] = carry;
}

/* Copy result back */
vec0.reserve(length1 + length2);
std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
}---

Alf P. Steinbach

unread,

Nov 23, 2007, 5:02:48 PM11/23/07

to

* mike3:

Why do you have the length arguments?

A std::vector has a length, accessible via size() and resize().

> {
> /* Set up temporary digit buffer */
> Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
> to
> // maximize performance. Vector
> // takes forever to constantly
> de/
> // reallocate.

Have you measured this?

Anyway, reserve ALL UPPERCASE for macros.

If MAX_PRECISION is a macro, make it a typed constant.

>
> /* Safety */
> if(length1 > MAX_PRECISION)
> return(FG3DError(FG3D_INVALID_PRECISION, length1));
> if(length2 > MAX_PRECISION)
> return(FG3DError(FG3D_INVALID_PRECISION, length2));

Ensure class invariants in operations so that you don't have to pepper
the code with checks all over the place.

In other words,

class Bignum
{
private:
std::vector<Digit> myDigits;
public:
// Whatever, all operations assume a max length
// and ensure that max length.
};

> /* Clear buffer */
> std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

Just zero-initialize the array if you're using an array,

Digit tmpbuf[whatever] = {0};

>
> /* Set up iterators */
> std::vector<Digit>::iterator v0i(vec0.begin());
> std::vector<Digit>::const_iterator v1i(vec1.begin());
> std::vector<Digit>::const_iterator v2i(vec2.begin());
> std::vector<Digit>::const_iterator v2init(v2i);
>
> /* Do the multiplication */
> for(int i(0);i<length1;i++,++v1i)
> {
> TwoDigit carry;
> carry = 0;
> v2i = v2init;
> for(int j(0);j<length2;j++,++v2i)
> {
> TwoDigit
> tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
> +static_cast<TwoDigit>(tmpbuf[i+j])+carry);

This looks rather suspicious.

However, you don't provide definitions of Digit and TwoDigit, so
difficult to say.

> carry = tmp>>DIGIT_SIZE;
> tmp &= MAX_DIGIT;
> tmpbuf[i+j] = tmp;
> }
> tmpbuf[i+length2] = carry;
> }
>
> /* Copy result back */
> vec0.reserve(length1 + length2);
> std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

Don't modify out-arguments directly.

Create the result in a local variable, then at the end swap().

For exception safety.

> }---

Cheers, & hth.,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Message has been deleted

Marco Manfredini

unread,

Nov 23, 2007, 7:03:33 PM11/23/07

to

mike3 wrote:
> Hi.
>
> I've been reworking a bignum package I made a bit. I've got this thing
> here that multiplies two strings of digits using a simple "grade
> school" algorithm, and it seems I can't help but use an array in there
> for a temporary buffer, as reallocating an std::vector<> over and over

Well, if you really want in-place multiplication you can try this:
Arrange your loops in such a way, that the output digits (r) are
computed from 0 straight up. That is, for each n in your *result*,
starting at 0 sum up all the a_i*b_j where i+j=n. You will need a finite
number (usually 2, unless you want to multiply verrrry long numbers) of
temporaries to hold the overflow that will leap to the next few digits.
I'm leaving it to you how juggle these. Now look at this example
multiplication r=a*b where len(a)==3 and len(b)==4

r_0 = a_0*b_0
r_1 = a_0*b_1 + a_1*b_0 + overflow
r_2 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
r_3 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow
r_4 = a_1*b_3 + a_2*b_2 + overflow
r_5 = a_2*b_3 + overflow
r_6 = overflow

You might have noticed that after we are finished with r_3, a_0 is never
read again, and so the computation of r_4 obsoletes a_1 etc..Also,
pushing the carries through the result happens only once. Now I rename r:

a_3 = a_0*b_0
a_4 = a_0*b_1 + a_1*b_0 + overflow
a_5 = a_0*b_2 + a_1*b_1 + a_2*b_0 + overflow
a_6 = a_0*b_3 + a_1*b_2 + a_2*b_1 + overflow // a_0 obsoleted!
a_0 = a_1*b_3 + a_2*b_2 + overflow // a_1 obsoleted!
a_1 = a_2*b_3 + overflow // a_2 obsoleted!
a_2 = overflow

Your result is there, rotate a left by 3 to put it in place.

This should keep you busy.

mike3

unread,

Nov 23, 2007, 7:24:15 PM11/23/07

to

On Nov 23, 3:02 pm, "Alf P. Steinbach" <al...@start.no> wrote:
<snip>

> Why do you have the length arguments?
>
> A std::vector has a length, accessible via size() and resize().
>

I thought resize() is to do just that, resize the vector,
not get it's length. Also, I have then length arguments
so I can use lengths different from those of the vector
if one needs to do that for some reason.

> > {
> > /* Set up temporary digit buffer */
> > Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
> > to
> > // maximize performance. Vector
> > // takes forever to constantly
> > de/
> > // reallocate.
>
> Have you measured this?
>

Sure did. Took ~34 secs, for 100,000,000 calls, with all
compiler optimizations turned on.

I just reduced the routine to this:

FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
const std::vector<Digit> &vec1,
const std::vector<Digit> &vec2,
int length1, int length2)

{
std::vector<Digit> tmpbuf(length1 + length2);
return(FG3DError(FG3D_SUCCESS));
}

and just ran it over and over 100,000,000 times to
see how long that allocate/free takes and I was
stunned to find it wastes 34 seconds, more time than
the entire multiplication loop!

> Anyway, reserve ALL UPPERCASE for macros.
>
> If MAX_PRECISION is a macro, make it a typed constant.
>

It's a macro, #defined as some number equal to the maximum
amount of "Digit"s that we want to allow (right now, it's
set at 128).

>
>
> > /* Safety */
> > if(length1 > MAX_PRECISION)
> > return(FG3DError(FG3D_INVALID_PRECISION, length1));
> > if(length2 > MAX_PRECISION)
> > return(FG3DError(FG3D_INVALID_PRECISION, length2));
>
> Ensure class invariants in operations so that you don't have to pepper
> the code with checks all over the place.
>
> In other words,
>
> class Bignum
> {
> private:
> std::vector<Digit> myDigits;
> public:
> // Whatever, all operations assume a max length
> // and ensure that max length.
> };
>

The problem is I need to be able to work on pieces of the
digit string as well as the entire string with some
operations (addition, subtraction). And if the operations
ensure the max length does not that require those if()
checks in them to make sure the length parameter passed
does not exceed the limit?

> > /* Clear buffer */
> > std::fill(tmpbuf, tmpbuf + length1 + length2, 0);
>
> Just zero-initialize the array if you're using an array,
>
> Digit tmpbuf[whatever] = {0};
>

Alright.

>
>
>
>
>
>
> > /* Set up iterators */
> > std::vector<Digit>::iterator v0i(vec0.begin());
> > std::vector<Digit>::const_iterator v1i(vec1.begin());
> > std::vector<Digit>::const_iterator v2i(vec2.begin());
> > std::vector<Digit>::const_iterator v2init(v2i);
>
> > /* Do the multiplication */
> > for(int i(0);i<length1;i++,++v1i)
> > {
> > TwoDigit carry;
> > carry = 0;
> > v2i = v2init;
> > for(int j(0);j<length2;j++,++v2i)
> > {
> > TwoDigit
> > tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
> > +static_cast<TwoDigit>(tmpbuf[i+j])+carry);
>
> This looks rather suspicious.
>
> However, you don't provide definitions of Digit and TwoDigit, so
> difficult to say.
>

TwoDigit can be any unsigned type that has twice the bitlength
of Digit. You could use unsigned int (32 bit) for TwoDigit,
unsigned short (16 bit) for Digit, for example.

> > carry = tmp>>DIGIT_SIZE;
> > tmp &= MAX_DIGIT;
> > tmpbuf[i+j] = tmp;
> > }
> > tmpbuf[i+length2] = carry;
> > }
>
> > /* Copy result back */
> > vec0.reserve(length1 + length2);
> > std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
>
> Don't modify out-arguments directly.
>
> Create the result in a local variable, then at the end swap().
>
> For exception safety.
>

The result is already being built in the temp buffer, that's the
point. And does this mean one cannot resize the output vector if
it cannot hold the full size product? Or will swap() do that
automatically? Is swap() slower than the above?

> > }---
>

Alf P. Steinbach

unread,

Nov 24, 2007, 3:03:15 AM11/24/07

to

* mike3:

> On Nov 23, 3:02 pm, "Alf P. Steinbach" <al...@start.no> wrote:
> <snip>
>> Why do you have the length arguments?
>>
>> A std::vector has a length, accessible via size() and resize().
>>
>
> I thought resize() is to do just that, resize the vector,
> not get it's length.

It sets the length. Plus allocates if necessary, and initializes
elements that become accessible.

> Also, I have then length arguments
> so I can use lengths different from those of the vector
> if one needs to do that for some reason.

Yeah, but /do/ you need that?

Without needing it the passed lengths are just one more thing that can
go wrong.

>>> {
>>> /* Set up temporary digit buffer */
>>> Digit tmpbuf[2*MAX_PRECISION]; // use array instead of vector
>>> to
>>> // maximize performance. Vector
>>> // takes forever to constantly
>>> de/
>>> // reallocate.
>> Have you measured this?
>>
>
> Sure did. Took ~34 secs, for 100,000,000 calls, with all
> compiler optimizations turned on.
>
> I just reduced the routine to this:
>
> FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
> const std::vector<Digit> &vec1,
> const std::vector<Digit> &vec2,
> int length1, int length2)
> {
> std::vector<Digit> tmpbuf(length1 + length2);
> return(FG3DError(FG3D_SUCCESS));
> }
>
> and just ran it over and over 100,000,000 times to
> see how long that allocate/free takes and I was
> stunned to find it wastes 34 seconds, more time than
> the entire multiplication loop!

Well, it comes down to usage pattern.

If the out-parameter is generally an empty vector or of less capacity
than required for the result, then there will be an allocation anyway in
most cases. And then the most efficient will generally be to declare
the temporary as a vector, and swap at the end.

On the other hand, if the out-parameter is generally of sufficient
capacity, then from a local point of view the most efficient will
generally be to reuse that capacity, i.e. to use a raw array for the
temporary or to build the result right into the out-parameter.

However, reuse of capacity runs the risk of "leaking" memory in the
sense that the capacity is never lowered, only increasing.

So measuring for actual usage, typical usage pattern, is critical.

[snip]

>> Ensure class invariants in operations so that you don't have to pepper
>> the code with checks all over the place.
>>
>> In other words,
>>
>> class Bignum
>> {
>> private:
>> std::vector<Digit> myDigits;
>> public:
>> // Whatever, all operations assume a max length
>> // and ensure that max length.
>> };
>>
>
> The problem is I need to be able to work on pieces of the
> digit string as well as the entire string with some
> operations (addition, subtraction).

Sounds like slices.

The standard library isn't very well equipped when it comes to slices:
you may have to build that yourself.

There is std::valarray, which I think supports slices, but that's a
not-quite-finished piece of the library. If it works, great. If not...

> And if the operations
> ensure the max length does not that require those if()
> checks in them to make sure the length parameter passed
> does not exceed the limit?

No, the point is to not pass separate length parameters, that whatever
object you're passing knows its length.

But again, it sounds as if you're dealing with slices.

Then, if so, then Bignum (or whatever) needs to support slices,
efficiently. In the operation that creates a logical slice there will
be a length check. But that will then be the only one, centralized.

[snip]

>>> /* Do the multiplication */
>>> for(int i(0);i<length1;i++,++v1i)
>>> {
>>> TwoDigit carry;
>>> carry = 0;
>>> v2i = v2init;
>>> for(int j(0);j<length2;j++,++v2i)
>>> {
>>> TwoDigit
>>> tmp((static_cast<TwoDigit>(*v1i)*static_cast<TwoDigit>(*v2i))
>>> +static_cast<TwoDigit>(tmpbuf[i+j])+carry);
>> This looks rather suspicious.
>>
>> However, you don't provide definitions of Digit and TwoDigit, so
>> difficult to say.
>
> TwoDigit can be any unsigned type that has twice the bitlength
> of Digit. You could use unsigned int (32 bit) for TwoDigit,
> unsigned short (16 bit) for Digit, for example.

Well, the ordinary arithmetic promotions were designed to handle this
kind of stuff.

It's enough that one operand of an operation is of type TwoDigit in
order to force the other operand to be promoted.

This is one case where I'd forsake the newfangled casts for clarity,

TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;

"TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
like a constructor call, but is defined for a primitive type as
equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
context it could invoke reinterpret_cast or const_cast or even cast to
inaccessible base. I don't know why it wasn't defined as static_cast.
Probably an oversight or misunderstanding: they bungled it.

>>> carry = tmp>>DIGIT_SIZE;
>>> tmp &= MAX_DIGIT;
>>> tmpbuf[i+j] = tmp;
>>> }
>>> tmpbuf[i+length2] = carry;
>>> }
>>> /* Copy result back */
>>> vec0.reserve(length1 + length2);
>>> std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
>> Don't modify out-arguments directly.
>>
>> Create the result in a local variable, then at the end swap().
>>
>> For exception safety.
>>
>
> The result is already being built in the temp buffer, that's the
> point. And does this mean one cannot resize the output vector if
> it cannot hold the full size product? Or will swap() do that
> automatically? Is swap() slower than the above?

You can always resize. Whether swap is more or less efficient depends
on the context, in effect on the usage pattern for the calls of this
function. However, given your measurements I think what I wrote above
is too categorical: you'll need to try out both approaches (array temp +
resize + copy, or vector temp + swap) in the actual typical usage,
measuring.

Happily there's not much difference regarding the rest of the code.

Cheers, & hth,.

mike3

unread,

Nov 24, 2007, 3:14:26 PM11/24/07

to

On Nov 24, 1:03 am, "Alf P. Steinbach" <al...@start.no> wrote:
> * mike3:
>
> > On Nov 23, 3:02 pm, "Alf P. Steinbach" <al...@start.no> wrote:
> > <snip>
> >> Why do you have the length arguments?
>
> >> A std::vector has a length, accessible via size() and resize().
>
> > I thought resize() is to do just that, resize the vector,
> > not get it's length.
>
> It sets the length. Plus allocates if necessary, and initializes
> elements that become accessible.
>

Oh, so then size() gets the lenth, resize() resets the length.

> > Also, I have then length arguments
> > so I can use lengths different from those of the vector
> > if one needs to do that for some reason.
>
> Yeah, but /do/ you need that?
>

Well, not for the multiplication, so maybe I can get rid of it there.
But for the addition and subtraction, I need variable legnths _and_
variable starting offsets so I can manipulate pieces of the digit
strings (to allow for the bit/word shift used when adding floating
point).

> Without needing it the passed lengths are just one more thing that can
> go wrong.
>

Is it possible though the std::vector may expand/shrink on me and I
have to update the "length" field in my bignums accordingly? Or be
prepared to handle operands of differing lengths going into the
routine (which is bad for the addition/subtraction as every operation
counts in terms of speed and coding up things to handle varying
lengths when in the floating point arithmetic all passed lengths of
digit slices _should_ be the same is a waste of performance AND a
silly way to code something! Why put in what you don't use?)?!

But the simple act of declaring the vector (as I showed above) takes
mammoth amounts of time -- unreasonable amounts, in fact.

> On the other hand, if the out-parameter is generally of sufficient
> capacity, then from a local point of view the most efficient will
> generally be to reuse that capacity, i.e. to use a raw array for the
> temporary or to build the result right into the out-parameter.
>

This is the case I have. These low-level digit routines
are used as building blocks to build a floating point
bignum package. The out-parameter _is_ of sufficient
capacity, as it's a temporary buffer _inside the floating
point routine_. So I suppose I could scrap the
buffer in the mul routine above entirely. But that of
course raises the question of how to handle the buffer
in the floating point routine. How could I do that?

> However, reuse of capacity runs the risk of "leaking" memory in the
> sense that the capacity is never lowered, only increasing.
>

In the floating point library, the precision/"capacity"
of all operands should be fixed, not increasing _or_
decreasing as if this happens too frequently (as in
more than almost never) it's going to cut into the
performance of the fractal calculation.

> So measuring for actual usage, typical usage pattern, is critical.
>

But what this shows is the mere act of creating the temporary buffer
consumes ludicrous amounts of time.
This is an application in which every last bit of speed
I can get counts. There are calculations that need
to be done involving these operations -- lots of 'em --
for many different data points (and they're all complex
numbers -- don't even get me started on evaluating
transcendental functions). This has ramifications for
the floating point library which needs temporary buffers
to avoid munging operands (in the case of addition/
subtraction with it's bit shift) or to give the
correct result (in the case of multiplication which
requires the upper bits of the 2x-width product).

> [snip]
>
>
>
> >> Ensure class invariants in operations so that you don't have to pepper
> >> the code with checks all over the place.
>
> >> In other words,
>
> >> class Bignum
> >> {
> >> private:
> >> std::vector<Digit> myDigits;
> >> public:
> >> // Whatever, all operations assume a max length
> >> // and ensure that max length.
> >> };
>
> > The problem is I need to be able to work on pieces of the
> > digit string as well as the entire string with some
> > operations (addition, subtraction).
>
> Sounds like slices.
>
> The standard library isn't very well equipped when it comes to slices:
> you may have to build that yourself.
>

How? Any ideas on how to design it for maximum performance?

> There is std::valarray, which I think supports slices, but that's a
> not-quite-finished piece of the library. If it works, great. If not...
>

So does this mean I should just scrap the vector<>, use
arrays, and just be careful with the length/offset parameters I pass?

> > And if the operations
> > ensure the max length does not that require those if()
> > checks in them to make sure the length parameter passed
> > does not exceed the limit?
>
> No, the point is to not pass separate length parameters, that whatever
> object you're passing knows its length.
>

But for maximum speed, I'd rather demand all operands
to, say, the addition/subtraction routines have the same
length, as that's all that's going to occur in my floating
point arithmetic. Having all that redundancy not only
wastes performance but just doesn't feel good in my
opinion.

> But again, it sounds as if you're dealing with slices.
>
> Then, if so, then Bignum (or whatever) needs to support slices,
> efficiently. In the operation that creates a logical slice there will
> be a length check. But that will then be the only one, centralized.
>

Does such an operation cost less than the addition/
subtraction of the digits & the bit shifting combined,
even when repeated billions of times as it is in the
fractal generation?

There are 4 possible cases that can occur:

Floating Point Digit Add/Sub Scenarios:

R = A +/- B

MSD --- LSD
RRRRRRRRRRR
AAAAAAAAAAA
BBBBB (we need 2 slices of A, 1 of B, 2 of R)

RRRRRRRRRRR
AAAAAAAAAAA
BBBB (we need 3 slices of A, 1 of B, 3 of R)

RRRRRRRRRRR
AAAAAAA
BBBBBB (we need 2 slices of A, 2 of B, 3 of R)

RRRRRRRRRRR
AAAAA
BBBB (we need 1 slice of A, 1 of B, 3 of R)

Any one of these cases may occur, although case 1
will be the most frequent due to the numbers involved
having equal precision. That requires 5 slice requests.
With around 5 slice-requiring floating point operations
per fractal iteration, we must do this 25 times per
iteration, and one fractal image may average 12,000
iterations and hence require over 92 billion such
requests.

<snip>

> Well, the ordinary arithmetic promotions were designed to handle this
> kind of stuff.
>
> It's enough that one operand of an operation is of type TwoDigit in
> order to force the other operand to be promoted.
>

But doesn't that require a cast? Also, carry is of type
TwoDigit, so should not it promote everything else?

> This is one case where I'd forsake the newfangled casts for clarity,
>
> TwoDigit tmp = TwoDigit(*v1i)*(*v2i) + tmpbuf[i+j] + carry;
>
> "TwoDigit(*v1i)" is a syntax that was introduced in C++ to make it look
> like a constructor call, but is defined for a primitive type as
> equivalent to a C style cast "((TwoDigit)*v1i)". I.e. in some other
> context it could invoke reinterpret_cast or const_cast or even cast to
> inaccessible base. I don't know why it wasn't defined as static_cast.
> Probably an oversight or misunderstanding: they bungled it.
>

Thanks, I'll do that instead. Although, what about carry?
That's a TwoDigit, would not it automatically do the
promotion and hence one would not even need the thing
around (*v1i)? Or does one operand for each binary operator need to be
TwoDigit (in which case putting it at the beginning ensures all the
successive operations recieve at least one TwoDigit)?

>
>
> >>> carry = tmp>>DIGIT_SIZE;
> >>> tmp &= MAX_DIGIT;
> >>> tmpbuf[i+j] = tmp;
> >>> }
> >>> tmpbuf[i+length2] = carry;
> >>> }
> >>> /* Copy result back */
> >>> vec0.reserve(length1 + length2);
> >>> std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());
> >> Don't modify out-arguments directly.
>
> >> Create the result in a local variable, then at the end swap().
>
> >> For exception safety.
>
> > The result is already being built in the temp buffer, that's the
> > point. And does this mean one cannot resize the output vector if
> > it cannot hold the full size product? Or will swap() do that
> > automatically? Is swap() slower than the above?
>
> You can always resize. Whether swap is more or less efficient depends
> on the context, in effect on the usage pattern for the calls of this
> function. However, given your measurements I think what I wrote above
> is too categorical: you'll need to try out both approaches (array temp +
> resize + copy, or vector temp + swap) in the actual typical usage,
> measuring.
>

Well in my usage, it might be possible to entirely avoid the in-place
multiplication in the tight inner fractal loops at least for the
Mandelbrot set (that takes 1 complex multiplication, equivalent to 3
real multiplications), however for floating point multiplication one
still needs a temporary buffer to hold the 2x-width product of the
mantissas which is is then truncated to 1x-width and rounded. So that
simply moves the problem outside the multiplication routine I gave and
into that of it's caller.

terminator

unread,

Nov 25, 2007, 5:40:36 AM11/25/07

to

/*Very bad. I would do this instead:*/

const size_t len=vec1.size()+vec2.size();
vector<Digit> tmpbuf(len);//to be swapped.

/*if you really need to check you can do this instead:*/

vector<Digit> tmpbuf;
const size_t MaxSize = tmpbuf.get_allocator().max_size();
if(len > std::min( MaxSize , MAX_PRECISION ))
throw FG3DError(FG3D_INVALID_PRECISION, len);
tmpbuf.reserve(len);

Do not do this:

> /* Clear buffer */
> std::fill(tmpbuf, tmpbuf + length1 + length2, 0);

No need to do that .vector has already done it.

Do not do this:

> /* Copy result back */
> vec0.reserve(length1 + length2);
> std::copy(tmpbuf, tmpbuf + length1 + length2, vec0.begin());

/*you can simply swap instead of duplicating via copy:*/

vec0.swap(tmpbuf);//tmpbuf needs to be a vector<Digit>.

operations other than multiplication may reduce the size so you may
need to do this proir to above:

typedef vector<Digit>::reverse_iterator Rit;
const Rit rstart=tmpbuf.rbegin();
const Rit rfine=tmpbuf.rend();
const Rit
it=find_if(rstart,rfine,std::bind1st(std::not_equal_to(),0));
const size_t newlen = rfine-it;
if(newlen!=len)
tmpbuf.resize(newlen);

putting an intrinsic array on stack is really bad in that it increases
the risk of stack overflow which is usually an irricoverable exception
that leads to crul termination of the program.but vector puts the data
on the heap.

regards,
FM.

terminator

unread,

Nov 25, 2007, 7:37:38 AM11/25/07

to

turninng debug switch off I found the following two sequences almost
equivalent in time:

1.vector buf ; swap ;
2. array buf ; fill ; result.reserve ; copy ;

#include <iostream>
#include <conio.h>
#include <vector>
#include <string>
#include <Ctime>
using namespace std;

enum {sz=1000,loop=100*1000};

void vec1(){
vector <int> v(sz),dest;
v.swap(dest);
}

void vec0(){
vector <int> v,dest;
v.reserve(sz);
fill(v.begin(),v.begin()+v.capacity(),0);
v.swap(dest);
}

void vec2(){
vector <int> v(sz),dest;
swap(dest,v);
}

void arr(){
int arr[sz];
vector <int> dest;
fill(arr,&(arr[sz]),0);
dest.reserve(sz);
copy(arr,&arr[sz],dest.begin());
}

template <typename pred>
void test(pred f,string txt){
cout<<("testing:"+txt)<<endl;
const clock_t c=clock();
for(int i=0;i<loop;i++)
f();
cout<<clock()-c<<" ms"<<endl;
}

int main()
{
vector<A> v1(sz);
cout << "\nvector * i=" << A::get() << endl ;
A::get()=0;
vector<A> v2;
v2.reserve(sz);
cout << "reserve * i=" << A::get() << endl ;
test(vec0,"reserve");
test(vec1,"vector");
test(vec2,"std::swap");
test(arr,"array");
getch();
return 0;
}

output:

testing:reserve
1172 ms
testing:vector
1141 ms
testing:std::swap
1125 ms
testing:array
1141 ms

regards,
FM.

terminator

unread,

Nov 25, 2007, 8:35:14 AM11/25/07

to

testing the following urged me that the you need to pay a time
overhead because you can not put a large number of large objects on
stack and if you use dynamic allocation(as vector does) then
allocation and deallocation costs you some time. In place calculation
is the only case that does not demand dynamic allocation and only for
this one case using a normal array placed on the stack can
help.Putting arrays on stack is not normally a good idea .

void arr3(){
int *arr=new int[sz];
fill(arr,&(arr[sz]),0);
delete arr;
};

void vec4(){
vector <int> v;
v.reserve(sz);
};

void vec5(){
vector <int> v;
v.reserve(sz);
v.insert(v.begin(),sz,0);
};

void vec6(){
vector <int> v;
v.insert(v.begin(),sz,0);
};

void vec7(){
vector <int> v;
v.resize(sz);
};

ps:
reserve dose not add elements to vector,it just increase current
capacity, so use insert or resize to add elements to the vector.

regards,
FM.

mike3

unread,

Nov 25, 2007, 3:05:29 PM11/25/07

to

On Nov 25, 6:35 am, terminator <farid.mehr...@gmail.com> wrote:
<snip>

> testing the following urged me that the you need to pay a time
> overhead because you can not put a large number of large objects on
> stack and if you use dynamic allocation(as vector does) then
> allocation and deallocation costs you some time. In place calculation
> is the only case that does not demand dynamic allocation and only for
> this one case using a normal array placed on the stack can
> help.Putting arrays on stack is not normally a good idea .
>

So then in this case, if I want to maximize performance, I'll
need to use an array instead of vector?

mike3

unread,

Nov 25, 2007, 3:08:25 PM11/25/07

to

On Nov 25, 5:37 am, terminator <farid.mehr...@gmail.com> wrote:
<snip>

I noticed however in even the "array" routine
a vector is created. Have you tried one where
there are _no_ vectors created in the routines?

terminator

unread,

Nov 26, 2007, 5:16:17 AM11/26/07

to

> there are _no_ vectors created in the routines?- Hide quoted text -
>

later I added:

> void arr3(){
> int *arr=new int[sz];
> fill(arr,&(arr[sz]),0);

> //delete arr; //wrong
/*that was wrong.this I meant:*/

delete[] arr;
> };
>
> void vec4(){
> vector <int> v;
> v.reserve(sz);
> };
>
> void vec5(){
> vector <int> v;
> v.reserve(sz);
> v.insert(v.begin(),sz,0);
> };
>
> void vec6(){
> vector <int> v;
> v.insert(v.begin(),sz,0);
> };
>
> void vec7(){
> vector <int> v;
> v.resize(sz);
> };
>
> ps:
> reserve dose not add elements to vector,it just increase current
> capacity, so use insert or resize to add elements to the vector.

If you test the above functions you can see that using a vector is
equivalent to dynamically allocating and deallocating an array.Thus if
you do not perform an in-place calculation ,sooner or later you would
need to store the result in a somewhat dynamically allocated array
object.No matter when the dynamic allocation happens it always
increase runtime dramatically in an OS-dependent(unforcastable but
estimatable) manor and it is inevitable.So except for the case of in-
place calculation I would just start with a vector and finish with a
swap that takes place in no time.

regards,
FM.

terminator

unread,

Nov 26, 2007, 5:20:47 AM11/26/07

to

for the 'n'th digit 'n' numbers need to be summed so one should make
sure that n is less than upper bound of digit type or extra carry
digits would be needed.

regards,
FM.

terminator

unread,

Nov 26, 2007, 5:25:15 AM11/26/07

to

On Nov 24, 11:03 am, "Alf P. Steinbach" <al...@start.no> wrote:
>
> Sounds like slices.
>
> The standard library isn't very well equipped when it comes to slices:
> you may have to build that yourself.
>
> There is std::valarray, which I think supports slices, but that's a
> not-quite-finished piece of the library. If it works, great. If not...
>

on my platform index/iterating through a valarray is twice slower than
vector though for some mistery construction is somewhat faster.

regards,
FM.

mike3

unread,

Nov 26, 2007, 4:09:50 PM11/26/07

to

Then why does the routine using an array seem to run
much faster than the one using a vector?

mike3

unread,

Nov 26, 2007, 5:02:06 PM11/26/07

to

On Nov 23, 2:42 pm, mike3 <mike4...@yahoo.com> wrote:
<snip>

I just finished testing another attempt at this. It seems the
problem is not the choice of vector vs array, but instead how
the memory is handled.

A temporary array made like this:

Digit *digits(new Digit[4]);

and then removed using delete is just as slow as a vector.

Whereas an array like this:

Digit digits[4];

is far faster.

The problem is though those variable-length arrays are
necessary -- no easy ways around 'em with the Floating Point
(FP) routines as I need to store bitshifted versions of the
operands without munging said operands. It might be possible
to allocate arrays of the second type I discussed above, but
then all the digit-manipulators have to take arrays not
vectors and that just doesn't seem like a good idea. Now,
if my program were single-threaded, I would have been able to
make do with a global buffer that would only get allocated once,
and always use that. But the program is going to be multi-
threaded, so having a single global buffer cannot really work.
And it seems ugly to have to pass a buffer operand to every
single arithmetic routine and is downright impossible when
using overloaded operators.

What do you do?

Victor Bazarov

unread,

Nov 26, 2007, 5:17:51 PM11/26/07

to

mike3 wrote:
> On Nov 23, 2:42 pm, mike3 <mike4...@yahoo.com> wrote:
> <snip>
>
> I just finished testing another attempt at this. It seems the
> problem is not the choice of vector vs array, but instead how
> the memory is handled.
>
> A temporary array made like this:
>
> Digit *digits(new Digit[4]);
>
> and then removed using delete is just as slow as a vector.
>
> Whereas an array like this:
>
> Digit digits[4];
>
> is far faster.

Welcome to the dark side of the dynamic memory management.

> The problem is though those variable-length arrays are
> necessary -- no easy ways around 'em with the Floating Point
> (FP) routines as I need to store bitshifted versions of the
> operands without munging said operands. It might be possible
> to allocate arrays of the second type I discussed above, but
> then all the digit-manipulators have to take arrays not
> vectors and that just doesn't seem like a good idea. Now,
> if my program were single-threaded, I would have been able to
> make do with a global buffer that would only get allocated once,
> and always use that. But the program is going to be multi-
> threaded, so having a single global buffer cannot really work.
> And it seems ugly to have to pass a buffer operand to every
> single arithmetic routine and is downright impossible when
> using overloaded operators.
>
> What do you do?

There are commercial and homegrown memory managers that can be
made much faster than the generic one your RTL supplies. Also
consider that if you know the possible largest size of your
array, use that to define an automatic array, even if you only
use part of it in the current call to your function.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

werasm

unread,

Nov 26, 2007, 6:30:03 PM11/26/07

to

On Nov 26, 11:02 pm, mike3 <mike4...@yahoo.com> wrote:

> The problem is though those variable-length arrays are
> necessary -- no easy ways around 'em with the Floating Point
> (FP) routines as I need to store bitshifted versions of the
> operands without munging said operands. It might be possible
> to allocate arrays of the second type I discussed above, but
> then all the digit-manipulators have to take arrays not
> vectors and that just doesn't seem like a good idea.

A vector should be compatible with arrays as the memory
of vectors are guaranteed to be contiguous. Therefore
having an array should IMO not force you to use arrays
all over the show. If you really don't have a choice,
I would use a scheme whereby one pre creates vectors as
you require them, whereafter you just pop them of
a stack. This would/should reduce the required memory
allocations to a minimum. Something like this (probably
naive implementation, you would have to test and experiment
a little):

#include <cassert>
#include <vector>
#include <memory>
//...#include "yourMutex.h"

template <class T>
struct VecStack
{
VecStack(){ vecList_.reserve( 50 ); }

std::auto_ptr<std::vector<T> > pop()
{
//...Mutex::guard lock( mutex_ );
std::auto_ptr<std::vector<T> > v( 0 );

if( vecList_.empty() )
{
v.reset( new std::vector<T> );
}
else
{
v.reset( vecList_.back() );
vecList_.pop_back();
}
return v;
}

void push( std::auto_ptr< std::vector<T> > v )
{
//...Mutex::guard lock( mutex_ );
assert( v.get() );
vecList_.push_back( v.release() );
}

private:
std::vector<std::vector<T>*> vecList_;
//...Mutex mutex_;
};

int main()
{
VecStack<int> vs;
std::auto_ptr<std::vector<int> > ivec( vs.pop() );
vs.push( ivec );
}

Replace Mutex with your favourite mutex... I used
auto_ptr incase something goes wrong in the using
function (a throw preventing push). OTOH if
I am not forced to use arrays all over the show
and speed does matter, I would stick with it. I
think the above solution might be worth a try
because popping the vectors IMO should be almost
as fast as creating the array. OTOH you might loose
some speed as result of indirection as well.

Kind regards,

Werner

terminator

unread,

Nov 27, 2007, 10:04:21 AM11/27/07

to

On Nov 27, 1:02 am, mike3 <mike4...@yahoo.com> wrote:
> On Nov 23, 2:42 pm, mike3 <mike4...@yahoo.com> wrote:
> <snip>
>
> I just finished testing another attempt at this. It seems the
> problem is not the choice of vector vs array, but instead how
> the memory is handled.
>
> A temporary array made like this:
>
> Digit *digits(new Digit[4]);
>
> and then removed using delete is just as slow as a vector.
>
> Whereas an array like this:
>
> Digit digits[4];
>
> is far faster.
>
> The problem is though those variable-length arrays are
> necessary -- no easy ways around 'em with the Floating Point
> (FP) routines as I need to store bitshifted versions of the
> operands without munging said operands. It might be possible
> to allocate arrays of the second type I discussed above, but
> then all the digit-manipulators have to take arrays not
> vectors and that just doesn't seem like a good idea.

That is what I am talking about.

> Now,
> if my program were single-threaded, I would have been able to
> make do with a global buffer that would only get allocated once,
> and always use that. But the program is going to be multi-
> threaded, so having a single global buffer cannot really work.
> And it seems ugly to have to pass a buffer operand to every
> single arithmetic routine and is downright impossible when
> using overloaded operators.
>
> What do you do?

even in that case(global buffer) you would need to store the result in
a dynamically allocated array,so alloacating memory is inevitable and
I prefere to do it at the begining of the function.
If you do in-place calculation no dynamic allocation is necessary if
the capacity of destination is already enough,but if you need to store
the initial value of the destination in another 'bignum' object then
one dynamic alloction is needed.You are destined to pay that
overhead ,so a vector buffer followed by a swap is the clean readable
way of doing it ,but for the sake of performance I would declare an in-
place version too (just the way we can write 'x=y+z' and 'x+=y').

cheers,
FM.

terminator

unread,

Nov 27, 2007, 10:10:15 AM11/27/07

to

Great idea.

> Also
> consider that if you know the possible largest size of your
> array, use that to define an automatic array, even if you only
> use part of it in the current call to your function.
>

consider:
a=b*c;

'a' would need to be dynamically allocated in the end.
Dynamic allocation is generally inevitable in programs using that
'bignum'; so why to dirty the stack for preventing what has to happen?

regards,
FM.

mike3

unread,

Nov 27, 2007, 2:54:22 PM11/27/07

to

But it seems even the simplest method of allocating
the memory as an array is so slow.

Unless one puts a precision limit in (which it has),
and uses static arrays for the temporary buffers but
you can't mix vectors and arrays easily -- the routines
have to take one or the other and that would mean using
arrays directly inside objects. And I've heard that
arrays are "evil".

> > Also
> > consider that if you know the possible largest size of your
> > array, use that to define an automatic array, even if you only
> > use part of it in the current call to your function.
>
> consider:
> a=b*c;
>
> 'a' would need to be dynamically allocated in the end.
> Dynamic allocation is generally inevitable in programs using that
> 'bignum'; so why to dirty the stack for preventing what has to happen?
>

Not with floating point where you have to round at some
point. The 2x-width product of b and c need only be stored
in a temporary buffer before rounding to the precision of
"a", then it can be discarded.

> regards,
> FM.

mike3

unread,

Nov 27, 2007, 3:05:37 PM11/27/07

to

Why does the global array need to be dynamic and constantly
changing it's precision? The result is going to get rounded
at the end anyway, and the precision of the results is not
varied through the calculations constantly -- it does not
keep growing so there's no reason to keep dynamically
allocating more memory.

If you have set up to use X amount of precision, your
global array needs only hold 2X worth. There may be another
array that holds X+1 or X+2's worth for computing
transcendental functions hence the multipliaction
buffer may need to hold 2X+2 or 2X+4's worth. But that is
not a problem. One need only resize it if the precision
being used is too small. Then when it's resized, it stays
that big. (It cannot grow indefinitely though as the
bignum package contains a max precision limit coded in.)
And the memory consumption is not too great -- these
numbers can take up at most 1 Kbyte or so, so that's only
a few Kbytes of storage which is nothing on today's PCs.

werasm

unread,

Nov 28, 2007, 2:33:32 AM11/28/07

to

On Nov 24, 1:24 am, mike3 <mike4...@yahoo.com> wrote:
> I just reduced the routine to this:
>
> FG3DError FG3DDigitVector_Mul(std::vector<Digit> &vec0,
> const std::vector<Digit> &vec1,
> const std::vector<Digit> &vec2,
> int length1, int length2)
> {
> std::vector<Digit> tmpbuf(length1 + length2);
> return(FG3DError(FG3D_SUCCESS));
>
> }

I wrote a little test using VecStack I mentioned below. One
has to consider that the test ran in a single threaded
environment and things may change if you throw a
critical section in there (but not to much if not
contented). I used Dev C++ (don't know which STL),
turned debugging off and enabled optimization for
speed. Turned out the vector code executed faster
than the array code by +10 seconds for 30000000
items. Test is below:

#include <iostream>

#include <cassert>
#include <vector>
#include <memory>
//...#include "yourMutex.h"

template <class T>
struct VecStack
{

typedef std::auto_ptr<std::vector<T> > ClientItem;

VecStack( unsigned defaultSz )
: defaultSz_( defaultSz )
{ vecList_.reserve( 50 ); }

ClientItem pop()

{
//...Mutex::guard lock( mutex_ );
std::auto_ptr<std::vector<T> > v( 0 );

if( vecList_.empty() )
{
v.reset( new std::vector<T>( defaultSz_ ) );

}
else
{
v.reset( vecList_.back() );
vecList_.pop_back();
}
return v;
}

void push( ClientItem v )

{
//...Mutex::guard lock( mutex_ );

//assert( v.get() );
vecList_.push_back( v.release() );
}

private:
std::vector<std::vector<T>*> vecList_;

unsigned defaultSz_;
//...Mutex mutex_;

};

enum
{
eMAX_EXPECTED_SZ = 512,
eLOOP_SZ = 30000000,
eINNER_SZ = 500
};

void testVect1( unsigned len )
{
typedef VecStack<int> StackType;
static StackType stack( eMAX_EXPECTED_SZ );

StackType::ClientItem item = stack.pop();
item->resize( len );
stack.push( item );
}
void testVect2( unsigned len )
{
int Items[eMAX_EXPECTED_SZ] = { 0 };
}
void doTest( void(*op)(unsigned) )
{
for( int i = 0; i < eLOOP_SZ/eINNER_SZ; ++i )
{
for( int j = 0; j < eINNER_SZ; ++j )
{
(*op)( j );
}
}
}

int main(int argc, char *argv[])
{
std::cout << "Press key to start test!" << std::endl;
std::cin.get();
std::cout << "Test1 running...\n";
doTest( testVect1 );
std::cout << "Test2 running... \n";
doTest( testVect2 );
std::cout << "End... Press key to exit." << std::endl;
std::cin.get();
return EXIT_SUCCESS;
}

Kind regards,

Werner

terminator

unread,

Nov 29, 2007, 9:35:16 AM11/29/07

to

I did not mean the global buffer to be placed on heap;I mean that when
you store the value in the result 'bignum' a dynamic allocation
happens and it is better to have a none static vector buffer ( whose
implementation performs the dynamic allocation ) and swap it to the
result (that prevents a second dynamic allocation).

>
> If you have set up to use X amount of precision, your
> global array needs only hold 2X worth.

actually a mutiplication holds **less than** or equal to
length1+length2.
if you want to store just necessary digits then a normal array buffer
can help decrease the size of resulting vector , but there are better
programing techniques to avoid a normal array while achiving similar
performance.

> There may be another
> array that holds X+1 or X+2's worth for computing
> transcendental functions hence the multipliaction
> buffer may need to hold 2X+2 or 2X+4's worth. But that is
> not a problem. One need only resize it if the precision
> being used is too small. Then when it's resized, it stays
> that big. (It cannot grow indefinitely though as the
> bignum package contains a max precision limit coded in.)
> And the memory consumption is not too great -- these
> numbers can take up at most 1 Kbyte or so, so that's only
> a few Kbytes of storage which is nothing on today's PCs.

suppose a number of 'length1 ' multiplied by another of 'length2' then
you just need to check whether 'length1+length2' is greater than the
'maximum_size' .
'length1+lenght2' is less than '2*max(length1,length2)'.And none of
the operands need be less than 'maximum_size/2' but as mention
formerly ,the sum of lengthes must be not greater than 'maximum_size'.

regards,
FM.

mike3

unread,

Nov 29, 2007, 2:40:04 PM11/29/07

to

On Nov 29, 7:35 am, terminator <farid.mehr...@gmail.com> wrote:
> On Nov 27, 11:05 pm, mike3 <mike4...@yahoo.com> wrote:

<snip>

> I did not mean the global buffer to be placed on heap;I mean that when
> you store the value in the result 'bignum' a dynamic allocation
> happens and it is better to have a none static vector buffer ( whose
> implementation performs the dynamic allocation ) and swap it to the
> result (that prevents a second dynamic allocation).
>

Why does a dynamic allocation happen to the bignum?
In the floating point routines, the multiplication
stores it's wide result in a temporary buffer that's been
preallocated to the full size. Then the _upper part_ of
this buffer (it's most-significant digits) is copied to
the destination floating point number, so that it fits
within the allotted precision. Floating point numbers
are intended to be static, their precision does not
change except with specifically-made resize functions,
for maximum performance.

>
>
> > If you have set up to use X amount of precision, your
> > global array needs only hold 2X worth.
>
> actually a mutiplication holds **less than** or equal to
> length1+length2.

Yes, that's right.

> if you want to store just necessary digits then a normal array buffer
> can help decrease the size of resulting vector , but there are better
> programing techniques to avoid a normal array while achiving similar
> performance.
>

But then you need to count digits to figure out how
big the product is going to be.

The big problem here is handling the allocation/
deallocation of the memory.

> > There may be another
> > array that holds X+1 or X+2's worth for computing
> > transcendental functions hence the multipliaction
> > buffer may need to hold 2X+2 or 2X+4's worth. But that is
> > not a problem. One need only resize it if the precision
> > being used is too small. Then when it's resized, it stays
> > that big. (It cannot grow indefinitely though as the
> > bignum package contains a max precision limit coded in.)
> > And the memory consumption is not too great -- these
> > numbers can take up at most 1 Kbyte or so, so that's only
> > a few Kbytes of storage which is nothing on today's PCs.
>
> suppose a number of 'length1 ' multiplied by another of 'length2' then
> you just need to check whether 'length1+length2' is greater than the
> 'maximum_size' .
> 'length1+lenght2' is less than '2*max(length1,length2)'.And none of
> the operands need be less than 'maximum_size/2' but as mention
> formerly ,the sum of lengthes must be not greater than 'maximum_size'.
>

But where is this temporary buffer stored? Also,
in my multithreaded application, how can you give
each thread it's own buffer without having to
pass special parameters to every bignum routine
(impossible with overloaded operators!), _and_
not having a "rationing" method where the threads
take turns using a single buffer, which utterly
destroys the parallelism that would be gained
on machines with multiprocessing ability
(multiple CPUs/multiple cores)?

terminator

unread,

Nov 30, 2007, 1:50:32 PM11/30/07

to

On Nov 29, 5:35 pm, terminator <farid.mehr...@gmail.com> wrote:
> I mean that when
> you store the value in the result 'bignum' a dynamic allocation
> happens and it is better to have a none static vector buffer ( whose
> implementation performs the dynamic allocation ) and swap it to the
> result (that prevents a second dynamic allocation).
>

actually I mmissed something: if the result 'bignum' has formerly been
used then its size/capacity may be enough to keep the result of
calculation in which case a buffer on stack may increase speed.

regards,
FM.

mike3

unread,

Nov 30, 2007, 7:00:17 PM11/30/07

to

That's right, and with floating point we don't store the full-width
product, it's only stored as an intermediate temporary prior to
rounding. However a stack array would require the digit manipulation
routines take arrays instead of vectors, which would therefore force
everything to be arrays, unless of course one makes two routines,
one for each type of argument, but that seems like a poor, poor
coding practice.

mike3

unread,

Nov 30, 2007, 7:00:51 PM11/30/07

to

And therefore, because of that, what should I do?

terminator

unread,

Dec 1, 2007, 3:42:48 AM12/1/07

to

No , you just cannot use swap( which would not be need in this regard
anymore ).

mike3

unread,

Dec 1, 2007, 4:03:13 AM12/1/07

to

So then how do I avoid reallocation on every call of the routines?
If I keep a stack of preallocated vector buffers, one for each thread,
how do I make the bignum routines know which to use without passing
parameters all the time, which to me is poor coding, and also
downright
impossible with overloaded operators?

werasm

unread,

Dec 1, 2007, 2:00:10 PM12/1/07

to

On Dec 1, 10:03 am, mike3 <mike4...@yahoo.com> wrote:

> So then how do I avoid reallocation on every call of the routines?
> If I keep a stack of preallocated vector buffers, one for each thread,
> how do I make the bignum routines know which to use without passing
> parameters all the time, which to me is poor coding, and also
> downright

> impossible with overloaded operators?- Hide quoted text -
>
> - Show quoted text -

I don't know what I'm missing here. To me the solution I
proposed should work fine. You don't need a vector per
thread. You only need a vector for the amount of threads
passing through the routine simultaneously. Furthermore
you would not have that many reallocations. Guess at
the largest buffer required and reserve on instantiation.
Chances are that you would not have more than two or
three threads passing through the function simultaneously
anyway. I tested it (stack of vectors) for single
threaded scenario and it beat the array by some margin
(obviously optimization were maximum). The solution
should at least improve things dramatically in comparison
to your original solution.

Regards,

Werner

mike3

unread,

Dec 1, 2007, 4:59:30 PM12/1/07

to

But still, I need more that none buffer. How do I indicate
to the threads though which buffer they should use without
including funny "use this buffer" parameters on every
bignum routine, which to me equals poor coding practice and
also is 100% impossible with overloaded operators?

mike3

unread,

Dec 1, 2007, 11:15:14 PM12/1/07

to

On Nov 27, 8:04 am, terminator <farid.mehr...@gmail.com> wrote:

And of course the reallocations shouldn't get done in every single
iteration of the fractals!

mike3

unread,

Dec 2, 2007, 2:52:09 AM12/2/07

to

On Dec 1, 2:59 pm, mike3 <mike4...@yahoo.com> wrote:
> On Dec 1, 12:00 pm, werasm <wer...@gmail.com> wrote:

<snip>

I think I figured out how to get the threads assigned
the right buffers, but I need a very fast "mutex" control.
Would something simple, like including a boolean "in use"
flag in the buffer stack when pushing/popping that is waited
for with a simple while() loop suffice?

mike3

unread,

Dec 3, 2007, 4:05:59 PM12/3/07

to

On Nov 26, 4:30 pm, werasm <wer...@gmail.com> wrote:
> On Nov 26, 11:02 pm, mike3 <mike4...@yahoo.com> wrote:
>
> > The problem is though those variable-length arrays are
> > necessary -- no easy ways around 'em with the Floating Point
> > (FP) routines as I need to store bitshifted versions of the
> > operands without munging said operands. It might be possible
> > to allocate arrays of the second type I discussed above, but
> > then all the digit-manipulators have to take arrays not
> > vectors and that just doesn't seem like a good idea.
>
> A vector should be compatible with arrays as the memory
> of vectors are guaranteed to be contiguous. Therefore
> having an array should IMO not force you to use arrays
> all over the show. If you really don't have a choice,
> I would use a scheme whereby one pre creates vectors as
> you require them, whereafter you just pop them of
> a stack. This would/should reduce the required memory
> allocations to a minimum. Something like this (probably
> naive implementation, you would have to test and experiment
> a little):

<snip>

The problem is parameter type conflicts and repeated code.
See, the basic routines that manipulate the digits take
special objects of type "RawDigit", not simple arrays of
digits. This avoids having nasty boundschecking all over
the place (note the boundschecking in the routine on my
original post.), as RawDigit contains it's own "length" field.
Therefore you'd have to have separate routines for array
manipulation and having two sets of routines like that is
a poor design and bad for mantainability purposes (think:
bug inducing). Thus for the purpose of good design, I am
forced to use the bunch of preallocated buffers. Furthermore,
what about a thread that needs a buffer deep down in the
stack? You have to pop off _all_ the vectors down to that
point, retrieve it, then push them _all_ back on. Also,
what would be the fastest possible "mutex", anyway?

werasm

unread,

Dec 3, 2007, 4:32:24 PM12/3/07

to

On Dec 1, 10:59 pm, mike3 <mike4...@yahoo.com> wrote:

> But still, I need more that none buffer.

Don't follow?

> How do I indicate
> to the threads though which buffer they should use without
> including funny "use this buffer" parameters on every
> bignum routine, which to me equals poor coding practice and
> also is 100% impossible with overloaded operators?

I don't see why it is necessary to indicate to threads
which buffer they should use. It is irrelevant which
buffer should be used. You simply pop of the next buffer.
If none exists (at the time of the pop), it means all
are used at that instant, in which case one gets created.

After using the buffer, you just push it (or the pointer
to it) back. Threads never share buffers as once a buffer
is used it is not available to other threads. As buffers
are only used temporarily (like the array on the stack)
you don't require to keep state and threads don't care
about which buffer is used (See my first reply to you
for example):

http://groups.google.com/group/comp.lang.c++/tree/browse_frm/thread/3a4878ba5a6a844a/30edbcd282782897?rnum=11&_done=%2Fgroup%2Fcomp.lang.c%2B%2B%2Fbrowse_frm%2Fthread%2F3a4878ba5a6a844a%2Fbef7efe28a247163%3F#doc_9d759b72822e5491

Your original question: Can I avoid the use of arrays...?

My answer - yes, use a stack of pointers to vectors
and allocate the vectors as required (see example).
In testing it for one thread of execution it proved
to my surprise to beat zero initialized arrays. For
the multithreaded (MT) case it depends on how good your
MT primitives are. The whole which thread uses
which buffer question is irrelevant (emphasis).

I used auto_ptr for the generic stack as this
emphasizes that the buffer does not belong to
the stack when popped, until returned. Therefore
should exceptions happen, you would have no
leaks (transferral of ownership to the caller -
the purpose(to me) of using auto_ptr).

Regards,

Werner

mike3

unread,

Dec 4, 2007, 2:05:36 AM12/4/07

to

On Dec 3, 2:32 pm, werasm <wer...@gmail.com> wrote:
> On Dec 1, 10:59 pm, mike3 <mike4...@yahoo.com> wrote:
>
> > But still, I need more that none buffer.
>
> Don't follow?
>
> > How do I indicate
> > to the threads though which buffer they should use without
> > including funny "use this buffer" parameters on every
> > bignum routine, which to me equals poor coding practice and
> > also is 100% impossible with overloaded operators?
>
> I don't see why it is necessary to indicate to threads
> which buffer they should use. It is irrelevant which
> buffer should be used. You simply pop of the next buffer.
> If none exists (at the time of the pop), it means all
> are used at that instant, in which case one gets created.
>
> After using the buffer, you just push it (or the pointer
> to it) back. Threads never share buffers as once a buffer
> is used it is not available to other threads. As buffers
> are only used temporarily (like the array on the stack)
> you don't require to keep state and threads don't care
> about which buffer is used (See my first reply to you
> for example):
>

> http://groups.google.com/group/comp.lang.c++/tree/browse_frm/thread/3...

>
> Your original question: Can I avoid the use of arrays...?
>
> My answer - yes, use a stack of pointers to vectors
> and allocate the vectors as required (see example).
> In testing it for one thread of execution it proved
> to my surprise to beat zero initialized arrays. For
> the multithreaded (MT) case it depends on how good your
> MT primitives are. The whole which thread uses
> which buffer question is irrelevant (emphasis).
>
> I used auto_ptr for the generic stack as this
> emphasizes that the buffer does not belong to
> the stack when popped, until returned. Therefore
> should exceptions happen, you would have no
> leaks (transferral of ownership to the caller -
> the purpose(to me) of using auto_ptr).
>
> Regards,
>
> Werner

Ah, now I see how it works. So if you have, say,
8 threads running, you create and push on 8 buffers.
One will never need more so long as the amount of
threads does not exceed that limit. If one needs more
threads, one need only allocate the extra buffers once,
not on all 500 million bignum operations
needed to render an image.

Also, what about the speed of the mutex, which
gets called so constantly? Will that significantly
impede performance? Although in the single-thread
case, the mutex calls can be suppressed. But what
about in the multi-thread one?

werasm

unread,

Dec 4, 2007, 3:57:28 AM12/4/07

to

On Dec 4, 8:05 am, mike3 <mike4...@yahoo.com> wrote:

> Ah, now I see how it works. So if you have, say,
> 8 threads running, you create and push on 8 buffers.
> One will never need more so long as the amount of
> threads does not exceed that limit. If one needs more
> threads, one need only allocate the extra buffers once,
> not on all 500 million bignum operations
> needed to render an image.

Actually, for this implementation you don't need to
push any buffers at all initially. You just need
to pop and then push when you've done. If the under-
lying buffer is empty a new vector would be created
automatically (the first time). Then buffers will
only be added when more than one thread passes
through the function simultaneously. See the example
in main.

You only need as many buffers as the amount of passing
through the function in question simultaneoulsy.

> Also, what about the speed of the mutex, which
> gets called so constantly? Will that significantly
> impede performance?

This depends on your OS. Most OS's have mechanisms
that only cause significant reduction in
performance when contention arises (switching to
Kernel mode). In windows you could typically wrap
the critical section (I think boost or a library
like LOKI will have done this for your already
with some scoped locking mechanism). Under Windows
you should rather use CRITICAL_SECTION (and not
MUTEX) as MUTEX is for interprocess synchronization
which has more overhead.

Regards,

Werner

werasm

unread,

Dec 4, 2007, 3:58:51 AM12/4/07

to

On Dec 4, 9:57 am, werasm <wer...@gmail.com> wrote:
> On Dec 4, 8:05 am, mike3 <mike4...@yahoo.com> wrote:
>
> > Ah, now I see how it works. So if you have, say,
> > 8 threads running, you create and push on 8 buffers.
> > One will never need more so long as the amount of
> > threads does not exceed that limit. If one needs more
> > threads, one need only allocate the extra buffers once,
> > not on all 500 million bignum operations
> > needed to render an image.
>
> Actually, for this implementation you don't need to
> push any buffers at all initially. You just need
> to pop and then push when you've done. If the under-
> lying buffer is empty a new vector would be created
> automatically (the first time). Then buffers will
> only be added when more than one thread passes
> through the function simultaneously. See the example
> in main.
>
> You only need as many buffers as the amount of passing
> through the function in question simultaneoulsy.

And this is determined automatically, because if pop
is called when the underlying structure is empty, a
new vector gets created...

mike3

unread,

Dec 4, 2007, 4:29:42 AM12/4/07

to

Alright, I'll do that then.

Thanks for the answers.

terminator

unread,

Dec 4, 2007, 5:53:13 AM12/4/07

to

pop can be designed better so that it never returns a null in a
multithreaded environment:

ClientItem pop(){
//...Mutex::guard lock( mutex_ );

//do{
if( ! vecList_.empty() )
{
ClientItem v ( vecList_.back() );
vecList_.pop_back();
if(v.get()!=0)
return v;
};
//}while(v.get()==0);
//reach here if ( vecList_.empty() ||
(vecList_.pop_back().get()==NULL) )
return ClientItem ( new std::vector<T>( defaultSz_ ) );
};

you can also define an allocator instead of a vecstack with similar
implementation(cache a few pointers and allocate new memory if cache
is full).

regards,
FM.

werasm

unread,

Dec 4, 2007, 6:19:29 AM12/4/07

to

On Dec 4, 11:53 am, terminator <farid.mehr...@gmail.com> wrote:

> pop can be designed better so that it never returns a null in a
> multithreaded environment:

I don't follow - currently it never returns a null. The
idea is that you can pop from an empty stack. I probably
should not call it stack as it is more of an allocator.

> you can also define an allocator instead of a vecstack with similar
> implementation(cache a few pointers and allocate new memory if cache
> is full).

Yes, you could use an allocator interface. I don't think
caching pointers will buy you that much though. Perhaps
give an example.

Regards,

Werner

werasm

unread,

Dec 4, 2007, 6:43:27 AM12/4/07

to

On Dec 4, 11:53 am, terminator <farid.mehr...@gmail.com> wrote:

> pop can be designed better so that it never returns a null in a
> multithreaded environment:

My idea was rather to prevent pushing NULL onto
the stack. Therefore implementation of push
becomes:

assert( v.get() );
Mutex::guard lock( mutex_ );
vecList_.push_back( v.release() );

or even...

if( v.get() != NULL )
{
Mutex::guard lock( mutex_ );
vecList_.push_back( v.release() );
}

... as popping from the empty stack is OK
in this case. The only thing that I don't like
about the above is it is not consistent with
traditional stack. Therefore an allocator
interface/adapter would be better, yes. Still
don't follow the whole caching of pointers, or
I don't think it will buy you much.

Regards,

Werner

terminator

unread,

Dec 4, 2007, 9:16:22 AM12/4/07

to

mistake is mine don`t worry

really sorry

werasm

unread,

Dec 4, 2007, 9:31:17 AM12/4/07

to

On Dec 4, 3:16 pm, terminator <farid.mehr...@gmail.com> wrote:

> mistake is mine don`t worry

No, perhaps you are right though. Calling
the data structure a stack is slightly
confusing. It's more of a vector pool
implemented in terms of an underlying
stack.

Perhaps...

template <class T>
struct VecStack
{

//...
ClientItem get();
void put( ClientItem v );
//...
};

... would be a clearer interface.

Regards,

Werner

terminator

unread,

Dec 5, 2007, 12:44:34 PM12/5/07

to

I still think that it is better to use this pattern(with minor
modifications) for an allocator , not a stack of vectors.

regards,
FM.

werasm

unread,

Dec 5, 2007, 4:01:20 PM12/5/07

to

On Dec 5, 6:44 pm, terminator <farid.mehr...@gmail.com> wrote:

> I still think that it is better to use this pattern(with minor
> modifications) for an allocator , not a stack of vectors.

Without a little example I'm in the dark as to what you
mean. Which pattern - the one I've suggested, and what
minor modifications? I would love the example as I believe
I might learn something in the process. I've contemplated
traditional allocators but to me this is just not the
same thing. Care to take the time to give an example?

Regards,

Werner