Optimizing Power Function for Integral Types

131 views
Skip to first unread message

joel.g....@gmail.com

unread,
Jan 17, 2019, 11:02:13 AM1/17/19
to ISO C++ Standard - Future Proposals
As of C++11, per overload 7 of the 'pow' function (https://en.cppreference.com/w/cpp/numeric/math/pow), integral types can be used as parameters, but they will be cast to double before calculation. I believe that an integral-specific overload should be made, which makes use of faster integer-operations available to CPUs.

As a disclaimer, I am still a rather inexperienced programmer (College Sophomore), and I my have some misunderstands as to what would fall under the ISO standard and what would fall under compiler-implementations. As I understand it, the line about "If any argument has integral type, it is cast to double." is a component of the ISO standard.

As stated on isocpp.org's FAQ section, this idea is based on existing practices. As discussed here (https://stackoverflow.com/questions/2398442/why-isnt-int-powint-base-int-exponent-in-the-standard-c-libraries), multiple implementations of this function have been made on a per-user basis, due to its general simplicity and lack of inclusion in previous standards. In the below testing, I created a function using x86 asm. It is worth noting that even with more standard C++ approaches, as testified in the stackoverflow discussion above, a significant performance improvement can be made.

For the sake of brevity, the data-table and source have been withheld, but if it is relevant to the conversation, I would be more than willing to provide them.

Do any of you agree that this is a useful addition to the C++ standards?

Performance Testing:
Tests were run on an Intel Core i3 6006U @ 2.0GHz

Custom Function (VS 2017) :
__declspec(naked) __int32 __fastcall powi(__int32 base, __int32 exp) { __asm { ... } }

Test Range:
base: [-46340, 46340]  (int-truncation of the sqrt of MAX_INT)
exp: [-19, 19] (highest power of 3, that does not overflow the int type)

Results: (Approx. x20 avg. improvement)

Untitled.png


Possible Use-Cases:

Allocation of N-dimensional arrays, with N^M elements.

Dilip Ranganathan

unread,
Jan 17, 2019, 11:06:25 AM1/17/19
to std-pr...@isocpp.org
I am just writing to +1 this. I have long wondered why there is no int overload
for the pow function but wasn't confident enough of asking here fearing that
I would've missed something blindingly obvious.

Thanks Joel for taking that baton.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/fbfec82c-5416-4c96-b53b-ef3d4309b613%40isocpp.org.
Message has been deleted

joel.g....@gmail.com

unread,
Jan 18, 2019, 1:12:54 PM1/18/19
to ISO C++ Standard - Future Proposals
To amend the graph given; I have discovered a mistake in my timing code. After fixing said mistake, the graph no-longer shows O(log(n)) complexity for either function with positive powers. The 20x improvement in performance is still present. I know that this doesn't effect the primary points of the proposal, but I just wanted to clarify a bit of information.

joel.g....@gmail.com

unread,
Jan 18, 2019, 1:17:27 PM1/18/19
to ISO C++ Standard - Future Proposals
Thanks for the positive feedback. I will admit, there was a certain amount of doubt when I was first considering pushing for this. It felt like every good suggestion had already been made and considered; To a certain extent, I had considered the STL holy. That's not to say I don't understand some of the rationale behind this optimization's inclusion. Bloat is a major design constraint.

Arthur O'Dwyer

unread,
Jan 18, 2019, 5:38:49 PM1/18/19
to ISO C++ Standard - Future Proposals
I am skeptical of this graph's reality. It looks too neat to me, with an identical dip at n^0 and a constant gap between the two lines even on a log scale. (I could believe a constant gap on a linear scale, if what we were witnessing is the constant cost of two int-to-double conversions. But not a constant gap on a log scale!)
Do you have a link to your exact test cases and the software that produced this graph, such that other people could reproduce your experiment from scratch on their own hardware?

Another thing that makes me skeptical of this graph is the weirdly constant and very low latencies for n^-19 through n^-1 (which are displayed as -n^19 to -n^1 on the X-axis, but I assume that's a typo).  Solving n^-19 (a.k.a. 1/n^19) in integers is just "return (n == 1) ? 1 : (n == -1) ? -1 : 0;", right? Whereas solving n^-19 in floating-point ought to require the exact same "log, multiply, exp" that would be required for n^19.

As for adding a new overload of std::pow — no, you'll never get away with that.

    double x = std::pow(10, 20);

currently has well-defined behavior — it yields 1e+20.  You're proposing to make it have undefined behavior instead.  I don't think that's going to fly.

You'd have better luck asking GCC and Clang to optimize `__builtin_pow` for small integers. Which I think they might already do?
Looks like maybe they special-case std::pow(1, n) but nothing else.

HTH,
Arthur

Jake Arkinstall

unread,
Jan 18, 2019, 6:42:19 PM1/18/19
to ISO C++ Standard - Future Proposals
I agree on that - but I also dislike like the unnecessary abbreviation "pow", so I'd propose a templated function std::power, maybe with a default for unsigned integral powers if T operator*(T const&, T const&) exists.

Even better, an operator for it instead (e.g. ^^, as ** introduces ambiguities).

joel.g....@gmail.com

unread,
Jan 18, 2019, 8:12:18 PM1/18/19
to ISO C++ Standard - Future Proposals
The case you pointed out of using the double-type of the return for 'pow(int, int)' would be a major concern. Perhaps a workaround would be the requirement of an explicit template? i.e. "double x = std::pow(10, 20);" would use overload 7, whereas "std::pow<int, int>(10, 20)" would return int? Alternatively, as Jake Arkinstall suggested, a new syntax for 'pow' as 'power' might be in order, in which case, such contingencies wouldn't be as large a concern. I would also argue that the above code segment is poor practice, as it is relying upon the pow function to convert the two integers to doubles. More explicit typing would be preferred in most coding environments. Another solution, as is implemented with powl and powf, is to use a single-letter abbreviation to explicitly define the types used. i.e. powi.

As for the graph; I mentioned in a reply to my original posting that I have since found an issue with the testing method I used for graphing the execution times for pow and powi. I have since done other tests as shown below [avg. of 20000 runs for powi, and 500 for pow (since the execution is longer, I decided to neglect the sample-count somewhat)]:

Untitled2.png


As you can see, powi does experience a gradual increase in execution length with the increasing exponents (since it needs to do at worst [if an implementation were to use a simple linear approach, O(n) multiplications]).


While the normal method does appear to be O(1), the execution lengths for floating-point operations are significant enough to negate those benefits with the ranges of 32 and 64-bit integer types.


Also below, I have included the collected data, which was gathered using "std::chrono::high_resolution_clock::now()" surrounding for-loops executing the functions within the base-value ranges for each exponent.


Power STL Custom
-19 152.4881 5.922077
-18 156.3487 5.497594
-17 141.8481 5.546061
-16 145.8614 5.683295
-15 145.8326 6.041384
-14 149.2215 5.701708
-13 145.5323 5.59141
-12 149.3403 5.602157
-11 146.3427 5.974374
-10 152.5245 6.180647
-9 146.4068 5.772337
-8 143.3287 5.782097
-7 152.735 5.678149
-6 146.4267 5.743723
-5 145.5064 5.750488
-4 150.1708 5.772985
-3 142.524 6.569755
-2 140.4772 5.600695
-1 143.6424 6.281142
0 107.1507 3.854533
1 140.658 6.386001
2 147.3853 6.076429
3 146.9267 8.102243
4 141.4524 6.984409
5 136.9645 10.44516
6 139.3443 8.883989
7 142.125 7.988396
8 141.0955 7.625333
9 140.6131 12.97227
10 138.1168 11.3224
11 143.1305 10.61643
12 139.4992 10.38626
13 141.5211 11.2896
14 141.0866 10.22915
15 140.7223 9.948738
16 139.7512 9.173968
17 138.1409 15.03408
18 141.6834 13.97868
19 153.7554 14.77899

joel.g....@gmail.com

unread,
Jan 18, 2019, 8:19:17 PM1/18/19
to ISO C++ Standard - Future Proposals
I think the idea of un-abbreviating pow would be a nice design shift, though I am skeptical due to the amount of code which would be made 'legacy' by such a change. As for the operators... Those would need to be introduced at the core-language level, and I don't think that it would be an appropriate proposal here. As a side note here; I DO think it would be an interesting shift in the core language to support user-defined operators. It would be very difficult to define a consistent methodology, but it would significantly improve the general readability of a good number of object-interactions. For instance: operator::@(MyPointClass c)". Like I said though, this really isn't the place for such ideas, as those would need to happen at the lower-levels of C++.

Arthur O'Dwyer

unread,
Jan 19, 2019, 1:59:21 AM1/19/19
to ISO C++ Standard - Future Proposals
On Fri, Jan 18, 2019 at 8:12 PM <joel.g....@gmail.com> wrote:
The case you pointed out of using the double-type of the return for 'pow(int, int)' would be a major concern. Perhaps a workaround would be the requirement of an explicit template? i.e. "double x = std::pow(10, 20);" would use overload 7, whereas "std::pow<int, int>(10, 20)" would return int? Alternatively, as Jake Arkinstall suggested, a new syntax for 'pow' as 'power' might be in order, in which case, such contingencies wouldn't be as large a concern. I would also argue that the above code segment is poor practice, as it is relying upon the pow function to convert the two integers to doubles. More explicit typing would be preferred in most coding environments. Another solution, as is implemented with powl and powf, is to use a single-letter abbreviation to explicitly define the types used. i.e. powi.

As for the graph; I mentioned in a reply to my original posting that I have since found an issue with the testing method I used for graphing the execution times for pow and powi. I have since done other tests as shown below [avg. of 20000 runs for powi, and 500 for pow (since the execution is longer, I decided to neglect the sample-count somewhat)]:

Untitled2.png

As you can see, powi does experience a gradual increase in execution length with the increasing exponents (since it needs to do at worst [if an implementation were to use a simple linear approach, O(n) multiplications]).

But really O(lg n) multiplications, right?
That graph looks a lot better (flatter), but I'd still have to see the code before I believe it. You say your data was

> gathered using "std::chrono::high_resolution_clock::now()" surrounding for-loops executing the functions

which pointedly omits any reference to e.g. Google Benchmark or ClobberMemory/DoNotOptimize. Which suggests that your benchmark might not have been measuring anything meaningful.

–Arthur

joel.g....@gmail.com

unread,
Jan 22, 2019, 12:50:59 PM1/22/19
to ISO C++ Standard - Future Proposals
I wasn't aware of Google Benchmark before now. I will attempt to familiarize myself with it, so that I can get more accurate timings. In the meanwhile, below is my implementation of what I hope to propose. If the results from Google Benchmark don't align with what I have presented, I sincerely apologize for the misinformation.

__declspec(naked) __int32 __fastcall powi(__int32 base, __int32 exponent) {
  __asm
{
 
// No Stack Usage
 
/*------- Prolog ---------*/
 
//push ebp
 
//mov ebp, esp
 
//sub esp, __LOCAL_SIZE
 
/*------------------------*/


 
// Entry Register Values
 
//------------------------
 
// EAX = UNDEFINED
 
// EBX = UNDEFINED
 
// ECX = base
 
// EDX = exponent
 
// ESI = UNDEFINED
 
// EDI = UNDEFINED
 
//------------------------


 
// eax = 0
  xor eax
, eax;


 
// if(exponent == 0) { return 1; }
  test edx
, edx;
  jnz if_zero_exp
;
  inc eax
; // Set eax to 1
  jmp powi_exit
;
if_zero_exp
:
 
 
// if(base == 0) { return 0; }
  test ecx
, ecx;
  jz powi_exit
;


 
// ebx = base
 
// ecx = exponent
  mov ebx
, ecx;
  mov ecx
, edx;


 
 
// ecx = abs(exponent)
 
// edx = sign(exponent)
  sar edx
, 31;
  xor ecx
, edx;
 
sub ecx, edx;


 
// Create a copy of the base, and move it to a temporary register
  mov esi
, ebx;
 
 
/* find abs(base) */
  mov eax
, ebx; // Copy the base to the accumulator
  sar eax
, 31;  // Fill with sign bit
  xor ebx
, eax; // Not the value, if sign bit is set
 
sub ebx, eax; // Subtract the sign bit, giving the absolute value
 
 
// Return the copy to the accumulator
  mov eax
, esi;


 
// EAX = base
 
// EBX = abs(base)
 
// ECX = abs(exponent)
 
// EDX = sign(exponent)


 
// if exponent is negative
 
/*=========================================================*/
  test edx
, edx;
  jz if_exp_sign_eq_0
;


 
// if abs(base) == 1
 
/*----------------------------*/
  cmp ebx
, 1;
  jg if_base_gt_1
;
 
// ecx = exponent & 1
 
and ecx, ebx;
 
 
// eax = exponent & 1 ? (base < 0 ? -1 : 1) : 0
  imul ecx
;


 
// ecx = exponent & 1 ? 0 : 1
  xor ecx
, ebx;


 
// eax = exponent & 1 ? (base < 0 ? -1 : 1) : 1
  add eax
, ecx;


 
// Return exponent & 1 ? (base < 0 ? -1 : 1) : 1
  jmp powi_exit
;
if_base_gt_1
:
 
/*----------------------------*/
 
 
// Return 0
  xor eax
, eax;
  jmp powi_exit
;


if_exp_sign_eq_0
:
 
/*=========================================================*/


 
/* if exponent == 1, return base */
  dec ecx
;
  jz powi_exit
;


 
// If base == 2
 
//cmp ebx, 2;
 
//jne if_base_2;
 
// sal eax, cl;
 
// ret;
 
//if_base_2:


 
/* Exponentiate the multiplication */
  mov ebx
, eax;
multiply
:
  shr ecx
, 1;    // Shift and check the first bit of the loop counter
  jnc noMulByEBX
  imul ebx
;  // Multiply by the current multiplier
noMulByEBX
:
  imul ebx
, ebx; // Square the multiplier
  test ecx
, ecx; // Test if this was the last itteration
  jnz multiply
;


 
/* Exit the function, returning the value of eax */
powi_exit
:
 
// No stack usage
 
/*------- Epilog ---------*/
 
//mov esp, ebp;
 
//pop ebp;
 
/*------------------------*/
  ret
;
 
}
}

Arthur O'Dwyer

unread,
Jan 22, 2019, 12:52:45 PM1/22/19
to ISO C++ Standard - Future Proposals
Why don't you just post your benchmark code? Why all the graphs and "oops here's the new graph" and so on?

--
You received this message because you are subscribed to a topic in the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this topic, visit https://groups.google.com/a/isocpp.org/d/topic/std-proposals/y9qMRNKrS2E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to std-proposal...@isocpp.org.

To post to this group, send email to std-pr...@isocpp.org.
Reply all
Reply to author
Forward
0 new messages