How would my SoC calculate an integer square root?

jlad...@itu.edu

unread,

Jan 4, 2017, 6:27:16 PM1/4/17

to

Hello everyone,

I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.

I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers.

Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:

http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots

I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this?

I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.

Thanks for any advice you can share.

Clifford Heath

unread,

Jan 4, 2017, 6:46:43 PM1/4/17

to

On 05/01/17 10:27, jlad...@itu.edu wrote:
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program,
> and then call sqrt() with an int32 argument.

sqrt() takes a double-precision floating point number, so
your int32 will be promoted to double before sqrt is called,
and you'll get a double back.

jlad...@itu.edu

unread,

Jan 4, 2017, 7:17:19 PM1/4/17

to

On Wednesday, January 4, 2017 at 3:46:43 PM UTC-8, Clifford Heath wrote:

Thanks for your reply, Clifford. That's actually not an encouraging answer though! If that is the case, I think I will write the square root function that was shown in Listing 4 of Jack Crenshaw's blog. The floating-point function will spend time calculating digits of precision that I do not need for my application.

David Brown

unread,

Jan 4, 2017, 7:23:47 PM1/4/17

to

On 05/01/17 00:27, jlad...@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52
> is an ARM Cortex-M4. My development system is a Linux box running
> GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1,
> as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have
> to explore a bit deeper. I have an algorithm that may require me to
> calculate a lot of square roots, and quickly. However, the numbers
> in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get
> involved in this square root calculation. Optimized algorithms for
> various math functions exist. A nice discussion of the optimization
> of the integer square root algorithm, written by Dr. Jack Crenshaw,
> is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>
> I am interested in understanding exactly what code will be compiled
> if I #include <math.h> in my program, and then call sqrt() with an
> int32 argument. It stands to reason that optimized code, like that
> shown in the blog above, is what should be used. How can I determine
> this?

No, it does not "stand to reason" at all. The C standard library header
<math.h> goes along with an implementation of the C standard library
function for sqrt() - this will do the calculations in double-precision
floating point, as required by the C standards, and giving the answer to
full IEEE accuracy.

There are a lot of ways you can calculate square roots, with various
algorithms, lookup tables, etc., providing different balances between
accuracy, run-time, and code space. There is no one "perfect" answer.

And don't dismiss the floating point unit in the M4 too quickly - for
some sorts of calculations, it may be faster than using integer
arithmetic, even though you are using integer operands at the start.

lasselangwad...@gmail.com

unread,

Jan 4, 2017, 7:28:53 PM1/4/17

to

the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat

anti...@math.uni.wroc.pl

unread,

Jan 4, 2017, 7:54:50 PM1/4/17

to

jlad...@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots

Actually, this method is good if you do not have hardware multiplication,
or possibly if code size is more important than speed. However,
otherwise method based on table lookup for initial approximation
and Newton iteration is likely to be faster.

> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this?

Use something like:

/mnt/m1/pom/usr/bin/arm-none-eabi-gcc -Wall -mthumb -mcpu=cortex-m4 -march=armv7e-m -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O3 -S your_file.c

(replace /mnt/m1/pom/usr/bin/arm-none-eabi-gcc by path to your gcc and
use your comiler options). In file 'your_file.s' you will find
resulting assembly code. Using gcc-5.3 I get:

bl __aeabi_i2d
vmov d0, r0, r1
bl sqrt
vmov r0, r1, d0
bl __aeabi_d2iz

AFAICS this is: convertion from integer to double, moving argument to
flating point register, calling sqrt from C library, moving argument
back to integer registers, convertion from double to integer.

What sqrt from library will do depends on C library, I do not
use any...

> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.

'sqrt' is in C library. Some frequently used library functions
are buit into gcc, but apparenty 'sqrt' is not one of them.
So do not look inside gcc. Instad look for C library. It
seems that newlib uses implementation which converts back to
integers and is doing most calculations on integers. Seem
highly suboptimal.

--
Waldek Hebisch

Reinhardt Behm

unread,

Jan 4, 2017, 8:22:38 PM1/4/17

to

One of the rules of optimization is: Only do it after you have numbers to
compare. You are jumping to conclusions without any data. First benchmark
and measure, then optimize, measure again and compare.

--
Reinhardt

rickman

unread,

Jan 4, 2017, 8:32:29 PM1/4/17

to

The first rule of optimizing is, "don't" until you find you clearly need
it.

--

Rick C

Tim Wescott

unread,

Jan 4, 2017, 9:56:51 PM1/4/17

to

The FPU hardware is almost certainly faster than anything you could do in
software. Unless the largest-value integer you need to take the square
root of is small enough that you can just implement a look-up table.

Benchmark, and see.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

David Brown

unread,

Jan 5, 2017, 3:38:33 AM1/5/17

to

As long as "-ffast-math" and at least -O1 are enabled, as well as
appropriate cpu flags, the "sqrtf" function should generate a VSQRT.F32
automatically. Note that "sqrtf" is not the same as "sqrt".

David Brown

unread,

Jan 5, 2017, 3:43:54 AM1/5/17

to

Rules about optimisation are often misunderstood - perhaps because
"optimisation" is such a general term, and is used for everything
between enabling a compiler switch to algorithmic changes to fine-tuning
hand-written assembly routines.

The first rule of optimising is from Knuth - "Premature optimisation is
the root of all evil." The emphasis is on /premature/. That
encompasses not putting effort into optimisation that is not needed -
but equally, it allows for optimisation when you know that it /is/ needed.

Measurement is useful - invaluable, even - when you are comparing
different solutions, finding bottlenecks, or identifying which changes
made which effects. But it is not necessary when you can get a good
enough estimation of the scales involved. If you want to plan a route
across a country, you might want to calculate carefully the times for
driving, trains, or taking a plane - but you can already rule out
walking from the start as a poor "algorithm", and also rule out rockets
and submarines for various reasons. You don't need any sort of
measurements to start with.

The OP here says specifically that he needs to do lots of square roots,
and needs them fast, with small integers. Until I know otherwise, I
assume he has a reasonable idea of what he is doing, and can have a
reasonable estimate that a software-only double-precision IEEE accuracy
full range square root function is going to be too slow. It makes sense
to think about this from the start, not after benchmarking and
measurement - it could easily be a make-or-break issue for the algorithm
and his whole project. I assume that the OP /does/ have useful data,
such as the number of square roots per second and the clock speed of the
device, even if he does not have /all/ the relevant data.

There are also plenty of optimisations that are very "cheap" to do - and
in general, should /always/ be in place. Basic compiler optimisation -
at least "-O1", but usually "-O2" or "-Os", should be standard. It
costs nothing, gives better code (assembly code with "-O1" is usually a
lot easier to understand than code from "-O0"), and enables better
static warning and error checking.

If you are using floating point, "-ffast-math" should also be standard
for most uses - unless you /really/ need full IEEE support.

You don't need to measure anything before using these - just as you
don't need to time your car journeys before going out of first gear.

And there is also the common misconception that "optimised code" is
difficult to understand, difficult to write, and difficult to debug.
Sometimes that is the case - often it is not. The OP needs the square
roots of small integers - the first idea that comes to my mind is a
lookup table, which is pretty simple.

Dimiter_Popoff

unread,

Jan 5, 2017, 5:18:53 AM1/5/17

to

Here are my two replies to a similar query some 9 years ago:

> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/jz_myBDmvvUJ

> https://groups.google.com/d/msg/comp.arch.embedded/4osSFiWt2WI/CzUxHUdKYA0J

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

jlad...@itu.edu

unread,

Jan 5, 2017, 6:13:48 PM1/5/17

to

On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8, lasselangwad...@gmail.com wrote:

> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that is probably hard to beat

My information about the performance of FPU's is clearly a few years behind the times. That's actually a very impressive result, and you're right, it would be quite hard to beat.

jlad...@itu.edu

unread,

Jan 5, 2017, 6:23:01 PM1/5/17

to

Thanks Waldek, I haven't looked at the .s files and that's definitely helpful. If I compile using the "fast math" flag that was suggested by David Brown, that "sqrt" MIGHT compile to a single FPU instruction, VSQRT.F32, which apparently only requires 14 clock cycles. While the FPU result may not achieve IEEE precision, I need speed.

jlad...@itu.edu

unread,

Jan 5, 2017, 6:38:15 PM1/5/17

to

On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote:
> On 05/01/17 02:22, Reinhardt Behm wrote:
>
> The OP here says specifically that he needs to do lots of square roots,
> and needs them fast, with small integers. Until I know otherwise, I
> assume he has a reasonable idea of what he is doing, and can have a
> reasonable estimate that a software-only double-precision IEEE accuracy
> full range square root function is going to be too slow.

My nRF52 has a 32 MHz CPU clock. The system will need to calculate about 1,000 square roots per second. The largest number I expect to see is about 10,000. If this task was all that concerned me, I wouldn't sweat. However: my system also has several other real-time tasks to perform. It's also battery-powered. I am thinking about optimization because I need to conserve both time and power.

> There are also plenty of optimisations that are very "cheap" to do - and
> in general, should /always/ be in place. Basic compiler optimisation -
> at least "-O1", but usually "-O2" or "-Os", should be standard. It
> costs nothing, gives better code (assembly code with "-O1" is usually a
> lot easier to understand than code from "-O0"), and enables better
> static warning and error checking.
>
> If you are using floating point, "-ffast-math" should also be standard
> for most uses - unless you /really/ need full IEEE support.

That's helpful. Now you have me wondering whether there are compiler flags for optimizing integer math as well.

> And there is also the common misconception that "optimised code" is
> difficult to understand, difficult to write, and difficult to debug.
> Sometimes that is the case - often it is not. The OP needs the square
> roots of small integers - the first idea that comes to my mind is a
> lookup table, which is pretty simple.

As you can see, a 100-number lookup table should address my needs. A bit long, perhaps. A binary search through the table could require 7 iterations.

Tim Wescott

unread,

Jan 5, 2017, 6:44:44 PM1/5/17

to

Keep in mind, as David pointed out, that you want to make sure it's a
single-precision floating point: the FPU on the M4 doesn't do double-
precision math, so that would still go the horridly slow software route
(although a smart, FPU-aware algorithm might get an initial guess from
the FPU).

Robert Wessel

unread,

Jan 5, 2017, 8:54:23 PM1/5/17

to

The basic binary digit-by-digit algorithm for number of that scale
ought to be at least a bit faster than a binary search of a table that
size, and be less code (and no data).

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29

The inner loop of that is about a dozen instructions. And if you're
really limited to 10,000, you can initialize "bit" to 1<<12. If you
can a count-leading zeros instructions available, you can improve the
initial computation of bit a fair, *ahem*, bit.

Obviously you can do that faster, but almost always with considerable
more complexity and/or memory usage.

upsid...@downunder.com

unread,

Jan 6, 2017, 12:54:22 AM1/6/17

to

On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
<seemyw...@myfooter.really> wrote:

>On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>
>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>> lasselangwad...@gmail.com wrote:
>>
>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>> is probably hard to beat
>>
>> My information about the performance of FPU's is clearly a few years
>> behind the times. That's actually a very impressive result, and you're
>> right, it would be quite hard to beat.
>
>Keep in mind, as David pointed out, that you want to make sure it's a
>single-precision floating point: the FPU on the M4 doesn't do double-
>precision math, so that would still go the horridly slow software route
>(although a smart, FPU-aware algorithm might get an initial guess from
>the FPU).

If I understand correctly, the OP wants a square root algorithm
accepting an integer and returning an integer (not a float).

That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25
and so on. The sqrt for other values contains a more or less large
error.

The discussion about single vs double precision floats is not relevant
in this environment.

The OP said that the argument is a small integer and as long as the
argument is 6 decimal digits or less, it can be represented exactly
in single precision float. Thus the result will have up to 4 decimal
digits in integer part and a few decimal digits in the fractional
part. Just convert the single precision value to integer to get away
the fractional part, if truly an integer result is required.

It should be noted that conversion from float to integer can take a
long time on some architectures (if no barrel shifter available), so
it may be a good idea to keep integer and float calculations separate
and avoid mixing them in the same calculations.

David Brown

unread,

Jan 6, 2017, 2:46:15 AM1/6/17

to

On 06/01/17 00:44, Tim Wescott wrote:
> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>
>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>> lasselangwad...@gmail.com wrote:
>>
>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>> is probably hard to beat
>>
>> My information about the performance of FPU's is clearly a few years
>> behind the times. That's actually a very impressive result, and you're
>> right, it would be quite hard to beat.
>

Not all FPU's give accurate square root results from their hardware
instructions. The ARM documentation does not give details, so (unless
someone else knows better) we can assume that it /does/ give accurate
values, but not necessarily following IEEE specifications precisely. I
have seen other FPU's equivalent to the M4F's where the square root
instruction is explicitly an approximation. It would be good enough for
many uses, but for a highly accurate result you would use it as the
estimate and then run a couple of rounds of Newton-Raphson.

> Keep in mind, as David pointed out, that you want to make sure it's a
> single-precision floating point: the FPU on the M4 doesn't do double-
> precision math, so that would still go the horridly slow software route
> (although a smart, FPU-aware algorithm might get an initial guess from
> the FPU).
>

A smart FPU-aware algorithm /might/ do that, but it is very unlikely in
practice. It would not actually save many cycles - you can already get
a good initial estimate by simply taking the exponent from the floating
point number and halving it. By the time you have handled the
conversions between double and floating point and vice versa, and
considering the cost of the rest of the double-precision software
floating point operations needed in the square root, I can't see you
saving more than a couple of percent off the total time. It is not
worth building a special version of the library just for this one case -
other double-precision maths cannot, in general, get any benefit from
using the single-precision hardware.

David Brown

unread,

Jan 6, 2017, 3:12:48 AM1/6/17

to

On 06/01/17 06:54, upsid...@downunder.com wrote:
> On Thu, 05 Jan 2017 17:44:36 -0600, Tim Wescott
> <seemyw...@myfooter.really> wrote:
>
>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>>
>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>>> lasselangwad...@gmail.com wrote:
>>>
>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>>> is probably hard to beat
>>>
>>> My information about the performance of FPU's is clearly a few years
>>> behind the times. That's actually a very impressive result, and you're
>>> right, it would be quite hard to beat.
>>
>> Keep in mind, as David pointed out, that you want to make sure it's a
>> single-precision floating point: the FPU on the M4 doesn't do double-
>> precision math, so that would still go the horridly slow software route
>> (although a smart, FPU-aware algorithm might get an initial guess from
>> the FPU).
>
> If I understand correctly, the OP wants a square root algorithm
> accepting an integer and returning an integer (not a float).
>

Yes.

> That algorithm will give precise values for sqrt of 1, 4, 9, 16, 25
> and so on. The sqrt for other values contains a more or less large
> error.
>

Yes. The OP has not said if he wants round-down, round-to-nearest, or
does not care.

> The discussion about single vs double precision floats is not relevant
> in this environment.

It is very relevant. An easy way to implement his requirements
(ignoring range checks) is:

int sqrt_int(int x) {
return sqrtf(x);
}

But that requires the compiler flags that have been discussed, and
"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
and a couple of conversion instructions.

>
> The OP said that the argument is a small integer and as long as the
> argument is 6 decimal digits or less, it can be represented exactly
> in single precision float. Thus the result will have up to 4 decimal
> digits in integer part and a few decimal digits in the fractional
> part. Just convert the single precision value to integer to get away
> the fractional part, if truly an integer result is required.

It should work fine - as long as doubles are avoided (to avoid the
slowdown).

>
> It should be noted that conversion from float to integer can take a
> long time on some architectures (if no barrel shifter available), so
> it may be a good idea to keep integer and float calculations separate
> and avoid mixing them in the same calculations.
>

The M4F has float to integer conversion instructions (as do all cpus
with floating point hardware that I have ever seen). Again, flags like
-ffast-math and -O1 are required to be sure that these are used without
library calls to handle the possibility of "weird" floats.

Your point is important if doubles are involved, or on an architecture
without hardware floating point. Conversions back and forth between
integers and floating point formats can be relevant for the cost of the
operations. However, they are not /that/ expensive - they are typically
not worse than general arithmetic like addition and multiplication, and
cheaper than division. They may also give the compiler more
optimisation opportunities than floating point values (especially if you
have forgotten -ffast-math).

(And the ARM architecture has a barrel shifter.)

David Brown

unread,

Jan 6, 2017, 3:25:36 AM1/6/17

to

Integer maths is optimised with "-O1" or higher. It does not need
separate flags. (There are other flags that can affect the way integer
overflow is handled, but the default is to follow the C standards, and
that is usually what gives the fastest results anyway.)

>
>> And there is also the common misconception that "optimised code"
>> is difficult to understand, difficult to write, and difficult to
>> debug. Sometimes that is the case - often it is not. The OP needs
>> the square roots of small integers - the first idea that comes to
>> my mind is a lookup table, which is pretty simple.
>
> As you can see, a 100-number lookup table should address my needs. A
> bit long, perhaps. A binary search through the table could require 7
> iterations.
>

A 10000 number lookup table will give you the result in a single
operation. Do you have 20K of flash (or ram) space to spare? I don't
know your particular chip, but many Cortex devices actually have quite a
lot of flash space, and while 20K for a square root table might seem
excessive, it could work. And it would be the fastest possible method.

You can also do smart things with a smaller table - say, a 157 entry
table for the square roots of (x >> 6), shifted left by 3 bits. Your
algorithm is then to take x, shift right 6 bits, look it up in the table
as y. The integer square root of x is then between y and y + 7. A
simple algorithm to check these would take 8 checks, each of which is
simpler than binary search rounds. (And the table takes less flash.) A
smarter algorithm could probably jump to the correct y faster, based on
(x >> 6) and (x & 63).

That would still be slower than using VSQRT, but is a possibility if you
want to avoid floating point.

upsid...@downunder.com

unread,

Jan 6, 2017, 5:21:20 AM1/6/17

to

On Fri, 06 Jan 2017 09:12:44 +0100, David Brown
<david...@hesbynett.no> wrote:

>
>> The discussion about single vs double precision floats is not relevant
>> in this environment.
>
>It is very relevant. An easy way to implement his requirements
>(ignoring range checks) is:
>
>int sqrt_int(int x) {
> return sqrtf(x);
>}
>
>But that requires the compiler flags that have been discussed, and
>"sqrtf" instead of "sqrt", in order to compile to a VSQRT instruction
>and a couple of conversion instructions.

The time used to discuss what compiler flags to use could have better
used coding such simple function in assembler :-).

upsid...@downunder.com

unread,

Jan 6, 2017, 5:50:41 AM1/6/17

to

On Fri, 06 Jan 2017 08:46:11 +0100, David Brown
<david...@hesbynett.no> wrote:

>On 06/01/17 00:44, Tim Wescott wrote:
>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>>
>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>>> lasselangwad...@gmail.com wrote:
>>>
>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>>> is probably hard to beat

If those cycles are the same as the actual clock oscillator cycles,
the 32 MHz oscillator would allow more than 2 million VSQRT
instructions each second.

The OP estimated a thousand square roots each second, which would
consume less than 0.1 % of the clock cycles. No big deal and not a
reason for optimization.

>>> My information about the performance of FPU's is clearly a few years
>>> behind the times. That's actually a very impressive result, and you're
>>> right, it would be quite hard to beat.
>>
>
>Not all FPU's give accurate square root results from their hardware
>instructions. The ARM documentation does not give details, so (unless
>someone else knows better) we can assume that it /does/ give accurate
>values, but not necessarily following IEEE specifications precisely.

Not following the IEEE usually means it doesn't handle (extremely
small) denormalized values, might not generate proper +/- infinity
values f(for huge values) or properly handling NaN (Not a number)
cases.

> I
>have seen other FPU's equivalent to the M4F's where the square root
>instruction is explicitly an approximation. It would be good enough for
>many uses, but for a highly accurate result you would use it as the
>estimate and then run a couple of rounds of Newton-Raphson.

Floating point values are nearly always approximations anyway. Only
some special cases (small integers) are exact, but you can't represent
exactly values like 0.1 or 0.01.

To calculate the sqrt on a hardware with basic floating point
instruction but not hardware SQRT instruction, I have used the
following approach:
* normalize the value between 1..4 so that the exponent is even
* divide the exponent by two to get the result exponent
* using a polynomical with the mantissa bits to calculate the result
mantissa
* 4th order polynomicals give sufficient results for float and 8th
order for double.
* I have seen implementations (such as VAX) with 3rd order for single
and 6th for double degree.

David Brown

unread,

Jan 6, 2017, 10:50:07 AM1/6/17

to

No, it could not - it is better to learn how to use your tools (such as
the compiler) properly, so that you can re-use the information to
improve more code later, rather than to learn a single inconvenient
work-around.

Understanding the point of -ffast-math and optimisation flags, and about
float vs. double, will improve /all/ floating point code written by
/everyone/ who reads this thread and learned from it. A simple inline
assembly function (which is, in fact, slightly fiddly if you want to get
the best from it) would merely help one person in one single situation.

Give a man a fish, and all that.

David Brown

unread,

Jan 6, 2017, 10:55:21 AM1/6/17

to

On 06/01/17 11:50, upsid...@downunder.com wrote:
> On Fri, 06 Jan 2017 08:46:11 +0100, David Brown
> <david...@hesbynett.no> wrote:
>
>> On 06/01/17 00:44, Tim Wescott wrote:
>>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>>>
>>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>>>> lasselangwad...@gmail.com wrote:
>>>>
>>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>>>> is probably hard to beat
>
> If those cycles are the same as the actual clock oscillator cycles,
> the 32 MHz oscillator would allow more than 2 million VSQRT
> instructions each second.
>
> The OP estimated a thousand square roots each second, which would
> consume less than 0.1 % of the clock cycles. No big deal and not a
> reason for optimization.

He also made it clear that he wants to do other things during the time
than just square roots, as well as wanting to minimise the time in order
to reduce power consumption.

>
>>>> My information about the performance of FPU's is clearly a few years
>>>> behind the times. That's actually a very impressive result, and you're
>>>> right, it would be quite hard to beat.
>>>
>>
>> Not all FPU's give accurate square root results from their hardware
>> instructions. The ARM documentation does not give details, so (unless
>> someone else knows better) we can assume that it /does/ give accurate
>> values, but not necessarily following IEEE specifications precisely.
>
> Not following the IEEE usually means it doesn't handle (extremely
> small) denormalized values, might not generate proper +/- infinity
> values f(for huge values) or properly handling NaN (Not a number)
> cases.

Mostly true. It is also not uncommon for rounding to be done in a
different way, or to lose a LSB here or there compared to IEEE
references. Usually these details don't matter. But if you compile
without -ffast-math (or the relevant individual flags), the compiler
will do its best to get IEEE conformance /exactly/ right - even if that
means a library subroutine to handle the final details.

>
>> I
>> have seen other FPU's equivalent to the M4F's where the square root
>> instruction is explicitly an approximation. It would be good enough for
>> many uses, but for a highly accurate result you would use it as the
>> estimate and then run a couple of rounds of Newton-Raphson.
>
> Floating point values are nearly always approximations anyway. Only
> some special cases (small integers) are exact, but you can't represent
> exactly values like 0.1 or 0.01.
>
> To calculate the sqrt on a hardware with basic floating point
> instruction but not hardware SQRT instruction, I have used the
> following approach:
> * normalize the value between 1..4 so that the exponent is even
> * divide the exponent by two to get the result exponent
> * using a polynomical with the mantissa bits to calculate the result
> mantissa
> * 4th order polynomicals give sufficient results for float and 8th
> order for double.
> * I have seen implementations (such as VAX) with 3rd order for single
> and 6th for double degree.
>

Polynomials like this are sometimes used. Usually the "best" choice of
algorithm here depends on the relative speeds of division and
multiplication. Early floating point units (such as the VAX) were
notoriously slow at division, so an algorithm that uses polynomials
(addition and multiplication) is preferred.

Dimiter_Popoff

unread,

Jan 6, 2017, 11:29:13 AM1/6/17

to

On 06.1.2017 г. 01:38, jlad...@itu.edu wrote:
> On Thursday, January 5, 2017 at 12:43:54 AM UTC-8, David Brown wrote:
>> On 05/01/17 02:22, Reinhardt Behm wrote:
>>
>> The OP here says specifically that he needs to do lots of square roots,
>> and needs them fast, with small integers. Until I know otherwise, I
>> assume he has a reasonable idea of what he is doing, and can have a
>> reasonable estimate that a software-only double-precision IEEE accuracy
>> full range square root function is going to be too slow.
>
>My nRF52 has a 32 MHz CPU clock. The system will need to calculate about
>1,000 square roots per second. The largest number I expect to see is
>about 10,000. If this task was all that concerned me, I wouldn't sweat.
>However: my system also has several other real-time tasks to perform.
>It's also battery-powered. I am thinking about optimization because I
>need to conserve both time and power.
>

So calculating the square root of a 14 bit number is fine for you; you
will get a 7 bit result.
Using my algorithm this will cost you 7 multiply instructions, 7
compares, 7 bset/bclr or some equivalent (e.g. shift + or sort of
thing), 7 conditional branches and 7 decrement and branch opcodes.
35, say I forgot something, make that 70 cycles per square root.

70000 cycles per second out of a total of 32M cycles per second
makes.... 0.22% of the computing power you have.

Does not sound like something worth all that head banging.

upsid...@downunder.com

unread,

Jan 6, 2017, 12:08:21 PM1/6/17

to

On Fri, 6 Jan 2017 16:50:02 +0100, David Brown

I did not refer to in-line assembly in particular, but also to
separate compiled assembler code. Of course, you need to know the
parameter passing mechanism to separate compiled/assembled units and
for functions how to return results. In most languages, parameters are
passed on the stack, some push the first parameter first, while some
push the last parameter last.

>Give a man a fish, and all that.

Understanding the parameter passing mechanism in general is a useful
skill.

upsid...@downunder.com

unread,

Jan 6, 2017, 12:13:24 PM1/6/17

to

On Fri, 6 Jan 2017 16:55:18 +0100, David Brown

<david...@hesbynett.no> wrote:

>On 06/01/17 11:50, upsid...@downunder.com wrote:
>> On Fri, 06 Jan 2017 08:46:11 +0100, David Brown
>> <david...@hesbynett.no> wrote:
>>
>>> On 06/01/17 00:44, Tim Wescott wrote:
>>>> On Thu, 05 Jan 2017 15:13:42 -0800, jladasky wrote:
>>>>
>>>>> On Wednesday, January 4, 2017 at 4:28:53 PM UTC-8,
>>>>> lasselangwad...@gmail.com wrote:
>>>>>
>>>>>> the fpu instruction VSQRT.F32 takes 14 cycles if floats are enough that
>>>>>> is probably hard to beat
>>
>> If those cycles are the same as the actual clock oscillator cycles,
>> the 32 MHz oscillator would allow more than 2 million VSQRT
>> instructions each second.
>>
>> The OP estimated a thousand square roots each second, which would
>> consume less than 0.1 % of the clock cycles. No big deal and not a
>> reason for optimization.
>
>He also made it clear that he wants to do other things during the time
>than just square roots, as well as wanting to minimise the time in order
>to reduce power consumption.

He just has 99.9 % of the total CPU cycles available for the "other"
tasks.

John Devereux

unread,

Jan 6, 2017, 1:33:36 PM1/6/17

to

It looks like you did not read to the end of his sentence. It will be
consume 100% of the cycles if it only wakes up to do that one thing.

Modern MCUs can achieve very low power operation by sleeping most of the
time. Power consumption can then be nearly proportional to the wake duty
cycle. So going from "0.1%" to "0.2%" could be the difference between a
2 year and a 4 year life when powered by a coin cell say.

--

John Devereux

Tim Wescott

unread,

Jan 6, 2017, 4:40:19 PM1/6/17

to

On Wed, 04 Jan 2017 15:27:10 -0800, jladasky wrote:

> Hello everyone,
>
> I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is
> an ARM Cortex-M4. My development system is a Linux box running GCC. My
> embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as
> recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have to
> explore a bit deeper. I have an algorithm that may require me to
> calculate a lot of square roots, and quickly. However, the numbers in
> question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get
> involved in this square root calculation. Optimized algorithms for
> various math functions exist. A nice discussion of the optimization of
> the integer square root algorithm, written by Dr. Jack Crenshaw, is
> found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/
Integer-Square-Roots
>
> I am interested in understanding exactly what code will be compiled if I
> #include <math.h> in my program, and then call sqrt() with an int32
> argument. It stands to reason that optimized code, like that shown in
> the blog above, is what should be used. How can I determine this?
>

> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues.
> The only math-related file that I can find there is called tgmath.h,
> and it just #defines aliases which point back to math.h. I am looking
> for source code.
>

> Thanks for any advice you can share.

I notice, in our arguing about what you said, that some folks are
thinking you want to get an integer _out_ of your algorithm -- is this
true?

I'm not sure if it has much of a bearing on things, because I suspect
that using the processor's floating point math is going to be the
quickest, even with conversion to and from float. But I'm curious.

jlad...@itu.edu

unread,

Jan 6, 2017, 7:52:29 PM1/6/17

to

On Friday, January 6, 2017 at 1:40:19 PM UTC-8, Tim Wescott wrote:

> I notice, in our arguing about what you said, that some folks are
> thinking you want to get an integer _out_ of your algorithm -- is this
> true?

That's correct. Rounding or even truncating is OK. These square roots are being fed to an algorithm which needs the rank order of the results to be preserved, but it's OK if the exact distribution is a little distorted.

> I'm not sure if it has much of a bearing on things, because I suspect
> that using the processor's floating point math is going to be the
> quickest, even with conversion to and from float. But I'm curious.

After reading everyone's input, I'm starting to think that the code which would be produced by -ffast-math will be a pretty good start.

I am giving serious consideration to David Brown's suggestion of a 10,000 entry lookup table. I have 512K of flash, and I only need to store int16's.

Les Cargill

unread,

Jan 6, 2017, 9:28:22 PM1/6/17

to

jlad...@itu.edu wrote:
> Hello everyone,
>
> I am working with a Nordic nRF52 SoC. The CPU subsystem of the nRF52 is an ARM Cortex-M4. My development system is a Linux box running GCC. My embedded systems toolchain is gcc-arm-none-eabi-4_9-2015q1, as recommended by Nordic.
>
> I have written a fair amount of code for this device, but now I have to explore a bit deeper. I have an algorithm that may require me to calculate a lot of square roots, and quickly. However, the numbers in question are all small integers.
>
> Even though the ARM Cortex-M4 has an FPU, it shouldn't need to get involved in this square root calculation. Optimized algorithms for various math functions exist. A nice discussion of the optimization of the integer square root algorithm, written by Dr. Jack Crenshaw, is found here:
>
> http://www.embedded.com/electronics-blogs/programmer-s-toolbox/4219659/Integer-Square-Roots
>
> I am interested in understanding exactly what code will be compiled if I #include <math.h> in my program, and then call sqrt() with an int32 argument. It stands to reason that optimized code, like that shown in the blog above, is what should be used. How can I determine this?
>
> I am poking around in the gcc-arm-none-eabi-4_9-2015q1 folder for clues. The only math-related file that I can find there is called tgmath.h, and it just #defines aliases which point back to math.h. I am looking for source code.
>
> Thanks for any advice you can share.
>

I'd measure the floating point version before piling off into a
bespoke ( custom ) square root routine.

--
Les Cargill

Wouter van Ooijen

unread,

Jan 7, 2017, 3:16:29 AM1/7/17

to

Op 07-Jan-17 om 01:52 schreef jlad...@itu.edu:

The sqrt of 10000 is only 100, so you'd need to store only unsigned chars.

Wouter "Objects? No Thanks!" van Ooijen

Dimiter_Popoff

unread,

Jan 7, 2017, 5:29:27 AM1/7/17

to

On 07.1.2017 г. 02:52, jlad...@itu.edu wrote:
> .....

>
> After reading everyone's input, I'm starting to think that the code
> which would be produced by -ffast-math will be a pretty good start.
>
> I am giving serious consideration to David Brown's suggestion of a
> 10,000 entry lookup table. I have 512K of flash, and I only need to
> store int16's.
>

At 512k flash this is a no brainer of course.

Dimiter

Mike Perkins

unread,

Jan 7, 2017, 5:47:32 AM1/7/17

to

ARM documentation suggests this takes 14/15 clocks cycles. There is then
the conversion from integer to floating point.

--
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Mike Perkins

unread,

Jan 7, 2017, 5:58:01 AM1/7/17

to

For me the lookup table is the best and simplest, especially on a device
with 512kB of ROM. Is your program that big? Given the answer is always
going to be of byte length, for the square root of a number up to 10,000
you would only need 10kB of table.

An alternative is a variation of:

https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Decimal_.28base_10.29
except you would work to base 256 and have a lookup table of just 256
bytes to find the sqrt of 0 to 255.

Robert Wessel

unread,

Jan 8, 2017, 12:03:24 AM1/8/17

to

On Fri, 06 Jan 2017 09:25:32 +0100, David Brown
<david...@hesbynett.no> wrote:

>A 10000 number lookup table will give you the result in a single
>operation. Do you have 20K of flash (or ram) space to spare? I don't
>know your particular chip, but many Cortex devices actually have quite a
>lot of flash space, and while 20K for a square root table might seem
>excessive, it could work. And it would be the fastest possible method.

Of course he'd only need a 10000 byte table, all the results of
isqrt(n) for n<10000 fit in a byte.

David Brown

unread,

Jan 8, 2017, 10:50:00 AM1/8/17

to

Separately compiled assembly usually involves quite a bit more messing
around than inline assembly. You have to handle a variety of assembler
directives to get it to work. You need to understand more, in order to
follow the calling convention properly (including preserving registers
if and only if necessary). And the result is a lot less efficient for
short sections of code, because of the function call overhead. Inline
assembly is shorter, simpler, and can work together with the optimiser -
though you still need to know what you are doing.

> In most languages, parameters are
> passed on the stack, some push the first parameter first, while some
> push the last parameter last.
>

The calling convention varies a lot between platforms. On targets with
plenty of registers, parameters are mostly passed in registers, with the
stack used when there are too many registers or for things like structs.

>> Give a man a fish, and all that.
>
> Understanding the parameter passing mechanism in general is a useful
> skill.
>

It /can/ be a useful skill, yes. But there is no such thing as
"parameter passing mechanism in general" as it varies between platforms.
And the skill is not nearly as helpful as the skill of understanding
how to use your compiler. Learning some of the basics of how compiler
flags work and how compiler optimisation works is /vastly/ more useful
than trying to write your own assembly functions. It is also much
easier, much safer (if you get it wrong you are likely to get slower
code, but still equally correct code), much more portable, and leads to
writing clearer and more maintainable code instead of cryptic assembly.

(I strongly advocate learning to /read/ assembly on the platforms you
work with, in order to understand the results of your compilation. But
for most developers, if you are writing more assembly than a very
occasional single-line inline assembly function, you should probably
re-consider your development strategy.)

David Brown

unread,

Jan 8, 2017, 11:04:50 AM1/8/17

to

True.

David Brown

unread,

Jan 8, 2017, 11:09:24 AM1/8/17

to

I is /almost/ a no-brainer. There might be other considerations, such
as a slow over-the-air firmware update.

And even if this sounds like a simple answer, it's good that he is
giving the different ideas "serious consideration". Some of the ideas
and points raised in this thread could be useful learning experience for
the future.