Cortex M4 Floating Point Size

Tim Wescott

unread,

Jul 30, 2013, 5:08:52 PM7/30/13

to

I am, apparently, incompetent at reading data sheets.

At least when they get up to several hundred pages.

Do Cortex M4 parts deal with 64-bit floating point in hardware, or just
32-bit?

Thanks...

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Roberto Waltman

unread,

Jul 30, 2013, 5:37:06 PM7/30/13

to

Tim Wescott wrote:

>Do Cortex M4 parts deal with 64-bit floating point in hardware, or just
>32-bit?

32, I believe.

From the Cortex-M4 reference manual
( DDI0439D_cortex_m4_processor_r0p1_trm.pdf

"2.1 About the functions
Optional Floating Point Unit (FPU) providing:
* 32-bit instructions for single-precision (C float) data-processing
operations.
* Combined Multiply and Accumulate instructions for increased
precision (Fused MAC).
* Hardware support for conversion, addition, subtraction,
multiplication with optional accumulate, division, and square-root.
* Hardware support for denormals and all IEEE rounding modes.
* 32 dedicated 32-bit single precision registers, also addressable as
16 double-word registers.
* Decoupled three stage pipeline."

"7.1 - About the FPU
The Cortex-M4 FPU is an implementation of the single precision variant
of the ARMv7-M Floating-Point Extension (FPv4-SP).
It provides floating-point computation functionality that is compliant
with the ANSI/IEEE Std 754-2008, IEEE Standard for Binary
Floating-Point Arithmetic, referred to as the IEEE 754 standard.
The FPU supports all single-precision data-processing instructions and
data types described in the ARM�v7-M Architecture Reference Manual"

And from infocenter.arm.com:
"ARMv7-M Architecture Reference Manual
...
This document is only available ... to registered ARM customers."
--
Roberto Waltman

[ Please reply to the group,
return address is invalid ]

Tim Wescott

unread,

Jul 30, 2013, 6:11:24 PM7/30/13

to

Crud.

Thanks.

I guess I test my algorithm with 32-bit arithmetic and see how it flies,
then.

Anders....@kapsi.spam.stop.fi.invalid

unread,

Jul 30, 2013, 9:20:27 PM7/30/13

to

Tim Wescott <t...@seemywebsite.really> wrote:
> I am, apparently, incompetent at reading data sheets.
>
> At least when they get up to several hundred pages.
>
> Do Cortex M4 parts deal with 64-bit floating point in hardware, or just
> 32-bit?

In addition to the information Roberto posted, it may be worth keeping
in mind that the parts with the FPU are "Cortex-M4F", and the parts
without are plain "Cortex-M4". At least some of Freescale's Kinetis
parts are of the latter kind.

-a

Jim Stewart

unread,

Jul 31, 2013, 1:59:39 PM7/31/13

to

Just out of idle curiosity, what kind of an
application might require 64 bit floating point?

Tim Wescott

unread,

Jul 31, 2013, 3:43:58 PM7/31/13

to

Most control loops that need any precision won't work quite right with 32
bit floating point. You need more than the 25 bits worth of mantissa
that comes with single-precision floating point (32 bit fixed-point often
works quite well, however). If you're just spinning a motor then you can
get by, but if you've got a PID loop with 16-bit or better inputs and a
high sampling rate to bandwidth ratio, then you need integrators with
more than 25 bits worth of precision.

In this case it's a Kalman filter application. It may work with 32 bits,
but I haven't tested it against the data that I have, and it'll be
tight. So either I'll need to rearrange the algorithm (Kalman filters
can use a "square root" algorithm that basically halves the required
precision in the most sensitive areas, in return for a whole bunch of
extra, and extra-weird, math) or re-think my processor choice.

Sigh...

in...@quantum-leaps.com

unread,

Jul 31, 2013, 9:35:31 PM7/31/13

to

The single-precision FPU of Cortex-M4F needs to be enabled before it is used (the FPU is disabled out of reset). Typically the FPU is enabled in the startup code, but you need to check to be sure.

Also, the FPU in Cortex-M4F comes with its own register bank, which needs to be saved/restored if the FPU can be used in the ISRs or in tasks of a preemptive RTOS. The need for saving/restoring this context is a huge penalty for using the FPU in such circumstances. To reduce this (unacceptable really) overhead, ARM has introduced the feature called "lazy stacking" described in the ARM App Note: http://infocenter.arm.com/help/topic/com.arm.doc.dai0298a/DAI0298A_cortex_m4f_lazy_stacking_and_context_switching.pdf . Lazy stacking of FPU registers is enabled by default.

Miro Samek
state-machine.com/arm

FreeRTOS info

unread,

Aug 1, 2013, 2:11:29 PM8/1/13

to

We seem to of gone off the topic of the OP, but...

[hardware] lazy stacking breaks down when using a true multi-threaded
OS, requiring the FPU registers to be saved on a task context switch.
The reason being, the lazy stacking algorithm [obviously] cannot be
aware of the kernel's radical stack pointer manipulation - it can only
be aware of predicable stack pointer increments and decrements.

Regards,
Richard.

+ http://www.FreeRTOS.org
Designed for microcontrollers. More than 103000 downloads in 2012.

+ http://www.FreeRTOS.org/plus
Trace, safety certification, FAT FS, TCP/IP, training, and more...

Paul Rubin

unread,

Aug 1, 2013, 11:08:05 PM8/1/13

to

in...@quantum-leaps.com writes:
> Also, the FPU in Cortex-M4F comes with its own register bank, which
> needs to be saved/restored if the FPU can be used in the ISRs or in
> tasks of a preemptive RTOS.

In Tim's application, I wonder whether the FPU can be exclusively used
by a single task, so nothing else touches the registers. Is that a
reasonable approach?

upsid...@downunder.com

unread,

Aug 2, 2013, 12:38:34 AM8/2/13

to

Floating point instructions in ISRs ? I have never encountered such
ISRs.

Why not use the same principle for some of the highest priority tasks
and only below a certain priority level FP-register save/restore is
performed. At the low levels, the full save/restore cost is not that
significant, since these tasks typically execute for quite long times
at once. Of course, this requires some hooks into the task scheduler,
but should not be too hard to implement.

Paul Rubin

unread,

Aug 2, 2013, 3:29:50 AM8/2/13

to

upsid...@downunder.com writes:
> Floating point instructions in ISRs ? I have never encountered such
> ISRs.

Well I've heard of applications whose main loop consisted of a halt
instruction repeated endlessly. All the functionality happened at
interrupt level. No idea if they used floating point. :)

FreeRTOS info

unread,

Aug 2, 2013, 4:03:47 AM8/2/13

to

Where in this thread does it say that the OP is using multitasking or a
task scheduler?

If multithreading is not being used then the Cortex-M4F will handle
everything for you by only saving the floating point registers when it
is absolutely necessary (the save being triggered by a floating point
instruction being executed - if you turn this functionality on).

If multithreading is being used then there are several different ways of
doing it...the best of which can only be determined when you know how
the application is using the FPU (from how many tasks, how often, etc.).

However, as per my previous post, I think this is quite off topic from a
question of "is it 32-bits or 64-bits" so probably not a helpful
discussion to the OP.

hamilton

unread,

Aug 2, 2013, 10:18:06 AM8/2/13

to

On 8/1/2013 10:38 PM, upsid...@downunder.com wrote:
> On Thu, 01 Aug 2013 20:08:05 -0700, Paul Rubin
> <no.e...@nospam.invalid> wrote:
>
>> in...@quantum-leaps.com writes:
>>> Also, the FPU in Cortex-M4F comes with its own register bank, which
>>> needs to be saved/restored if the FPU can be used in the ISRs or in
>>> tasks of a preemptive RTOS.
>>
>> In Tim's application, I wonder whether the FPU can be exclusively used
>> by a single task, so nothing else touches the registers. Is that a
>> reasonable approach?
>
> Floating point instructions in ISRs ? I have never encountered such
> ISRs.

I did that years ago (1985) on the i286 w/floating point co-processor
(i287).

3-Axis vertical mill, at each 8 mSec interrupt a new position of one of
the axis was run.

A simple mutex handled the FPU.

There was no RTOS involved, just a simple round robin of each axis.
All code was written with Turbo C.

Also did the same with a Z80 and an AM9511a co-processor before that.
This one used Microsoft BASIC.

hamilton

Tim Wescott

unread,

Aug 2, 2013, 12:44:50 PM8/2/13

to

It would. I've thought of that. At the moment the whole application is
small enough that I'm planning on using a home-rolled cooperative
multitasker that dodges the whole context-switch thing at the expense of
weighing down the developer with the need to chop low-priority
computations up into bits that are small enough that they don't bog down
important tasks. So the whole "can't RTOS" thing is moot for me at the
moment.

As far as the "only one task gets the math processor", I've actually
already been there, done that (sorta), with the ADSP 2101 using an RTOS.
The ADSP 2101 has some hardware context associated with its DSP
functionality that is simply not accessible via software (except by
"push" and "pop" into very shallow hardware stacks). It's not even a
matter of "slow" -- it's "you can't, sucker". So if you want to use its
DSP features in an RTOS you're limited to doing it in one task. (Well,
one task and one ISR, thanks to those shallow stacks).

All the "regular processor" stuff can be context-switched just fine,
however. So we used the thing exactly that way: we had one task for the
heavy lifting (running a spinning-wheel gyroscope that had to be in
closed loop) with a bunch of tasks to make it play nice with the balance
of the system. That one magic control task was the _only_ task that got
its fingers onto the MAC and associated instructions; everything else was
kept away.

The board, by the way, worked great.

It would be harder to do this with the M4F. Ironically, it's because the
tools support floating point -- in the case of that 2101, the tools
didn't know what to do with a MAC instruction and never generated one.
So it was easy to tuck all the "DSP" stuff away in assembly language code
that was only called from one c file.

I suppose it might be possible to compile just one or two magic files
using the M4F switch, and compile the rest using the M4 switch (or
whatever the gnu compiler supports -- that's my next task!!!). If so,
and if it works without weird namespace or other collisions, then I'd get
software-synthesized math for most of the thing, and hardware math for
the important stuff.

Tim Wescott

unread,

Aug 2, 2013, 12:49:49 PM8/2/13

to

Totally off topic, yes. But still interesting, and useful in that I may
get 32 bit to work for me, and the selection of a multithreaded OS isn't
entirely off the table. This side discussion has certainly put a pretty
high bar on any multitasking OS that I do select, so it's useful in that
regard.

As I mentioned elsewhere, I'm currently planning on using a cooperative
multitasker because (a) I have it lying around, and (b) I'm the only
author on this software, so I don't have to worry about some dip****
trying to compute pi to 100 decimal places in the lowest-priority task
without yielding.

in...@quantum-leaps.com

unread,

Aug 2, 2013, 1:06:47 PM8/2/13

to

Tim, Richard: To be strictly on topic, the whole discussion can be closed with just one number: 32, so all of the posts that go beyond this number are OT.

But, I still believe that the mention of the "lazy stacking" feature of the Cortex-M4F FPU _is_ relevant, even in the absence of a preemptive RTOS or ISRs that use the FPU. I think it's good to know about "lazy stacking", because it is enabled by default (when you enable the FPU), so if you don't know about it, it can hit you by unexpectedly high stack usage. "Lazy stacking" always allocates the space for the FPU registers on the stack, but the actual saving/restoring of the registers does not happen until the FPU is used. This has also an interesting implications for real-time, because if an ISR uses the FPU, its timing will carry the penalty of stacking the FPU registers.

Miro Samek
state-machine.com

in...@quantum-leaps.com

unread,

Aug 2, 2013, 1:27:42 PM8/2/13

to

Indeed, a traditional RTOS kernel that can block in multiple places in a task body probably cannot take advantage of the "lazy stacking" feature.

But a simpler class of run-to-completion preemptive kernels _can_ take advantage of the "lazy stacking" and, in fact, this feature integrates very seamlessly with this type of kernels. The use of the Cortex-M4F FPU with a preemptive QK kernel is described in Section 4.2 of the AppNote, available at: http://www.state-machine.com/arm/AN_QP_and_ARM-Cortex-M-IAR.pdf .

Miro Samek
state-machine.com/arm

in...@quantum-leaps.com

unread,

Aug 2, 2013, 1:30:38 PM8/2/13

to

On Thursday, August 1, 2013 11:08:05 PM UTC-4, Paul Rubin wrote:
> In Tim's application, I wonder whether the FPU can be exclusively used
> by a single task, so nothing else touches the registers. Is that a
> reasonable approach?

Yes, this is the most efficient use of the FPU. In this case, you can disable "lazy stacking" to save stack space. The CMSIS-compliant code for disabling "lazy stacking" is:

FPU->FPCCR &= ~((1U << FPU_FPCCR_ASPEN_Pos) | (1U << FPU_FPCCR_LSPEN_Pos));

Miro Samek
state-machine.com/arm

dp

unread,

Aug 2, 2013, 6:54:47 PM8/2/13

to

On Wednesday, July 31, 2013 8:59:39 PM UTC+3, Jim Stewart wrote:
> ....

>
> Just out of idle curiosity, what kind of an
>
> application might require 64 bit floating point?

Oh more than those which can use 32 bits for sure.
For example, if you will be DSP-ing (that is, doing lots of MAC),
32-bit FP is just useless, the 24 bit mantissa begins
to lose data before you know.

32 bit FP can be useful of course but not a lot if
the FPU is constrained to 32-bit only. If it has both 32 and 64
one tends to use both, well, at least I tend to do so.

Dimiter

------------------------------------------------------
Dimiter Popoff Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

dp

unread,

Aug 2, 2013, 7:03:41 PM8/2/13

to

Or, if an OS is well written, it does allow the tasks to
switch FPU saving on/off when needed - like I do under DPS
all the time, need FPU - call "fpuon$", which returns the former
state of "fpu" for that task. Return from the function, if former
state was off, switch it off again, leave on otherwise.
So FPU registers are saved during task switch only when
necessary.
This is not applicable to IRQ handlers, of course, but I can
think of no IRQ handler I ever wrote for what, nearly 30 years,
which needs/uses FPU.

upsid...@downunder.com

unread,

Aug 3, 2013, 1:59:40 AM8/3/13

to

On Fri, 2 Aug 2013 15:54:47 -0700 (PDT), dp <d...@tgi-sci.com> wrote:

>On Wednesday, July 31, 2013 8:59:39 PM UTC+3, Jim Stewart wrote:
>> ....
>>
>> Just out of idle curiosity, what kind of an
>>
>> application might require 64 bit floating point?
>
>Oh more than those which can use 32 bits for sure.
>For example, if you will be DSP-ing (that is, doing lots of MAC),
>32-bit FP is just useless, the 24 bit mantissa begins
>to lose data before you know.

Are FP-DSP processor really doing hardware MAC instructions internally
using 32 bit FP representation ? I very much doubt that.

When doing MACs in software using integer instruction, why would one
use FP for MAC processing ? Use some big 32/64 bit integer/fixed point
accumulator and only convert the final result to floating point for
further processing.

FP add/sub are nasty, since these may require normalization of the
result, in which first must be determined how many bits needs to be
shifted and then shift the mantissa that amount of bits to the left.
Without some hardware support (find-first-bit-set style HW
instruction), this is quite time consuming and cause variable latency.

Thus doing some higher degree polynomials calculations, the
intermediate results should be kept in integer/fixed point format and
only round/truncate the final result to required representation.

dp

unread,

Aug 3, 2013, 10:44:38 AM8/3/13

to

On Saturday, August 3, 2013 8:59:40 AM UTC+3, upsid...@downunder.com wrote:
> On Fri, 2 Aug 2013 15:54:47 -0700 (PDT), dp <d...@tgi-sci.com> wrote:
>
> >On Wednesday, July 31, 2013 8:59:39 PM UTC+3, Jim Stewart wrote:
> >> ....
> >>
> >> Just out of idle curiosity, what kind of an
> >>
> >> application might require 64 bit floating point?
> >
>
> >Oh more than those which can use 32 bits for sure.
> >For example, if you will be DSP-ing (that is, doing lots of MAC),
> >32-bit FP is just useless, the 24 bit mantissa begins
> >to lose data before you know.
>
> Are FP-DSP processor really doing hardware MAC instructions internally
> using 32 bit FP representation ? I very much doubt that.
>

Don't know about specialized FP DSP-s, never used one. I have been
doing a lot of DSP-ing on a power (PPC) FPU (mostly on an MPC5200B).
It has 32 64-bit FPU regs and can do MAC at both 32 and 64 bits precision.
It takes 1 cycle/32 bit MAC and 2 cycles per 64 bit/MAC.
Reaching that is not straight forward as on a DSP though, there
are data dependencies to take into account. OTOH, having 32 registers
can save a lot of load and store during the filter loop, I managed
the 2 cycles/MAC in a loop at about 10% load/store etc. overhead
penalty. Here is how I did it (VPA macros, self explanatory
enough though):

http://tgi-sci.com/misc/mac8.sa

Without going through that instead of 5nS/MAC I was getting 30nS/MAC
in a plain loop, to be expected really as the pipeline is 6 stages
IIRC.

> When doing MACs in software using integer instruction, why would one
> use FP for MAC processing ? Use some big 32/64 bit integer/fixed point
> accumulator and only convert the final result to floating point for
> further processing.

Well of course, the thing with "normal" 32 bit processors is that
they do not have 64 bit accumulators and 32 bits is nowhere near
sufficient. 64 bit FP, OTOH, is quite handy. Especially on the
power architecture FPU, where one can read 32 bit FP data
and have these expanded to 64 bits in a single cycle.

> FP add/sub are nasty, since these may require normalization of the
> result, in which first must be determined how many bits needs to be
> shifted and then shift the mantissa that amount of bits to the left.

Last time I had the fun doing this was on a CPU32 (on the 68340),
quite a while ago :-). But the hardware FPU-s on the power
architecture processors are really good at this, somehow they
manage add/sub/mul within a single cycle.

Jon Kirwan

unread,

Aug 4, 2013, 5:37:36 AM8/4/13

to

On Wed, 31 Jul 2013 14:43:58 -0500, Tim Wescott
<t...@seemywebsite.really> wrote:

><snip>

>Most control loops that need any precision won't work quite right with 32
>bit floating point. You need more than the 25 bits worth of mantissa
>that comes with single-precision floating point

><snip>

I know you don't really need the details but:

Most 32-bit FPUs use 8 bits for the signed exponent, one bit
for the sign, and this leaves only 23 bits for the mantissa.
Not 25. (There is also the hidden bit, of course.)

Just being pedantic.

Jon