Basic 64-bit Forth considerations

minf...@arcor.de

unread,

Mar 15, 2018, 12:11:30 PM3/15/18

to

The standard intentionally does not specify cell sizes and address sizes (cell/s representing a pointer to memory).

In a 32-bit Forth system everything is simple: cells and addresses are 32 bit wide.

For a 64-bit Forth there are to obvious choices:
A) cells 32-bit and addresses 64-bit
B) cells 64-bit and addresses 64-bit

A) is practical enough, matches most CPU instruction sets, Forth double numbers are still there for 64-bit math, and it means a rather simple extension starting from an existing 32-bit system.

B) seems more linear at first glance, but there are tons of practical implementation challenges like limited support by simple CPU instrucxtions, 128-bit integer math for doubles (who needs that monster?), 128-bit division/multiplication in number conversion words like # #S >NUMBER, etc.

I am inclining to A) but I then am no real systems developer and may have overlooked important things. What were your 64-bit design cosiderations?

Anton Ertl

unread,

Mar 15, 2018, 12:43:14 PM3/15/18

to

minf...@arcor.de writes:
>The standard intentionally does not specify cell sizes and address sizes (c=

>ell/s representing a pointer to memory).

It does specify that an address fits into a cell. Address arithmetics
is performed with the same operations as integer arithmetics etc.

>For a 64-bit Forth there are to obvious choices:
>A) cells 32-bit and addresses 64-bit
>B) cells 64-bit and addresses 64-bit
>

>A) is practical enough, matches most CPU instruction sets, Forth double num=
>bers are still there for 64-bit math, and it means a rather simple extensio=

>n starting from an existing 32-bit system.

I really have a hard time imagining how that should work. Do you mean
that addresses use two cells? AFAIK even in the bad old days of the
8086 nobody went there (at least for general addresses; F-PC used two
cells for return addresses). How would you perform address
arithmetics in such a system?

>B) seems more linear at first glance, but there are tons of practical imple=

>mentation challenges like limited support by simple CPU instrucxtions

On all 64-bit CPUs I know there is full support for 64-bit data.

> 128-=

>bit integer math for doubles (who needs that monster?)

Anyone working with big integers. Otherwise, not that frequent, true.

>128-bit division/mu=

>ltiplication in number conversion words like # #S >NUMBER, etc.
>

>I am inclining to A) but I then am no real systems developer and may have o=

>verlooked important things. What were your 64-bit design cosiderations?

Quite simple: The standard requires that an address fits into a cell,
so if the address grows to 64 bits, the cell grows, too. That did not
require any thought.

My experience is that Forth code tends to be pretty portable between
32-bit and 64-bit systems (more than C code). If you want a
double-cell address, code would not port between systems with
single-cell addresses and double-cell addresses.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2017: http://euro.theforth.net/

hughag...@gmail.com

unread,

Mar 15, 2018, 12:48:32 PM3/15/18

to

It is not that obvious. Straight Forth does this:
cells 64-bit and addresses 32-bit
A number is fixed-point assuming a unity of 2^32.
An address is the integer part of a number with the fractional part zero'd.

The heap has negative addresses. This allows you to easily determine if an address is in the heap or in the dictionary.
For example, FREE will free a node from the heap, but will do nothing if the node is in the dictionary.

Data-structures can be filled either at compile-time (in the dictionary) or at run-time (in the heap).
They can be moved from one to the other as necessary, and the code that works with them will continue working unchanged.
One of the many stupid things about ANS-Forth is that it has incompatible code for the dictionary and the heap.
The dictionary uses ALLOT and the heap uses ALLOCATE --- why not use the same word for both? --- ANS-Forth is just a jumbled mess!

ANS-Forth is messy like this because there was no planning ahead, and there was a great emphasis on keeping legacy code from the 1970s going.
In the 1970s there was no heap and there was no plan to have a heap. Later on the heap got introduced --- essentially a patch over a hole in the design.
Now ANS-Forth is a patchwork of bug-fixes and bad-design work-arounds.

Forth is not going to succeed until ANS-Forth is discarded and the design is started fresh --- this time with some thinking involved!

hughag...@gmail.com

unread,

Mar 15, 2018, 1:10:26 PM3/15/18

to

On Thursday, March 15, 2018 at 9:43:14 AM UTC-7, Anton Ertl wrote:

> minf...@arcor.de writes:
> > 128-=
> >bit integer math for doubles (who needs that monster?)
>
> Anyone working with big integers. Otherwise, not that frequent, true.

Straight Forth has four data-stacks:
a single-cell stack (one cell), double-cell stack (two cells), a float stack (double-precision IEEE-754 or 80-bit extended, and only 8 elements deep),
and a string-stack (one cell: 32-bit address and 32-bit count packed together into a 64-bit cell).

The 128-bit double-stack contains these types of data intermixed:
double-number: assumes a unity of 2^64
ratio: 64-bit numerator and 64-bit denominator
big-number: assumes a unity of 1 (this is primarily provided to support continued fractions)

Mixed-precision arithmetic automatically uses both the single-stack and the double-stack:
For example: UM* takes two arguments from the single-stack and puts the result on the double-stack.
UM/MOD takes a numerator on the double-stack and a denominator on the single-stack, and puts both the quotient and remainder on the single-stack.

> My experience is that Forth code tends to be pretty portable between
> 32-bit and 64-bit systems (more than C code). If you want a
> double-cell address, code would not port between systems with
> single-cell addresses and double-cell addresses.

There haven't been any 32-bit x86 computers built in over 15 years --- you are living in the past if you think that a 32-bit Forth is meaningful.

Numeric programs are generally only portable if you know what precision you are working with.

minf...@arcor.de

unread,

Mar 15, 2018, 4:12:52 PM3/15/18

to

Am Donnerstag, 15. März 2018 17:43:14 UTC+1 schrieb Anton Ertl:
> minf...@arcor.de writes:
> >The standard intentionally does not specify cell sizes and address sizes (c=
> >ell/s representing a pointer to memory).
>
> It does specify that an address fits into a cell.

You're right, I found it in Table 3.1 of the standard

> >For a 64-bit Forth there are to obvious choices:
> >A) cells 32-bit and addresses 64-bit
> >B) cells 64-bit and addresses 64-bit
> >
> >A) is practical enough, matches most CPU instruction sets, Forth double num=
> >bers are still there for 64-bit math, and it means a rather simple extensio=
> >n starting from an existing 32-bit system.
>
> I really have a hard time imagining how that should work. Do you mean
> that addresses use two cells? AFAIK even in the bad old days of the
> 8086 nobody went there (at least for general addresses; F-PC used two
> cells for return addresses). How would you perform address
> arithmetics in such a system?

No, double cells for addresses would be a mess unless in a typed system.
I had the strange idea that integer numbers would only occupy the lower
32-bit half in a 64 bit cell, the upper half would always mean the sign.

I did not think it true obviously. Adding 64-bit address pointers to a 32-bit
Forth (that remain more or less unchanged) would not reach far enough, and
is problematic due to the typeless nature of Forth.

>
> On all 64-bit CPUs I know there is full support for 64-bit data.

Yes, but not always for 128-bit data for Forth mixed integer mul/div or
double words. Such operationa require longish assembler code or libraries.

Lars Brinkhoff

unread,

Mar 16, 2018, 2:25:18 AM3/16/18

to

minforth wrote:
> Anton Ertl:

>>>The standard intentionally does not specify cell sizes and address
>>>sizes (c= ell/s representing a pointer to memory).
>> It does specify that an address fits into a cell.
> You're right, I found it in Table 3.1 of the standard

As pointed out, handling addresses wider than a cell is a problem.
How would they fit in the stack?

A possible solution might be to always put addresses on the return
stack. Someone else in c.l.f suggested that at one time. Now, you want
to do arithmetic on addresses, so it's still inconvenient.

minf...@arcor.de

unread,

Mar 16, 2018, 4:14:40 AM3/16/18

to

As I said, addresses as double cells would be a mess, because commutativity
of adding integers to addresses would be hurt. F.ex.
( adrlo adrhi offset ) + ... versus
( offset adrlo adrhi ) + ???

Possible solution: boxing. But then it would be a typed Forth which is quite
another beast.

Or use extended precision 80-bit floating point with 64-bit mantissa
throughout for any and all elementary data types.

Lars Brinkhoff

unread,

Mar 16, 2018, 4:28:07 AM3/16/18

to

minforth wrote:
>> A possible solution might be to always put addresses on the return
>> stack. Someone else in c.l.f suggested that at one time. Now, you
>> want to do arithmetic on addresses, so it's still inconvenient.
> As I said, addresses as double cells would be a mess

Addresses wouldn't need to be double cells if they always stay on the
return stack (or address stack).

There could be a separate set of operations for doing arithmetic on
addresses on the return stack. This may not be as far fetched as it
sounds. The 68000 had a separate set of opcodes for operating on
address registers.

minf...@arcor.de

unread,

Mar 16, 2018, 5:05:16 AM3/16/18

to

Interesting. In other words a 32-bit Forth plus a 64-bit address stack
plus a suitable set of new operators like A+ for adding an offset to an
address?

A lot of standard Forth words would have to adapted too though....
on the address stack.

Anton Ertl

unread,

Mar 16, 2018, 5:11:48 AM3/16/18

to

minf...@arcor.de writes:

>Am Donnerstag, 15. M=C3=A4rz 2018 17:43:14 UTC+1 schrieb Anton Ertl:
>> minf...@arcor.de writes:
>> >For a 64-bit Forth there are to obvious choices:
>> >A) cells 32-bit and addresses 64-bit
>> >B) cells 64-bit and addresses 64-bit
>> >

>> >A) is practical enough, matches most CPU instruction sets, Forth double =
>num=3D
>> >bers are still there for 64-bit math, and it means a rather simple exten=
>sio=3D

>> >n starting from an existing 32-bit system.

>>=20

>> I really have a hard time imagining how that should work. Do you mean
>> that addresses use two cells? AFAIK even in the bad old days of the
>> 8086 nobody went there (at least for general addresses; F-PC used two
>> cells for return addresses). How would you perform address
>> arithmetics in such a system?
>
>No, double cells for addresses would be a mess unless in a typed system.
>I had the strange idea that integer numbers would only occupy the lower
>32-bit half in a 64 bit cell, the upper half would always mean the sign.

Ok, but that would mean that address arithmetic would not work with
addresses outside the -2G..2G address range. Essentially you would
still have a 32-bit system, but cells would now consume 64 bits.

>> On all 64-bit CPUs I know there is full support for 64-bit data.
>
>Yes, but not always for 128-bit data for Forth mixed integer mul/div or
>double words. Such operationa require longish assembler code or libraries.

It's not that hard to write these things. But if you don't want to go
there, the simplest thing is to define M* UM* UM/MOD FM/MOD SM/REM to
use just one 64x64->64 operation, and give an error (ideally) or wrong
result if the upper half of the 128-bit operand/result is not the
sign-extended or zero-extended lower half. If you are right and
numbers that do not fit into 64 bits are not neeeded, the error will
never happen.

IIRC we had such a discussion a few years ago, where somebody reported
doing without doubles or somesuch. I think the fake doubles outlined
above are nicer for compatibility.

Lars Brinkhoff

unread,

Mar 16, 2018, 5:20:02 AM3/16/18

to

minforth wrote:
> Interesting. In other words a 32-bit Forth plus a 64-bit address stack
> plus a suitable set of new operators like A+ for adding an offset to an
> address?

Right, something like that. It's just an untested idea. I really don't
see much advantage to doing thing this way. But if you're in an
experimental mood it might be the way to go.

> A lot of standard Forth words would have to adapted too though....
> on the address stack.

Yes, it's not compatible with the standard.

Stephen Pelc

unread,

Mar 16, 2018, 6:15:07 AM3/16/18

to

On Thu, 15 Mar 2018 09:11:28 -0700 (PDT), minf...@arcor.de wrote:

>B) seems more linear at first glance, but there are tons of practical imple=
>mentation challenges like limited support by simple CPU instrucxtions, 128-=
>bit integer math for doubles (who needs that monster?), 128-bit division/mu=

>ltiplication in number conversion words like # #S >NUMBER, etc.

B satisfies the *requirement* that an address is a single cell, in
that a VARIABLE has cell-sized data.

Doubles larger than 64 bits are required by any commercial app that
handles money and exchange rate calculations. According to CCS, you
could probably get away with 96 bits, but the overhead of 128 bit
is very low compared to 96 bits.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441
web: http://www.mpeforth.com - free VFX Forth downloads

Albert van der Horst

unread,

Mar 16, 2018, 7:16:52 AM3/16/18

to

In article <d2cbc9a2-4569-4948...@googlegroups.com>,
<minf...@arcor.de> wrote:
>The standard intentionally does not specify cell sizes and address sizes (c=

>ell/s representing a pointer to memory).
>

>In a 32-bit Forth system everything is simple: cells and addresses are 32 b=

>it wide.
>
>For a 64-bit Forth there are to obvious choices:
>A) cells 32-bit and addresses 64-bit
>B) cells 64-bit and addresses 64-bit
>

>A) is practical enough, matches most CPU instruction sets, Forth double num=
>bers are still there for 64-bit math, and it means a rather simple extensio=

>n starting from an existing 32-bit system.
>

>B) seems more linear at first glance, but there are tons of practical imple=
>mentation challenges like limited support by simple CPU instrucxtions, 128-=
>bit integer math for doubles (who needs that monster?), 128-bit division/mu=

>ltiplication in number conversion words like # #S >NUMBER, etc.

You don't know what you're talking about, huh?
The difference between 32 bits ciforth and 64 bit ciforth w.r.t to
double precision aka number conversion is done with an automatic
conversion
AX -> EAX
or
AX -> RAX

And yes double precision comes in very handy for projecteuler
problems. Java folks constantly conplain about unsignalled overflow
errors.

>
>I am inclining to A) but I then am no real systems developer and may have o=

>verlooked important things. What were your 64-bit design cosiderations?

None, just map 32 bit operations and register to 64 bit operations.

OS calls in Linux and MS-Windows are different, but you cannot call
" let's use 64 bit MS-Windows calls under 64 bit Windows"
a design consideration.

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

minf...@arcor.de

unread,

Mar 16, 2018, 8:10:08 AM3/16/18

to

Am Freitag, 16. März 2018 11:15:07 UTC+1 schrieb Stephen Pelc:
> On Thu, 15 Mar 2018 09:11:28 -0700 (PDT), minf...@arcor.de wrote:
>
> >B) seems more linear at first glance, but there are tons of practical imple=
> >mentation challenges like limited support by simple CPU instrucxtions, 128-=
> >bit integer math for doubles (who needs that monster?), 128-bit division/mu=
> >ltiplication in number conversion words like # #S >NUMBER, etc.
>
> B satisfies the *requirement* that an address is a single cell, in
> that a VARIABLE has cell-sized data.
>
> Doubles larger than 64 bits are required by any commercial app that
> handles money and exchange rate calculations. According to CCS, you
> could probably get away with 96 bits, but the overhead of 128 bit
> is very low compared to 96 bits.
>

Thanks, that was new to me. I am more at home with Forth for controller
boards. The only desktop application was prototyping controls for
such boards. Unfortunately we don't deal with trillions of currency units
there. :-)

But with 32 bits dwindling away there are considerations to "upgrade".

john

unread,

Mar 16, 2018, 10:11:24 AM3/16/18

to

In article <5aab9808....@news.eternal-september.org>,
ste...@mpeforth.com says...

>
> Doubles larger than 64 bits are required by any commercial app that
> handles money and exchange rate calculations. According to CCS, you
> could probably get away with 96 bits, but the overhead of 128 bit
> is very low compared to 96 bits.
>
> Stephen
>
>

Some ideas seem a bit quart/pint pot-ish to me.

Rigging Forth to start using the return stack seems to undermine the forth
standard in a way that may be damaging to those who need a standard to
justify their products. I'd like to suggest such open heart surgery be kept well
away from the standard.

As minforth mentioned the 68000 is possibly a good model for the 32/64 mix.

Given an OS such as Linux I can't see any reason a CPU manufacturer couldn't
jump into the market and start making a clean simple CPU and support system
to target various markets - cross assemblers are cheap after all. The monolithic
Intel monster looks very susceptible to replacement to me.
If I had the resources I'd do it myself. The time seems right.

There are some older comments here that are tangental but may be of interest
to other posters:
https://lwn.net/Articles/631734/
These aren't new issues.

--

john

=========================
http://johntech.co.uk
=========================

Coos Haak

unread,

Mar 16, 2018, 11:34:26 AM3/16/18

to

Op Fri, 16 Mar 2018 14:11:16 -0000 schreef john:

<snip>

> As minforth mentioned the 68000 is possibly a good model for the 32/64 mix.

As there is no 68000 with 64 bit addresses, do you think of
a 64 bit implementation (cells) with 32 bit addressing?

groet Coos

a...@littlepinkcloud.invalid

unread,

Mar 16, 2018, 1:57:43 PM3/16/18

to

Albert van der Horst <alb...@cherry.spenarnc.xs4all.nl> wrote:
> In article <d2cbc9a2-4569-4948...@googlegroups.com>,
> <minf...@arcor.de> wrote:
>>The standard intentionally does not specify cell sizes and address sizes (c=
>>ell/s representing a pointer to memory).
>>
>>In a 32-bit Forth system everything is simple: cells and addresses are 32 b=
>>it wide.
>>
>>For a 64-bit Forth there are to obvious choices:
>>A) cells 32-bit and addresses 64-bit
>>B) cells 64-bit and addresses 64-bit
>>
>>A) is practical enough, matches most CPU instruction sets, Forth double num=
>>bers are still there for 64-bit math, and it means a rather simple extensio=
>>n starting from an existing 32-bit system.
>>
>>B) seems more linear at first glance, but there are tons of

>>practical implementation challenges like limited support by simple
>>CPU instrucxtions, 128-bit integer math for doubles (who needs that
>>monster?), 128-bit division/multiplication in number conversion

>>words like # #S >NUMBER, etc.
> You don't know what you're talking about, huh?
> The difference between 32 bits ciforth and 64 bit ciforth w.r.t to
> double precision aka number conversion is done with an automatic
> conversion
> AX -> EAX
> or
> AX -> RAX

You're rather assuming Intel here. Not every 64-bit processor has,
for example, 128/64 divide instructions.

Andrew.

john

unread,

Mar 17, 2018, 8:58:04 AM3/17/18

to

In article <kukqzjhtdw5z.1riha4ujl40pw$.d...@40tude.net>, htr...@gmail.com
says...

> > As minforth mentioned the 68000 is possibly a good model for the 32/64 mix.

> As there is no 68000 with 64 bit addresses, do you think of
> a 64 bit implementation (cells) with 32 bit addressing?
>
> groet Coos
>
>

Hi
maybe - why not - but original comment was based
on the notion of specific assembler instructions and register use for wider data
paths rather than specific implementations of hardware - 68K was an example
of clever thinking over brute force. (6502 V Z80 if you like - and are old enough)

Start talking about applications that have issues and how to resolve them
and you may be able to get the horse back in front of the cart.

The only issue is - as Stephen pointed out - some markets do need wider data
for maths precision or other needs that go outside the norm. And markets like
UHD Film editing may need very large amounts of memory (address space) but
maybe not the maths percision register/bus widths - it all depends.

The real question is how much of the market needs what exactly and how often.
Answer that first.

The research we did last year suggests 32bits is perfectly adequate for 95% +
of computer use. 64bits is more about marketing than benefits while some
military and scientific use we found was using up to 512bits of data. (not
necessarily with very large addressing ranges)

Personally - I'd say - If you're not into the sort of things I'm doing then I'd ignore
the entire concept and just stick to whatever fits your existing hardware best.
You'll be more efficient that way whichever market you're in. Including handling
pictographs. Text manipulation doesn't need real-time radar responses.

The 0.5% speed increase you get from convoluted fiddling with the forth
language will be overtaken in 6 months time just by plugging in a newer
processor anyway.

If you all start from applications issues perhaps you can stop tilting at windmills.

Rod Pemberton

unread,

Mar 17, 2018, 11:49:29 AM3/17/18

to

On Thu, 15 Mar 2018 10:10:25 -0700 (PDT)
hughag...@gmail.com wrote:

> On Thursday, March 15, 2018 at 9:43:14 AM UTC-7, Anton Ertl wrote:

> There haven't been any 32-bit x86 computers built in over 15 years

Every 64-bit Intel and 64-bit AMD processor still runs 32-bit x86 code.
So, why are you saying these machines aren't 32-bit x86? ...

> --- you are living in the past if you think that a 32-bit Forth is
> meaningful.

My 32-bit Forth is coded in C. It needs a recompile for 64-bits, not a
complete recode of the assembly.

Rod Pemberton
--
feedback loop: Russian aggression -> World complains -> Russian
paranoia -> Russian threats -> repeat

minf...@arcor.de

unread,

Mar 17, 2018, 12:07:56 PM3/17/18

to

Am Samstag, 17. März 2018 13:58:04 UTC+1 schrieb john:
> The research we did last year suggests 32bits is perfectly adequate for 95% +
> of computer use.

I think so too. But now even small laptops are equipped with lots of ram
and are capable of running signal processing software, like video stuff.

In our case it is signal archives (up to several years of measured and
calculated data) of size often beyond 4 gig. It would be nice to load them
completely into ram for further processing.

A 32 bit system is ample enough for our processing tasks. The only thing that
is required would be handling/reading/writing of memory blocks > 4 gig.

Anton Ertl

unread,

Mar 17, 2018, 1:19:37 PM3/17/18

to

john <jo...@example.com> writes:
>68K was an example
>of clever thinking over brute force.

Where do you see the clever thinking in the 68000, and what is the
brute force you contrast it with?

The 68000 was more advanced, but also used more transistors and area
(more brute force) than previous CPUs. One might consider the ARM1 as
a result of clever thinking, but then that was started 7 years later
(and available 5 years later), and the result of the RISC research was
available when they started, so it was clever thinking of many people.

john

unread,

Mar 17, 2018, 4:42:32 PM3/17/18

to

In article <ff8a42e1-7287-42cb...@googlegroups.com>,
minf...@arcor.de says...

>
> A 32 bit system is ample enough for our processing tasks. The only thing that
> is required would be handling/reading/writing of memory blocks > 4 gig.
>

A 32bit processor/forth with 64bit addressing - presumably you use x86 which
does the job nicely enough with the 64bit mode IA-32. Its controled at the code
segment level.

default operand is 32bits default address size 64bits linear with extra registers
and SIMD thrown in for free(I think).
Registers are widened to 64bits. You can use 64bit
operands with a prefix but I've never done this. I stopped at standard 32bit PM
mode for my boot loader.

minf...@arcor.de

unread,

Mar 24, 2018, 4:21:38 AM3/24/18

to

Am Samstag, 17. März 2018 21:42:32 UTC+1 schrieb john:
> In article <ff8a42e1-7287-42cb...@googlegroups.com>,
> minf...@arcor.de says...
> >
> > A 32 bit system is ample enough for our processing tasks. The only thing that
> > is required would be handling/reading/writing of memory blocks > 4 gig.
> >
> A 32bit processor/forth with 64bit addressing - presumably you use x86 which
> does the job nicely enough with the 64bit mode IA-32. Its controled at the code
> segment level.
>
> default operand is 32bits default address size 64bits linear with extra registers
> and SIMD thrown in for free(I think).
> Registers are widened to 64bits. You can use 64bit
> operands with a prefix but I've never done this. I stopped at standard 32bit PM
> mode for my boot loader.
>

The past days I tinkered our 32-bit Forth up to 64 bit. So far so good, it
passes the core test. The 128/64 bit division was tough though. It's the
classic binary shift-subtract-algorithm. I didn't find any simpler method
in the web.

Perhaps some good fellow could give me a hint how the 128/64 division could
be composed of simple 64/32 bit division elements?

Anton Ertl

unread,

Mar 24, 2018, 5:06:52 AM3/24/18

to

minf...@arcor.de writes:
>The 128/64 bit division was tough though. It's the
>classic binary shift-subtract-algorithm. I didn't find any simpler method
>in the web.
>
>Perhaps some good fellow could give me a hint how the 128/64 division could

>be composed of simple 64/32 bit division elements? =20

It's already bad enough if the element is 64/64. With 64/32, the
shift-and-subtract variant may be fastest; and it's certainly the
simplest. Another option is to use the 64/32 in combination with
shifts to get an approximate result, and then refine that.

Albert van der Horst

unread,

Mar 24, 2018, 6:27:41 AM3/24/18

to

In article <77b8ce6c-c1bc-4a99...@googlegroups.com>,
<minf...@arcor.de> wrote:
>Am Samstag, 17. M=C3=A4rz 2018 21:42:32 UTC+1 schrieb john:
>> In article <ff8a42e1-7287-42cb...@googlegroups.com>,=20
>> minf...@arcor.de says...
>> >=20
>> > A 32 bit system is ample enough for our processing tasks. The only thin=

>g that
>> > is required would be handling/reading/writing of memory blocks > 4 gig.

>> >=20
>> A 32bit processor/forth with 64bit addressing - presumably you use x86 =
>which=20
>> does the job nicely enough with the 64bit mode IA-32. Its controled at =
>the code=20
>> segment level.
>>=20
>> default operand is 32bits default address size 64bits linear with extra r=
>egisters=20

>> and SIMD thrown in for free(I think).
>> Registers are widened to 64bits. You can use 64bit

>> operands with a prefix but I've never done this. I stopped at standard 32=
>bit PM=20

>> mode for my boot loader.

>>=20

>
>The past days I tinkered our 32-bit Forth up to 64 bit. So far so good, it
>passes the core test. The 128/64 bit division was tough though. It's the
>classic binary shift-subtract-algorithm. I didn't find any simpler method
>in the web.
>
>Perhaps some good fellow could give me a hint how the 128/64 division could

>be composed of simple 64/32 bit division elements? =20

Seriously?

Assuming you're running a 64 bit processor:
; UM/MOD
POP EBX ;DIVISOR
POP EDX ;MSW OF DIVIDEND
POP EAX ;LSW OF DIVIDEND
DIV EBX ;DIVIDE BY 16 BITS
PUSH EDX
PUSH EAX

In this 32 bits code replace all E's by R's.

(The comments reflects that it is actually 16 bits code,
milled through m4:
define({BX},{EBX})dnl
or
define({BX},{RBX})dnl
)

There is also IDIV in behalf of SM/REM.

Albert van der Horst

unread,

Mar 24, 2018, 7:05:34 AM3/24/18

to

In article <2018Mar2...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>minf...@arcor.de writes:
>>The 128/64 bit division was tough though. It's the
>>classic binary shift-subtract-algorithm. I didn't find any simpler method
>>in the web.
>>
>>Perhaps some good fellow could give me a hint how the 128/64 division could
>>be composed of simple 64/32 bit division elements? =20
>
>It's already bad enough if the element is 64/64. With 64/32, the
>shift-and-subtract variant may be fastest; and it's certainly the
>simplest. Another option is to use the 64/32 in combination with
>shifts to get an approximate result, and then refine that.

C is like assembler with no flag registers.
E.g. Z80 code for shit-and-subtract is easy to find, but it uses flags.
Dec alpha assembler is also like that and it has no basic divide
instruction.
Instead a register that I called compare is used.

This is my shift-and-subtract code for the DEC alpha.
The macro processor is used to give registers a name.
CW is the cell width

CODE_HEADER({UM/MOD},{USLAS})
define({divisor},{SR1})
define({dividend},{SR3})
define({modulus},{SR3})
define({quotient},{SR0})
define({compare},{SR2})
define({mask},{{$}{5}})
LDQ divisor, _CELLS(0)(SPO) #{LSW OF DIVISOR}
LDQ quotient, _CELLS(1)(SPO) #{MSW OF DIVIDEND}
LDQ dividend, _CELLS(2)(SPO) #{LSW OF DIVIDEND}
ADDQ SPO, _CELLS(3), SPO
BEQ divisor, DZERO /* div by zero */
BIS $31,1,mask

/* shift divisor left */
USLAS1: CMPULT divisor,modulus,compare
BLT divisor, USLAS3
ADDQ divisor,divisor,divisor
ADDQ mask,mask,mask
BNE compare,USLAS1

/* ok, start to go right again.. */
USLAS2: SRL divisor,1,divisor
BEQ mask, USLAS9
SRL mask,1,mask
USLAS3: CMPULE divisor,modulus,compare
BEQ compare,USLAS2
ADDQ quotient,mask,quotient
BEQ mask,USLAS9
SUBQ modulus,divisor,modulus
BR USLAS2
USLAS9:
_2PUSH

undefine({divisor})
undefine({dividend})
undefine({modulus})
undefine({quotient})
undefine({compare})
undefine({mask})

#
#{ DIVIDE BY ZERO ERROR - SHOW MAX NUMBERS}
DZERO: SUBQ $31, 1, SR0
ADDQ SR0, 0, SR3
_2PUSH #{STORE QUOT/REM}
#

\--------------- after macro ------------------------------
X_USLAS:

LDQ $1, (CW*(0))($29) #LSW OF DIVISOR
LDQ $0, (CW*(1))($29) #MSW OF DIVIDEND
LDQ $3, (CW*(2))($29) #LSW OF DIVIDEND
ADDQ $29, (CW*(3)), $29
BEQ $1, DZERO /* div by zero */
BIS $31,1,$5

/* $1 left */
USLAS1: CMPULT $1,$3,$2
BLT $1, USLAS3
ADDQ $1,$1,$1
ADDQ $5,$5,$5
BNE $2,USLAS1

/* ok, start to go right again.. */
USLAS2: SRL $1,1,$1
BEQ $5, USLAS9
SRL $5,1,$5
USLAS3: CMPULE $1,$3,$2
BEQ $2,USLAS2
ADDQ $0,$5,$0
BEQ $5,USLAS9
SUBQ $3,$1,$3
BR USLAS2
USLAS9:
SUBQ $29, 2*CW, $29
STQ $3, CW($29)
STQ $0, 0($29)
LDQ $0, 0($30)
ADDQ $30, CW, $30
LDQ $1, 0($0)
JMP $31, ($1), 0

#
# DIVIDE BY ZERO ERROR - SHOW MAX NUMBERS
DZERO: SUBQ $31, 1, $0
ADDQ $0, 0, $3
SUBQ $29, 2*CW, $29
STQ $3, CW($29)
STQ $0, 0($29)
LDQ $0, 0($30)
ADDQ $30, CW, $30
LDQ $1, 0($0)
JMP $31, ($1), 0 #STORE QUOT/REM

\----------------------------------------------------------

(You can also look at the dec alpha Forth on my side below)

This is of course tricky code.
>
>- anton

minf...@arcor.de

unread,

Mar 24, 2018, 9:00:06 AM3/24/18

to

Am Samstag, 24. März 2018 10:06:52 UTC+1 schrieb Anton Ertl:
> minf...@arcor.de writes:
> >The 128/64 bit division was tough though. It's the
> >classic binary shift-subtract-algorithm. I didn't find any simpler method
> >in the web.
> >
> >Perhaps some good fellow could give me a hint how the 128/64 division could
> >be composed of simple 64/32 bit division elements? =20
>
> It's already bad enough if the element is 64/64. With 64/32, the
> shift-and-subtract variant may be fastest; and it's certainly the
> simplest. Another option is to use the 64/32 in combination with
> shifts to get an approximate result, and then refine that.
>

Yes I thnink so too. Given that shift and subtract operations are pretty
fast, other algorithms with repeated or iterative multiplications or
divisions wouldn't beat it. We are not talking about bignums here.

minf...@arcor.de

unread,

Mar 24, 2018, 9:01:17 AM3/24/18

to

Yes thanks but I cannot use assembler.

minf...@arcor.de

unread,

Mar 24, 2018, 9:09:52 AM3/24/18

to

Thanks for posting this, somewhat hard to digest indeed.

Here's my C code for what it's worth:

: MU/MOD \ ( ud u -- r udq ) mixed unsigned division
C #ifdef _MF64 // e.g. visual c++
C mfUCell u=mftos, a, sa, b=mfthd, c=mfsec;
C if(c==0) { mfthd=b%u, mfsec=b/u, mftos=0; return; }
C a = c/u, c= c%u;
C for (int i=0; i<64; i++)
C { sa = c >> 63;
C c = (c << 1) | (b >> 63);
C b = (b << 1) | (a >> 63);
C a = a << 1;
C if (sa | (c >= u)) c -= u, a += 1; }
C mfsec = a, mftos = b, mfthd = c;
C #else // 32 bit or gnu c 32/64 bit
C mfUCell u=(mfUCell)mftos; mfUDbl udq,ud=*(mfUDbl*)(mfsp-2);
C udq=ud/u, mfthd=ud-udq*u, *(mfUDbl*)(mfsp-1)=udq;
C #endif
;

Anton Ertl

unread,

Mar 24, 2018, 10:00:29 AM3/24/18

to

minf...@arcor.de writes:
>Here's my C code for what it's worth:

If you have C, why would you use 64/32 as element, and not 64/64?

Anyway, you can look at

http://git.savannah.gnu.org/cgit/gforth.git/tree/engine/support.c?id=9e4d1f2ee4197cc532e702a20c81af57535ea8b2#n791

and the following lines to see how Gforth does it. There is code
there for subtract-and-shift, as well as for shift-64/64-and-adjust.

minf...@arcor.de

unread,

Mar 24, 2018, 11:48:32 AM3/24/18

to

Am Samstag, 24. März 2018 15:00:29 UTC+1 schrieb Anton Ertl:
> minf...@arcor.de writes:
> >Here's my C code for what it's worth:
>
> If you have C, why would you use 64/32 as element, and not 64/64?
>

Because this is input to a transpiler from Forth to C.
You are right that when the target is a PC workstation I can use gcc or
other good C compilers that support int_64 as backend. And the PC is
used as prototyping and development platform indeed.

However when the targets are controller boards the transpiler generates
different code and int_64 is not supported there. But as I said in a previous
posting, I am more interested in addressing huge signal archives. Full
128-bit math is not really requested. Quad division is just a fun exercise.

Albert van der Horst

unread,

Mar 24, 2018, 2:09:12 PM3/24/18

to

In article <2018Mar2...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>minf...@arcor.de writes:
>>The 128/64 bit division was tough though. It's the
>>classic binary shift-subtract-algorithm. I didn't find any simpler method
>>in the web.
>>
>>Perhaps some good fellow could give me a hint how the 128/64 division could
>>be composed of simple 64/32 bit division elements? =20
>
>It's already bad enough if the element is 64/64. With 64/32, the
>shift-and-subtract variant may be fastest; and it's certainly the
>simplest. Another option is to use the 64/32 in combination with
>shifts to get an approximate result, and then refine that.

The UM/MOD I published is my transcription of a 64 by 64 bit unsigned
divide of Linus Torvalds.
(I'll correct that and show my own 128 by 64 version).
Anyhow this is what he said

"
/*
* arch/alpha/lib/divide.S
*
* (C) 1995 Linus Torvalds
*
* Alpha division..
*/

/*
* The alpha chip doesn't provide hardware division, so we have to do it
* by hand. The compiler expects the functions
*
<SNIP>
*
* In short: painful.
*
* This is a rather simple bit-at-a-time algorithm: it's very good
* at dividing random 64-bit numbers, but the more usual case where
* the divisor is small is handled better by the DEC algorithm
* using lookup tables. This uses much less memory, though, and is
* nicer on the cache.. Besides, I don't know the copyright status
* of the DEC code.
*/
"
So apparently Dec had more sophisticated algorithms in store.
The copyright is now with Intel probably.

One may notice that what Linus does is friendly for small dividend's.
but not for small divisors. (See other post.)
For printing in base 10 fast, apparently multiplying with a reciprocal
would be indicated.

>
>- anton

Albert van der Horst

unread,

Mar 24, 2018, 2:20:24 PM3/24/18

to

In article <p95bdn$4t7$1...@cherry.spenarnc.xs4all.nl>,

Albert van der Horst <alb...@cherry.spenarnc.xs4all.nl> wrote:

>In article <2018Mar2...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>minf...@arcor.de writes:
>>>The 128/64 bit division was tough though. It's the
>>>classic binary shift-subtract-algorithm. I didn't find any simpler method
>>>in the web.
>>>
>>>Perhaps some good fellow could give me a hint how the 128/64 division could
>>>be composed of simple 64/32 bit division elements? =20
>>
>>It's already bad enough if the element is 64/64. With 64/32, the
>>shift-and-subtract variant may be fastest; and it's certainly the
>>simplest. Another option is to use the 64/32 in combination with
>>shifts to get an approximate result, and then refine that.
>
>C is like assembler with no flag registers.
>E.g. Z80 code for shit-and-subtract is easy to find, but it uses flags.
>Dec alpha assembler is also like that and it has no basic divide
>instruction.
>Instead a register that I called compare is used.
>
>This is my shift-and-subtract code for the DEC alpha.

But it is not a 128 by 64 division, just 64 by 64 copied
from Linus Tovalds assembler code. That was the first version
I had in my alpha Forth.

<SNIP>

Here comes the 128 by 64 division for DEC Alpha.

worddoc( {MULTIPLYING},{UM/MOD},{u_slash},{ud u1 --- u2 u3},{ISO},
{Leave the unsigned remainder forthvar({u2}) and unsigned
quotient forthsamp({u3}) from the unsigned double dividend
forthvar({ud}) and unsigned divisor forthvar({u1}) .},
{{UM*},{SM/REM},{/}},
{
{-1. -1 UM/MOD . .},{-1 -1},
{DECIMAL 12.34 100 UM/MOD . .},{12 34}
},
enddoc)

CODE_HEADER({UM/MOD},{USLAS})

define({div},{{$}{4}})
define({modh},{SR3})
define({modl},{SR2})
define({quot},{SR0})

define({mask},{{$}{5}})

define({carryout},{{$}{6}})
# SR1 is scratch
LDQ div, _CELLS(0)(SPO) #{LSW OF DIVISOR}
LDQ modh, _CELLS(1)(SPO) #{MSW OF DIVIDEND}
LDQ modl, _CELLS(2)(SPO) #{LSW OF DIVIDEND}
ADDQ SPO, _CELLS(3), SPO

BIS $31,1,mask
BIS $31,0,quot
SLL mask, 63, mask
CMPULT modh, div, SR1
BEQ SR1, DZERO #{Overflow.}

UMSLAS3:
CMPLT modl, $31, SR1
CMPLT modh, $31, carryout
SLL modh, 1, modh
SLL modl, 1, modl
ADDQ modh, SR1, modh
CMPULE div,modh,SR1
BIS SR1, carryout, SR1

BEQ SR1,UMSLAS2
ADDQ quot,mask,quot
SUBQ modh,div,modh
UMSLAS2:

SRL mask,1,mask
BNE mask, UMSLAS3
UMSLAS9:

_2PUSH

undefine({div})
undefine({modl})
undefine({modh})
undefine({quot})
undefine({mask})

Compared to other processors the code is straightforward
because handling carries with registers is easier than
a carry flag, as soon as there are two carries.

minf...@arcor.de

unread,

Mar 24, 2018, 5:32:48 PM3/24/18

to

Implying that Forth primitives with carries/borrows in registers are recommended?

Albert van der Horst

unread,

Mar 24, 2018, 10:40:02 PM3/24/18

to

In article <1928ad6b-ff26-4b13...@googlegroups.com>,
<minf...@arcor.de> wrote:

>Am Samstag, 24. M=C3=A4rz 2018 19:20:24 UTC+1 schrieb Albert van der Horst:
>> In article <p95bdn$4t7$1...@cherry.spenarnc.xs4all.nl>,
>> Albert van der Horst <alb...@cherry.spenarnc.xs4all.nl> wrote:
>> >In article <2018Mar2...@mips.complang.tuwien.ac.at>,
>> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> >>minf...@arcor.de writes:
>> >>>The 128/64 bit division was tough though. It's the

>> >>>classic binary shift-subtract-algorithm. I didn't find any simpler met=
>hod
>> >>>in the web.
>> >>>
>> >>>Perhaps some good fellow could give me a hint how the 128/64 division =
>could
>> >>>be composed of simple 64/32 bit division elements? =3D20

>> >>
>> >>It's already bad enough if the element is 64/64. With 64/32, the
>> >>shift-and-subtract variant may be fastest; and it's certainly the
>> >>simplest. Another option is to use the 64/32 in combination with
>> >>shifts to get an approximate result, and then refine that.
>> >
>> >C is like assembler with no flag registers.
>> >E.g. Z80 code for shit-and-subtract is easy to find, but it uses flags.
>> >Dec alpha assembler is also like that and it has no basic divide
>> >instruction.
>> >Instead a register that I called compare is used.
>> >
>> >This is my shift-and-subtract code for the DEC alpha.

>>=20

>> But it is not a 128 by 64 division, just 64 by 64 copied
>> from Linus Tovalds assembler code. That was the first version
>> I had in my alpha Forth.

>>=20
>> <SNIP>
>>=20

>> Here comes the 128 by 64 division for DEC Alpha.

>>=20
<SNIP>

>> Compared to other processors the code is straightforward
>> because handling carries with registers is easier than
>> a carry flag, as soon as there are two carries.

>>=20
>
>Implying that Forth primitives with carries/borrows in registers are recomm=
>ended?

You can't decide that, it is enforced by the processor.
The DEC alpha just has no status register,
It has sufficient (32) registers to do that .
In multi-tasking/OS there is no status register to save.
Purists may think it is commendable.
If you reserve a register for carry, you can do interesting tricks
for multi precision, I guess.

minf...@arcor.de

unread,

Mar 25, 2018, 12:07:18 AM3/25/18

to

Maybe for simple numeric experiments. In most other cases more complex numeric
algorithms would achieve better efficiency. Math textbooks are full of them.

Mark Humphries

unread,

Mar 25, 2018, 10:19:56 AM3/25/18

to

Here's my untested C code for 4 types of 128/64 bit integer division and scaling: unsigned, truncating, floored, and Euclidean.

// Our 64-bit integer scaling functions multiply two 64-bit integers
// producing a 128-bit intermediary product that they then divide by a
// third 64-bit integer resulting in a 64-bit quotient and a 64-bit
// remainder.
// There are two factors of this operation that ISO C11 does not provide:
// - the high-order 64-bits of a 64-bit by 64-bit integer product
// (cf. the signed and unsigned MULH instructions on some processors)
// - 128-bit by 64-bit integer division
// We will therefore roll our own in this section.

// High-order and low-order 32 of 64 bits
#define LO(x) ((x)&UINT32_C(-1))
#define HI(x) ((x)>>32) // *** N.B. uses arithmetic shift if x signed ***

// High-order 64 out of the 128 bits of a 64x64 bit product
// Cf. the signed and unsigned MULH instructions on some processors. [HD]
#define HL(x,y) (HI(x)*LO(y) +HI(LO(x)*LO(y))) // factor of MULH()
#define MULH(x,y) (HI(x)*HI(y) +HI(HL(x,y)) +HI(LO(x)*HI(y) +LO(HL(x,y))))

// 128-bit by 64-bit unsigned integer division
// uhi is the 64 most significant bits of the dividend.
// ulo is the 64 least significant bits of the dividend.
// v is the 64 bit divisor. *** N.B. Assumes divisor is not zero ***
// Returns remainder, stores quotient at q.
// On overflow:
// - returns (U64)-1, which is an impossible remainder value, since
// for a valid division the remainder must be less than the divisor.
// - quotient at q is undefined.

static U64 divu(U64 uhi, U64 ulo, U64 v, U64 *q){
// Check if quotient can possibly fit in 64 bits,
// if not return a remainder of (U64)-1 to indicate overflow.
if(uhi>=v) return U64C(-1);

U8 s=nlz64(v); // shift normalization amount, 0<=s<=63
v<<=s; // normalize divisor
U64 vhi=HI(v), vlo=LO(v); // break normalized divisor into high and low halves

// dividend digit pair 3,2
U64 u3_2=(uhi<<s);
if(s) u3_2|=ulo>>(64-s); // [UB]

U64 u1_0=ulo<<s; // dividend digit pair 1,0

U64 u1=HI(u1_0), u0=LO(u1_0); // break dividend digit pair u1_0 into 2 digits

// compute high 32 bits of quotient

const U64 b=U64C(0x100000000); // number base (32 bits)

U64 qhi=u3_2/vhi;
U64 r_est=u3_2-qhi*vhi; // remainder estimate
while(qhi>=b || qhi*vlo>b*r_est+u1){ --qhi; r_est+=vhi; if(r_est>=b) break; }

// multiply and subtract: digit pair 2,1
U64 u2_1=u3_2*b +u1 -qhi*v;

// compute low 32 bits of quotient

U64 qlo=u2_1/vhi;
r_est=u2_1-qlo*vhi; // remainder estimate
while(qlo>=b || qlo*vlo>b*r_est+u0){ --qlo; r_est+=vhi; if(r_est>=b) break; }

*q=qhi*b +qlo; // quotient
return (u2_1*b +u0 -qlo*v)>>s; // remainder
}

// Signed Integer Division
// nhi is the 64 most significant bits of the dividend.
// nlo is the 64 least significant bits of the dividend.
// m is the 64 bit divisor. *** N.B. Assumes divisor is not zero ***
// Returns remainder, stores quotient at q.
// On overflow:
// - returns impossible remainder of INT64_MIN.
// - quotient at q is undefined.

// Truncating 128-bit by 64-bit signed division AKA symmetric division. [HD]
static I64 div_trunc(I64 nhi, I64 nlo, I64 m, I64 *q){
I64 n_neg=nhi>>63; // fill with sign bit of n
I64 m_neg=m>>63; // fill with sign bit of m
I64 diff=n_neg^m_neg; // -1 if signs of n and m differ, otherwise 0

// absolute value of n
U64 ulo=(U64)((nlo^n_neg)-n_neg); // low 64 bits
U64 uhi=(U64)((nhi^n_neg)+(n_neg&!ulo)); // high 64 bits (borrows if ulo!=0)

U64 v=(U64)((m^m_neg)-m_neg); // absolute value of m

I64 r=(I64)divu(uhi,ulo,v,(U64 *)q);
if(r==-1) return INT64_MIN; // divu() overflowed

*q=(*q^diff)-diff; // negate quotient if signs of n and m differ
if((*q^diff)<0 && *q) return INT64_MIN; // overflow

r=(r^n_neg)-n_neg; // negate remainder if dividend negative
return r;
}

// Floored 128-bit by 64-bit signed division
SI I64 div_floor(I64 nhi, I64 nlo, I64 m, I64 *q){
I64 r=div_trunc(nhi,nlo,m,q);

I64 diff=(nhi^m)>>63; // -1 if signs of n and m differ, otherwise 0

*q+=diff; // decrement truncating division quotient if signs of n and m differ

// if the signs of n and m differ and no overflow occurred
// add m to the truncating division remainder
r+=(diff&-(r!=INT64_MIN)&m);
return r;
}

// Euclidean 128-bit by 64-bit signed division AKA modulus division.
SI I64 div_euclid(I64 nhi, I64 nlo, I64 m, I64 *q){
I64 r=div_trunc(nhi,nlo,m,q);
I64 adjust=-(r<0)&(1|-(m<0));
*q-=adjust;
r+=((adjust*m)&-(r!=INT64_MIN)); // adjust remainder if no overflow occured
return r;
}

// Unsigned integer scaling
// Quotient and remainder of (i*j)/k, uses 128 bits for the unsigned
// intermediate product (i*j). *** N.B. Assumes divisor is not zero ***
// Returns remainder, stores quotient at q.
// On overflow:
// - returns (U64)-1, which is an impossible remainder value, since
// for a valid division the remainder must be less than the divisor.
// - quotient at q is undefined.

SI U64 scaleu(U64 i, U64 j, U64 k, U64 *q){ return divu(MULH(i,j),i*j,k,q); }

// Signed integer scaling
// Quotient and remainder of (i*j)/k, uses 128 bits for the signed intermediate
// product (i*j). *** N.B. Assumes divisor is not zero ***
// Returns remainder, stores quotient at q.
// On overflow:
// - returns impossible remainder of INT64_MIN.
// - quotient at q is undefined.

// Signed integer scaling using truncating division
SI I64 scale_trunc(I64 i, I64 j, I64 k, I64 *q){
return div_trunc(MULH(i,j),i*j,k,q);
}

// Signed integer scaling using floored division
SI I64 scale_floor(I64 i, I64 j, I64 k, I64 *q){
return div_floor(MULH(i,j),i*j,k,q);
}

// Signed integer scaling using Euclidean division
SI I64 scale_euclid(I64 i, I64 j, I64 k, I64 *q){
return div_euclid(MULH(i,j),i*j,k,q);
}

Mark Humphries

unread,

Mar 25, 2018, 10:29:19 AM3/25/18

to

Forgot the post some defines required by my previously posted C code snippet:

#define SI static inline
#define I64 int64_t
#define U64 uint64_t
#define U64C UINT64_C

Anton Ertl

unread,

Mar 25, 2018, 11:25:41 AM3/25/18

to

alb...@cherry.spenarnc.xs4all.nl (Albert van der Horst) writes:
>So apparently Dec had more sophisticated algorithms in store.
>The copyright is now with Intel probably.

More likely with HPE. Compaq bought DEC, HP bought Compaq, and the
enterprise part of HP was spun of as HPE. Intel only bought some of
the semiconductor business and StrongARM.

>For printing in base 10 fast, apparently multiplying with a reciprocal
>would be indicated.

It would certainly be faster than using division, and not only on
Alpha. Forth code for performing the division in two stages
(1. create a reciprocal; 2. multiply with the reciprocal) can be found
at

http://git.savannah.gnu.org/cgit/gforth.git/tree/compat/stagediv.fs

What is currently missing is signed division, and division of unsigned
doubles in this way; the latter is needed for implementing "#".

Anton Ertl

unread,

Mar 25, 2018, 11:41:39 AM3/25/18

to

minf...@arcor.de writes:
>> Compared to other processors the code is straightforward
>> because handling carries with registers is easier than
>> a carry flag, as soon as there are two carries.

>>=20
>
>Implying that Forth primitives with carries/borrows in registers are recomm=
>ended?

I doubt that he wanted to imply this. You can do what the Alpha, MIPS
and RISC-V does as follows:

: +c ( u1 u2 -- u uc )
over + tuck u< negate ;

and a somewhat sophisticated compiler will generate the two
instructions on these architectures that an assembly programmer would.

However, even if you have +c as primitive, it will be pretty hard for
a compiler to generate good code for it on machines that have a single
carry flag in a flags register. You can see an example where gcc or
clang generated pretty horrible code for a similar builtin in
<2016May2...@mips.complang.tuwien.ac.at> and its ancestors.

hughag...@gmail.com

unread,

Mar 25, 2018, 4:45:12 PM3/25/18

to

On Saturday, March 24, 2018 at 6:01:17 AM UTC-7, minf...@arcor.de wrote:
> Am Samstag, 24. März 2018 11:27:41 UTC+1 schrieb Albert van der Horst:
> > >The past days I tinkered our 32-bit Forth up to 64 bit. So far so good, it
> > >passes the core test. The 128/64 bit division was tough though. It's the
> > >classic binary shift-subtract-algorithm. I didn't find any simpler method
> > >in the web.
> > >
> > >Perhaps some good fellow could give me a hint how the 128/64 division could
> > >be composed of simple 64/32 bit division elements? =20
> >
> > Seriously?
> >
> > Assuming you're running a 64 bit processor:
>

> Yes thanks but I cannot use assembler.

I have D/ in the novice-package --- this is a 64/64 division done in a 32-bit ANS-Forth system --- this was actually written by Nathaniel Grosman in his "Forth Dimensions" article and is now in CF.4TH.

I don't know how to do a 128/64 division in 32-bit ANS-Forth --- that is more difficult.

Who are you? What is your real name? All I'm seeing here is minf...@arcor.de that doesn't tell me anything.

minf...@arcor.de

unread,

Mar 28, 2018, 4:38:01 AM3/28/18

to

YOU are really the right one to whine after netiquette
with all your past foul words to other people in clf

Rob Sciuk

unread,

Mar 29, 2018, 12:14:00 PM3/29/18

to

On Thu, 15 Mar 2018, minf...@arcor.de wrote:

> Date: Thu, 15 Mar 2018 09:11:28 -0700 (PDT)
> From: minf...@arcor.de
> Newsgroups: comp.lang.forth
> Subject: Basic 64-bit Forth considerations
>
> The standard intentionally does not specify cell sizes and address sizes (cell/s representing a pointer to memory).
>
> In a 32-bit Forth system everything is simple: cells and addresses are 32 bit wide.
>
> For a 64-bit Forth there are to obvious choices:
> A) cells 32-bit and addresses 64-bit
> B) cells 64-bit and addresses 64-bit
>
> A) is practical enough, matches most CPU instruction sets, Forth double
> numbers are still there for 64-bit math, and it means a rather simple
> extension starting from an existing 32-bit system.
>
> B) seems more linear at first glance, but there are tons of practical
> implementation challenges like limited support by simple CPU
> instrucxtions, 128-bit integer math for doubles (who needs that
> monster?), 128-bit division/multiplication in number conversion words
> like # #S >NUMBER, etc.
>
> I am inclining to A) but I then am no real systems developer and may
> have overlooked important things. What were your 64-bit design
> cosiderations?
>

OneFileForth simply uses the native word size. There is no reason to
complicate your life and code by having double word pointers when they
should ideally fit into an object the same size as an int.

HTH,
Rob.

minf...@arcor.de

unread,

Mar 29, 2018, 2:51:46 PM3/29/18

to

Yes there is when objects are larger than your max address range.
Right now we are using files in memory. For big files we have to
rewind or swap (archive files can be up to 12 gig and contain data spanning
several years). Now we have a 32-bit system but could upgrade RAM up
to 16 gig.

hughag...@gmail.com

unread,

Mar 30, 2018, 12:46:43 AM3/30/18

to

You are still not telling us your real name.

Just making a guess here, but this is one of Marcel Hendrix's sock-puppets.
https://groups.google.com/forum/#!topic/comp.lang.forth/wP5nw1ClzsM%5B1-25%5D
Marcel Hendix posted my code on comp.lang.forth with my name and copyright notice removed,
claimed that he had improved it, then admitted that he didn't know what it did.
Is that you?

Ron Aaron

unread,

Mar 30, 2018, 1:06:02 AM3/30/18

to

On 30/03/2018 7:46, hughag...@gmail.com wrote:

> You are still not telling us your real name.

Why should he tell you his real name? So you can dox him, or what?

Alex McDonald

unread,

Mar 30, 2018, 4:45:38 AM3/30/18

to

Hugh isn't Hugh's real name either. And Alex isn't mine. It's a funny
old world.

--
Alex

john

unread,

Mar 30, 2018, 6:38:59 AM3/30/18

to

In article <e53e03f1-89aa-497e...@googlegroups.com>,
minf...@arcor.de says...

> Yes there is when objects are larger than your max address range.
> Right now we are using files in memory. For big files we have to
> rewind or swap (archive files can be up to 12 gig and contain data spanning
> several years). Now we have a 32-bit system but could upgrade RAM up
> to 16 gig.
>
>

I'm, just curious - is there some reason you need to have so much
data in memory at the same time?

I have seen 2 systems that benefit (need) vast data records in memory
but they were quite specialised applications and were not run on
desktop systems by any means..
I'm guessing climate modelling and that sort of thing may benefit too
but since I've never had access to the models I don't really know how they are
implemented.

If you only need up to 16GB-ish maybe an SSD could make chunking fast
enough (SSD's are crippled by the manufacturers because they out perform
data buses - they can make a very slow system really responsive)
Second gen SSD tech (used instead of ram) will be even faster and I think is now
becoming available from Intel at least when used as a system/boot ram.
The problems with using it more are bus not ram related.

Maybe just re-designing your implementation method could help.

--

john

=========================
http://johntech.co.uk
=========================

minf...@arcor.de

unread,

Mar 30, 2018, 9:32:31 AM3/30/18

to

Complete redesign is out of question. The data are archived plant process
signal recordings. The task is doing cross-correlation of various signal sets
over years for early detection of silent deterioration by aging machinery.
The goal is to get indicators for maintenance and preventive repair planning.
It's heavy number crunching but that's what computers are good for.

Right now it's done offline after copying from the archive to a processing
server. It would be nice if we could do it right in the control system.

john

unread,

Mar 30, 2018, 11:25:33 AM3/30/18

to

In article <e13a9a86-c4d2-4110...@googlegroups.com>,
minf...@arcor.de says...

> Complete redesign is out of question. The data are archived plant process
> signal recordings. The task is doing cross-correlation of various signal sets
> over years for early detection of silent deterioration by aging machinery.
> The goal is to get indicators for maintenance and preventive repair planning.
> It's heavy number crunching but that's what computers are good for.
>
> Right now it's done offline after copying from the archive to a processing
> server. It would be nice if we could do it right in the control system.
>

There are control systems specialists who do this I believe.
Just run a mile from anyone mentioning the "C" word. (cloud)

Data analytics isn't a field
I know much about - that being said I was a member of the Operational
Research Society (ORSOC) about 25 years ago but I dont think industrial
predictive data analytics was in much demand back then!

What I would say is sticking a plaster on it is probably not a good way to go
with industry 4 snapping at your heels.

I don't know enough about what you are doing obviously but it does sound like
your problem is just number crucnhing which has some pretty standard solutions

Maybe you need to think about installing real time nodes (PLC/equipment
monitoring, etc). I believe those can process data and feed your control system
but all I have is academic knowledge of this. My entire knowledge of process
tech is just from working for an industrial graphics company for a couple of years.
(Control terminals/xray image processing etc).
You should be able to extract entire scheduling and reporting data from node
systems for both sheduling maintenance and end of life tasks though. .
There's a few new interesting things in the trade mags these days.

Dumping data into big ram on a PC isn't a way I'd want to go either.
I'm suprised your existing control system supplier isn't all over this.
I don't see you sticking with Forth for much longer in truth.

If you have budget issues maybe a number crunching plugin board
could help if such exists. There's a lot going on in the DSP/FPGA
world these days.

If you need more serious input feel free to contact me anytime.

Alex McDonald

unread,

Mar 30, 2018, 12:39:44 PM3/30/18

to

On 30-Mar-18 11:32, john wrote:

>
> If you only need up to 16GB-ish maybe an SSD could make chunking fast
> enough (SSD's are crippled by the manufacturers because they out
> perform data buses - they can make a very slow system really
> responsive)

SSDs aren't crippled by manufacturers; it's an engineering problem, not
a marketing decision. They're limited by the underlying speed of the
flash chips and the interfaces to the outside world. There's a mismatch
between the two. Current interfaces are SATA, SAS, and PCIe, which are
all much slower (both in bandwidth and latency numbers) than most flash
used to make the SSDs.

SSDs are all block based disk-like interfaces. They use I/O software
stacks that have microsecond latencies (in the order of 100us or less),
which, when used with traditional spinning disks and their millsecond
latencies (5ms to 7ms being typical), don't make much difference.

Flash is so much faster that latencies are in the 100s of us; that is,
many times faster than disk. The IO stack is now a significant part of
the overhead. Newer interfaces like NVDIMM based block devices can cut
the hardware overhead way down, but the software element is pretty fixed.

Even with those restrictions it's possible to do millions of IO
operations per second (IOPS) at sub-ms latencies on a single NVDIMM
flash device; compare that with 250 IOPS at 7ms for a fast hard disk drive.

Even with that, SSDs are now really at the limit of what can be
achieved. Updating a single byte takes a complete read and write of a 4K
block, which is quite an overhead. So...

> Second gen SSD tech (used instead of ram) will be even
> faster and I think is now becoming available from Intel at least
> when used as a system/boot ram. The problems with using it more are
> bus not ram related.

... we're moving on to memory bus connected.

Flash is block based, but the newer memories (Intel's is called 3D
Xpoint) are persistent (like flash) and byte addressable (unlike flash).
These are part of the memory space, and have speeds approaching that of
DRAM. They're not SSDs at all.

They bring a host of new opportunities and a shedload of programming
problems too.
https://www.snia.org/tech_activities/standards/curr_standards/npm

>
> Maybe just re-designing your implementation method could help.

New memory technologies are going to need new ways of addressing them,
but in the meanwhile, the easiest, fastest and most portable solution is
to do very large file access through memory mapping (mmap).

--
Alex

Paul Rubin

unread,

Mar 30, 2018, 4:15:54 PM3/30/18

to

Alex McDonald <al...@rivadpm.com> writes:
> Flash is block based, but the newer memories (Intel's is called 3D
> Xpoint) are persistent (like flash) and byte addressable (unlike
> flash). These are part of the memory space, and have speeds
> approaching that of DRAM. They're not SSDs at all.

Of course 3d xpoint ("Optane") is solid state. It doesn't use any
spinning parts or vacuum tubes, and it's marketed as an SSD:

https://www.anandtech.com/show/12562/intel-previews-optane-enterprise-m2-ssd

Anyway, yes, mapping a big file into ram is often a convenient way to
access its contents.

Alex McDonald

unread,

Mar 30, 2018, 5:04:46 PM3/30/18

to

On 30-Mar-18 21:15, Paul Rubin wrote:
> Alex McDonald <al...@rivadpm.com> writes:
>> Flash is block based, but the newer memories (Intel's is called 3D
>> Xpoint) are persistent (like flash) and byte addressable (unlike
>> flash). These are part of the memory space, and have speeds
>> approaching that of DRAM. They're not SSDs at all.
>
> Of course 3d xpoint ("Optane") is solid state. It doesn't use any
> spinning parts or vacuum tubes, and it's marketed as an SSD:
>

>
https://www.anandtech.com/show/12562/intel-previews-optane-enterprise-m2-ssd

>

Yes, Optane is an SSD; a solid state disk. 3d Xpoint is persistent
memory. Optanes are built out of 3D Xpoint. But I was not talking about
them.

The point is that 3D Xpoint is not /flash/ memory. Flash is only
readable in pages (big, say 4KB in size) and writeable in blocks, or
groups of pages (huge, typically 256KB or more). Flash is great for
making fast disks (with the caveat that they really only perform well
for random reads).

SSDs are a stop-gap form factor for these new PM (persistent memory)
technologies, since disk protocols and block access are not very
memory-like, and the block based interfaces and protocol stacks have
have high overheads. It's like building disks out of battery backed
DRAM; expensive and not very interesting, since you can make DRAM memory
go a lot faster on the memory bus.

That's where PM technologies are being implemented; on the memory bus.
The form factor is NVDIMM. Wikipedia has an ok-ish summary.
https://en.wikipedia.org/wiki/NVDIMM

> Anyway, yes, mapping a big file into ram is often a convenient way to
> access its contents.

Not just RAM, but NVDIMM too; the file doesn't need to move, since it's
always in memory. The important bit is that mapping the file provides
independence from the underlying storage, whether it is PM or disk; and
byte addressability, which is more like memory than traditional storage.

--
Alex

Paul Rubin

unread,

Mar 30, 2018, 6:44:39 PM3/30/18

to

Alex McDonald <al...@rivadpm.com> writes:
> The point is that 3D Xpoint is not /flash/ memory.

Of course I know that: the only thing I was saying is that it's still an
SSD despite not being flash.

> That's where PM technologies are being implemented; on the memory
> bus. The form factor is NVDIMM. Wikipedia has an ok-ish
> summary. https://en.wikipedia.org/wiki/NVDIMM

That seems like a stopgap too, since it occupies a memory channel and
those are often scarce on cpus. They have to be physically close to the
CPU itself and the controllers for them consume silicon on the CPU die.
Using that for slower media (even Optane) isn't that great, especially
if you want lots of that type of storage (tens of sockets, say).

There needs to be something a little bit further away from the CPU but
closer than the PCI bus, to allow installings lots of these devices.
Maybe we'll see new boards and chipsets for that purpose sometime.

john

unread,

Mar 31, 2018, 5:40:59 AM3/31/18

to

In article <p9lp8e$mh0$1...@dont-email.me>, al...@rivadpm.com says...

>
> SSDs aren't crippled by manufacturers; it's an engineering problem, not
> a marketing decision. They're limited by the underlying speed of the
> flash chips and the interfaces to the outside world. There's a mismatch
> between the two
>

yes - that's what I said.

Alex McDonald

unread,

Mar 31, 2018, 9:18:29 AM3/31/18

to

On 30-Mar-18 23:44, Paul Rubin wrote:
> Alex McDonald <al...@rivadpm.com> writes:
>> The point is that 3D Xpoint is not /flash/ memory.
>
> Of course I know that: the only thing I was saying is that it's still an
> SSD despite not being flash.

But I am not talking about the Optane (and haven't been); I'm taking
about 3D Xpoint and similar persistent memories.

>
>> That's where PM technologies are being implemented; on the memory
>> bus. The form factor is NVDIMM. Wikipedia has an ok-ish
>> summary. https://en.wikipedia.org/wiki/NVDIMM
>
> That seems like a stopgap too, since it occupies a memory channel and

Yes, it's an issue, but not for the reasons you give.

> those are often scarce on cpus. They have to be physically close to the
> CPU itself and the controllers for them consume silicon on the CPU die.

We already have OSes up to 24TB (Windows Server 2016 for example), and
Supermicro sell boxes that support 24TB. (x86/64 only supports 48
address bits, so the absolute max currently is 256TB.)

The issue is getting the data in and out of these beasts, for which
there are several proposed networking technology improvements; NVMeoF
(NVMe over Fabrics) which supports fibre channel and coming soon, TCP;
and PMoF (persistent memeory over fabrics) is more than a twinkle in
some people's eye. These are all classes of remote direct memory access
(RDMA) that allows high-throughput, low-latency networking because it
doesn't involve the CPU at all. The controller overheads are currently
in the single digit us/100s of ns range (ignoring the wire length).

> Using that for slower media (even Optane) isn't that great, especially

Again, not Optane. That's an SSD.

> if you want lots of that type of storage (tens of sockets, say).
>
> There needs to be something a little bit further away from the CPU but
> closer than the PCI bus, to allow installings lots of these devices.
> Maybe we'll see new boards and chipsets for that purpose sometime.
>

We'll see improvements and extensions to the memory bus. Here are some
links on 3 front runners; CCIX, Open CAPI and Gen-Z.

https://www.ccixconsortium.com/
http://opencapi.org/
https://genzconsortium.org/

--
Alex

Paul Rubin

unread,

Mar 31, 2018, 1:15:19 PM3/31/18

to

Alex McDonald <al...@rivadpm.com> writes:
> But I am not talking about the Optane (and haven't been); I'm taking
> about 3D Xpoint and similar persistent memories.

Optane is the brand name Intel uses for SSDs built from 3D XPoint
memory.

> We already have OSes up to 24TB (Windows Server 2016 for example), and
> Supermicro sell boxes that support 24TB.

But only super duper high end multi-socket cpus can addresss that much
ram, and only with ultra expensive extra-dense ram modules. Actually do
you know the box model offhand? Is it really 24TB on a single compute
node? If it's multiple servers in a single box that doesn't really
count. The biggest servers I know of have 6TB, so maybe that box has 4
of those inside.

If you use a reasonable single-socket cpu then you will be short of
channels quickly if you start using it for nvdimm.

> These are all classes of remote direct memory access (RDMA) that
> allows high-throughput, low-latency networking

Yeah, that stuff has been around for a while. Intel has some kind of
network interface that even bypasses RAM and goes directly to the cpu's
L2 cache.

>> Using that for slower media (even Optane) isn't that great, especially
> Again, not Optane. That's an SSD.

Optane is an SSD.

> We'll see improvements and extensions to the memory bus. Here are some
> links on 3 front runners; CCIX, Open CAPI and Gen-Z.

Cool!

Alex McDonald

unread,

Mar 31, 2018, 2:19:42 PM3/31/18

to

On 31-Mar-18 18:15, Paul Rubin wrote:
> Alex McDonald <al...@rivadpm.com> writes:
>> But I am not talking about the Optane (and haven't been); I'm
>> taking about 3D Xpoint and similar persistent memories.
>
> Optane is the brand name Intel uses for SSDs built from 3D XPoint
> memory.

Yes. Yes, I know.

> Actually do you know the box model offhand? Is it really 24TB on a
> single compute node? If it's multiple servers in a single box that
> doesn't really count. The biggest servers I know of have 6TB, so
> maybe that box has 4 of those inside.

SGI's UV 300 has 32 sockets and 24TB of coherent shared memory.

>> These are all classes of remote direct memory access (RDMA) that
>> allows high-throughput, low-latency networking
>
> Yeah, that stuff has been around for a while. Intel has some kind
> of network interface that even bypasses RAM and goes directly to the
> cpu's L2 cache.

DDIO.
https://www.intel.co.uk/content/www/uk/en/io/data-direct-i-o-technology.html

>
>>> Using that for slower media (even Optane) isn't that great,
>>> especially
>> Again, not Optane. That's an SSD.
>
> Optane is an SSD.

There's an definitely an echo in here.

--
Alex

Paul Rubin

unread,

Mar 31, 2018, 3:18:58 PM3/31/18

to

Alex McDonald <al...@rivadpm.com> writes:
> SGI's UV 300 has 32 sockets and 24TB of coherent shared memory.

https://insidehpc.com/2014/12/introducing-sgi-uv-300-big-memory-supercomputer/

That page (from 2014) says to get 24TB you have to connect 8 boxes
together (5U size boxes). So there will be interconnect delay (6 feet
of cable among other things) slower than what we usually think of as a
memory bus. AND you will have to populate all the CPU sockets with E7
CPUs. So that sounds like an awfully expensive way to free up ram
sockets for your nvdimms.

It does say they get in 24 DIMMS per CPU socket, which is a lot more
than I've seen in other systems, so it's still interesting.

> DDIO.

Yes, that, thanks.

Alex McDonald

unread,

Mar 31, 2018, 5:33:30 PM3/31/18

to

On 31-Mar-18 20:18, Paul Rubin wrote:
> Alex McDonald <al...@rivadpm.com> writes:
>> SGI's UV 300 has 32 sockets and 24TB of coherent shared memory.
>
> https://insidehpc.com/2014/12/introducing-sgi-uv-300-big-memory-supercomputer/
>
> That page (from 2014) says to get 24TB you have to connect 8 boxes
> together (5U size boxes). So there will be interconnect delay (6 feet
> of cable among other things) slower than what we usually think of as a
> memory bus. AND you will have to populate all the CPU sockets with E7
> CPUs. So that sounds like an awfully expensive way to free up ram
> sockets for your nvdimms >

> It does say they get in 24 DIMMS per CPU socket, which is a lot more
> than I've seen in other systems, so it's still interesting.

How about the HPE Superdoem X?
https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04383189.pdf

16 procesors, 384 DIMM slots with up to 48 TB of DDR4 memory.

(I'm no processor expert btw. Storage & storage networking is my area.)

--
Alex

Paul Rubin

unread,

Mar 31, 2018, 6:26:03 PM3/31/18

to

Alex McDonald <al...@rivadpm.com> writes:
> How about the HPE Superdoem X?
> https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04383189.pdf
> 16 procesors, 384 DIMM slots with up to 48 TB of DDR4 memory.
> (I'm no processor expert btw. Storage & storage networking is my area.)

This is also 8 separate computers in a single box, connected by some
kind of communication fabric. It's for huge-memory HPC problems which
are a limited area these days. The numerics supercomputing crowd has
long since shifted to GPU-based machines, and almost everyone else tries
to parallelize their workload across clusters of normal (largish)
machines if they can. But EVERYONE wants tons of storage.

It looks like some of those NVDIMMs are ordinary RAM backed by
capacitor-powered persistent storage right there on the module, so the
data is saved in the event of a sudden power loss. There would also
have to be enough capacitor power to let the CPU flush its caches to ram
before the ram persists. For this purpose flash seems ok as a
persistence medium.

I guess it's an interesting approach since it might be more reliable
than powering a computer through a normal UPS. But, the OS and
application software would both have to know about the nvdimm and
essentially treat it like a disk. I.e. you'd have to read and write it
with database-like operations so that the ram contents are never in an
inconsistent state. And because its capacity is limited, you'd still
have to flush out its contents to conventional disk/ssd. The main
application I see is database servers, where the database software
itself would be in charge of the nvdimm and know how to manage it.

Alex McDonald

unread,

Mar 31, 2018, 7:10:33 PM3/31/18

to

On 31-Mar-18 23:25, Paul Rubin wrote:
> Alex McDonald <al...@rivadpm.com> writes:
>> How about the HPE Superdoem X?
>> https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04383189.pdf
>> 16 procesors, 384 DIMM slots with up to 48 TB of DDR4 memory.
>> (I'm no processor expert btw. Storage & storage networking is my area.)
>
> This is also 8 separate computers in a single box, connected by some
> kind of communication fabric. It's for huge-memory HPC problems which
> are a limited area these days. The numerics supercomputing crowd has

SAP/HANA, Apache Ignite and other big database/big data applications. In
memeory is very popular.

> long since shifted to GPU-based machines, and almost everyone else tries
> to parallelize their workload across clusters of normal (largish)
> machines if they can. But EVERYONE wants tons of storage.
>
> It looks like some of those NVDIMMs are ordinary RAM backed by
> capacitor-powered persistent storage right there on the module, so the
> data is saved in the event of a sudden power loss. There would also
> have to be enough capacitor power to let the CPU flush its caches to ram
> before the ram persists. For this purpose flash seems ok as a
> persistence medium.

NVDIMM-F or NVDIMM-N. They use supercaps on the DIMM to provide the
power to copy from DRAM to flash. NVDIMM-P will employ 3D Xpoint/ReRAM
type memory.

>
> I guess it's an interesting approach since it might be more reliable
> than powering a computer through a normal UPS. But, the OS and
> application software would both have to know about the nvdimm and
> essentially treat it like a disk. I.e. you'd have to read and write it
> with database-like operations so that the ram contents are never in an
> inconsistent state. And because its capacity is limited, you'd still
> have to flush out its contents to conventional disk/ssd. The main
> application I see is database servers, where the database software
> itself would be in charge of the nvdimm and know how to manage it.

The point is that you don't need to treat persistent memory like disk;
you can treat it like memory. Loads and stores rather than I/O
operations, using the C++11 memory model for consistency (that also
handles the cache issues you raised). I gave a link about the approaches
used in a previous thread.

https://www.snia.org/tech_activities/standards/curr_standards/npm

As a stepping stone and to remove the need for application changes,
there's support for file systems in PM (the document refers to it as NVM
or non-volatile memory) that is built in to both Linux and Windows; DAX
(Direct Access).

--
Alex