Permutation Tensor

Arnold Doray

unread,

Jul 5, 2012, 3:13:53 AM7/5/12

to

What's the best way to implement the 3D permutation tensor E(I,J,K)?

Thanks,
Arnold

A. K.

unread,

Jul 5, 2012, 3:25:07 AM7/5/12

to

There is no generic answer to your question.
It really depends on
- which planes you need most
- whether you use another library
- whether you intend to use the GPU
- which tensor elements you access mostly
etc etc

Arnold Doray

unread,

Jul 5, 2012, 4:42:47 AM7/5/12

to

No, I mean the *permutation* tensor... not tensors in general.

Given I, J, K on the stack, return the -1, 1 or 0.

-AD

A. K.

unread,

Jul 5, 2012, 5:25:43 AM7/5/12

to

How about
( i j k --> n ) 10 * + 10 * +
then CASEs
n = 123 or 312 or 231 --> eps = 1
n = 132 or 321 or 213 --> eps = -1
otherwise --> eps = 0

Krishna Myneni

unread,

Jul 6, 2012, 9:18:34 PM7/6/12

to

I would use a precomputed binary lookup table -- there are only 27
elements x 2 bits/element = 54 bits, so the entire tensor should fit
into a single 64-bit word. Now, you need a mapping function from {i,
j, k} to the start bit position, but that's easy.

Krishna

Krishna Myneni

unread,

Jul 6, 2012, 9:33:40 PM7/6/12

to

Spoke too soon. A.K.'s method is easier.

Krishna

Krishna Myneni

unread,

Jul 6, 2012, 9:49:42 PM7/6/12

to

A slight correction is necessary to the above:

\ levi-civita.4th
\
\ Return the value of the Levi-Civita symbol in 3D, also known as
\ the 3D permutation tensor.
\

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase
;

KM

Paul Rubin

unread,

Jul 6, 2012, 11:46:22 PM7/6/12

to

"A. K." <a...@nospam.org> writes:
> then CASEs
> n = 123 or 312 or 231 --> eps = 1
> n = 132 or 321 or 213 --> eps = -1
> otherwise --> eps = 0

With locals:

: eps { i j k -- n }
i j - j k - * k i - * 2/ ;

Marcel Hendrix

unread,

Jul 7, 2012, 2:32:40 AM7/7/12

to

Krishna Myneni <krishna...@ccreweb.org> writes Re: Permutation Tensor

[..]

> \ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
> : lcsymbol ( i j k -- n )
> >r >r 10 * r> + 10 * r> +
> case
> 123 of 1 endof
> 231 of 1 endof
> 312 of 1 endof
> 132 of -1 endof
> 213 of -1 endof
> 321 of -1 endof
> 0 swap
> endcase
> ;

Why not:

: lcsymbol2 ( i j k -- n )
swap 2 lshift or swap 4 lshift or
case [ 4 base ! ]

123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof

[ decimal ]
0 swap
endcase ;

: TEST 4 1 DO
4 1 DO
4 1 DO
CR K . J . I .
K J I lcsymbol .
K J I lcsymbol2 .
LOOP
LOOP
LOOP ;

-marcel

A. K.

unread,

Jul 7, 2012, 3:49:07 AM7/7/12

to

shorter:

... 4 lshift or 4 lshift or
case
$321 of ... et cetera

Krishna Myneni

unread,

Jul 7, 2012, 8:35:43 AM7/7/12

to

On Jul 7, 1:32 am, m...@iae.nl (Marcel Hendrix) wrote:
> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor

Definitely an improvement! Now, if we can replace the case
statement ...

Krishna

Krishna Myneni

unread,

Jul 7, 2012, 8:36:09 AM7/7/12

to

Nice.

KM

Marcel Hendrix

unread,

Jul 7, 2012, 9:45:03 AM7/7/12

to

Krishna Myneni <krishna...@ccreweb.org> writes Re: Permutation Tensor

> On Jul 7, 1:32 am, m...@iae.nl (Marcel Hendrix) wrote:
>> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor

[..]

> Definitely an improvement! Now, if we can replace the case
> statement ...

CREATE tab #64 CELLS ALLOT

MARKER -tab
: cases ( ix -- val )
case [ 4 base ! ]
012 of 1 endof
120 of 1 endof
201 of 1 endof
021 of -1 endof
102 of -1 endof
210 of -1 endof
[ decimal ]
0 swap
endcase ;

: lcsymbol2 ( i j k -- n )

1-
swap 1- 2 lshift or
swap 1- 4 lshift or cases ;

:NONAME tab #64 0 DO I cases OVER ! CELL+ LOOP drop ; EXECUTE -tab

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1

: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or swap 1- 4 lshift or CELLS tab + @ ;

: tt 1 2 3 lcsymbol3 . ; see tt
Flags: ANSI
$0124CC40 : tt
$0124CC4A push $0124BEB0 qword-offset
$0124CC50 jmp .+10 ( $011398A2 ) offset NEAR
$0124CC55 ;

Bytes would be even shorter, but then we have to do sign extension 255 -> -1.

-marcel

Krishna Myneni

unread,

Jul 7, 2012, 10:28:47 AM7/7/12

to

On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:

Just for fun, I went ahead and implemented a lookup table version
also.

\ WARNING: this version assumes a minimum 32 bit cell size and
\ little endian ordering! Also, i,j,k are zero-based!

\ We will use 3 cells to hold the tensor. Each cell will
\ represent one plane of the 3D tensor. k will index the cell.
\ i and j will index the bits in the cell.

create lctable 1639701 , 1316133 , 1381905 ,

\ i, j, k are zero based!

: lcsym4 ( i j k -- n )
cells lctable + @ \ return the k^th tensor plane
>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;

Krishna Myneni

unread,

Jul 7, 2012, 10:32:54 AM7/7/12

to

On Jul 7, 8:45 am, m...@iae.nl (Marcel Hendrix) wrote:
> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
>
> > On Jul 7, 1:32 am, m...@iae.nl (Marcel Hendrix) wrote:
> >> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
>
> [..]
>
> > Definitely an improvement! Now, if we can replace the case
> > statement ...
>
> CREATE tab #64 CELLS ALLOT

...

>
> Bytes would be even shorter, but then we have to do sign extension 255 -> -1.
>

You can remap {-1,0,1} to {0,1,2} and just subtract 1 after fetching
the value. I implemented the table using 3 cells (12 bytes on a 32-bit
system), and I expect it might be faster than your solution above.

Krishna

Krishna Myneni

unread,

Jul 7, 2012, 11:02:05 AM7/7/12

to

On Jul 7, 9:28 am, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
> On Jul 6, 8:18 pm, Krishna Myneni <krishna.myn...@ccreweb.org> wrote:
>
> > On Jul 5, 2:13 am, Arnold Doray <inva...@invalid.com> wrote:
>
> > > What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> > > Thanks,
> > > Arnold
>
> > I would use a precomputed binary lookup table -- there are only 27
> > elements x 2 bits/element = 54 bits, so the entire tensor should fit
> > into a single 64-bit word. Now, you need a mapping function from {i,
> > j, k} to the start bit position, but that's easy.
>
> > Krishna
>
> Just for fun, I went ahead and implemented a lookup table version
> also.
>
> \ WARNING: this version assumes a minimum 32 bit cell size and
> \ little endian ordering! Also, i,j,k are zero-based!
>

The warning is incorrect. The code should work fine on a big endian
system as well, provided the cell size is at least 32 bits.

KM

Marcel Hendrix

unread,

Jul 7, 2012, 11:53:39 AM7/7/12

to

Krishna Myneni <krishna...@ccreweb.org> writes Re: Permutation Tensor

> On Jul 7, 8:45=A0am, m...@iae.nl (Marcel Hendrix) wrote:
>> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor

>> > On Jul 7, 1:32 am, m...@iae.nl (Marcel Hendrix) wrote:
>> >> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor

[..]

> You can remap {-1,0,1} to {0,1,2} and just subtract 1 after fetching
> the value.

A very good idea.

> I implemented the table using 3 cells (12 bytes on a 32-bit
> system), and I expect it might be faster than your solution above.

Apparently not, at least not with iForth.

The difference between branching and table lookup is unexpectedly large
- I thought that memory was supposed to be slow. Apparently branching is
becoming very expensive.

-marcel

-- ------------------------------------------------

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1

: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase
;

: lcsymbol2 ( i j k -- n )
swap 2 lshift or swap 4 lshift or
case [ 4 base ! ]
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof

321 of -1 endof
[ decimal ]
0 swap
endcase ;

CREATE tab #64 CHARS ALLOT

MARKER -tab
: cases ( ix -- val )
case [ 4 base ! ]

012 of 2 endof
120 of 2 endof
201 of 2 endof
021 of 0 endof
102 of 0 endof
210 of 0 endof
[ decimal ]
1 swap
endcase ;

: lcsymbolX ( i j k -- n )

1-
swap 1- 2 lshift or

swap 1- 4 lshift or cases 1- ;

:NONAME tab #64 0 DO I cases OVER C! CHAR+ LOOP drop ; EXECUTE -tab

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1

: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or swap 1- 4 lshift or tab + C@ 1- ;

create lctable 1639701 , 1316133 , 1381905 ,

\ i, j, k are zero based!

: lcsymbol4 ( i j k -- n )

cells lctable + @ \ return the k^th tensor plane
>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;

: TEST ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol + LOOP LOOP LOOP ;
: TEST2 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol2 + LOOP LOOP LOOP ;
: TEST3 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol3 + LOOP LOOP LOOP ;
: TEST4 ( -- n ) 0 3 0 DO 3 0 DO 3 0 DO K J I lcsymbol4 + LOOP LOOP LOOP ;

: TOPTEST ( u -- )
1 UMAX LOCAL #times
CR timer-reset 0 #times 0 ?DO TEST + LOOP . .elapsed
CR timer-reset 0 #times 0 ?DO TEST2 + LOOP . .elapsed
CR timer-reset 0 #times 0 ?DO TEST3 + LOOP . .elapsed
CR timer-reset 0 #times 0 ?DO TEST4 + LOOP . .elapsed ;

\ FORTH> 10000000 toptest
\ 0 2.997 seconds elapsed.
\ 0 2.972 seconds elapsed.
\ 0 1.648 seconds elapsed.
\ 0 1.695 seconds elapsed. ok

Krishna Myneni

unread,

Jul 7, 2012, 12:10:31 PM7/7/12

to

On Jul 7, 10:53 am, m...@iae.nl (Marcel Hendrix) wrote:
> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
>
> > On Jul 7, 8:45=A0am, m...@iae.nl (Marcel Hendrix) wrote:
> >> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
> >> > On Jul 7, 1:32 am, m...@iae.nl (Marcel Hendrix) wrote:
> >> >> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
>
> [..]
>
> > You can remap {-1,0,1} to {0,1,2} and just subtract 1 after fetching
> > the value.
>
> A very good idea.
>
> > I implemented the table using 3 cells (12 bytes on a 32-bit
> > system), and I expect it might be faster than your solution above.
>
> Apparently not, at least not with iForth.
>
> The difference between branching and table lookup is unexpectedly large
> - I thought that memory was supposed to be slow. Apparently branching is
> becoming very expensive.
>

...

> : TOPTEST ( u -- )
> 1 UMAX LOCAL #times
> CR timer-reset 0 #times 0 ?DO TEST + LOOP . .elapsed
> CR timer-reset 0 #times 0 ?DO TEST2 + LOOP . .elapsed
> CR timer-reset 0 #times 0 ?DO TEST3 + LOOP . .elapsed
> CR timer-reset 0 #times 0 ?DO TEST4 + LOOP . .elapsed ;
>
> \ FORTH> 10000000 toptest
> \ 0 2.997 seconds elapsed.
> \ 0 2.972 seconds elapsed.
> \ 0 1.648 seconds elapsed.
> \ 0 1.695 seconds elapsed. ok

I stand corrected! How about Paul's locals version? It appears that
recent gforth's UTIME is broken, or I would have tested the speed of
the locals version. I don't provide locals in kForth, and using VALUEs
gave me an even slower performance than the branching version.

Krishna

Marcel Hendrix

unread,

Jul 7, 2012, 12:41:18 PM7/7/12

to

Krishna Myneni <krishna...@ccreweb.org> writes Re: Permutation Tensor

> On Jul 7, 10:53 am, m...@iae.nl (Marcel Hendrix) wrote:
>> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor

[..]

> I stand corrected! How about Paul's locals version?

[..]

Nice register based code (when using the iForth PARAMS|), but the
multiplications apparently slow it down (eps). With Standard LOCALS| the
code is prepostorously slow (eps2).

-marcel

-- ------------------------------------------------
\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof

[..]

: eps params| i j k | ( i j k -- n ) i j - j k - * k i - * 2/ ;
: eps2 LOCALS| k j i | ( i j k -- n ) i j - j k - * k i - * 2/ ;

[..]

: TESTe ( -- n ) 0 4 0 DO 4 0 DO 4 0 DO K J I eps + LOOP LOOP LOOP ;
: TESTe2 ( -- n ) 0 4 0 DO 4 0 DO 4 0 DO K J I eps2 + LOOP LOOP LOOP ;

: TOPTEST ( u -- )
1 UMAX LOCAL #times

CR ." lcsymbol : " timer-reset 0 #times 0 ?DO TEST + LOOP . .elapsed
CR ." lcsymbol2 : " timer-reset 0 #times 0 ?DO TEST2 + LOOP . .elapsed
CR ." lcsymbol3 : " timer-reset 0 #times 0 ?DO TEST3 + LOOP . .elapsed
CR ." lcsymbol4 : " timer-reset 0 #times 0 ?DO TEST4 + LOOP . .elapsed
CR ." eps : " timer-reset 0 #times 0 ?DO TESTe + LOOP . .elapsed
CR ." eps2 : " timer-reset 0 #times 0 ?DO TESTe2 + LOOP . .elapsed ;

\ FORTH> 10000000 toptest
\ lcsymbol : 0 3.059 seconds elapsed.
\ lcsymbol2 : 0 2.773 seconds elapsed.
\ lcsymbol3 : 0 1.492 seconds elapsed.
\ lcsymbol4 : 0 1.531 seconds elapsed.
\ eps : 0 2.850 seconds elapsed.
\ eps2 : 0 5.933 seconds elapsed. ok

Albert van der Horst

unread,

Jul 7, 2012, 3:58:55 PM7/7/12

to

In article <jt3erh$cd2$1...@dont-email.me>,

Arnold Doray <inv...@invalid.com> wrote:
>What's the best way to implement the 3D permutation tensor E(I,J,K)?

Something along
: eps
3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
/ At this point we know that all indices are different.
BEGIN DUP 3 <> WHILE ROT REPEAT
DROP < 2* 1-
;

Come on boys, cases!

>
>Thanks,
>Arnold

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Krishna Myneni

unread,

Jul 7, 2012, 3:24:10 PM7/7/12

to

Below are the test results with gforth-fast (Gforth 0.7.9-20101227).
There's no problem with
gforth's UTIME, just with the user :) . Times are in milliseconds.

$ gforth-fast levi-civita.fs

\ lcsymbol : 0 1192
\ lcsymbol2 : 0 1083
\ lcsymbol3 : 0 503
\ lcsymbol4 : 0 523
\ eps : 0 1852
\ eps2 : 0 1857

Apparently, use of locals can have a severe drawback in efficiency.

The full test code is given below:

--
\ levi-civita.fs

\
\ Return the value of the Levi-Civita symbol in 3D, also known as
\ the 3D permutation tensor.
\

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1

\ Version by AK and KM, posted on comp.lang.forth,
\ 6 July 2012.

: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof

132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase
;

\ Version by AK, KM, and MH, posted on comp.lang.forth,
\ 7 July 2012.

: lcsymbol2 ( i j k -- n )
swap 2 lshift or swap 4 lshift or
case [ 4 base ! ]

123 of 1 endof
231 of 1 endof
312 of 1 endof

132 of -1 endof
213 of -1 endof
321 of -1 endof
[ decimal ]
0 swap
endcase ;

\ Paul Rubin's solution using locals, posted on comp.lang.forth,
\ 6 July 2012.

: eps { i j k -- n }

i j - j k - * k i - * 2/ ;

\ Using standard locals

: eps2 LOCALS| k j i | ( i j k -- n ) i j - j k - * k i - * 2/ ;

\ Faster version with lookup table by MH, posted on c.l.f,
\ on 7 July 2012.

CREATE tab #64 CHARS ALLOT

MARKER -tab
: cases ( ix -- val )
case [ 4 base ! ]
012 of 2 endof
120 of 2 endof
201 of 2 endof
021 of 0 endof
102 of 0 endof
210 of 0 endof
[ decimal ]
1 swap
endcase ;

: lcsymbolX ( i j k -- n )
1-
swap 1- 2 lshift or
swap 1- 4 lshift or cases 1- ;

:NONAME tab #64 0 DO I cases OVER C! CHAR+ LOOP drop ; EXECUTE -tab

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1

: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or swap 1- 4 lshift or
tab + C@ 1- ;

\ Alternate table lookup which uses less storage, by KM.
\ Posted on c.l.f. on 7 July 2012.

\ We will use 3 cells to hold the tensor. Each cell will
\ represent one plane of the 3D tensor. k will index the cell.
\ i and j will index the bits in the cell.

\
\ For this implementation, we will take the indices i,j,k to
\ be 0-based.

create lctable 1639701 , 1316133 , 1381905 ,

: lcsymbol4 ( i j k -- n )

cells lctable + @ \ return the appropriate tensor plane

>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;

\ Speed Comparison
1 [IF]
: ms@ ( -- u ) utime 1 1000 m*/ d>s ;

: TEST ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol + LOOP
LOOP LOOP ;
: TEST2 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol2 + LOOP
LOOP LOOP ;
: TEST3 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO K J I lcsymbol3 + LOOP
LOOP LOOP ;
: TEST4 ( -- n ) 0 3 0 DO 3 0 DO 3 0 DO K J I lcsymbol4 + LOOP
LOOP LOOP ;

: TESTe ( -- n ) 0 4 0 DO 4 0 DO 4 0 DO K J I eps + LOOP
LOOP LOOP ;
: TESTe2 ( -- n ) 0 4 0 DO 4 0 DO 4 0 DO K J I eps2 + LOOP
LOOP LOOP ;

: .elapsed ( msstart msend -- ) swap - . ;

1000000 value #times

: TOPTEST ( u -- )

CR ms@ 0 #times 0 ?DO TEST + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST2 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST3 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST4 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TESTe + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TESTe2 + LOOP ms@ swap . .elapsed
;

toptest cr
bye
[THEN]

Paul Rubin

unread,

Jul 7, 2012, 4:03:09 PM7/7/12

to

m...@iae.nl (Marcel Hendrix) writes:
> Nice register based code (when using the iForth PARAMS|), but the
> multiplications apparently slow it down (eps). With Standard LOCALS| the
> code is prepostorously slow (eps2).

That's kind of interesting about the multiplications, on a modern
machine. Can you change them to additions (so it will give the wrong
answer) and see if it's faster? Thanks.

Marcel Hendrix

unread,

Jul 7, 2012, 5:54:43 PM7/7/12

to

Paul Rubin <no.e...@nospam.invalid> writes Re: Permutation Tensor

My hypothesis was wrong, the multiplication isn't important.

I completely rewrote the test because the loop overhead was too high
with respect to the work done by the lcsymbol code. With the new test,
all non-local implementations have approximately the same speed, and the
locals add a sizable overhead (because iForth doesn't inline words that
use locals.)

Also interesting: lcsymbol4 is very fast, but only for 64 bit code where
more free registers are available to the code generator.

-marcel

-- ----------
ANEW -lcsymbol

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol ( i j k -- n )
>r >r 10 * r> + 10 * r> +
case
123 of 1 endof
231 of 1 endof
312 of 1 endof

132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase
;

4 BASE !

: lcsymbol2 ( i j k -- n )

swap 2 lshift or swap 10 lshift or

case
123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
0 swap
endcase ;

DECIMAL

1 VALUE a
2 VALUE b
3 VALUE c

CREATE tab #64 CHARS ALLOT tab #64 CONST-DATA

MARKER -tab
: cases ( ix -- val )
case [ 4 base ! ]
012 of 2 endof
120 of 2 endof
201 of 2 endof
021 of 0 endof
102 of 0 endof
210 of 0 endof
[ decimal ]
1 swap
endcase ;

: lcsymbolX ( i j k -- n )
1-
swap 1- 2 lshift or
swap 1- 4 lshift or cases 1- ;

:NONAME tab #64 0 DO I cases OVER C! CHAR+ LOOP drop ; EXECUTE -tab

\ For i,j,k in {1,2,3}, returns n = -1, 0, or 1
: lcsymbol3 ( i j k -- n ) 1- swap 1- 2 lshift or swap 1- 4 lshift or tab + C@ 1- ;

create lctable 1639701 , 1316133 , 1381905 , lctable 3 cells CONST-DATA

\ i, j, k are zero based!

: lcsymbol4 ( i j k -- n )

cells lctable + @ \ return the k^th tensor plane

>r
3 lshift over 2* + nip \ compute the bit offset
r> swap rshift 3 and 1-
;

: eps params| i j k | ( i j k -- n ) i j - j k - * k i - * 2/ ;
: eps2 LOCALS| k j i | ( i j k -- n ) i j - j k - * k i - * 2/ ;

: weps params| i j k | ( i j k -- n ) i j - j k - + k i - + 2/ ;
: weps2 LOCALS| k j i | ( i j k -- n ) i j - j k - + k i - + 2/ ;

: TEST ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol + a b c lcsymbol + a b c lcsymbol + a b c lcsymbol + LOOP drop ;
: TEST2 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 + a b c lcsymbol2 + LOOP drop ;
: TEST3 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 + a b c lcsymbol3 + LOOP drop ;
: TEST4 ( n1 -- n2 ) 0 SWAP 0 DO a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 + a b c lcsymbol4 + LOOP drop ;
: TESTe ( n1 -- n2 ) 0 SWAP 0 DO a b c eps + a b c eps + a b c eps + a b c eps + LOOP drop ;
: TESTe2 ( n1 -- n2 ) 0 SWAP 0 DO a b c eps2 + a b c eps2 + a b c eps2 + a b c eps2 + LOOP drop ;
: TESTwe ( n1 -- n2 ) 0 SWAP 0 DO a b c weps + a b c weps + a b c weps + a b c weps + LOOP drop ;
: TESTwe2 ( n1 -- n2 ) 0 SWAP 0 DO a b c weps2 + a b c weps2 + a b c weps2 + a b c weps2 + LOOP drop ;

: TOPTEST ( u -- )

2/ 2/ 1 UMAX LOCAL #times
CR ." \ lcsymbol : " timer-reset #times TEST .elapsed
CR ." \ lcsymbol2 : " timer-reset #times TEST2 .elapsed
CR ." \ lcsymbol3 : " timer-reset #times TEST3 .elapsed
CR ." \ lcsymbol4 : " timer-reset #times TEST4 .elapsed
CR ." \ eps : " timer-reset #times TESTe .elapsed
CR ." \ eps2 : " timer-reset #times TESTe2 .elapsed
CR ." \ weps : " timer-reset #times TESTwe .elapsed
CR ." \ weps2 : " timer-reset #times TESTwe2 .elapsed ;

\ 1,000,000,000 toptest ( 64bit code)
\ lcsymbol : 2.495 seconds elapsed.
\ lcsymbol2 : 2.217 seconds elapsed.
\ lcsymbol3 : 2.075 seconds elapsed.
\ lcsymbol4 : 1.778 seconds elapsed.
\ eps : 2.281 seconds elapsed.
\ eps2 : 7.390 seconds elapsed.
\ weps : 2.307 seconds elapsed.
\ weps2 : 7.264 seconds elapsed. ok

\ 1,000,000,000 toptest ( 32bit code)
\ lcsymbol : 2.400 seconds elapsed.
\ lcsymbol2 : 2.093 seconds elapsed.
\ lcsymbol3 : 2.069 seconds elapsed.
\ lcsymbol4 : 4.320 seconds elapsed.
\ eps : 2.265 seconds elapsed.
\ eps2 : 7.352 seconds elapsed.
\ weps : 2.298 seconds elapsed.
\ weps2 : 7.234 seconds elapsed. ok

Paul Rubin

unread,

Jul 7, 2012, 7:43:49 PM7/7/12

to

m...@iae.nl (Marcel Hendrix) writes:
> all non-local implementations have approximately the same speed, and the
> locals add a sizable overhead (because iForth doesn't inline words that
> use locals.)

Thanks, how about this?

: eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;

Gerry Jackson

unread,

Jul 8, 2012, 2:48:01 AM7/8/12

to

On 07/07/2012 20:24, Krishna Myneni wrote:
> On Jul 7, 11:41 am, m...@iae.nl (Marcel Hendrix) wrote:
>> Krishna Myneni <krishna.myn...@ccreweb.org> writes Re: Permutation Tensor
>>

>
>

How does this version compare?

\ Using a byte-wide look-up table
\ Assumes i, j , k are all in the range [1..3]
create dat
2 c, 5 c, 8 c,
11 c, 15 c, 13 c,
12 c, 11 c, 17 c,
16 c, 14 c, 11 c,
1 c, 1 c, 1 c,
2 c, 1 c, 1 c,
0 c, 1 c, 1 c,

: (eps) ( u -- u2 ) dat + + c@ ;

: eps ( i j k -- n ) -1 (eps) (eps) (eps) 1- ;

--
Gerry

Marcel Hendrix

unread,

Jul 8, 2012, 3:07:18 AM7/8/12

to

Paul Rubin <no.e...@nospam.invalid> writes Re: Permutation Tensor

FORTH> 1000000000 toptest
\ lcsymbol : 2.409 seconds elapsed.
\ lcsymbol2 : 2.136 seconds elapsed.
\ lcsymbol3 : 2.016 seconds elapsed.
\ lcsymbol4 : 1.711 seconds elapsed.
\ eps : 2.221 seconds elapsed.
\ eps2 : 7.174 seconds elapsed.
\ eps3 : 2.221 seconds elapsed.
\ weps : 2.225 seconds elapsed.
\ weps2 : 7.008 seconds elapsed. ok

As you can see, eps and eps3 differ insignificantly.

Thanks to your questions, I looked into it a little further.
Instead of doing only a b c eps3, I tried a b c eps3
1 b c eps3a, 1 2 c eps3ab, and 1 2 3 eps3abc (i.e. some
arguments are constants instead of variables).

The 1 2 3 eps3abc will calculate lcsymbol at compile time,
therefore the loop overhead time comes out at 0.5 seconds.

The other three implementations cause the word to fetch
up to three cells from memory:

\ eps3 : 2.221 seconds elapsed. ( fetch 3 cells )
\ eps3a : 1.579 seconds elapsed. ( fetch 2 cells )
\ eps3ab : 1.399 seconds elapsed. ( fetch 1 cell )
\ eps3abc : 0.513 seconds elapsed. ( compute at compile-time)

The run-time difference for these implementations
appears to be caused by their number of memory accesses.

-marcel

Gerry Jackson

unread,

Jul 8, 2012, 3:09:03 AM7/8/12

to

Sorry the stack comment for (eps) should be

: (eps) ( u1 u2 -- u3 ) ...

--
Gerry

Marcel Hendrix

unread,

Jul 8, 2012, 3:23:25 AM7/8/12

to

Gerry Jackson <ge...@jackson9000.fsnet.co.uk> writes Re: Permutation Tensor

> On 07/07/2012 20:24, Krishna Myneni wrote:
>> On Jul 7, 11:41 am, m...@iae.nl (Marcel Hendrix) wrote:

[..]

> How does this version compare?

[..]

Quite good.

-marcel

\ Assumes i, j , k are all in the range [1..3]
create dat
2 c, 5 c, 8 c,
11 c, 15 c, 13 c,
12 c, 11 c, 17 c,
16 c, 14 c, 11 c,
1 c, 1 c, 1 c,
2 c, 1 c, 1 c,
0 c, 1 c, 1 c,

: (eps) ( u -- u2 ) dat + + c@ ;

: epsg ( i j k -- n ) -1 (eps) (eps) (eps) 1- ;

[..]

: TESTg ( n -- ) 0 SWAP 0 DO a b c epsg + a b c epsg + a b c epsg + a b c epsg + LOOP drop ;

[..]

: TOPTEST ( u -- )

2/ 2/ 1 UMAX LOCAL #times
CR ." \ lcsymbol : " timer-reset #times TEST .elapsed
CR ." \ lcsymbol2 : " timer-reset #times TEST2 .elapsed
CR ." \ lcsymbol3 : " timer-reset #times TEST3 .elapsed
CR ." \ lcsymbol4 : " timer-reset #times TEST4 .elapsed
CR ." \ eps : " timer-reset #times TESTe .elapsed
CR ." \ eps2 : " timer-reset #times TESTe2 .elapsed

CR ." \ eps3 : " timer-reset #times TESTe3 .elapsed
CR ." \ eps3a : " timer-reset #times TESTe3a .elapsed
CR ." \ eps3ab : " timer-reset #times TESTe3ab .elapsed
CR ." \ eps3abc : " timer-reset #times TESTe3abc .elapsed
CR ." \ epsg : " timer-reset #times TESTg .elapsed

CR ." \ weps : " timer-reset #times TESTwe .elapsed
CR ." \ weps2 : " timer-reset #times TESTwe2 .elapsed ;

FORTH> 1000000000 toptest
\ lcsymbol : 2.408 seconds elapsed.
\ lcsymbol2 : 2.137 seconds elapsed.
\ lcsymbol3 : 2.007 seconds elapsed.
\ lcsymbol4 : 1.711 seconds elapsed.
\ eps : 2.275 seconds elapsed.
\ eps2 : 7.356 seconds elapsed.
\ eps3 : 2.224 seconds elapsed.
\ eps3a : 1.576 seconds elapsed.
\ eps3ab : 1.401 seconds elapsed.
\ eps3abc : 0.512 seconds elapsed.
\ epsg : 2.224 seconds elapsed.
\ weps : 2.223 seconds elapsed.

Albert van der Horst

unread,

Jul 8, 2012, 9:03:19 AM7/8/12

to

In article <7xpq87b...@ruckus.brouhaha.com>,

I come up with this:
(now tested)

The idea is once you established that the modulo 3 sum,
you need only inspect two to find out what the eps is.

"
REQUIRE CASE-INSENSITIVE
CASE-INSENSITIVE
REQUIRE 2>R

: -ROT POSTPONE ROT POSTPONE ROT ; IMMEDIATE

: eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;

\ Ensure proper mathematical mod.
: MOD' >R S>D R> FM/MOD DROP ;

: eps2 >R 2DUP + R> + 3 MOD IF 2DROP 0 ELSE SWAP - 1+ 3 MOD' 1- THEN ;

\ Less elegant works with floored mod too.
: eps2' >R 2DUP + R> + 3 MOD IF 2DROP 0 ELSE SWAP - 4 + 3 MOD 1- THEN ;

: test 4 1 DO I 4 1 DO 4 1 DO
DUP . J . I .
DUP J I EPS3 .
DUP J I EPS2 .
DUP J I EPS2' .
CR
LOOP LOOP DROP LOOP ;

"

In the FORTRAN era I hated the trick to use multiplication
to combine cases. But I'm sure your code will win hands down
on a Pentium for lack of branching.

The mod could be a table lookup, as the range is only -2..6

Groetjes Albert

Alex Wegel

unread,

Jul 8, 2012, 8:24:30 AM7/8/12

to

Arnold Doray <inv...@invalid.com> wrote:

> What's the best way to implement the 3D permutation tensor E(I,J,K)?

To not actually do it (advantageous in some contexts).

Let's assume that we already know that all i,j,k triplets we'll have to
handle are valid (ie. permutations of 1,2,3 - ie. with no duplicates).

This assumption will hold for certain sources (algorithms), and also
when the triplets don't change too often, and thus can be checked
somewhere else (eg. on value change, or on selection, or earlier in a
function body, if i,j,k were derived from function arguments).

Assuming validity of i,j,k, we could ignore one of them completely
(because it's redundand) and only look at the difference between the two
others.
By separating the checking from the determination of "handedness", we
also simplify the value returned by the function - no need to indicate
an invalid triplet - we can just return a boolean.

: <E> ( i j -- f ) - 3 mod 2 - ; \ check if i follows j or vice versa
: E ( i j k -- -1|+1 ) drop <E> 2* 1+ ; \ wrapper for compatibility

I think this (untested) code illustrates that the "best implementation"
still depends on context (intended usage).

Cheers,
Alex

Krishna Myneni

unread,

Jul 8, 2012, 8:55:09 AM7/8/12

to

On Jul 8, 1:48 am, Gerry Jackson <ge...@jackson9000.fsnet.co.uk>
wrote:

Below are my test results with gforth-fast (64-bit version). My
earlier test code had some errors, for the locals implementations.
Times are in ms. The fastest is lcsymbol4 with the test system.

$ gforth-fast levi-civita.fs

\ lcsymbol : 0 3553
\ lcsymbol2 : 0 3363
\ lcsymbol3 : 0 1810
\ lcsymbol4 : 0 1764
\ eps : 0 2712
\ eps2 : 0 2730
\ eps3 : 0 1991
\ epsg : 0 4048

Test code below.

KM

: eps2 LOCALS| k j i | ( i j k -- n ) i j - j k - * k i - * 2/ ;

\ Not using locals, by P.R., posted on c.l.f. on 7 July 2012.
: eps3 ( i j k -- e ) 2dup - >r -rot 2dup - >r drop - 2r> * * 2/ ;

\ Table lookup using bytes by GJ, posted on c.l.f. on 8 July 2012.

\ Using a byte-wide look-up table
\ Assumes i, j , k are all in the range [1..3]
create dat
2 c, 5 c, 8 c,
11 c, 15 c, 13 c,
12 c, 11 c, 17 c,
16 c, 14 c, 11 c,
1 c, 1 c, 1 c,
2 c, 1 c, 1 c,
0 c, 1 c, 1 c,

: (eps) ( u1 u2 -- u3 ) dat + + c@ ;

: epsg ( i j k -- n ) -1 (eps) (eps) (eps) 1- ;

\ Speed Comparison
1 [IF]
: ms@ ( -- u ) utime 1 1000 m*/ d>s ;

: TEST ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO

K J I lcsymbol + K J I lcsymbol + K J I lcsymbol + LOOP LOOP

LOOP ;
: TEST2 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO

K J I lcsymbol2 + K J I lcsymbol2 + K J I lcsymbol2 + LOOP LOOP

LOOP ;
: TEST3 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO

K J I lcsymbol3 + K J I lcsymbol3 + K J I lcsymbol3 + LOOP LOOP

LOOP ;
: TEST4 ( -- n ) 0 3 0 DO 3 0 DO 3 0 DO

K J I lcsymbol4 + K J I lcsymbol4 + K J I lcsymbol4 + LOOP LOOP
LOOP ;
: TESTe ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO
K J I eps + K J I eps + K J I eps + LOOP LOOP LOOP ;
: TESTe2 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO
K J I eps2 + K J I eps2 + K J I eps2 + LOOP LOOP LOOP ;
: TESTe3 ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO
K J I eps3 + K J I eps3 + K J I eps3 + LOOP LOOP LOOP ;
: TESTeg ( -- n ) 0 4 1 DO 4 1 DO 4 1 DO
K J I epsg + K J I epsg + K J I epsg + LOOP LOOP LOOP ;

: .elapsed ( msstart msend -- ) swap - . ;

1000000 value #times
: TOPTEST ( u -- )
CR ms@ 0 #times 0 ?DO TEST + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST2 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST3 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TEST4 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TESTe + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TESTe2 + LOOP ms@ swap . .elapsed

CR ms@ 0 #times 0 ?DO TESTe3 + LOOP ms@ swap . .elapsed
CR ms@ 0 #times 0 ?DO TESTeg + LOOP ms@ swap . .elapsed
;

toptest cr
bye
[THEN]

Krishna Myneni

unread,

Jul 8, 2012, 8:45:29 AM7/8/12

to

On Jul 7, 2:58 pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>
wrote:
> In article <jt3erh$cd...@dont-email.me>,

> Arnold Doray <inva...@invalid.com> wrote:
>
> >What's the best way to implement the 3D permutation tensor E(I,J,K)?
>
> Something along
> : eps
> 3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
> / At this point we know that all indices are different.
> BEGIN DUP 3 <> WHILE ROT REPEAT
> DROP < 2* 1-
> ;
>

The above seems to go into an infinite loop for me.

: epsh ( i j k -- n )
( 3DUP) 2dup 2>r 2 pick 2r>
+ + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN
\ At this point we know that all indices are different.

BEGIN DUP 3 <> WHILE ROT REPEAT
DROP < 2* 1-
;

1 1 1 epsh

KM

Gerry Jackson

unread,

Jul 8, 2012, 3:14:23 PM7/8/12

to

No good therefore.

> Test code below.
>
[...]

> \ Alternate table lookup which uses less storage, by KM.
> \ Posted on c.l.f. on 7 July 2012.
>
> \ We will use 3 cells to hold the tensor. Each cell will
> \ represent one plane of the 3D tensor. k will index the cell.
> \ i and j will index the bits in the cell.
> \
> \ For this implementation, we will take the indices i,j,k to
> \ be 0-based.
>
> create lctable 1639701 , 1316133 , 1381905 ,
>
> : lcsymbol4 ( i j k -- n )
> cells lctable + @ \ return the appropriate tensor plane
> >r
> 3 lshift over 2* + nip \ compute the bit offset
> r> swap rshift 3 and 1-
> ;
>

Why did you make i,j,k zero based? Your method will still work in a 32
bit system with them 1-based as in the original question e.g. (plus a
bit of tidying up)

align here
hex 64145400 , 50549400 , 54584400 , decimal
1 cells - constant lctable

: lcsymbol5 ( i j k -- n )

cells lctable + @ \ return the appropriate tensor plane

-rot
3 lshift swap 2* + \ compute the bit offset
rshift 3 and 1-
;

--
Gerry

Albert van der Horst

unread,

Jul 8, 2012, 4:05:38 PM7/8/12

to

In article <6774ceb8-dbe5-47db...@d24g2000yqh.googlegroups.com>,
Krishna Myneni <krishna...@ccreweb.org> wrote:
>On Jul 7, 2:58=A0pm, Albert van der Horst <alb...@spenarnc.xs4all.nl>

>wrote:
>> In article <jt3erh$cd...@dont-email.me>,

>> Arnold Doray =A0<inva...@invalid.com> wrote:
>>
>> >What's the best way to implement the 3D permutation tensor E(I,J,K)?
>>
>> Something along
>> : eps

>> =A0 =A0 =A0 =A0 3DUP + + 3 MOD IF 3DROP 0 EXIT THEN
>> =A0 =A0 =A0 =A0 / At this point we know that all indices are different.
>> =A0 =A0 =A0 =A0 BEGIN DUP 3 <> WHILE ROT REPEAT
>> =A0 =A0 =A0 =A0 DROP < 2* 1-

>> ;
>>
>
>The above seems to go into an infinite loop for me.
>
>: epsh ( i j k -- n )
> ( 3DUP) 2dup 2>r 2 pick 2r>
> + + 3 MOD IF ( 3DROP) 2drop drop 0 EXIT THEN
> \ At this point we know that all indices are different.
> BEGIN DUP 3 <> WHILE ROT REPEAT
> DROP < 2* 1-
>;
>
>1 1 1 epsh

Too hasty! I overlooked the possibility of 3 equals and didn't
test.

>
>
>KM
>

Groetjes Albert

Krishna Myneni

unread,

Jul 8, 2012, 4:49:39 PM7/8/12

to

On Jul 8, 2:14 pm, Gerry Jackson <ge...@jackson9000.fsnet.co.uk>

Yes, it's easy to make it 1-based also. I just wanted to eliminate the
extra 1- words in the code, because I thought indexing into the packed
bit structures was easier that way (but your code below proves that
point wrong). More important, in actual use, the tensor will likely
multiply arrays or matrices. In the FSL, we use 0-based indices for
arrays and matrices, so I expect the 0-based Levi-Civita symbol may be
closer to what will be used in practice. But, if a 1-based index is
necessary, your version fits the bill.

> align here
> hex 64145400 , 50549400 , 54584400 , decimal
> 1 cells - constant lctable
>
> : lcsymbol5 ( i j k -- n )
> cells lctable + @ \ return the appropriate tensor plane
> -rot
> 3 lshift swap 2* + \ compute the bit offset
> rshift 3 and 1-
> ;
>

Thanks.

Krishna

Alex Wegel

unread,

Jul 8, 2012, 5:44:43 PM7/8/12

to

Arnold Doray <inv...@invalid.com> wrote:

> What's the best way to implement the 3D permutation tensor E(I,J,K)?

Another one by me (my first one might not match a very common scenario,
and was a cheat anyway).

This one is based on the following observation:

For every case where E(i,j,k) yields a non-zero value (-1, or 1), the
following holds (asuming modular arithmetic, base 3):
j-i = k-j = i-k = E(i,j,k)

So, if j-i is equal to k-j then we can basically just return this value,
otherwise the result is 0.

I then dare to make the usual assumptions about integer representation
and booleans, which simplifies the combination of the two differences
and the recovery of the end result :-)

: eps ( i j k -- n )

over - 3 mod >r
swap - 3 mod r> ( j-i k-j )
and \ this AND does a lot of what's needed
dup 2 = or \ change result 2 to -1 (assumes true == -1)
;

: test
4 1 do
4 1 do
4 1 do
i 4 .r j 4 .r k 4 .r
i j k eps 4 .r cr
loop
loop
loop
;

test bye

Cheers,
Alex

Anton Ertl

unread,

Jul 9, 2012, 10:50:12 AM7/9/12

to

m...@iae.nl (Marcel Hendrix) writes:
>The difference between branching and table lookup is unexpectedly large
>- I thought that memory was supposed to be slow. Apparently branching is
>becoming very expensive.

Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
Mispredicted branches are much slower (10-20 cycles).

Of course, if you miss all caches (several hundred cycles) and compare
that with a correctly predicted branch (~1 cycle) things look
differently.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/

m.a.m....@tue.nl

unread,

Jul 10, 2012, 3:28:19 AM7/10/12

to

On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
> m...@iae.nl (Marcel Hendrix) writes:
>> The difference between branching and table lookup is unexpectedly large
>> - I thought that memory was supposed to be slow. Apparently branching is
>> becoming very expensive.
>
> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
> Mispredicted branches are much slower (10-20 cycles).

It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:

4 BASE !

: lcsymbol2 ( i j k -- n )

swap 2 lshift or swap 10 lshift or

case
\ 123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof

123 of 1 endof
0 swap
endcase ;
DECIMAL

Note that the test harness in effect performs "1 2 3 lcsymbol2."

FORTH> 1000000000 TOPTEST ( original)
\ lcsymbol : 1.959 seconds elapsed.
\ lcsymbol2 : 1.750 seconds elapsed.

FORTH> 1000000000 TOPTEST ( swapped test cases)
\ lcsymbol : 1.957 seconds elapsed.
\ lcsymbol2 : 4.478 seconds elapsed.

This makes the jumptable approach a clear favorite.

-marcel

Anton Ertl

unread,

Jul 10, 2012, 6:11:08 AM7/10/12

to

m.a.m....@tue.nl writes:
>On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
>> m...@iae.nl (Marcel Hendrix) writes:
>>> The difference between branching and table lookup is unexpectedly large
>>> - I thought that memory was supposed to be slow. Apparently branching is
>>> becoming very expensive.
>>
>> Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
>> Mispredicted branches are much slower (10-20 cycles).
>
>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>
>4 BASE !
>: lcsymbol2 ( i j k -- n )
> swap 2 lshift or swap 10 lshift or
> case
>\ 123 of 1 endof
> 231 of 1 endof
> 312 of 1 endof
> 132 of -1 endof
> 213 of -1 endof
> 321 of -1 endof
>
> 123 of 1 endof
> 0 swap
> endcase ;
>DECIMAL
>
>Note that the test harness in effect performs "1 2 3 lcsymbol2."

Hmm, sounds like perfectly predictable branches. Then it's probably
something else. You can check the number of mispredictions with
performance counters.

>FORTH> 1000000000 TOPTEST ( original)
>\ lcsymbol : 1.959 seconds elapsed.
>\ lcsymbol2 : 1.750 seconds elapsed.
>
>FORTH> 1000000000 TOPTEST ( swapped test cases)
>\ lcsymbol : 1.957 seconds elapsed.
>\ lcsymbol2 : 4.478 seconds elapsed.
>
>This makes the jumptable approach a clear favorite.

Can you post the machine code for the second LCSYMBOL2?

Krishna Myneni

unread,

Jul 10, 2012, 8:10:36 AM7/10/12

to

On Jul 10, 2:28 am, m.a.m.hend...@tue.nl wrote:
> On Monday, July 9, 2012 4:50:12 PM UTC+2, Anton Ertl wrote:
> > m...@iae.nl (Marcel Hendrix) writes:
> >> The difference between branching and table lookup is unexpectedly large
> >> - I thought that memory was supposed to be slow. Apparently branching is
> >> becoming very expensive.
>
> > Cache hits are relatively fast (3-4 cycles for a L1 cache hit).
> > Mispredicted branches are much slower (10-20 cycles).
>
> It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>

...

> This makes the jumptable approach a clear favorite.
>
> -marcel

Lookup table or jump table?

Krishna

Marcel Hendrix

unread,

Jul 10, 2012, 3:37:41 PM7/10/12

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor

> m.a.m....@tue.nl writes:
[..]

>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:

[..]

> Hmm, sounds like perfectly predictable branches. Then it's probably
> something else. You can check the number of mispredictions with
> performance counters.

>>This makes the jumptable approach a clear favorite.

[ Krishna: should have been lookup table ]

> Can you post the machine code for the second LCSYMBOL2?

4 BASE !
: lcsymbl2a ( i j k -- n )

swap 2 lshift or swap 10 lshift or
case
\ 123 of 1 endof
231 of 1 endof
312 of 1 endof
132 of -1 endof
213 of -1 endof
321 of -1 endof
123 of 1 endof
0 swap
endcase ;
DECIMAL

1 VALUE a
2 VALUE b
3 VALUE c

: TEST2a ( n -- )
0 SWAP 0 DO a b c lcsymbl2a +
a b c lcsymbl2a +
a b c lcsymbl2a +
a b c lcsymbl2a +
LOOP drop ;

FORTH> ' test2a idis
$0124B540 : [trashed]
$0124B54A pop rbx
$0124B54B push 0 b#
$0124B54D push rbx
$0124B54E xor rbx, rbx
$0124B551 pop rcx
$0124B552 call (DO) offset NEAR
$0124B55C lea rax, [rax 0 +] qword

$0124B560 mov rdi, $012497A0 qword-offset
$0124B567 lea rdi, [rdi*4 0 +] qword
$0124B56F or rdi, $012497C0 qword-offset
$0124B576 mov rax, $01249780 qword-offset
$0124B57D shl rax, 4 b#
$0124B581 or rdi, rax
$0124B584 cmp rdi, #45 b#
$0124B588 push rbx
$0124B589 mov rbx, rcx
$0124B58C mov rcx, rdi
$0124B58F mov rbx, rcx
$0124B592 jne $0124B5A4 offset NEAR
$0124B598 mov rbx, 1 d#
$0124B59F jmp $0124B615 offset NEAR
$0124B5A4 cmp rbx, #54 b#
$0124B5A8 jne $0124B5BA offset NEAR
$0124B5AE mov rbx, 1 d#
$0124B5B5 jmp $0124B615 offset NEAR
$0124B5BA cmp rbx, #30 b#
$0124B5BE jne $0124B5D0 offset NEAR
$0124B5C4 mov rbx, -1 d#
$0124B5CB jmp $0124B615 offset NEAR
$0124B5D0 cmp rbx, #39 b#
$0124B5D4 jne $0124B5E6 offset NEAR
$0124B5DA mov rbx, -1 d#
$0124B5E1 jmp $0124B615 offset NEAR
$0124B5E6 cmp rbx, #57 b#
$0124B5EA jne $0124B5FC offset NEAR
$0124B5F0 mov rbx, -1 d#
$0124B5F7 jmp $0124B615 offset NEAR
$0124B5FC cmp rbx, #27 b#
$0124B600 jne $0124B612 offset NEAR
$0124B606 mov rbx, 1 d#
$0124B60D jmp $0124B615 offset NEAR
$0124B612 push 0 b#
$0124B614 pop rbx
$0124B615 pop rdi
$0124B616 mov rax, $012497A0 qword-offset
$0124B61D lea rax, [rax*4 0 +] qword
$0124B625 or rax, $012497C0 qword-offset
$0124B62C mov rdx, $01249780 qword-offset
$0124B633 shl rdx, 4 b#
$0124B637 or rax, rdx
$0124B63A lea rdx, [rdi rbx*1] qword

[ etc. ]

$0124B837 push 0 b#
$0124B839 pop rbx
$0124B83A pop rdi
$0124B83B lea rbx, [rdi rbx*1] qword
$0124B83F add [rbp 0 +] qword, 1 b#
$0124B844 add [rbp 8 +] qword, 1 b#
$0124B849 jno $0124B560 offset NEAR
$0124B84F add rbp, #24 b#
$0124B853 ;

-marcel

Krishna Myneni

unread,

Jul 10, 2012, 8:06:31 PM7/10/12

to

On Jul 8, 4:44 pm, awe...@arcor.de (Alex Wegel) wrote:

This definition of eps is not portable in Forth-94. The use of MOD
makes it implementation dependent, since the system might do either
floored division or symmetric division. For example, the following
happens on systems that do symmetric division,

1 3 1 eps .
-1 ok

Krishna

Alex Wegel

unread,

Jul 11, 2012, 1:09:46 AM7/11/12

to

Krishna Myneni <krishna...@ccreweb.org> wrote:

> This definition of eps is not portable in Forth-94. The use of MOD
> makes it implementation dependent, since the system might do either
> floored division or symmetric division. For example, the following
> happens on systems that do symmetric division,
>
> 1 3 1 eps .
> -1 ok
>
> Krishna

Yes - i fathomed that, (mostly because AvdH even added a definition of
mod to his code to address this). Sorry that i didn't mention it
alongside the other nonportabilities..

So, lets get rid of that ugly MOD - mathematically it's overkill here.
In the following, the nonportable "3 mod" sequences are being replaced:

: (3mod) dup 0< 3 and + ; \ "3 mod" valid for input range -2..2

: epsh1 ( i j k -- n )
over - (3mod) >r
swap - (3mod) r> ( j-i k-j )
and
dup 2 = or
;

This ran faster than the original already, but the fastest variant so
far is the following where the (3mod) code is inlined:

: epsh2 ( i j k -- n )
over - dup 0< 3 and + >r
swap - dup 0< 3 and + r> ( j-i k-j )
and
dup 2 = or
;

I added all 3 versions to the "levi-civita.fs" benchmark (as epsh0 -
epsh2), and epsh2 looks quite competitive here (gforth-fast, powerpc g5
@ 1.8GHz):

\ lcsymbol : 0 7848
\ lcsymbol2 : 0 6280
\ lcsymbol3 : 0 3018
\ lcsymbol4 : 0 3469
\ eps : 0 4260
\ eps2 : 0 3851
\ eps3 : 0 3806
\ epsg : 0 6177
\ epsh0 : 0 5817 <- original, using 3 mod
\ epsh1 : 0 6291 <- fixed, using call
\ epsh2 : 0 3360 <- fixed, inlined

Cheers,
Alex

PS: epsi2 (from the other post) came out at 3020.

Alex Wegel

unread,

Jul 11, 2012, 1:09:46 AM7/11/12

to

Evolution of old code:

\ epsi0 is like the original version,
\ but as a single definition (included for reference)
\ - only works when correct result is -1 or 1
\ - has mod portability bug
: epsi0 ( i j k -- -1|+1 ) drop - 3 mod 2 - 2* 1+ ;

\ complete, (mod-)fixed & fast version
: epsi2 ( i j k -- -1|0|+1 )
>r 2dup xor r> xor \ check if i,j,k are permutation of 1,2,3; drop k
if 2drop 0 else \ return 0 otherwise
- dup 0< 3 and + \ deduce result from difference i-j (the else part
2 - 2* 1+ \ is just epsi0 w/o the drop and the mod-bug)
then ;

I guess there are still ways to simplify the code, esp. the else-part...

Anton Ertl

unread,

Jul 11, 2012, 9:50:58 AM7/11/12

to

m...@iae.nl (Marcel Hendrix) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Permutation Tensor
>
>> m.a.m....@tue.nl writes:
>[..]
>>>It is possible to see that branching is indeed slow by swapping the test cases in lcsymbol2:
>[..]
>> Hmm, sounds like perfectly predictable branches. Then it's probably
>> something else. You can check the number of mispredictions with
>> performance counters.

[...]

>$0124B592 jne $0124B5A4 offset NEAR
>$0124B598 mov rbx, 1 d#
>$0124B59F jmp $0124B615 offset NEAR
>$0124B5A4 cmp rbx, #54 b#
>$0124B5A8 jne $0124B5BA offset NEAR
>$0124B5AE mov rbx, 1 d#
>$0124B5B5 jmp $0124B615 offset NEAR
>$0124B5BA cmp rbx, #30 b#
>$0124B5BE jne $0124B5D0 offset NEAR
>$0124B5C4 mov rbx, -1 d#
>$0124B5CB jmp $0124B615 offset NEAR
>$0124B5D0 cmp rbx, #39 b#
>$0124B5D4 jne $0124B5E6 offset NEAR
>$0124B5DA mov rbx, -1 d#
>$0124B5E1 jmp $0124B615 offset NEAR
>$0124B5E6 cmp rbx, #57 b#
>$0124B5EA jne $0124B5FC offset NEAR
>$0124B5F0 mov rbx, -1 d#
>$0124B5F7 jmp $0124B615 offset NEAR
>$0124B5FC cmp rbx, #27 b#
>$0124B600 jne $0124B612 offset NEAR
>$0124B606 mov rbx, 1 d#
>$0124B60D jmp $0124B615 offset NEAR

[...]

Hmm, I remember something about the number of branch prediction table
entries per cache line or something being limited. Maybe that's the
reason for the low performance.

If I wanted to verify this, I would first use performance counters to
check if there is a high number of branch mispredictions, and if so,
insert some padding after the "jmp"s to see if it helps.

Alex Wegel

unread,

Jul 11, 2012, 12:42:01 PM7/11/12

to

More evolution (now fastest here when using gforth-fast. i'm curious how
it compares under different cpus/forths):

-1 , 1 , here 0 , -1 , 1 , constant tab

: epsi5 ( i j k -- -1|+1 )

>r 2dup xor r> xor

if
2drop 0
else
- cells tab + @
then
;

Marcel Hendrix

unread,

Jul 11, 2012, 2:18:36 PM7/11/12

to

awe...@arcor.de (Alex Wegel) writes Re: Permutation Tensor

You are going in the wrong direction, at least for iForth.

FORTH> 1,000,000,000 toptest
\ lcsymbol : 2.422 seconds elapsed.
\ lcsymbol2 : 2.141 seconds elapsed.
\ lcsymbl2a : 5.467 seconds elapsed.
\ lcsymbol3 : 2.010 seconds elapsed.
\ lcsymbol4 : 1.708 seconds elapsed.
\ eps : 2.224 seconds elapsed.
\ eps2 : 7.177 seconds elapsed.
\ eps3 : 2.225 seconds elapsed.
\ eps3a : 1.580 seconds elapsed.
\ eps3ab : 1.400 seconds elapsed.
\ eps3abc : 0.512 seconds elapsed.
\ epsg : 2.227 seconds elapsed.
\ Ea : 12.108 seconds elapsed.
\ epsh2 : 3.309 seconds elapsed.
\ epsi2 : 3.585 seconds elapsed.
\ epsi5 : 4.530 seconds elapsed. ok

-marcel

Alex Wegel

unread,

Jul 12, 2012, 6:43:04 AM7/12/12

to

Marcel Hendrix <m...@iae.nl> wrote:

> You are going in the wrong direction, at least for iForth.

Thanks for your answer. I'm not too surprised though, because even the
two versions of gforth/powerpc (ie: gforth vs. gforth-fast) lead to
different "rankings":

\ gforth-fast ppc gforth ppc iForth
\ noop : 2354 6498
\ lcsymbol : 7593 25195 2.422
\ lcsymbol2 : 6176 24478 2.141
\ lcsymbol3 : 2937 18094 2.010
\ lcsymbol4 : 3217 16031 1.708
\ eps : 4353 17486 2.224
\ eps2 : 4132 18988 7.177
\ eps3 : 3840 14916 2.225
\ epsg : 6060 25045 2.227
\ epsh0 : 5744 18312
\ epsh1 : 6038 29271
\ epsh2 : 3442 20999 3.309
\ epsi2 : 2950 16314 3.585
\ epsi3 : 2986 16079
\ epsi4 : 3760 15598
\ epsi5 : 2765 15011 4.530

It all just seems to confirm that manual micro-optimization is a gamble.

Cheers,
Alex

Arnold Doray

unread,

Jul 25, 2012, 10:07:46 AM7/25/12

to

Perhaps the difference is due to CPU architecture. AFAIK, the pentium has
a deeper pipeline compared to the PPC? So iForth is slower probably
because of pipeline bubbles due to branch misprediction? -AD

Arnold Doray

unread,

Jul 25, 2012, 10:08:51 AM7/25/12

to

On Thu, 05 Jul 2012 07:13:53 +0000, Arnold Doray wrote:

> What's the best way to implement the 3D permutation tensor E(I,J,K)?
>

> Thanks,
> Arnold

Thank you all for your very interesting solutions to this problem. Much
appreciated. -Arnold

Alex Wegel

unread,

Jul 27, 2012, 7:42:22 AM7/27/12

to

Arnold Doray <inv...@invalid.com> wrote:

> On Thu, 12 Jul 2012 12:43:04 +0200, Alex Wegel wrote:
>
> > Marcel Hendrix <m...@iae.nl> wrote:
> >
> >> You are going in the wrong direction, at least for iForth.
> >
> > Thanks for your answer. I'm not too surprised though, because even the
> > two versions of gforth/powerpc (ie: gforth vs. gforth-fast) lead to
> > different "rankings":
> >
> > \ gforth-fast ppc gforth ppc iForth \ noop :
> > 2354 6498 \ lcsymbol : 7593 25195
> > 2.422 \ lcsymbol2 : 6176 24478 2.141 \
> > lcsymbol3 : 2937 18094 2.010 \ lcsymbol4 : 3217
> > 16031 1.708 \ eps : 4353
> > 17486 2.224 \ eps2 : 4132 18988
> > 7.177 \ eps3 : 3840 14916 2.225 \ epsg
> > : 6060 25045 2.227 \ epsh0 : 5744
> > 18312 \ epsh1 : 6038 29271 \ epsh2 : 3442
> > 20999 3.309 \ epsi2 : 2950 16314
> > 3.585 \ epsi3 : 2986 16079 \ epsi4 : 3760
> > 15598 \ epsi5 : 2765 15011 4.530
> >
> > It all just seems to confirm that manual micro-optimization is a gamble.
> >
> > Cheers,
> > Alex
>
> Perhaps the difference is due to CPU architecture.

One part of it surely is, but - as i stated - the forth compiler used
plays a major role too.

> AFAIK, the pentium has
> a deeper pipeline compared to the PPC?

It used to be this way - my ppc is rather old anyway, so it's pipeline
might look even shorter when comparing with a modern x86 (but i didn't
care for such trends at all lately, so this is just a guess).

> So iForth is slower probably
> because of pipeline bubbles due to branch misprediction? -AD

That's where guessing and gambling starts without access to everything
involved (iForth, x86).

In the end, there are a lot of factors, esp. when looking at todays
superscalar CPUs, and the real world might still look *totally*
different than a simplistic benchmark would suggest. (Different input
data, different call pattern, etc.).

To sum it up: Searching the "best" variant of such a function by using a
benchmark is nerd-fun, but that's probably it.

Cheers,
Alex

Marcel Hendrix

unread,

Jul 27, 2012, 9:39:37 AM7/27/12

to

awe...@arcor.de (Alex Wegel) writes Re: Permutation Tensor

> Arnold Doray <inv...@invalid.com> wrote:
[..]

>> Perhaps the difference is due to CPU architecture.

> One part of it surely is, but - as i stated - the forth compiler used
> plays a major role too.

[..]

>> So iForth is slower probably
>> because of pipeline bubbles due to branch misprediction? -AD

> That's where guessing and gambling starts without access to everything
> involved (iForth, x86).

Would it be so much less of a gamble when *did* have access?

> In the end, there are a lot of factors, esp. when looking at todays
> superscalar CPUs, and the real world might still look *totally*
> different than a simplistic benchmark would suggest. (Different input
> data, different call pattern, etc.).

This exercise has shown that the difference between algorithms is (1) large,
and (2) unpredictable and counterintuitive. I looked again just now,
and I would not have expected lcsymbol4 to be the best (in iForth64), nor
can can I understand its terrible performance in 32bit mode.

> To sum it up: Searching the "best" variant of such a function by using a
> benchmark is nerd-fun, but that's probably it.

In Forth we can test such things at runtime very easily. Imagine
Forth applications that adapt themselves to their surroundings
organically :-)

-marcel

Arnold Doray

unread,

Jul 30, 2012, 12:52:43 AM7/30/12

to

On Fri, 27 Jul 2012 15:39:37 +0200, Marcel Hendrix wrote:

>> To sum it up: Searching the "best" variant of such a function by using
>> a benchmark is nerd-fun, but that's probably it.
>
> In Forth we can test such things at runtime very easily. Imagine Forth
> applications that adapt themselves to their surroundings organically :-)
>
> -marcel

A tongue-in-cheek comment, but still an interesting thought.

JITs do this all the time, but the direction for optimization is
unambigious. With multiple algorithms, it would take longer to test for
optimality, and for the hypothetical "JIT" to kick in, unless there are
some heuristics. A more obvious application is processing data across
multiple nodes -- sometimes it's good to just stream the data from the
"data node" to the "compute node", at others, to send the computation to
be performed on the "data node" itself and relay the results back to the
compute node.