Vector addition:
: v+ + ;
Cheers,
Alex
Not sure what you mean by that, but in graphics applications I've
frequently coded V+ to add the x,y components of a 2-vector. Related
words include V*/ which scales a 2-vector by a ratio of scalars.
Cheers,
Elizabeth
--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com
"Forth-based products and Services for real-time
applications since 1973."
==================================================
> On 9/12/10 6:21 PM, Alex Wegel wrote:
> > Vector addition:
> >
> > : v+ + ;
> >
>
> Not sure what you mean by that,
I admit, this was terse, but that was the point.
Also, i wanted to see what happens :-)
> but in graphics applications I've
> frequently coded V+ to add the x,y components of a 2-vector. Related
> words include V*/ which scales a 2-vector by a ratio of scalars.
Not sure either - i fathom you mean processing cell-pairs?
Then that's something different.
Let's riddle just a bit more - a very important word to help v+ is:
: vclc $7f7f7f7f and ;
..but the constant depends on the format you want the vector to have.
:-)
I'll post a working example later.
> : vclc $7f7f7f7f and ;
> ..but the constant depends on the format you want the vector to have.
You should use the built-in assembler in iForth for that:
@ab offset -> mm3 movd,
@cd offset -> mm1 movd,
..
mm3 -> mm1 paddb,
.. etc.
-marcel
> I'll post a working example later.
A friend of mine recently pointed me to a web page showing some possible
implementations of vector addition in forth. [1]
(To be fair: That page is not about the best vector addition algorithm,
but rather meant as a kind of introductery text to forth (?))
To make it short: After reading through it, i felt that there's too much
time spent improving the implementation details of a conventional
approach, rather than hitting the problem itself hard enough.
My inner zen (or was it jedi) master says: Think no box!
Thinking inside the box, thinking outside the box - who put that box
there anyway??
Especially in the case of vector functions, this proverbial box often
manifests itself clearly in the code (making it look like a
spreadsheet), so this seemed like a great simple case for unthinking it.
So, here's the example:
\ Packed vectors
hex
\ Note: >v is what i previously called vclc - never mind.
: >v 7f7f7f7f and ; \ format as vector (= clear carry/separator bits)
: v>s dup >v swap 2* 80808080 and or ; \ sign extend vec. elements
: v@ @ >v ; \ fetch vector
: v! swap v>s swap ! ; \ store vector
: v+c + ; \ add two vecs together, leave the carry flags in the result
: v+ + >v ; \ add two vectors together, return a proper vector
\ printing
: s7ext dup 80 and 0<> -100 and or ; \ sign extend a 7bit number
: v@. 4 0 do dup c@ s7ext . char+ loop drop ; \ print a stored vector
\ Example
decimal
create va 1 c, 2 c, 5 c, 4 c,
create vb -5 c, 3 c, -2 c, 0 c,
create vr 4 chars allot
va v@ vb v@ dup v+ v+ vr v!
.( va = ) va v@. cr
.( vb = ) vb v@. cr
.( va + 2*vb = ) vr v@. cr
\ end of code
So how did i attack it?
First, with the broadsword: Vector multiplication and other stuff is of
no matter, so chop off everything it would demand. This allows for
packing the elements together with only 1 bit lost for separating two
adjacent elements.
Second, with the sledgehammer: Who (bash) needs (bash) 32bit (bash) ints
(bash) anyway ?!
Oh yes, even before first, with the magic lantern: "I wish i had a SIMD
engine" - whoosh - it's there (was there all the time).
Sorry if this seems pointless or sth., but i felt like making the more
general point that working on improving a solution might sometimes not
be worthwhile, when one shouldn't have the problem in the first place.
:-)
Cheers
Alex
> You should use the built-in assembler in iForth for that:
>
> @ab offset -> mm3 movd,
> @cd offset -> mm1 movd,
> ..
> mm3 -> mm1 paddb,
> .. etc.
I don't think so.
1) No iForth here
2) No x86 here (at least my ppc can do vperm - miles ahead ;-)
3) Are the "mm"'s vector regs? If yes, then it's not my point.
I'm just posting the example i promised. It's about adding two vectors,
basically using only one single scalar instruction.
More generally, it's about rethinking requirements instead of throwing
big code/silicon at a problem. (And in this respect it even relates to
other current discussions here.)
Alex
> A friend of mine recently pointed me to a web page showing some possible
> implementations of vector addition in forth. [1]
> (To be fair: That page is not about the best vector addition algorithm,
> but rather meant as a kind of introductery text to forth (?))
> [..]
> [1] http://prog21.dadgum.com/33.html
That guy's examples seems to be a bit overcomplicated to me; what about:
: 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
: vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
--
Zbigniew
>First, with the broadsword: Vector multiplication and other stuff is of
>no matter, so chop off everything it would demand. This allows for
>packing the elements together with only 1 bit lost for separating two
>adjacent elements.
>
>Second, with the sledgehammer: Who (bash) needs (bash) 32bit (bash) ints
>(bash) anyway ?!
I recently run into overflow problems on my 64 bit Forth solving
problem 301 of projecteuler.net
>
>Cheers
>Alex
>
>[1] http://prog21.dadgum.com/33.html
Groetjes Albert
--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
At last, a justification for an optimising compiler and locals. ;-)
: vadd { v1 v2 v3 w1 w2 w3 -- y1 y2 y3 } v1 w1 + v2 w2 + v3 w3 + ;
> On 14 Sep, 00:05, ZB <zbTHIS...@ispid.THIS-NOcom.pl> wrote:
>> Dnia 13.09.2010 Alex Wegel <awe...@arcor.de> napisa=B3/a:
[..]
>> : 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) =A0-rot 2swap swap 2rot rot ;
>> : vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) =A02swap 3swap 2swap + -rot + 2swap + ;
[..]
> At last, a justification for an optimising compiler and locals. ;-)
> : vadd { v1 v2 v3 w1 w2 w3 -- y1 y2 y3 } v1 w1 + v2 w2 + v3 w3 + ;
Not always possible, but here it works:
FORTH> : vadd params| v1 v2 v3 w1 w2 w3 | v1 w1 + v2 w2 + v3 w3 + ; ok
FORTH> : test 1 2 3 11 22 33 vadd ; ' test idis
$01242780 : [trashed]
$0124278A push #12 b#
$0124278C push #24 b#
$0124278E push #36 b#
$01242790 ;
-marcel
>> : 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
>> : vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
>>
>> --
>> Zbigniew
>
> At last, a justification for an optimising compiler and locals. ;-)
>
>: vadd { v1 v2 v3 w1 w2 w3 -- y1 y2 y3 } v1 w1 + v2 w2 + v3 w3 + ;
...but my solution takes 50 microseconds - while the one locals-based 77... :)
(Gforth, Linux, Pentium III/750)
--
Zbigniew
>: 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
>: vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
Wow - thats a huge amount of stack shuffling, which is why James
Hague's version passed around pointers to data. A search of the VFX
code base shows three uses of 2ROT in total. However, after adding a
code generator for 2ROT to VFX, the source above gives:
dis vadd
VADD
( 004C1A50 8B5510 ) MOV EDX, [EBP+10]
( 004C1A53 035504 ) ADD EDX, [EBP+04]
( 004C1A56 8B4D00 ) MOV ECX, [EBP]
( 004C1A59 034D0C ) ADD ECX, [EBP+0C]
( 004C1A5C 035D08 ) ADD EBX, [EBP+08]
( 004C1A5F 894D0C ) MOV [EBP+0C], ECX
( 004C1A62 895510 ) MOV [EBP+10], EDX
( 004C1A65 8D6D0C ) LEA EBP, [EBP+0C]
( 004C1A68 C3 ) NEXT,
( 25 bytes, 9 instructions )
ok
Although the code is compilable and can be made efficient, the
source code is a maintenance nightmare! The register caching
of TOS reduces the number of memory accesses to seven, whereas
pointer code would probably require at least nine.
It would be interesting to see usage examples of VADD from a real
application.
Stephen
--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads
> Wow - thats a huge amount of stack shuffling, which is why James
> Hague's version passed around pointers to data.
Yes, it doesn't look as nice, as his listings - but I was trying to make
it simple. Aren't the operations on stack the fastest ones?
> Although the code is compilable and can be made efficient, the
> source code is a maintenance nightmare!
But why, actually? Just two words, nothing particularly complicated...
--
Zbigniew
Strange, my Core 2 Duo 3 GHz needs 49 nanoseconds for your version and 35
ns for the one with locals. Gforth, Cygwin, Vista/32
--
Coos
CHForth, 16 bit DOS applications
http://home.hccnet.nl/j.j.haak/forth.html
> Op 14 Sep 2010 22:42:12 +0200 schreef ZB:
>
>> Dnia 14.09.2010 Alex McDonald <bl...@rivadpm.com> napisał/a:
>>
>>>> : 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
>>>> : vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
>>>>
>>>> --
>>>> Zbigniew
>>>
>>> At last, a justification for an optimising compiler and locals. ;-)
>>>
>>>: vadd { v1 v2 v3 w1 w2 w3 -- y1 y2 y3 } v1 w1 + v2 w2 + v3 w3 + ;
>>
>> ...but my solution takes 50 microseconds - while the one locals-based 77... :)
>> (Gforth, Linux, Pentium III/750)
>
> Strange, my Core 2 Duo 3 GHz needs 49 nanoseconds for your version and 35
> ns for the one with locals. Gforth, Cygwin, Vista/32
Faster still, using gforth-fast gave 29 and 28 ns respectively, so not much
of a difference here ;)
But running 64 bit Gforth under linux give 51 and 79 ns (consistent with
your observation, here locals seem to be slow.
Gforth-fast gave 27 and 29 ns.
> In article <1jorwt9.12qny7ixbqfjrN%awe...@arcor.de>,
> Alex Wegel <awe...@arcor.de> wrote:
> >
> >Second, with the sledgehammer: Who (bash) needs (bash) 32bit (bash) ints
> >(bash) anyway ?!
>
> I recently run into overflow problems on my 64 bit Forth solving
> problem 301 of projecteuler.net
You see? You don't need 32 bit ;-)
Actually, i think this supports my point, that the problem really needs
more bashing to arrive at a better implementation, or that it needs
dedicated arithmetic functions beyond the "standard" stuff anyway.
I never liked the "nim" example (or hanoi) - but i don't know why :-(
Alex
> Dnia 13.09.2010 Alex Wegel <awe...@arcor.de> napisa?/a:
>
> > A friend of mine recently pointed me to a web page showing some possible
> > implementations of vector addition in forth. [1]
> > (To be fair: That page is not about the best vector addition algorithm,
> > but rather meant as a kind of introductery text to forth (?))
> > [..]
> > [1] http://prog21.dadgum.com/33.html
>
> That guy's examples seems to be a bit overcomplicated to me;
Yes - and he's citing Charles Moore, so the whole thing cries for a
radical simplification.
> what about:
>
> : 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
> : vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
Hmm - how about:
: vadd >r >r >r rot r> + rot r> + rot r> + ;
Or, if v1+w1 is known not to overflow:
: vadd >r rot >r D+ 2r> + ;
Or even (just for varieties sake):
: >4th sp@ [ 3cells ] literal + ;
: +!+ tuck +! cell+ ;
: vadd >4th +!+ +!+ +!+ drop ;
All still complicated, and/or expensive (when compared to a single
scalar '+' and ANDing with a constant value).
It all shows that the stack is not a random access data structure, and
it sure gets worse with larger vectors, not to mention matrices.
BTW: The version i posted already adds two 4-element vectors ;-)
The same code, with only the constants changed (huh!), can be used to
add vectors of more (smaller), less (bigger), or even mixed elements.
And then there's still 'D+' :-))
OTOH, my code is no plug-in solution for adding vectors of cell-sized
ints (it can be scaled up to any integer size by using the carry bits,
though), and it has some overhead for getting the values in and out
(depending on the situation). Also, it doesn't work like that for
multiplication.
The purpose of this thread is to generally encourage thinking beyond the
problem one is facing - it is easy to get "hypnotized" trying to solve
the problem in the form that it presents itself in, instead of finding
the *real* problem behind it.
The SIMD-style vadd seemed a great example, because it shows (despite
its limitations) how much difference there can be compared to a
conventional approach (emulating a "conventional" language and sticking
to a "scalar interface").
Well, i think sth. like Charles Moores decision to include 2 pointer
regs in the new processor cores helps vector stuff a lot.
Alex
>Dnia 14.09.2010 Stephen Pelc <steph...@mpeforth.com> napisał/a:
>
>> Wow - thats a huge amount of stack shuffling, which is why James
>> Hague's version passed around pointers to data.
>
>Yes, it doesn't look as nice, as his listings - but I was trying to make
>it simple. Aren't the operations on stack the fastest ones?
Not always. The simple rule of thumb is that memory operations are
more expensive than register operations.
>
>> Although the code is compilable and can be made efficient, the
>> source code is a maintenance nightmare!
>
>But why, actually? Just two words, nothing particularly complicated...
>: 3swap ( v1 v2 v3 w1 w2 w3 -- w1 w2 w3 v1 v2 v3 ) -rot 2swap swap 2rot rot ;
>: vadd ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap 3swap 2swap + -rot + 2swap + ;
Remember that VFX got VADD down to 9 instructions. What was compiled
was:
2swap
-rot 2swap swap 2rot rot \ was 3swap
2swap + -rot + 2swap + ;
That's 13 source code tokens, of which only 3 (the adds) do useful
work. I end up concluding that 'basic' Forth is not a good way to
express this kind of problem. That's why I wanted to see how your
VADD is used in practice.
> That's 13 source code tokens, of which only 3 (the adds) do useful
> work. I end up concluding that 'basic' Forth is not a good way to
> express this kind of problem. That's why I wanted to see how your
> VADD is used in practice.
OK, so here is "optimized" version:
: vadd ( v1 v2 v3 w1 w2 w3 - y1 y2 y3 ) 2swap -rot + 2swap + 2swap + swap rot ;
--
Zbigniew
> But running 64 bit Gforth under linux give 51 and 79 ns (consistent with
> your observation, here locals seem to be slow.
> Gforth-fast gave 27 and 29 ns.
Strange. Even two strange things:
1) That the results - and the difference itself - are OS-dependent.
2) That your Core needs the same time, as my old Pentium III.
--
Zbigniew
>OK, so here is "optimized" version:
>
>: vadd ( v1 v2 v3 w1 w2 w3 - y1 y2 y3 ) 2swap -rot + 2swap + 2swap + swap rot ;
: vadd ( v1 v2 v3 w1 w2 w3 - y1 y2 y3 ) 2swap -rot + 2swap + 2swap +
swap rot ; ok
ok
dis vadd
VADD
( 004C1A10 035D08 ) ADD EBX, [EBP+08]
( 004C1A13 8B5500 ) MOV EDX, [EBP]
( 004C1A16 03550C ) ADD EDX, [EBP+0C]
( 004C1A19 8B4D04 ) MOV ECX, [EBP+04]
( 004C1A1C 034D10 ) ADD ECX, [EBP+10]
( 004C1A1F 89550C ) MOV [EBP+0C], EDX
( 004C1A22 894D10 ) MOV [EBP+10], ECX
( 004C1A25 8D6D0C ) LEA EBP, [EBP+0C]
( 004C1A28 C3 ) NEXT,
( 25 bytes, 9 instructions )
ok
That's essentially the same as before because the stack usage
is almost the same. What I am asking for is not the implementation
of VADD but examples (with comments) of words that call VADD.
In my experience with optimising compilers, operations that require
6 items on the stack are better expressed using pointers to data
structures. I mean better both in terms of code readability and
performance.
Although there are 7 memory operations in the code above and
pointer version will require at least 9, the 6 items had
to be put on the stack, probably as memory operations, for
a total of 6+7=13 memory operations for the stack version.
Yes, 3 is about the limit. This problem is crying out for a vector
stack.
Andrew.
> Or, if v1+w1 is known not to overflow:
>: vadd >r rot >r D+ 2r> + ;
Oh, yeah: completely missed the fact, that I could add two "ordered" pairs
using single 'd+'.
So most probably now it can't be made even shorter.
> Or even (just for varieties sake):
>: >4th sp@ [ 3cells ] literal + ;
>: +!+ tuck +! cell+ ;
>: vadd >4th +!+ +!+ +!+ drop ;
Interesting, different approach - still the above seems to more "compact". ;)
> BTW: The version i posted already adds two 4-element vectors ;-)
> The same code, with only the constants changed (huh!), can be used to
> add vectors of more (smaller), less (bigger), or even mixed elements.
Do you mean (with) a little modification?
: v4add 2>r 2rot d+ 2swap 2r> d+ ;
--
Zbigniew
> OK, so here is "optimized" version:
>
>: vadd ( v1 v2 v3 w1 w2 w3 - y1 y2 y3 ) 2swap -rot + 2swap + 2swap + swap rot ;
But, of course, Alex' version
: vadd >r rot >r D+ 2r> + ;
...is the shortest one.
--
Zbigniew
variable VP \ grows downward
: (vadd)
VP @ 3 cells + dup @ VP @ + swap !
1 cells VP +! ;
: vadd ( V: v1 v2 -- v3 )
(vadd) (vadd) (vadd) ;
-Brad
Oops, forgot a @
: (vadd)
VP @ 3 cells + dup @ VP @ @ + swap !
1 cells VP +! ;
>> I'm dying to see how VFX does with this VADD:
variable VP \ grows downward ok
: (vadd)
VP @ 3 cells + dup @ VP @ @ + swap !
1 cells VP +! ; ok
: vadd ( V: v1 v2 -- v3 )
(vadd) (vadd) (vadd) ; ok
ok
dis vadd
VADD
( 004C1A90 8B154C9E4500 ) MOV EDX, [00459E4C]
( 004C1A96 83C20C ) ADD EDX, 0C
( 004C1A99 8B0A ) MOV ECX, 0 [EDX]
( 004C1A9B A14C9E4500 ) MOV EAX, [00459E4C]
( 004C1AA0 8B00 ) MOV EAX, 0 [EAX]
( 004C1AA2 03C1 ) ADD EAX, ECX
( 004C1AA4 8902 ) MOV 0 [EDX], EAX
( 004C1AA6 83054C9E450004 ) ADD DWord Ptr [00459E4C],
04
( 004C1AAD 8B154C9E4500 ) MOV EDX, [00459E4C]
( 004C1AB3 83C20C ) ADD EDX, 0C
( 004C1AB6 8B0A ) MOV ECX, 0 [EDX]
( 004C1AB8 A14C9E4500 ) MOV EAX, [00459E4C]
( 004C1ABD 8B00 ) MOV EAX, 0 [EAX]
( 004C1ABF 03C1 ) ADD EAX, ECX
( 004C1AC1 8902 ) MOV 0 [EDX], EAX
( 004C1AC3 83054C9E450004 ) ADD DWord Ptr [00459E4C],
04
( 004C1ACA 8B154C9E4500 ) MOV EDX, [00459E4C]
( 004C1AD0 83C20C ) ADD EDX, 0C
( 004C1AD3 8B0A ) MOV ECX, 0 [EDX]
( 004C1AD5 A14C9E4500 ) MOV EAX, [00459E4C]
( 004C1ADA 8B00 ) MOV EAX, 0 [EAX]
( 004C1ADC 03C1 ) ADD EAX, ECX
( 004C1ADE 8902 ) MOV 0 [EDX], EAX
( 004C1AE0 83054C9E450004 ) ADD DWord Ptr [00459E4C],
04
( 004C1AE7 C3 ) NEXT,
( 88 bytes, 25 instructions )
ok
The problem here is the VP @ ... VP ! pair in each (VADD). If
(VADD) took the contents of VP on the stack and updated it, a lot
of the noise would probably disappear.
> On Wed, 15 Sep 2010 08:37:06 -0700 (PDT), Brad <hwf...@gmail.com> wrote:
>>> I'm dying to see how VFX does with this VADD:
> variable VP \ grows downward ok
> : (vadd)
> VP @ 3 cells + dup @ VP @ @ + swap !
> 1 cells VP +! ; ok
> : vadd ( V: v1 v2 -- v3 )
> (vadd) (vadd) (vadd) ; ok
> ok
> The problem here is the VP @ ... VP ! pair in each (VADD). If
> (VADD) took the contents of VP on the stack and updated it, a lot
> of the noise would probably disappear.
variable VP \ grows downward
: (vadd) ( [sp] -- [sp+] )
>R
R@ 3 cells + dup @
R@ @ + swap !
R> 1 cells + ;
: vadd ( V: v1 v2 -- v3 )
VP (vadd) (vadd) (vadd) DROP ;
FORTH> ' vadd idis
$012437C0 : [trashed]
$012437CA mov rbx, $01243328 qword-offset
$012437D1 add rbx, $01243310 qword-offset
$012437D8 mov $01243328 qword-offset, rbx
$012437DF mov rbx, $01243330 qword-offset
$012437E6 add rbx, $01243318 qword-offset
$012437ED mov $01243330 qword-offset, rbx
$012437F4 mov rbx, $01243338 qword-offset
$012437FB add rbx, $01243320 qword-offset
$01243802 mov $01243338 qword-offset, rbx
$01243809 ;
-marcel
The same? I looped 10^9 (short billion/long milliard) times and counted
seconds*, because I haven't figured out yet how to measure time in Gforth.
Did you really mean microseconds and not nanonseconds?
: now ( -- seconds ) time&date 3drop 60 * + 60 * + ;
>variable VP \ grows downward
>: (vadd) ( [sp] -- [sp+] )
> >R
> R@ 3 cells + dup @
> R@ @ + swap !
> R> 1 cells + ;
>: vadd ( V: v1 v2 -- v3 )
> VP (vadd) (vadd) (vadd) DROP ;
Shouldn't that be as below?
: vadd ( V: v1 v2 -- v3 )
VP @ (vadd) (vadd) (vadd) DROP ;
ZB <zbTH...@ispid.THIS-NOcom.pl> wrote:
> Dnia 15.09.2010 Alex Wegel <awe...@arcor.de> napisa?/a:
>
> > Or, if v1+w1 is known not to overflow:
> >: vadd >r rot >r D+ 2r> + ;
>
> Oh, yeah: completely missed the fact, that I could add two "ordered" pairs
> using single 'd+'.
>
> So most probably now it can't be made even shorter.
I did a benchmark of your and my vadd's, and your optimized version
turned out fastest (powerpc g5 gforth), i think my >r>r>r version was
second, but i threw the prog and the results away already, so i'm not
entirely sure anymore about which of mine was 2nd :-/
> > Or even (just for varieties sake):
> >: >4th sp@ [ 3cells ] literal + ;
> >: +!+ tuck +! cell+ ;
> >: vadd >4th +!+ +!+ +!+ drop ;
>
> Interesting, different approach - still the above seems to more "compact". ;)
Different approaches is what i started this thread for, but i think a
real difference can only be made after stopping to look for even better
ways to shuffle 6 numbers on the stack, and solving the real problem of
a missing concept/infrastructure for vector processing.
In vector processing there's something to exploit in the nature of the
data involved (hopefully homogenous arrays offering a nice consecutive
arrangement of elements).
This isn't fully reflected by putting all the contents on the stack - we
don't really want to know any of the vector element values most of the
time anyway, just add them all together, or whatever.
If (standard) forth would offer sth. like the A and B pointer registers
in Moores latest chips, or a clean way to get register-based locals and
primitive pre- or postincrement memory access operators, then - i might
just use them :-)
Anyway - one thing that has gained importance is how the data is
arranged, because for SIMD processing it's best to have vectors full of
homogenous data, for example: not R,G,B,R,G,B,... but rather R,R,R,...
G,G,G,... B,B,B,...
> > BTW: The version i posted already adds two 4-element vectors ;-)
> > The same code, with only the constants changed (huh!), can be used to
> > add vectors of more (smaller), less (bigger), or even mixed elements.
>
> Do you mean (with) a little modification?
>
> : v4add 2>r 2rot d+ 2swap 2r> d+ ;
Uhm no. I meant my older post. It's so simple:
hex \ for sure
7f7f7f7f constant v:-) \ vector mask for 4 * 7bit vector
: >v v:-) and ; \ format as vector (= clear carry/separator bits)
: v+c + ; \ add two vecs together, leave the carry flags in the result
: v+ + >v ; \ add two vectors together, return a proper vector
The trick is to first pack all elements into a single cell, leaving a
zero bit between each of them, and feed this cell to + .
After the addition, these "separation bits" contain the carry flags,
which can just be masked out in order to keep on using the cell as a
vector, or be used independently. Just make sure they're clear before
using v+.
So - my "code" added 2 vectors of four 7-bit values at once by using the
mask 7f7f7f7f, which splits up the cell into 4 "compartments".
Just by putting the zeros in that mask to other places, you can define
other vector formats.
Som examples:
ffdffbff - 3 elements, 10 bits each (leftmost element has no carry)
55555555 - 16 elements, 1 bit each
3FDFEFF - 3 elements, 8 bits each (some space left for e.g. a 5 bit
element or 3 flags)
With 2and and d+ (ie. 64 bit arithmetic) it becomes:
7fff7fff7fff7fff. 2constant v:-) \ vector mask for 4 * 15bit vector
: >2v v:-) 2and ;
: 2v+c d+ ;
: 2v+ d+ >2v ;
more interesting vectors can be built with this method (3 * 20bit, 4 *
15 bit, 7 * 8 bit, ...)
But - as i mentioned, it breaks down for multiplication (hmm - any way
to get around that?), so it's not too useful after all.
(I thought about going further by completely turning around the bits,
such that the first cell contains the least-significant bit of all the
elements, the next cell contains the second-to-least ones, etc..
Then it would be possible to apply 1-bit-at-a-time algorithms for
arithmetic - but so far i don't think that it can get fast enough to be
considered for doing ordinary arithmetics).
Alex
> Op 15 Sep 2010 13:14:07 +0200 schreef ZB:
>
> > Dnia 14.09.2010 Coos Haak <chf...@hccnet.nl> napisa?/a:
> >
> >> But running 64 bit Gforth under linux give 51 and 79 ns (consistent with
> >> your observation, here locals seem to be slow.
> >> Gforth-fast gave 27 and 29 ns.
> >
> > Strange. Even two strange things:
> >
> > 1) That the results - and the difference itself - are OS-dependent.
> > 2) That your Core needs the same time, as my old Pentium III.
>
> The same? I looped 10^9 (short billion/long milliard) times and counted
> seconds*, because I haven't figured out yet how to measure time in Gforth.
cputime - it's being used in the benchmarks that come with gforth, see
onebench.fs in the gforth directory.
It's also documented in gforths docs.
> Coos Haak <chf...@hccnet.nl> wrote:
>> The same? I looped 10^9 (short billion/long milliard) times and counted
>> seconds*, because I haven't figured out yet how to measure time in Gforth.
>
> cputime - it's being used in the benchmarks that come with gforth, see
> onebench.fs in the gforth directory.
> It's also documented in gforths docs.
Sorry, does not work under Vista with Cygwin.
Nor with Ubuntu in VBox unter Vista/32
Nor with Ubuntu in VBox under Windows 7/64
Proof: cputime d. d. 1000 ms cputime d. d. gives two same pairs.
I've recently recovered from a big crash (both Windows and Ubuntu) and I'm
not going to a double boot for the comping year or so.
CPUTIME does work on my P133 with SuSE. But doing benchmarks on a slow
machine is not really rocket science.
> > cputime - it's being used in the benchmarks that come with gforth, see
> > onebench.fs in the gforth directory.
> > It's also documented in gforths docs.
>
> Sorry, does not work under Vista with Cygwin.
> Nor with Ubuntu in VBox unter Vista/32
> Nor with Ubuntu in VBox under Windows 7/64
> Proof: cputime d. d. 1000 ms cputime d. d. gives two same pairs.
Oops.
I guess then utime won't either.
Sorry, i don't know much about cygwin - maybe it's possible do time the
benchmarks using the shell? (Cygwin does have a shell, right?)
You could still run the program on the fast PC and have the other PC
measure the time -- ok, maybe i'll stop here :-)
Alex
That's the likely result. CPUTIME reports the CPU time, and 1000 ms
consumes very little CPU time (a few microseconds), and, given the
granularity of the OS timer that is used (typically 1ms-10ms), it's
very unlikely that the results of these two invocations are different.
If you want a test where the times should differ, run something that
actually consumes CPU time.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2010: http://www.euroforth.org/ef10/
Don't do that. It fails if one is negative. Recently bitten by that.
>
>So most probably now it can't be made even shorter.
>
>> Or even (just for varieties sake):
>>: >4th sp@ [ 3cells ] literal + ;
>>: +!+ tuck +! cell+ ;
>>: vadd >4th +!+ +!+ +!+ drop ;
>
>Interesting, different approach - still the above seems to more "compact". ;)
>
>> BTW: The version i posted already adds two 4-element vectors ;-)
>> The same code, with only the constants changed (huh!), can be used to
>> add vectors of more (smaller), less (bigger), or even mixed elements.
>
>Do you mean (with) a little modification?
>
>: v4add 2>r 2rot d+ 2swap 2r> d+ ;
>
>--
>Zbigniew
--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
> In article <slrni91i8u.3...@localhost.localdomain>,
> ZB <zbTH...@ispid.THIS-NOcom.pl> wrote:
> >Dnia 15.09.2010 Alex Wegel <awe...@arcor.de>:
> >
> >> Or, if v1+w1 is known not to overflow:
> >>: vadd >r rot >r D+ 2r> + ;
> >
> >Oh, yeah: completely missed the fact, that I could add two "ordered" pairs
> >using single 'd+'.
>
> Don't do that. It fails if one is negative. Recently bitten by that.
More exactly it fails when addition of the two numbers which end up in
the low words of the summands gives an *unsigned* result >ffffffff.
Sure, this is much more likely with a negative number, and always the
case with two.
The other 2 numbers (forming the summands's high order words) don't have
this restriction.
Alex
> What I am asking for is not the implementation
> of VADD but examples (with comments) of words that call VADD.
Indeed: in such form the word isn't especially convenient. But it can be
tweaked a little - and "factored" a bit - for example this way:
: (vadd) ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap -rot + 2swap + 2swap + ;
: vector2stack ( addr -- s1 s2 s3 ) 3 0 do dup i cells + @ swap loop drop ;
: stack2vector ( addr s3 s2 s1 -- ) 3 0 do tuck i cells + ! loop drop ;
: vadd ( addr1 addr2 addr3 -- )
>r >r vector2stack r> vector2stack (vadd) r> stack2vector ;
Not sure, can it be made even shorter.
--
Zbigniew
> The same? I looped 10^9 (short billion/long milliard) times
Indeed: _not_ the same...
--
Zbigniew
> Don't do that. It fails if one is negative. Recently bitten by that.
You're right - of course, unfortunately vectors can have negative coordinates
--
Zbigniew
>: stack2vector ( addr s3 s2 s1 -- ) 3 0 do tuck i cells + ! loop drop ;
Of course, the comment above should be: ( s3 s2 s1 addr -- )
--
Zbigniew
> Dnia 16.09.2010 Albert van der Horst <alb...@spenarnc.xs4all.nl> napisa?/a:
>
> > Don't do that. It fails if one is negative. Recently bitten by that.
>
> You're right - of course, unfortunately vectors can have negative coordinates
The point about this is that sometimes you know that these elements
won't be negative, and then tricks like this can be useful.
> Dnia 15.09.2010 Stephen Pelc <steph...@mpeforth.com> napisa?/a:
>
> > What I am asking for is not the implementation
> > of VADD but examples (with comments) of words that call VADD.
>
> Indeed: in such form the word isn't especially convenient. But it can be
> tweaked a little - and "factored" a bit - for example this way:
>
> : (vadd) ( v1 v2 v3 w1 w2 w3 -- y1 y2 y3 ) 2swap -rot + 2swap + 2swap + ;
> : vector2stack ( addr -- s1 s2 s3 ) 3 0 do dup i cells + @ swap loop drop ;
> : stack2vector ( addr s3 s2 s1 -- ) 3 0 do tuck i cells + ! loop drop ;
> : vadd ( addr1 addr2 addr3 -- )
> >r >r vector2stack r> vector2stack (vadd) r> stack2vector ;
>
> Not sure, can it be made even shorter.
Yes we can :-)
But then - given the additional complications of (vadd), it might be
shorter to implement vadd without using vector2stack and stack2vector at
all:
\ i shortened the names too ;)
: v3@ dup @ swap cell+ dup @ swap cell+ @ ;
: dcell+ [ 2 cells ] literal + ;
: v3! tuck dcell+ ! tuck cell+ ! ! ;
: cellcell+ [ cell cell ] 2literal d+ ; \ addresses don't overflow.
: vadd >r
over @ over @ + r@ ! cellcell+
over @ over @ r@ cell+ ! cellcell+
@ swap @ + r> dcell+ ! ;
> But then - given the additional complications of (vadd), it might be
> shorter to implement vadd without using vector2stack and stack2vector at
> all:
Decided to make these parts as separate words, because it can be used then
in any other word related to vector math.
> \ i shortened the names too ;)
>: v3@ dup @ swap cell+ dup @ swap cell+ @ ;
>: dcell+ [ 2 cells ] literal + ;
>: v3! tuck dcell+ ! tuck cell+ ! ! ;
>: cellcell+ [ cell cell ] 2literal d+ ; \ addresses don't overflow.
>: vadd >r
> over @ over @ + r@ ! cellcell+
> over @ over @ r@ cell+ ! cellcell+
> @ swap @ + r> dcell+ ! ;
I'm curious, how this would look like - my solution, and yours - using that
Stephen's disassembler(?).
--
Zbigniew
> The point about this is that sometimes you know that these elements
> won't be negative, and then tricks like this can be useful.
Of course - if one can assure only positive coords.
After trying to implement vector product, and some of matrix calculations,
I've got a feeling, that the guy's approach wasn't that bad, actually. :]
Maybe a bit verbose, but - at the same time - more "generic".
By putting all the data onto stack first - then shuffling and calculating -
one quickly reaches the "limit of sense". It can be done - but it becomes
less and less comprehensible, and "stack noise" becomes the larger part of
all operations, the more data I need to shuffle. 6 cells on stack - not a
problem - but doing vector product I need to shuffle 12 (and even more, when
the space is more than 3-dimensional). So better is to fetch it partially,
calculate, and move out of the way ASAP, into variable.
--
Zbigniew
> Dnia 18.09.2010 Alex Wegel <awe...@arcor.de> napisa?/a:
>
> > But then - given the additional complications of (vadd), it might be
> > shorter to implement vadd without using vector2stack and stack2vector at
> > all:
>
> Decided to make these parts as separate words, because it can be used then
> in any other word related to vector math.
Yes, that's why i included them too as v3@ and v3!. The loops were
overkill for 3 items, but how about generalizing them by omitting the
3's, and maybe guarding the loops with ?do :
: n@ ( addr n -- x1..xn ) 0 ?do dup i cells + @ swap loop drop ;
: n! ( x1..xn addr n -- ) 0 ?do tuck i cells + ! loop drop ;
> > \ i shortened the names too ;)
> >: v3@ dup @ swap cell+ dup @ swap cell+ @ ;
> >: dcell+ [ 2 cells ] literal + ;
> >: v3! tuck dcell+ ! tuck cell+ ! ! ;
> >: cellcell+ [ cell cell ] 2literal d+ ; \ addresses don't overflow.
> >: vadd >r
> > over @ over @ + r@ ! cellcell+
> > over @ over @ r@ cell+ ! cellcell+
> > @ swap @ + r> dcell+ ! ;
Oops - forgot to apply the double-cell trick:
: v3@ dup @ swap cell+ 2@ swap ;
: v3! homework ;
> I'm curious, how this would look like - my solution, and yours - using that
> Stephen's disassembler(?).
Maybe it should use inlined versions of dcell+ and cellcell+ then.
Alex
> By putting all the data onto stack first - then shuffling and calculating -
> one quickly reaches the "limit of sense". It can be done - but it becomes
> less and less comprehensible, and "stack noise" becomes the larger part of
> all operations, the more data I need to shuffle. 6 cells on stack - not a
> problem - but doing vector product I need to shuffle 12 (and even more, when
> the space is more than 3-dimensional). So better is to fetch it partially,
> calculate, and move out of the way ASAP, into variable.
You are reaching Forth's arm length here. Vector/list/table handling
_can_ be done of course, but you either have to 'roll one yourself', try
to digest&adapt someone else's Forth code, or use another more
supportive programming environment eg Lua.
It's a real and deep pity, that the 200x guys here seem to be happy with
banning TIB, instead on agreeing/compromising on an OO standard. Only
then could one expect to share portable and debugged class libraries for
more extended data types than scalars. But I'll probably barf when I see
the next message-object versus object-message syntax discussion.
Andreas
>Decided to make these parts as separate words, because it can be used then
>in any other word related to vector math.
>
>> \ i shortened the names too ;)
>>: v3@ dup @ swap cell+ dup @ swap cell+ @ ;
>>: dcell+ [ 2 cells ] literal + ;
>>: v3! tuck dcell+ ! tuck cell+ ! ! ;
>>: cellcell+ [ cell cell ] 2literal d+ ; \ addresses don't overflow.
>>: vadd >r
>> over @ over @ + r@ ! cellcell+
>> over @ over @ r@ cell+ ! cellcell+
>> @ swap @ + r> dcell+ ! ;
>
>I'm curious, how this would look like - my solution, and yours - using that
>Stephen's disassembler(?).
Under Windows or Linux, there's no reason not to do it for yourself.
Just download the relevant version of VFX Forth.
It is rare that I'm going to examine code without stack comments.
Sorry - but I don't need to suffer.
Stephen
Stephen Pelc <steph...@mpeforth.com> wrote:
> It is rare that I'm going to examine code without stack comments.
> Sorry - but I don't need to suffer.
Please let me do the suffering :-)
There was a bug anyway (one more point against the stack noise, even
when using pointers).
: v3@ ( a-addr -- n1 n2 n3 ) dup @ swap cell+ 2@ swap ;
: dcell+ ( n1 -- n2 ) [ 2 cells ] literal + ;
: v3! ( n1 n2 n3 a-addr -- ) tuck dcell+ ! tuck cell+ ! ! ;
: cellcell+ ( a1 a2 -- a3 a4 ) [ cell cell ] 2literal d+ ;
: vadd ( a-addr1 a-addr2 a-addr3 -- )
>r over @ over @ + r@ ! cellcell+
over @ over @ + r@ cell+ ! cellcell+ \ bug fixed: '+' was missing
@ swap @ + r> dcell+ ! ;
Anyway, this is just a recoded example.
I think that, in the long run, vector processing should be properly
defined in forth, so that common SIMD architectures work at their full
potential.
This means that the concentration is on doing things in sets of hardware
vector registers, and on avoiding loads and stores.
This in turn puts a vector stack into the second row as a pure
save/restore stack, and probably suggests defining more assembler-like
bindings to the operations (using register numbers as operands).
A more complicated compiler (complicator?) could hide this, and handle
the register allocation stuff, but even with that, the "low-level"
vector bindings might be a good basis for such a compiler.
Alex ( c--u)