Skybuck's Parallel Universal Code

Skybuck Flying

unread,

Sep 29, 2015, 3:07:06 AM9/29/15

to

Hello,

(It is now 29 september 2015)

As promised here is Skybuck's Parallel Universal Code demonstration program.

This posting contains a Delphi and C/C++ version for you to learn from.

(I was kinda thinking of adding some kind of case statement/array and out of
order execution/randomization to proof that the lines can be executed in any
order, though I didn't come around to that (yet) been playing World of
Warships =D at 1 fps to 30 fps lol)
(If anybody doubt that these lines can be executed out of order I may
eventually add such a feature as welll.. maybe I will do it even for the fun
of it ! ;))
(I was also maybe thinking of trying to through it all into a nice
Tprocessor class with an actual thread to show it off as well... didn't come
around to that yet either ;):))
(However if you clearly examine the indexes used you will discover that
there is no overlap, all indexes are uniquely read and such... no
write-read-write-read-sequential stuff going on... it can all be read in
parallel... except for building rowoffset array).

// ** Begin of Delphi Program ***

program TestProgram;

{$APPTYPE CONSOLE}

{$R *.res}

uses
System.SysUtils;

{

Skybuck's Parallel Universal Code, Demonstration program

version 0.01 created on 19 september 2015

This demonstration program will illustrate how to decode fields of data in
parallel.

For this demonstration program bit fiddling will be avoided by allowing 1
bit per array index to illustrate the general idea.

Also lengths may be stored by 1 integer per array index. In a real world
scenerio it would be Skybuck Universally Encoded.

Skybuck's Parallel Universal Code, Design Document:

version 1 created on 16 september 2015 (after seeing another mentioning of
IBM's processor MIL which makes me sick, as if decoding fields in parallel
is hard ! LOL ;) :))

(I also wrote a little introductionary question for it see other file for it
and I even learned something from it: Parallel decoding lesson for you.txt)

All bits of the fields are split up in such a way that the first bit of each
field is right next to each other, the second bit of each field is also next
to each other, and so forth.

Conceptually view:

First currently situation

Field A consists out of a1a2a3a4 (4 bits)
Field B consists out of b1b2b3 (3 bits)
Field C consists out of c1 (1 bit)
Field D consists out of d1d2d3d4d5d6 ( 6 bits)

These bits are stored as follows:

"First row":
a1b1c1d1

"Second row":
a2b2d2

"Third row":
a3b3d3

"Fourth row":
a4d4

"Fiveth row":
d5

"Sixth row":
d6

These rows will be stores sequentially as follows:
a1b1c1d1a2b2d2a3b3d4a4d4d5d6

Now the question is: How does a processor know where each row begins ? and
how many bits of each field there is, the answers are given below:

The bit length of each row is stored preemptively/prefixed:

"First row": 4
"Second row": 3
"Third row": 3
"Fourth row": 2
"Fiveth row": 1
"Sixth row" : 1

Now for each row their offset can be computed:

First row starts at 0,
Second row starts at 0 + 4 = 4
Third row starts at 0 + 4 + 3 = 7
Fouth row starts at 0 + 4 + 3 + 3 = 10
Fiveth row starts at 0 + 4 + 3 + 3 + 2 = 12
Sixth row starts at 0 + 4+ 3 + 3+ 2 + 1 = 13

Let's check if this is true:

0 1 2 3 4 5 6 7 8 9 10 11 12 13
a1 b1 c1 d1 a2 b2 d2 a3 b3 d3 a4 d4 d5 d6

Bingo, all match.

The processor can compute the offsets of each row.

And thus the processor can reach each field in parallel as follows:

Read field A bit 0 at offset 0
Read field B bit 0 at offset 4
Read field C bit 0 at offset 7
Read field D bit 0 at offset 10

Now the question is how can the processor know when to stop reading ?

A marker/meta bit could be used like described in Skybuck's Universal Code
version 1, which indicates if the field continues or stops.

These bits can be stored in the same way as the data bits above.

And thus for each data bit, a meta bit can be read as well.

This way the processor knows that C2 does not exist and can stop reading
field C

For huffman codes that would not even be required, since the processor can
see in the huffman tree when it reaches a leave/end node and then it will
know the end of C was reached.

So marker/beta bits can be avoided by using Huffman codes ! Pretty neeto !
;) =D

However huffman has a drawback that some fields might get large, and may not
be suited for rare information and modifications and so forth ! ;)

The last thing to do to make this usuable is to include another prefix field
which indicates how many prefixes there are.

So final stream will look something like:

[Number of Fields][Set of Field Lengths][Set of Field Data Bits
Intermixed/Parallel]

First field can be universally coded.
Second set of fields can be universally coded.
Third set of field contains data bits intermixed/in parallel.

Additional:

The problem can be solved by sorting the fields from largest to smallest
field, so new stream looks like:
(sorting from smallest to largest would be possible too... but code below
assumes first field is largest so it uses
largest to smallest sorting solution which is also applied to the stream A,
so stream A below is sorted that way)

Stream A: 433211d1a1b1c1d2a2b2d3a3b3d4a4d5d6

Bye,
Skybuck.

}

function Constrain( Para : integer ) : integer;
begin
result := Para;
if Para < 0 then Result := 0;
if Para >= 1 then Result := 1;
end;

procedure Main;
// must put variables here otherwise won't show up in debugger.
const
MaxProcessorCount = 4;
var
// information stream, input
Stream : array[0..20+(MaxProcessorCount-1)] of integer; // add max
processor count to create a safe "padding" for reading so no out of
bounds/range check errors with arrays.

// bits representing fields of data
a1,a2,a3,a4 : integer;
b1,b2,b3 : integer;
c1 : integer;
d1,d2,d3,d4,d5,d6 : integer;

// output
RowIndex : integer;

RowCount : integer;
RowLength : array[0..5] of integer;
RowOffset : array[0..5] of integer;
FieldRowMultiplier : array[0..3,0..5] of integer;

DataOffset : integer;

FieldCount : integer;
FieldLength : array[0..3] of integer;

Processor : array[0..3] of integer;

// debug fields
FieldA : integer;
FieldB : integer;
FieldC : integer;
FieldD : integer;
begin
a1 := 1; a2 := 1; a3:= 1; a4 := 1;
b1 := 1; b2 := 1; b3 := 1;
c1 := 1;
d1 := 1; d2 := 1; d3 := 1; d4 := 1; d5 := 1; d6 := 1;

// compute input fields to compare it later with output fields
FieldA := (a1) or (a2 shl 1) or (a3 shl 2) or (a4 shl 3);
FieldB := (b1) or (b2 shl 1) or (b3 shl 2);
FieldC := (c1);
FieldD := (d1) or (d2 shl 1) or (d3 shl 2) or (d4 shl 3) or (d5 shl 4)
or (d6 shl 5);

// print field values
writeln( 'FieldA: ', FieldA );
writeln( 'FieldB: ', FieldB );
writeln( 'FieldC: ', FieldC );
writeln( 'FieldD: ', FieldD );
writeln;

// number of rows
Stream[0] := 6;

// row lengths
Stream[1] := 4;
Stream[2] := 3;
Stream[3] := 3;
Stream[4] := 2;
Stream[5] := 1;
Stream[6] := 1;

// sorted information stream:
// d1a1b1c1d2a2b2d3a3b3d4a4d5d6

// data bits
Stream[7] := d1;
Stream[8] := a1;
Stream[9] := b1;
Stream[10] := c1;
Stream[11] := d2;
Stream[12] := a2;
Stream[13] := b2;
Stream[14] := d3;
Stream[15] := a3;
Stream[16] := b3;
Stream[17] := d4;
Stream[18] := a4;
Stream[19] := d5;
Stream[20] := d6;

// now the decoding algorithm:

// determine number of rows
RowCount := Stream[0];

// extract row lengths
RowLength[0] := Stream[1];
RowLength[1] := Stream[2];
RowLength[2] := Stream[3];
RowLength[3] := Stream[4];
RowLength[4] := Stream[5];
RowLength[5] := Stream[6];

// determine field count
FieldCount := RowLength[0]; // row[0] indicates number of fields.

// let's assume first field is largest so it always consumes all row
information.
// should be 0 if negative, should be zero if zero, should be 1 if
positive so and will do the trick to constaint it.
// the -0, -1, -2, -3 represents subtracting the processor
number/identify from it... to allow parallel processing ! ;)

// contraint is wrong... hahga unnt.
FieldRowMultiplier[0,0] := Constrain(RowLength[0]-0); // shoudl be set
to one if larger.
FieldRowMultiplier[0,1] := Constrain(RowLength[1]-0);
FieldRowMultiplier[0,2] := Constrain(RowLength[2]-0);
FieldRowMultiplier[0,3] := Constrain(RowLength[3]-0);
FieldRowMultiplier[0,4] := Constrain(RowLength[4]-0);
FieldRowMultiplier[0,5] := Constrain(RowLength[5]-0);

// now second field may consume less if there are not enough bits.
FieldRowMultiplier[1,0] := Constrain(RowLength[0]-1);
FieldRowMultiplier[1,1] := Constrain(RowLength[1]-1);
FieldRowMultiplier[1,2] := Constrain(RowLength[2]-1);
FieldRowMultiplier[1,3] := Constrain(RowLength[3]-1);
FieldRowMultiplier[1,4] := Constrain(RowLength[4]-1);
FieldRowMultiplier[1,5] := Constrain(RowLength[5]-1);

// now third field may consume less if there are not enough bits.
FieldRowMultiplier[2,0] := Constrain(RowLength[0]-2);
FieldRowMultiplier[2,1] := Constrain(RowLength[1]-2);
FieldRowMultiplier[2,2] := Constrain(RowLength[2]-2);
FieldRowMultiplier[2,3] := Constrain(RowLength[3]-2);
FieldRowMultiplier[2,4] := Constrain(RowLength[4]-2);
FieldRowMultiplier[2,5] := Constrain(RowLength[5]-2);

// now fourth field may consume less if there are not enough bits.
FieldRowMultiplier[3,0] := Constrain(RowLength[0]-3);
FieldRowMultiplier[3,1] := Constrain(RowLength[1]-3);
FieldRowMultiplier[3,2] := Constrain(RowLength[2]-3);
FieldRowMultiplier[3,3] := Constrain(RowLength[3]-3);
FieldRowMultiplier[3,4] := Constrain(RowLength[4]-3);
FieldRowMultiplier[3,5] := Constrain(RowLength[5]-3);

// now compute field lengths

// not necessary to multiply anything just add them up ! ;) =D
FieldLength[0] :=
FieldRowMultiplier[0,0] +
FieldRowMultiplier[0,1] +
FieldRowMultiplier[0,2] +
FieldRowMultiplier[0,3] +
FieldRowMultiplier[0,4] +
FieldRowMultiplier[0,5];

FieldLength[1] :=
FieldRowMultiplier[1,0] +
FieldRowMultiplier[1,1] +
FieldRowMultiplier[1,2] +
FieldRowMultiplier[1,3] +
FieldRowMultiplier[1,4] +
FieldRowMultiplier[1,5];

FieldLength[2] :=
FieldRowMultiplier[2,0] +
FieldRowMultiplier[2,1] +
FieldRowMultiplier[2,2] +
FieldRowMultiplier[2,3] +
FieldRowMultiplier[2,4] +
FieldRowMultiplier[2,5];

FieldLength[3] :=
FieldRowMultiplier[3,0] +
FieldRowMultiplier[3,1] +
FieldRowMultiplier[3,2] +
FieldRowMultiplier[3,3] +
FieldRowMultiplier[3,4] +
FieldRowMultiplier[3,5];

// though the field multipliers could come in handy later to read the
bits ! nice ! ;) =D

// for each row the offset must be calculated this can be done serially
or by all processors for themselfes at the same time:
// row zero starts after the number of rows which is indicated by the
first stream value =D

// first determine data offset properly ! ;) :) 1 for the row count +
RowCount to skip over row lengths.
DataOffset := 1 + RowCount;

RowOffset[0] := DataOffset;
RowOffset[1] := RowOffset[0] + RowLength[0];
RowOffset[2] := RowOffset[1] + RowLength[1];
RowOffset[3] := RowOffset[2] + RowLength[2];
RowOffset[4] := RowOffset[3] + RowLength[3];
RowOffset[5] := RowOffset[4] + RowLength[4];

// now calculate and/ore detemrine length of each field

// first determine number of fields

// now that all row offsets are calculated it's possible to decode the
stream into each processor, by each processor.

// each processor knows it's own number/identitiy represented by the +0
+1 +2 +3 down below:
// the general idea is:
// row[] here is RowOffset[]
{
Processor[0] :=
Stream[Row[0]+0]+Stream[Row[1]+0]+Stream[Row[2]+0]+Stream[Row[3]+0]+Stream[Row[4]+0]+Stream[Row[5]+0];
Processor[1] :=
Stream[Row[0]+1]+Stream[Row[1]+1]+Stream[Row[2]+1]+Stream[Row[3]+1]+Stream[Row[4]+1]+Stream[Row[5]+1];
Processor[2] :=
Stream[Row[0]+2]+Stream[Row[1]+2]+Stream[Row[2]+2]+Stream[Row[3]+2]+Stream[Row[4]+2]+Stream[Row[5]+2];
Processor[3] :=
Stream[Row[0]+3]+Stream[Row[1]+3]+Stream[Row[2]+3]+Stream[Row[3]+3]+Stream[Row[4]+3]+Stream[Row[5]+3];
}
// however a processor should only include the bits of a stream if it's
within the field's length for that we need
// to compute each field length and here we have an oops ! ;) :) cannot
compute field length if multiple fields
// per row. or can we ?! ;) :)perhaps we can... we know that fields have
at least 1 bit because of row 0
// and we know which fields have 2 bits because of row 2 and so forth...
and since all fields
// are ditributed parallely... we don't actually need to know where
their bits are.... cause they all packed lol...
// we only need to know what max length is or something... though how
can we dan be sure that a field ends up
// wehere it needs to be... well we dont... a field can end up in any
processor.

// so how should it actually look like then... well as follows:

// since the fields are stored as follows: A,B,C,D we know A ends up in
processor 0, B in 1, C in 2, D in 3 and so forth.
// thus... processor 0 can determine length of A by looking at row[0],
row[1], row[2], row[3], row[4], row[5], row[6].
// but how to know which count matches to who's field ?
// is the length of row 5 for A or B or C or D ? we don't know do we ?!
;)

// now we should be able to solve the problem... we know the bits belong
to the first few fields... cool ! ;) =D

// now we should be able to solve it easily... by only including a bit
from the stream if the multiplier is set to 1 ;) :)
// and now only thing left to do is shifting the bits into proper
position and or-ing them together ! ;)

Processor[0] :=
((Stream[RowOffset[0]+0]*FieldRowMultiplier[0,0]) shl 0) or
((Stream[RowOffset[1]+0]*FieldRowMultiplier[0,1]) shl 1) or
((Stream[RowOffset[2]+0]*FieldRowMultiplier[0,2]) shl 2) or
((Stream[RowOffset[3]+0]*FieldRowMultiplier[0,3]) shl 3) or
((Stream[RowOffset[4]+0]*FieldRowMultiplier[0,4]) shl 4) or
((Stream[RowOffset[5]+0]*FieldRowMultiplier[0,5]) shl 5);

Processor[1] :=
((Stream[RowOffset[0]+1]*FieldRowMultiplier[1,0]) shl 0) or
((Stream[RowOffset[1]+1]*FieldRowMultiplier[1,1]) shl 1) or
((Stream[RowOffset[2]+1]*FieldRowMultiplier[1,2]) shl 2) or
((Stream[RowOffset[3]+1]*FieldRowMultiplier[1,3]) shl 3) or
((Stream[RowOffset[4]+1]*FieldRowMultiplier[1,4]) shl 4) or
((Stream[RowOffset[5]+1]*FieldRowMultiplier[1,5]) shl 5);

Processor[2] :=
((Stream[RowOffset[0]+2]*FieldRowMultiplier[2,0]) shl 0) or
((Stream[RowOffset[1]+2]*FieldRowMultiplier[2,1]) shl 1) or
((Stream[RowOffset[2]+2]*FieldRowMultiplier[2,2]) shl 2) or
((Stream[RowOffset[3]+2]*FieldRowMultiplier[2,3]) shl 3) or
((Stream[RowOffset[4]+2]*FieldRowMultiplier[2,4]) shl 4) or
((Stream[RowOffset[5]+2]*FieldRowMultiplier[2,5]) shl 5);

Processor[3] :=
((Stream[RowOffset[0]+3]*FieldRowMultiplier[3,0]) shl 0) or
((Stream[RowOffset[1]+3]*FieldRowMultiplier[3,1]) shl 1) or
((Stream[RowOffset[2]+3]*FieldRowMultiplier[3,2]) shl 2) or
((Stream[RowOffset[3]+3]*FieldRowMultiplier[3,3]) shl 3) or
((Stream[RowOffset[4]+3]*FieldRowMultiplier[3,4]) shl 4) or
((Stream[RowOffset[5]+3]*FieldRowMultiplier[3,5]) shl 5);

// *** STILL TO DO (solved by extending array):
********************************
// *** ^^^ may have to look into potential out of range problem ^^^ ****
// *********************************************************************
// for now it seems ok, also as long as stream array has
+NumberOfProcessors scratch pad at end it may be ok ! ;) :)

// one last problem might remain... the row offset may go out of
range...
// we could either use a scratch pad or solve this in another way.
// I think it's best to leave it as as and perhaps make the stream a bit
larger or omething let's see what happens.

// print processor values.
writeln( 'Processor[0]: ', Processor[0] );
writeln( 'Processor[1]: ', Processor[1] );
writeln( 'Processor[2]: ', Processor[2] );
writeln( 'Processor[3]: ', Processor[3] );
writeln;
end;

begin
try
Main;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
ReadLn;
end.

// *** End of Delphi Program ***

// *** Begin of C/C++ Program ***

// ParallelDecodingCVersion.cpp : Defines the entry point for the console
application.
//

// Full C version 0.01 created on 23 september 2015 by Skybuck Flying =D

#include "stdafx.h"

// Begin of Dummy Decoder Example

const int MaxProcessorCount = 4;

int Constrain( int Para )
{
if (Para < 0) Para = 0;
if (Para >= 1) Para = 1;
return Para;
}

int _tmain(int argc, _TCHAR* argv[])
{
// information stream, input
int Stream[21+(MaxProcessorCount-1)]; // add max processor count to
create a safe "padding" for reading so no out of bounds/range check errors
with arrays.

// bits representing fields of data
int a1,a2,a3,a4;
int b1,b2,b3;
int c1;
int d1,d2,d3,d4,d5,d6;

// output
int RowIndex;

int RowCount;
int RowLength[6];
int RowOffset[6];
int FieldRowMultiplier[4][6];

int DataOffset;

int FieldCount;
int FieldLength[4];

int Processor[4];

// debug fields
int FieldA;
int FieldB;
int FieldC;
int FieldD;

// input test 1
/*
a1 = 1; a2 = 1; a3 = 1; a4 = 1;
b1 = 1; b2 = 1; b3 = 1;
c1 = 1;
d1 = 1; d2 = 1; d3 = 1; d4 = 1; d5 = 1; d6 = 1;
*/

// input test 2
a1 = 1; a2 = 0; a3 = 0; a4 = 1;
b1 = 1; b2 = 1; b3 = 0;
c1 = 1;
d1 = 1; d2 = 1; d3 = 0; d4 = 0; d5 = 1; d6 = 0;

// compute input fields to compare it later with output fields
FieldA = (a1) | (a2 << 1) | (a3 << 2) | (a4 << 3);
FieldB = (b1) | (b2 << 1) | (b3 << 2);
FieldC = (c1);
FieldD = (d1) | (d2 << 1) | (d3 << 2) | (d4 << 3) | (d5 << 4) | (d6 <<
5);

// print field values
printf( "FieldD: %d \n", FieldD );
printf( "FieldA: %d \n", FieldA );
printf( "FieldB: %d \n", FieldB );
printf( "FieldC: %d \n\n", FieldC );

// number of rows
Stream[0] = 6;

// row lengths
Stream[1] = 4;
Stream[2] = 3;
Stream[3] = 3;
Stream[4] = 2;
Stream[5] = 1;
Stream[6] = 1;

// sorted information stream:
// d1a1b1c1d2a2b2d3a3b3d4a4d5d6

// data bits
Stream[7] = d1;
Stream[8] = a1;
Stream[9] = b1;
Stream[10] = c1;
Stream[11] = d2;
Stream[12] = a2;
Stream[13] = b2;
Stream[14] = d3;
Stream[15] = a3;
Stream[16] = b3;
Stream[17] = d4;
Stream[18] = a4;
Stream[19] = d5;
Stream[20] = d6;

// now the decoding algorithm:

// determine number of rows
RowCount = Stream[0];

// extract row lengths
RowLength[0] = Stream[1];
RowLength[1] = Stream[2];
RowLength[2] = Stream[3];
RowLength[3] = Stream[4];
RowLength[4] = Stream[5];
RowLength[5] = Stream[6];

// determine field count
FieldCount = RowLength[0]; // row[0] indicates number of fields.

// I will help out a bit... by leaving this code in ! ;) seems somewhat
obvious ;)
// first determine data offset properly ! ;) :) 1 for the row count +
// RowCount to skip over row lengths.
DataOffset = 1 + RowCount;

RowOffset[0] = DataOffset;
RowOffset[1] = RowOffset[0] + RowLength[0];
RowOffset[2] = RowOffset[1] + RowLength[1];
RowOffset[3] = RowOffset[2] + RowLength[2];
RowOffset[4] = RowOffset[3] + RowLength[3];
RowOffset[5] = RowOffset[4] + RowLength[4];

// let's assume first field is largest so it always consumes all row
information.
// should be 0 if negative, should be zero if zero, should be 1 if
positive so and will do the trick to constaint it.
// the -0, -1, -2, -3 represents subtracting the processor
number/identify from it... to allow parallel processing ! ;)

// contraint is wrong... hahga unnt.
FieldRowMultiplier[0][0] = Constrain(RowLength[0]-0); // shoudl be set
to one if larger.
FieldRowMultiplier[0][1] = Constrain(RowLength[1]-0);
FieldRowMultiplier[0][2] = Constrain(RowLength[2]-0);
FieldRowMultiplier[0][3] = Constrain(RowLength[3]-0);
FieldRowMultiplier[0][4] = Constrain(RowLength[4]-0);
FieldRowMultiplier[0][5] = Constrain(RowLength[5]-0);

// now second field may consume less if there are not enough bits.
FieldRowMultiplier[1][0] = Constrain(RowLength[0]-1);
FieldRowMultiplier[1][1] = Constrain(RowLength[1]-1);
FieldRowMultiplier[1][2] = Constrain(RowLength[2]-1);
FieldRowMultiplier[1][3] = Constrain(RowLength[3]-1);
FieldRowMultiplier[1][4] = Constrain(RowLength[4]-1);
FieldRowMultiplier[1][5] = Constrain(RowLength[5]-1);

// now third field may consume less if there are not enough bits.
FieldRowMultiplier[2][0] = Constrain(RowLength[0]-2);
FieldRowMultiplier[2][1] = Constrain(RowLength[1]-2);
FieldRowMultiplier[2][2] = Constrain(RowLength[2]-2);
FieldRowMultiplier[2][3] = Constrain(RowLength[3]-2);
FieldRowMultiplier[2][4] = Constrain(RowLength[4]-2);
FieldRowMultiplier[2][5] = Constrain(RowLength[5]-2);

// now fourth field may consume less if there are not enough bits.
FieldRowMultiplier[3][0] = Constrain(RowLength[0]-3);
FieldRowMultiplier[3][1] = Constrain(RowLength[1]-3);
FieldRowMultiplier[3][2] = Constrain(RowLength[2]-3);
FieldRowMultiplier[3][3] = Constrain(RowLength[3]-3);
FieldRowMultiplier[3][4] = Constrain(RowLength[4]-3);
FieldRowMultiplier[3][5] = Constrain(RowLength[5]-3);

// now compute field lengths

// not necessary to multiply anything just add them up ! ;) =D
FieldLength[0] =
FieldRowMultiplier[0][0] +
FieldRowMultiplier[0][1] +
FieldRowMultiplier[0][2] +
FieldRowMultiplier[0][3] +
FieldRowMultiplier[0][4] +
FieldRowMultiplier[0][5];

FieldLength[1] =
FieldRowMultiplier[1][0] +
FieldRowMultiplier[1][1] +
FieldRowMultiplier[1][2] +
FieldRowMultiplier[1][3] +
FieldRowMultiplier[1][4] +
FieldRowMultiplier[1][5];

FieldLength[2] =
FieldRowMultiplier[2][0] +
FieldRowMultiplier[2][1] +
FieldRowMultiplier[2][2] +
FieldRowMultiplier[2][3] +
FieldRowMultiplier[2][4] +
FieldRowMultiplier[2][5];

FieldLength[3] =
FieldRowMultiplier[3][0] +
FieldRowMultiplier[3][1] +
FieldRowMultiplier[3][2] +
FieldRowMultiplier[3][3] +
FieldRowMultiplier[3][4] +
FieldRowMultiplier[3][5];

// though the field multipliers could come in handy later to read the
bits ! nice ! ;) =D

// for each row the offset must be calculated this can be done serially
or by all processors for themselfes at the same time:
// row zero starts after the number of rows which is indicated by the
first stream value =D

// first determine data offset properly ! ;) :) 1 for the row count +
RowCount to skip over row lengths.
DataOffset = 1 + RowCount;

RowOffset[0] = DataOffset;
RowOffset[1] = RowOffset[0] + RowLength[0];
RowOffset[2] = RowOffset[1] + RowLength[1];
RowOffset[3] = RowOffset[2] + RowLength[2];
RowOffset[4] = RowOffset[3] + RowLength[3];
RowOffset[5] = RowOffset[4] + RowLength[4];

// now calculate and/ore detemrine length of each field

// first determine number of fields

// now that all row offsets are calculated it's possible to decode the
stream into each processor, by each processor.

// each processor knows it's own number/identitiy represented by the +0
+1 +2 +3 down below:
// the general idea is:
// row[] here is RowOffset[]
/*
Processor[0] :=
Stream[Row[0]+0]+Stream[Row[1]+0]+Stream[Row[2]+0]+Stream[Row[3]+0]+Stream[Row[4]+0]+Stream[Row[5]+0];
Processor[1] :=
Stream[Row[0]+1]+Stream[Row[1]+1]+Stream[Row[2]+1]+Stream[Row[3]+1]+Stream[Row[4]+1]+Stream[Row[5]+1];
Processor[2] :=
Stream[Row[0]+2]+Stream[Row[1]+2]+Stream[Row[2]+2]+Stream[Row[3]+2]+Stream[Row[4]+2]+Stream[Row[5]+2];
Processor[3] :=
Stream[Row[0]+3]+Stream[Row[1]+3]+Stream[Row[2]+3]+Stream[Row[3]+3]+Stream[Row[4]+3]+Stream[Row[5]+3];
*/
// however a processor should only include the bits of a stream if it's
within the field's length for that we need
// to compute each field length and here we have an oops ! ;) :) cannot
compute field length if multiple fields
// per row. or can we ?! ;) :)perhaps we can... we know that fields have
at least 1 bit because of row 0
// and we know which fields have 2 bits because of row 2 and so forth...
and since all fields
// are ditributed parallely... we don't actually need to know where
their bits are.... cause they all packed lol...
// we only need to know what max length is or something... though how
can we dan be sure that a field ends up
// wehere it needs to be... well we dont... a field can end up in any
processor.

// so how should it actually look like then... well as follows:

// since the fields are stored as follows: A,B,C,D we know A ends up in
processor 0, B in 1, C in 2, D in 3 and so forth.
// thus... processor 0 can determine length of A by looking at row[0],
row[1], row[2], row[3], row[4], row[5], row[6].
// but how to know which count matches to who's field ?
// is the length of row 5 for A or B or C or D ? we don't know do we ?!
;)

// now we should be able to solve the problem... we know the bits belong
to the first few fields... cool ! ;) =D

// now we should be able to solve it easily... by only including a bit
from the stream if the multiplier is set to 1 ;) :)
// and now only thing left to do is shifting the bits into proper
position and or-ing them together ! ;)

Processor[0] =
((Stream[RowOffset[0]+0]*FieldRowMultiplier[0][0]) << 0) |
((Stream[RowOffset[1]+0]*FieldRowMultiplier[0][1]) << 1) |
((Stream[RowOffset[2]+0]*FieldRowMultiplier[0][2]) << 2) |
((Stream[RowOffset[3]+0]*FieldRowMultiplier[0][3]) << 3) |
((Stream[RowOffset[4]+0]*FieldRowMultiplier[0][4]) << 4) |
((Stream[RowOffset[5]+0]*FieldRowMultiplier[0][5]) << 5);

Processor[1] =
((Stream[RowOffset[0]+1]*FieldRowMultiplier[1][0]) << 0) |
((Stream[RowOffset[1]+1]*FieldRowMultiplier[1][1]) << 1) |
((Stream[RowOffset[2]+1]*FieldRowMultiplier[1][2]) << 2) |
((Stream[RowOffset[3]+1]*FieldRowMultiplier[1][3]) << 3) |
((Stream[RowOffset[4]+1]*FieldRowMultiplier[1][4]) << 4) |
((Stream[RowOffset[5]+1]*FieldRowMultiplier[1][5]) << 5);

Processor[2] =
((Stream[RowOffset[0]+2]*FieldRowMultiplier[2][0]) << 0) |
((Stream[RowOffset[1]+2]*FieldRowMultiplier[2][1]) << 1) |
((Stream[RowOffset[2]+2]*FieldRowMultiplier[2][2]) << 2) |
((Stream[RowOffset[3]+2]*FieldRowMultiplier[2][3]) << 3) |
((Stream[RowOffset[4]+2]*FieldRowMultiplier[2][4]) << 4) |
((Stream[RowOffset[5]+2]*FieldRowMultiplier[2][5]) << 5);

Processor[3] =
((Stream[RowOffset[0]+3]*FieldRowMultiplier[3][0]) << 0) |
((Stream[RowOffset[1]+3]*FieldRowMultiplier[3][1]) << 1) |
((Stream[RowOffset[2]+3]*FieldRowMultiplier[3][2]) << 2) |
((Stream[RowOffset[3]+3]*FieldRowMultiplier[3][3]) << 3) |
((Stream[RowOffset[4]+3]*FieldRowMultiplier[3][4]) << 4) |
((Stream[RowOffset[5]+3]*FieldRowMultiplier[3][5]) << 5);

// *** STILL TO DO (solved by extending array):
********************************
// *** ^^^ may have to look into potential out of range problem ^^^ ****
// solved by adding padding for reading ;)
// *********************************************************************
// for now it seems ok, also as long as stream array has
+NumberOfProcessors scratch pad at end it may be ok ! ;) :)

// one last problem might remain... the row offset may go out of
range...
// we could either use a scratch pad|solve this in another way.
// I think it's best to leave it as as and perhaps make the stream a bit
larger or omething let's see what happens.

// print processor values.
printf( "Processor[0]: %d \n", Processor[0] );
printf( "Processor[1]: %d \n", Processor[1] );
printf( "Processor[2]: %d \n", Processor[2] );
printf( "Processor[3]: %d \n\n", Processor[3] );

return 0;
}

// *** End of C/C++ Program ***

Bye,
Skybuck.

Kerr Mudd-John

unread,

Sep 29, 2015, 8:23:51 AM9/29/15

to

On Tue, 29 Sep 2015 08:07:08 +0100, Skybuck Flying
<skybu...@hotmail.com> wrote:

> Hello,
>
> (It is now 29 september 2015)
>
> As promised here is Skybuck's Parallel Universal Code demonstration
> program.
>
> This posting contains a Delphi and C/C++ version for you to learn from.

[]

> // *** End of C/C++ Program ***
>
> Bye,
> Skybuck.
>

Shere Jenious!

But not asm.

--
Bah, and indeed, Humbug

dunno

unread,

Sep 30, 2015, 7:12:49 AM9/30/15

to

Are you trying to emulate few processes with one which have out-of-order
execution? Usually this problem is solved differently.

What you just tried to wrote is implemented as actual CPU by MultiClet:
http://multiclet.com/
They have many (currently four) "processors" in one. "Processors" have same
set of registers, and they execute code out-of-order, waiting only when
result of operation depends on result of another operation which is
currently being executed by another "processor". Results of operations are
kept in internal CPU memory which is able to hold up to 64 results.

You actually can buy evaluation board with their MultiClet processor from
them. Unfortunately, their website is not that informative and IIRC the
English documentation is really outdated.

--
dunno

Skybuck Flying

unread,

Sep 30, 2015, 7:18:40 AM9/30/15

to

"
Are you trying to emulate few processes with one which have out-of-order
execution? Usually this problem is solved differently.
"

No parallel = out of order.

If it's not out of order it's not parallel.

Bye,
Skybuck.

Skybuck Flying

unread,

Sep 30, 2015, 7:19:37 AM9/30/15

to

Let me rephrase that for you

"
Are you trying to emulate few processes with one which have out-of-order
execution? Usually this problem is solved differently.
"

No, parallel = out of order.

If it's not out of order it's not parallel.

Bye,
Skybuck.

There ya go ;)

Skybuck Flying

unread,

Sep 30, 2015, 7:22:27 AM9/30/15

to

"
What you just tried to wrote is implemented as actual CPU by MultiClet:
http://multiclet.com/
They have many (currently four) "processors" in one. "Processors" have same
set of registers, and they execute code out-of-order, waiting only when
result of operation depends on result of another operation which is
currently being executed by another "processor". Results of operations are
kept in internal CPU memory which is able to hold up to 64 results.

You actually can buy evaluation board with their MultiClet processor from
them. Unfortunately, their website is not that informative and IIRC the
English documentation is really outdated.
"

Sounds nice, I may investigate further later on ! ;)

For now proving that code can be executed out of order also proves that it
can run in parallel.

So it's more about proving in case anybody doubts it.

If you don't doubt... then you can leave my out of order comments out of it.

Though perhaps one order... might make more sense than another ! ;) =D

^ optimization possibilities ? ;) ^

^ Caching possibilities ? ^

Bye,
Skybuck.

dunno

unread,

Sep 30, 2015, 7:36:21 AM9/30/15

to

Skybuck Flying <skybu...@hotmail.com> wrote:
> "
> What you just tried to wrote is implemented as actual CPU by MultiClet:
> http://multiclet.com/
> They have many (currently four) "processors" in one. "Processors" have same
> set of registers, and they execute code out-of-order, waiting only when
> result of operation depends on result of another operation which is
> currently being executed by another "processor". Results of operations are
> kept in internal CPU memory which is able to hold up to 64 results.
>
> You actually can buy evaluation board with their MultiClet processor from
> them. Unfortunately, their website is not that informative and IIRC the
> English documentation is really outdated.
> "
>
> Sounds nice, I may investigate further later on ! ;)
>
> For now proving that code can be executed out of order also proves that it
> can run in parallel.

No one doubts it, but there is catch. If you find out what the catch is,
you'll realize why MultiClet still don't have optimizing C compiler, even
though they already produce CPUs for few years.

> So it's more about proving in case anybody doubts it.
>
> If you don't doubt... then you can leave my out of order comments out of it.
>
> Though perhaps one order... might make more sense than another ! ;) =D
>
> ^ optimization possibilities ? ;) ^
>
> ^ Caching possibilities ? ^
>
> Bye,
> Skybuck.
>
>

--
dunno

Skybuck Flying

unread,

Sep 30, 2015, 8:51:07 AM9/30/15

to

"
No one doubts it, but there is catch. If you find out what the catch is,
you'll realize why MultiClet still don't have optimizing C compiler, even
though they already produce CPUs for few years.
"

Not sure, but let's go back to the original intention/problem.

The problem was decoding instructions in parallel.

This is and should now be solved.

However the instructions still are part of a sequential stream. (Though it
can also be extended to multiple parallel sequential instruction
streams/multi threading).

Eventually this stream will have to be executed in serial or partially in
parallel or totally parallel.

That depends on how smart the chip is.

Eventually the result (usually data, sometimes instruction stream
modifications) will be the product of some kind of instruction stream, a
serial result.

I am also not very familiar with C compilers or optimizations and also do
not see why it is relevant for this discussion.

Since the computer chip itself is responsibly for the parallel execution of
instructions.

So the C compiler has very little to do with it at first glance.

One could write assembly language and be perfectly fine with such a parallel
instruction decoding chip.

Perhaps the problem/catch is mostly related to MultiClet itself.

I have hard of bandwidth limitations between cores and such.. though again
don't see what that has to do with C compiler.

Bye,
Skybuck.

Skybuck Flying

unread,

Sep 30, 2015, 8:58:59 AM9/30/15

to

To visually understand my design imagine this:

It's more like compressed matrix, which is decompressed and then computed.

So it mimics the very first computers which were used to create
cloths/fabric.

All matrixes 0's and 1's to indicate which results/operations should be
done/kept or not done/kept.

It's funny to see how some things become very simple when it's completely in
parallel ! ;)

Bye,
Skybuck :)

dunno

unread,

Sep 30, 2015, 7:15:32 PM9/30/15

to

The catch is that almost all of algorithms that people use cannot be
executed out-of-order. In other words, you cannot take any random HLL
source code and expect it to be effectively compiled for CPU which is
trying to run the code out-of-order. It's possible, but the result will be
slow, effectively removing all the advantage of parallel execution.

--
dunno