The Great Chunky to Planar Competition.
[NOTE: The term chunky is used to refer to display organisations where
all the bits associated with the colour (logical or physical) of
a pixel are located together or "chunked" into one byte, word or
long word. Another common term used is "byte per pixel" display.]
[EXAMPLE: A 640x480x256 colour screen may be represented by a block of
307200 bytes. The first byte is the colour of the first pixel,
the second byte is the colour of the second pixel, and so forth.
notice how there is only one "plane", and all the bits of a single
pixel are located together - not distributed over separate planes.]
INTRODUCTION:
The amigas currently available contain no direct support for chunky
displays, instead the display architecture is based on the concept
of bitplanes.
[ Ok, wickedly perverse low resolution chunky displays are possible ]
[ through the use of copperlists, or rapid palette changes via the ]
[ cpu. However these have limited appeal. ]
For many programming tasks, a chunky display architecture provides a
significantly more efficient basis for implementation. This occurs
because the chunky approach is most efficient when all the bits of a
pixel are to be modified, in contrast to the bit plane approach, which
is most efficient when a subset of pixels bits are to be modified.
These issues are most pronounced because of the finite limit on registers
available within the 680x0 cpu, and the finite width of memory words.
THE COMPETITION:
Write the fastest chunky to planar converter and have it included
in an archive for other programmers to use. The winner will receive
the glory and admiration of zillions of other amiga programmers :)
[Actually because the rules are so flexible, everyone is a winner]
Submissions can be made to: jmcc...@postoffice.utas.edu.au
At the end of every two months, I will endeavor to update and upload
an archive of all useful solutions to aminet, so other programmers may
use them. Hopefully all submissions will be made in good spirit and all
other programmers will be able to use them free of charge. If people
object to this, then perhaps I can mark some routines as being free
for private/non-commercial use, but commercial users should get in
contact with the original author(s).
RULES:
There are no rules. Rules are just another conceptual framework to
make you follow the leader like a sheep. However to give you some
guidance you may consider the following.
Probably the most important chunky to planar convertor is one that
takes 8 bit per pixel chunky images to 8 or fewer bitplanes. For
the case of going to fewer than 8 bitplanes, some means of dithering
might be considered. Alternatively certain bits could just be ignored.
Ideally the code should be system friendly (that means no cheating to
gain the A7 as an extra general purpose register), however two versions
could be written. Data alignment/width restrictions would ideally be
32 bits, and your code/data should be of a reasonable size.
You are free to choose your own calling interface, an example interface
is:
; ChunkyToPlanar(src,dst,src_size,dst_size)
; [non interleaved output planes]
;
; input: a0.l = chunky image
; d0.l = size of chunky image in bytes
; a1.l = pointer to first of 8 bitplanes
; d1.l = size of each bitplane in bytes.
;
...
POSSIBLE METHODS:
Many chunky to planar convertors may be considered to consist of 3 stages:
1. Reading in the chunky data.
2. Massaging the data.
3. Writing out the planar data.
Design of the chunky to planar convertor requires careful analysis to
ensure that none of 1-3 stand out as an individual bottleneck. The
main issue is concerned with the size of data items read, converted and
written to memory.
There are four obvious methods for chunky to planar data conversion:
1. Bit-by-Bit
2. Bit-Spread
3. Bit-Rotation
4. Bit-Shuffle
Bit-by-Bit:
Each bit from a chunky pixel is processed in a serial fashion. Commonly
the 680x0 ADD instruction is used to pop a bit off the top of a register
into the X flag, for insertion into a register with a following ADDX
instruction. The alert thinker will of course see that a single ADDX
instruction can both insert a bit, and grab the next bit - two birds
with one stone (note this is not necessarily better - think about it).
Bit-Spread:
Suppose you are converting byte pixels. Read in a byte, look up a
64 bit version of this byte with all its bits spread, do this seven
more times, each time mixing in the freshly looked up bits.
Bit-Rotation:
Perverse method, look up Chris Greens tmapdemo or Graphics Gems II
rotate8x8.c method.
Bit-Shuffle:
If you know how the FFT algorithm works and do a bit of thinking...
Suppose you had a function that took two words and interleaved the
bits of the two words. Suppose you applied it 3-4 times to your data
with a bit of perverse thinking you would have converted your data from
chunky to planar. This approach is most suited to blitter
implemention. (I Guess its really a common sense interpretation of
the abstract and totally bizarre Bit-Rotation method).
FURTHER ISSUES:
Machines to support:
Because many chunky to planar routines are extremely sensitive to
memory bandwidths and cache sizes, it may be wise to consider 9 classes
of machine configurations as listed below.
CPU Chipset Chipram Fastram I-cache D-cache
68000 OCS/ECS 16-bit none none none
68000 OCS/ECS 16-bit 16-bit none none
68020 OCS/ECS 16-bit 32-bit 256 none
68030 OCS/ECS 16-bit* 32-bit 256 256
68040 OCS/ECS 16-bit* 32-bit 4k 4k
68020 AGA 32-bit none 256 none
68020 AGA 32-bit 32-bit 256 none
68030 AGA 32-bit 32-bit 256 256
68040 AGA 32-bit 32-bit 4k 4k
[* Potentially 32 bit chip ram as seen from the CPU (A3000) ]
Format of output planes:
The following probably cover all of those cases which will ever be wanted.
- Normal fully separated planes
- Line interleaved planes
- Word interleaved planes [Atari ST style, Spot the Falcon030 coder :-) ).
- Sprite Planes [Demo coders?].
Format of input data:
The most common input format will no doubt be byte per pixel chunky
maps, however the following might also be considered:
CHUNKY (non compact)
Each pixel is represented by one byte, at any of 1,2,3,4,5,6,7 or 8 bits
of precision. Data may be either left or right justified within each
byte.
CHUNKY (compact)
Each pixel is represented by 1,2,4 or 8 bits, with as many pixels packed
into each byte as possible.
OTHERS
- Word per pixel mode (like Falcon030). With 5 bits for each of red, green
and blue.
- Long word per pixel mode. With 8-10 bits for each of red, green and blue.
- Three plane chunky mode (as used in image processing). A separate chunky
plane is used for each of red, green and blue. Particularly useful
when working on high speed polygon colourizers with a different (R,G,B)
vector on each vertex.
TARGETS:
To give you an idea of what you are aiming for... good 68000 solutions
to byte per pixel -> 8 planes, currently take under 70 cycles per pixel
without table lookups.
Suggested targets useful to commercial games writers (and demo coders) are:
68020 - byte per pixel -> 8 bit plane convertor (AGA)
68000 - byte per pixel -> 4,5 or 6 planes possibly via dither table (ECS)
THE FUTURE:
Stay tuned for "The Great Copper Abstraction Competition", and also
"The Great Kickstart Optimisation Competition (patches)".
So, instead, I'll pass on a thought: there is a program available,
from the usual places (probably prep.ai.mit.edu) a program called
GSO (gnu super optimiser) which will find absolutely the fastest
code to achieve a given operation. It does it, if necessary, by *exhaustive
search* in the space of the results of executing opcodes. Since
the conversion problem is well specified, and since (I believe) the gso
knows the 68000 instruction set, a few hours running on a fast computer
ought to solve this problem for once and for all.
Good luck
michael
>Good luck
>michael
Nice try... but I just made a breakthru. Using a blitter routine, I'm
sure for A1200+fastram.. we can almost forget we never had a chunky mode :)...
I don't think the GCC super-compiler/optimser knows about the blitter :)
And I dont beleive any code optimizer has setup
enabling the user to describe the machine bus system
or any other special cases...
Anyway.I think a Blitter version will need 10 pass
on the data... Thats no problem if the CPU is slower
at calculating the chunky data before conversion.
But like the GREAT discusion on blitter VS cPU
I think the blitter is only good in some cases :)
If the CPU has to wait for the blitter to finish
all is lost.
By the way if can do this in less than 10 blitter
pass you are a wizard :)
Thats how I see the blitter do it.:
0123456701234567012345670123456701234567012345670123456701234567
0 4 1 5 2 6 3 70 4 1 5 2 6 3 7
0 4 1 5 2 6 0 4 1 5 2 6
0 4 1 5 0 4 1 5
0 4 0 4
1 5 2 6 3 7 1 5 2 6 3 7
2 6 3 7 2 6 3 7
3 7 3 7
0000444411115555222266663333777700004444111155552222666633337777
0000 1111 2222 3333
4444 5555 6666 7777
0000000011111111222222223333333344444444555555556666666677777777
00000000
00000000
add the plane copy....
Thats the same has the funky way I 680x0, bu the CPU can
read 2 long work on it and save.The Blitter need multiple
pass over... But if the CPU is not faster at creating the
frame we can say: "Hey we have Chunky to plane in hardware!" :))
S.Schaem
> Anyway.I think a Blitter version will need 10 pass
>on the data... Thats no problem if the CPU is slower
>at calculating the chunky data before conversion.
> But like the GREAT discusion on blitter VS cPU
> I think the blitter is only good in some cases :)
Wrong! Depending on your image generation speed you get the cpu
to help the blitter... the blitter will always give some speed up.
When your cpu copy data from fastram to chipram it may do some
passes that the blitter would do :)... how much it does depends on
your exact framrate... pretty easy to calculate and select one of 3
fastram->chipram copiers and one of 3 blitter chunky->planar routines.
Hehe i think 12.5 fps full screen 256 colour {288x180} is possible
giving 68020 about 30 cycles to generate and copy each pixel from
fastram to chipram.. 30 cycles per pixel rendering time should allow
complex scenes... this should be about 2x faster than tmapdemo on
A1200+fast!!!
> If the CPU has to wait for the blitter to finish
>all is lost.
Sure... once you get more than tripple buffering all is lost for sure :)
> By the way if can do this in less than 10 blitter
>pass you are a wizard :)
Here goes my entry into wizardness...
Four pass technique. Each pass is broken into two sub passes. Because the
blitter can only write out one destination word for every set of source words
read in, two sub passes are necessary.
PASS-1: Byte interleaving [ BLTCDAT = %1111111100000000 ] 8
PASS-2: Nibble interleaving [ BLTCDAT = %1111000011110000 ] 4
PASS-3: Pair interleaving [ BLTCDAT = %1100110011001100 ] 2
PASS-4: Bit interleaving. [ BLTCDAT = %1010101010101010 ] 1
Each passes uses the following blitter function:
D = (A and C) or (B and ~C) [only A,B,D dma channels enabled].
Now as a hint I demonstrate how the first pass works, you can work out
the rest from there.
Suppose Chunky = A B C D E F G H a b c d e f g h [each letter is a byte]
^ ^
| |
BLTAPT BLTBPT
subpass1: BLTDPT = TemporaryBuffer1
BLTAMOD = 8
BLTBMOD = 8
BLTSIZE = HEIGHT=NumberChunkyPixels/16, WIDTH=4
ASHFT = 0
BSHFT = 8
LF = AC+Bc, ASCENDING MODE
subpass2: BLTDPT = TemporaryBuffer1 + Size-2
BLTxMOD, as before.
BLTSIZE as before.
ASHFT =8, BSHFT =0, DESCENDING MODE.
BLTAPT = Chunky+Size-2-8
BLTBPT = Chunky+Size-2
Thus output is A a C c E e G g ....
B b D d F f H h ....
Another 3 passes and you can be home :)