http://www.ibookguy.com/wp-content/uploads/2010/12/53COUMNS.png
Anyone wants a copy of the executable or source code, I'll be happy to
send it. I thought it might make a good routine for a text-adventure.
Would you be willing to wrap this in CONIO for cc65? I'll pay $$$.
It would need to be a complete drop-in replacement for the stock CONIO
library for the 64.
Funny thing is, I already thought of doing that. However, truth be
told, I don't know how.
But I want 64 ~ the one thing that grated with C64 Forth
implementations was the "not quite standard" Blocks because of 40
column wide lines instead of 64 column wide lines.
Heck, switch back to 40 columns wide in the bottom 9 rows, just get a
16 line by 64 column block display.
This could be done too, using 5-pixel font. However, at this point
you would loose the double-thick lines common with the standard 64
fonts, which make it easy to read on NTSC. Once you move to single-
pixel lines the text become blurry. That was sort of my goal with the
6-pixel wide font was to attempt to keep it looking like the standard
font.
On the TI-99, we used a 4-pixel font to get 64 columns in Forth...
wasn't too bad, although the characters were somewhat blocky... but it
made for a great Forth editor, 64 columns by 16 rows...
Tried a 'proportional' font where some characters used only 3 pixels,
but the results were - well, pretty bad...
At least it allows a full 1024 block to be on the screen (you lose 24
with an 80x25 wide screen), even with a non-standard line length (just
don't use the ``\'' comment word) of 1 12 long and 21 48 long lines.
If a 5-wide font routine puts in the blank pixel itself, it can be a 4-
pixel font, which allows the font to be doubled to reduce the amount
of shifting required.
does it involve border-opening tricks/sprites?
--
No, it is simply using the 320x240 hi-res graphics mode with a 6-pixel
wide font and a program designed to plot the characters on the screen
very quickly.
pretty interesting then, though personally i've gave up
using c64 for text terminal not so long ago in favour of much more power-and
cost efficient microcontrollers (like atmega simple tv terminal)
--
you wish, 320x200 ;)
--
-=[]=--- iAN CooG/HVSC & C64Intros ---=[]=-
Dog crawls under doors, programs crawls under windows...
---------------
I dunno if shifting is that big a deal.
I wrote a 64x32 display routine for a terminal emulator once, using 5x6
characters (scrunched down from the real terminal's 8x16 characters). Didn't
have any trouble keeping up at 1200 baud even though shifting (if necessary)
always went to the right. Always used two bytes, a 'left' and a 'right',
both always plotted using masks, even though in many cases the shift didn't
actually put any pixels in the 'right' byte.
Much later I realized I could get by with one byte by rotating bits that
'fall off' the right edge into the left edge (since both 'left' and 'right'
bytes had to be masked anyway and there was no overlap between what they
held). Even that wouldn't be necessary unless the shift was for more than
three bits (for a five-pixel wide character).
And of course when using one byte with 'stuffing' at the other end of a
shift, a one-bit left shift has the same result as a seven-bit right shift.
Extending that, no more than four shifts are needed in any case. An indirect
jump set at the start of plotting each character points to the appropriate
place in a 'cascade' of shifts to start shifting each byte.
And if that wasn't fast enough, there was always the possibility of
pre-computing the shifts in seven lookup tables. (the emulator was only
about 3K in its first version; I had lots of space).
Still, I calculated that even the slowest version could place a character
faster than the next one could arrive at 1200 baud. The real bottleneck was
erasing the whole screen; that always took many character arrival times to
complete. What I should have done was use two hi-res screens, erasing a
small bit of the currently unused screen each pass through the main loop.
When the command to erase the screen came, in the best case it would simply
mean switching screens during the vertical blanking interval. In the worst
case, if nothing at all had been done to erase the unused screen before the
next erase command came, it was no worse than what I was doing already (and
switching during the vertical blank would have prevented the user from
seeing the erasing going on, another plus).
- Anton Treuenfels
> I dunno if shifting is that big a deal.
>
> I wrote a 64x32 display routine for a terminal emulator once, using 5x6
> characters (scrunched down from the real terminal's 8x16 characters). Didn't
> have any trouble keeping up at 1200 baud even though shifting (if necessary)
> always went to the right.
Keeping up with 1200 baud is not a very high bar to set.
----------------------
Maybe not, but it was the fastest speed I had access to at the time (early
80's), and the fastest modem speed ever widely available for the C64 anyway.
Moreover, since the C64 has no hardware UART, everything regarding bit
pushing has to be handled in software during NMI interrupts, further
reducing time available to plot.
Still, you're correct in that, even if it takes as much as 2000 cycles to
plot each character, 120 characters per second needs only 240K of the 1000K
cycles available per second (lots of time left for those 1200 NMIs).
But I'm not entirely sure what your point is. If I understand what you mean
by "font doubling", you want to place two identical patterns in each half of
a character definition byte to "reduce shifting". Well, if the width of a
character before plotting is only four pixels, there are only 16 different
patterns that characters can be made out of. So only 16 bytes are necessary
to represent those patterns shifted by one pixel, 16 more by two, etc. 112
bytes suffices to represent all shifts from one to seven pixels. Since it's
so few, why not go all the way and replace all shifts with pre-computed
lookups?
- Anton Treuenfels
The fastest user port serial port is 9600baud, but AFAIR, that was
developed later than the 80's ... but hardware serial port cartridges
were available, and ran up to 36kb (I guess faster for direct serial
connections, but I didn't get a 56kb modem until after I got my first
two-3.5" floppy DOS computer, a clunky transportable with supertwist
monochrome LCD screen).
In any event, Forth at the time normally worked with 8 or more 1K
blocks cached in RAM, with individual blocks edited with the blocks in
RAM, so being able to keep up as a terminal on a 1200baud serial port
does not seem to be saying much with respect to how much of a
perceived slowdown there is in the context of editing text held in RAM
buffers.
> Still, you're correct in that, even if it takes as much as 2000 cycles to
> plot each character, 120 characters per second needs only 240K of the 1000K
> cycles available per second (lots of time left for those 1200 NMIs).
> But I'm not entirely sure what your point is. If I understand what you mean
> by "font doubling", you want to place two identical patterns in each half of
> a character definition byte to "reduce shifting". Well, if the width of a
> character before plotting is only four pixels, there are only 16 different
> patterns that characters can be made out of. So only 16 bytes are necessary
> to represent those patterns shifted by one pixel, 16 more by two, etc. 112
> bytes suffices to represent all shifts from one to seven pixels. Since it's
> so few, why not go all the way and replace all shifts with pre-computed
> lookups?
Because with only zero, one, or two shifts required, the shifts are on
average as fast as the lookup. TAX ; LDA table,X is six cycles. And a
distinct routine for each base bit avoids a bit of computing as well.
But for plotting, and not just sequential EMIT with CR (carriage
return), probably do want to precompute the base address of each
screen line and the offset of each column, as well as the base pixel
bit of each column, which can be 0 through 7.
Off the top of my head, something like:
; PLOTXY, screencode in A, row in Y, column in X, to set row and col
PHA
CLC
LDA LINELO,Y
ADC COLLO,X
STA SCRN
LDA LINEHI,Y
ADC COLHI,X
STA SCRN+1
; screencode in A
PLA
LDY #$(<(font/8))
STY CHAR+1
ASL A
ROL CHAR+1
ASL
ROL CHAR+1
ASL
ROL CHAR+1
STA CHAR
LDA MOD6x8,X
BNE BIT1
LDY #7
LP0:
LDA (SCRN),Y
AND #$F0
STA TEMP
LDA (CHAR),Y
AND #$0F
ORA TEMP
STA (SCRN),Y
DEY
BPL LP0
RTS
BIT1:
TAX
DEX
BNE BIT2
LDY #7
LP1:
LDA (SCRN),Y
AND #$E1
STA TEMP
LDA (CHAR),Y
AND #$0F
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT2:
DEX
BNE BIT3
LDY #7
LP2:
LDA (SCRN),Y
AND #$87
STA TEMP
LDA (CHAR),Y
AND #$0F
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT3:
DEX
BNE BIT4
LDY #7
LP3:
LDA (SCRN),Y
AND #$87
STA TEMP
LDA (CHAR),Y
AND #$F0
LSR A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT4:
DEX
BNE BIT5
LDY #7
LP4:
LDA (SCRN),Y
AND #$0F
STA TEMP
LDA (CHAR),Y
AND #$F0
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT5:
DEX
BNE BIT6
LDY #15
LPHI5:
LDA (SCRN),Y
AND #$FE
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
LSR A
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI5
LPLO5:
LDA (SCRN),Y
AND #$1F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO5
RTS
BIT6:
DEX
BNE BIT7
LDY #15
LPHI6:
LDA (SCRN),Y
AND #$FC
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI5
LPLO6:
LDA (SCRN),Y
AND #$3F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO6
RTS
BIT7:
LDY #15
LPHI7:
LDA (SCRN),Y
AND #$F8
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI7
LPLO7:
LDA (SCRN),Y
AND #$7F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO7
RTS
> Maybe not, but it was the fastest speed I had access to at the time (early
> 80's), and the fastest modem speed ever widely available for the C64 anyway.
When I graduated from a Commodore 1200-baud modem
to the Aprotek Minimodem-C24 running at 2400 baud, it
was quite a wonder to cruise at such a speed!
Merry Christmas!
Robert Bernardo
Fresno Commodore User Group
http://videocam.net.au/fcug
--------------------------------------
Well, it's a start. Looks pretty complicated to me. My first attempt at
something like this was a pipeline-type thing: copy char pattern to temp,
shift temp, copy shifted temp to screen. Simple, but not the only way to do
it, of course.
One thing I notice is that these routines assume four-pixel wide chars, but
I understood you to mean four-pixel wide definitions of five-pixel wide
chars. The fifth pixel is implied as always clear, hence does need to be
stored and is "cleared out" by the plotting routine. I don't see that
happening here.
Anyhow...sixty-four five-pixel wide chars on each row means that chars begin
on pixel columns 0, 5, 10, 15, 20, 25, 30, 35, and so on. The eight-pixel
wide columns of the C64 hires screen mean that the pixel column remainders
run in the sequence 0, 5, 2, 7, 4, 1, 6, 3 and then repeat. So using your
scheme of the same four-pixel pattern repeated in both halves of a character
image byte, something like this should happen, I think:
0: mask out bits 0..4 of screen byte, OR in bits 0..3 of pattern byte (no
shift)
5: mask out bits 5..7 of screen byte, OR in bits 4..6 of pattern byte (right
shift one),
mask out bits 0..1 of screen byte+8, OR in bit 3 of pattern byte (left shift
three)
2: mask out bits 2..6 of screen byte, OR in bits 0..4 of pattern byte (right
shift two)
7: mask out bit 7 of screen byte,OR in bit 4 of pattern byte (right shift
three),
mask out bits 0..3 of of screen byte+8, OR in bits 1..3 of pattern byte
(left shift one)
4: mask out bits 4..7 of screen byte, OR in bits 4..7 of pattern byte (no
shift),
mask out bit 0 of screen byte+8
1: mask out bits 1..5 of screen byte, OR in bits 0..3 of pattern byte (right
shift one)
6: mask out bits 6..7 of screen byte, OR in bits 4..5 of pattern byte (right
shift two),
mask out bits 0..3 of screen byte+8, OR in bits 2..3 of pattern byte (left
shift two)
3: mask out bits 3..7 of screen byte, OR in bits 4..7 of pattern byte (left
shift one)
Still looks pretty complicated to me. I might write the main loop for the
pixel remainders 0..3 (the easy cases that don't cross cell boundaries)
something like this:
lda x_coor
and #%0011 ; mod 4
tax ; 0..3
lda shiftvct_lo,x
sta shift_vct
lda shiftvct_hi,x
sta shift_vct+1
ldy #8-1
- lda (char_ptr),y
jmp (shift_vct)
rght3:
lsr
rght2:
lsr
rght1:
lsr
plot:
sta temp
lda (screen_ptr),y
and mask,x
ora temp
sta (screen_ptr),y
dey
bpl -
rts
For pixel remainders 4..7 (the nasty cases that cross cell boundaries), I
might do something like this:
lda x_coor
and #%0111 ; mod 8
tax ; 4..7 (not 0..7)
lda shiftvct_lo,x
sta shift_vct
lda shiftvct_hi,x
sta shift_vct+1
clc
lda screen_ptr
adc #8
sta screen_ptr2
lda screen_ptr+1
adc #0
sta screen_ptr2+1
ldy #8
- lda #$00
sta temp
lda (char_ptr),y
jmp (shift_vct)
rgt4:
asl ; 4 lft = 4 rgt
rol temp
rgt5:
asl ; 3 lft = 5 rgt
rol temp
rgt6:
asl ; 2 lft = 6 rgt
rol temp
rgt7:
asl ; 1 lft = 7 rgt
rol temp
plot2:
sta temp+1
lda (screen_ptr),y
and lft_mask,x
ora temp
sta (screen_ptr),y
lda (screen_ptr2),y
and rgt_mask,x
ora temp+1
sta (screen_ptr2),y
dey
bne -
rts
That's only an outline of one way to do it with lots of details left out, of
course.
- Anton Treuenfels
> One thing I notice is that these routines assume four-pixel wide chars, but
> I understood you to mean four-pixel wide definitions of five-pixel wide
> chars. The fifth pixel is implied as always clear, hence does need to be
> stored and is "cleared out" by the plotting routine. I don't see that
> happening here.
No, I realized that after I posted it. The AND masks are off by one
bit, and the four wide needs to clear the top bit of bytes 8~15.
The shift vector is a cycles faster done directly, and more compact
when the table is included:
clc
> lda x_coor
> and #%0011 ; mod 4
tax
adc #>rght3
sta shift_vct
lda #<rght3
adc #0
sta shift_vct
ldy #7
> - lda (char_ptr),y
> jmp (shift_vct)
>
> rght3:
> lsr
> rght2:
> lsr
> rght1:
> lsr
>
> plot:
> sta temp
plot1:
> lda (screen_ptr),y
> and mask,x
> ora temp
> sta (screen_ptr),y
> dey
> bpl plot1
> rts
Obviously, unrolling the loop is faster than that for the first four
bit positions, since the masks are immediates and the "DEX / BMI"
averages faster than the jump vector, though in a second draft I'd
look to splitting it between 0~3 and 4~7. Like all unrolled loops, it
is more repetitive than complex.
Obviously, unrolling the loop is faster than that for the first four
bit positions, since the masks are immediates and the "DEX / BMI"
averages faster than the jump vector, though in a second draft I'd
look to splitting it between 0~3 and 4~7. Like all unrolled loops, it
is more repetitive than complex.
---------------------------
Speaking of second drafts, here's a more complete version. This one
minimizes code size. If that's not a concern then it could be made a bit
faster by writing each of the eight cases as a separate routine, eliminating
the indirect jump and using immediate values for masks.
; A = screen code of char to plot (0..255)
; Y = row (0..24)
; X = col (0..63)
PlotChar:
; create pointer to character image definition
; - bits 0..4 of eight consecutive bytes
asl
rol temp+1
asl
rol temp+1
asl
rol temp+1
clc
adc #<char_image_base
sta char_ptr
lda temp+1
and #%00000111
adc #>char_image_base
sta char_ptr+1
; create pointers to screen cells
txa
sta temp
asl ; *2
asl ; *4
adc temp ; *5
tax
lda #$00
rol
sta temp+1
txa
and #%11111000
adc screen_row_lo,y
sta screen_ptr_lft
lda temp+1
adc screen_row_hi,y
sta screen_ptr_lft+1
tay
lda screen_ptr_lft
adc #8
sta screen_ptr_rgt
bcc +
iny
+ sty screen_ptr_rgt+1
; set indirect jump for character image shifting
txa
and #%00000111
tax
lda char_shift_lo,x
sta shift_vct
lda char_shift_hi,x
sta shift_vct+1
; set screen cell masks
lda screen_lft_mask,x
sta lft_mask
lda screen_rgt_mask,x
sta rgt_mask
; main loop
ldy #8-1
next_row:
lda (char_ptr),y
jmp (shift_vct)
; plots that cross cell boundaries
rgt4: ; lft 4 = rgt 4
cmp #$80
rol
rgt5: ; lft 3 = rgt 5
cmp #$80
rol
rgt6: ; lft 2 = rgt 6
cmp #$80
rol
rgt7: ; lft 1 = rgt 7
cmp #80
rol
; plot right screen cell
tax
eor (screen_ptr_rgt),y
and rgt_mask
eor (screen_ptr_rgt),y
sta (screen_ptr_rgt),y
txa
bpl rgt0 ; b:always
; plots that do not cross cell boundaries
rgt3:
lsr
rgt2:
lsr
rgt1:
lsr
rgt0:
; plot left screen cell
eor (screen_ptr_lft),y
and lft_mask
eor (screen_ptr_lft),y
sta (screen_ptr_lft),y
; another row ?
dey
bpl next_row ; b:yes
rts
; character shift vectors
; - can replace "char_shift_hi" lookup with constant page number
; if guaranteed that all eight targets are on the same page
char_shift_lo:
.byte <rgt0,<rgt1,<rgt2,<rgt3,<rgt4,<rgt5,<rgt6,<rgt7
char_shift_hi:
.byte >rgt0,>rgt1,>rgt2,>rgt3,>rg4,>rgt5,>rgt6,>rgt7
; screen cell masks
; the first four bytes of "screen_rgt_mask" are "don't care" since they
; will never be used (but it doesn't save anything to check for this
; when setting masks - maybe can overlap tables somewhere if useful)
screen_lft_mask:
.byte %11111000,%01111100,%00111110,%00011111
.byte %00001111,%00000111,%00000011,%00000001
screen_rgt_mask:
.byte %00000000,%00000000,%00000000,%00000000
.byte %10000000,%11000000,%11100000,%11110000
; screen row starts
screen_row_lo:
]ptr = screen_base
.repeat 25
.byte <]ptr
]ptr = ]ptr + 320
.endr
screen_row_hi:
]ptr = screen_base
.repeat 25
byte >]ptr
]ptr = ]ptr + 320
.endr
; POSXY - assumes screen is aligned on page boundary
; X = col (0-64)
; Y = row (0-23)
; Row base pre-computed as an absolute address
; Col pre-computed as an offset byte, overflow inferred on col>50
; Result: screen=address, col=column, row=row, mod=bit-offset, uses A
Pos64XY
STX col
STY row
TXA
AND #7
TAX
LDA bitoffset,x
CLC
LDA rowlo,y
ADC colbyte,x
STA screen
LDA rowhi,y
CPX #50
BPL Pos64XY1
ADC #0
STA screen+1
LDA modebyte,x
TAX
RTS
Pos64XY1
ADC #1
STA screen+1
LDA colmod,x
STA bitoffset
RTS
NextCol:
; Result: screen=address, col=column, X=bit offset
; logic: bit offsets 0,1,2,3,4 crossed a byte boundary
LDX col
INX
CPX #64
BEQ NextRow
STX col
LDA colmod,x
CMP #5
BPL NextCol1
CLC
LDA screen
ADC #8
STA screen
NextCol1:
STA bitoffset
RTS
; NextRow
; Result: screen=address, row=row, col=bitoffset=0
; carry set = screen overflow, row stalls
; ; could also wrap around
SEC
LDX #0
STX col
STX bitoffset
LDY row
CPY #24
BEQ NextRow1
INY
CLC
NextRow1:
LDA rowlo,y
STA screen
LDA rowhi,y
STA screen+1
RTS
> PlotChar:
>
> ; create pointer to character image definition
> ; - bits 0..4 of eight consecutive bytes
; this assumes that the font table is aligned on an 8-byte boundary
; aligning on a binary page boundary would be the most common
clc
adc #>(char_image_base / 8)
sta char_ptr
lda #<(char_image_base / 8)
adc #0
asl char_ptr
rol
asl char_ptr
rol
asl char_ptr
rol
sta char_ptr+1
It occurs to me that a clear row is faster than a clear character
position, so I might split out clear row and clear character position
routines and have the character plot routine assume that the target is
cleared:
; ClearRow
; clear a full row of characters, screen is set to base of row
ClearRow
LDA #0
TAY
ClearRow1
STA (screen),Y
INY
BNE ClearRow1
INC screen+1
LDY #63
ClearRow2
STA (screen),Y
DEY
BPL ClearRow2
DEC screen+1
RTS
; ClearPos
; clear the current position, already set
; 12 byte mask table, maskhi=masklo+4
ClearPos
LDX mod
CPX #4
BPL ClearPos2
LDY #15
ClearPos1:
LDA (screen),Y
AND maskhi,x
STA (screen),Y
DEY
CPY #8
BPL ClearPos1
ClearPos2
LDY #7
LDA (screen),Y
AND masklo,x
STA (screen),Y
DEY
BPL ClearPos1
RTS
Having the higher level printing routine determine whether to clear a
full row or clear a single character position simplifies the character
plot. For one thing, the anomoly of bitoffset=4 where a single bit of
the high byte is cleared but it is not touched by the character plot
is handled in the set-up of the mask.
Yes, it costs 9 cycles to do the patch of one bit mask if the mask is
tabled, and the immediate saved 3 cycles per iteration versus the X-
indexed read. See below, that is actually faster on average than the
unrolled loop.
But I think a similar saving is available through stashing the mask in
X, if the clearing of the field and the plotting of the character are
done independently.
...
LDA masklo,X
TAX
LDY #7
LP:
TXA
AND (screen),Y
STA (screen),Y
DEY
BPL LP
... and since that costs 7 cycles for the single byte loop set-up,
where the unrolled loop (with a previous split between 0-3 and 4-7)
would be:
(1) bit 0, 2 cycles (fall through a branch)
(2) bit 1, 9 cycles, store to X, DEX, take one branch, fall through a
branch
(3) bit 2, 14 cycles store to X, DEX*2, take two branches, fall
through a branch
(4) bit 3, 15 cycles, store to X, DEX*2, take three branches
An average of 10 cycles overhead, so stashing in X and using it is
about three cycles faster as well as much more compact.
Doing the clearing of the space separately from the plotting not only
saves a bucket of cycles when clearing a whole row at a time, it also
means that the X-index only needs to be used once. However, its still
faster to do have eight separate loops:
LDY #7
LDX mod
BEQ bit0
CPX #4
BPL bit47
CPX #2
BMI bit1
BEQ bit2
BPL bit3
bit47:
BEQ bit4
CPX #6
BMI bit5
BEQ bit6
BNE bit7
bit0:
LDA (char),Y
AND #$0F
ORA (screen),Y
STA (screen),Y
DEY
BPL bit0
RTS
bit1:
LDA (char),Y
AND #$0F
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit1
RTS
bit2:
LDA (char),Y
AND #$0F
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit2
RTS
bit3:
LDA (char),Y
AND #$F0
LSR
ORA (screen),Y
STA (screen),Y
DEY
BPL bit3
RTS
bit4:
LDA (char),Y
AND #$F0
ORA (screen),Y
STA (screen),Y
DEY
BPL bit4
RTS
bit5:
LDA (char),Y
AND #$F0
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit5
LDY #15
bit5hi:
LDA (char),Y
AND #$0F
LSR
ORA (screen),Y
STA (screen),Y
DEY
CPY #8
BPL bit5hi
RTS
bit6:
LDA (char),Y
AND #$F0
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit6
LDY #15
bit6hi:
LDA (char),Y
AND #$0F
LSR
LSR
ORA (screen),Y
STA (screen),Y
DEY
CPY #8
BPL bit6hi
RTS
bit7:
LDA (char),Y
AND #$F0
ASL
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit7
LDY #15
bit7hi:
LDA (char),Y
AND #$0F
LSR
LSR
LSR
ORA (screen),Y
STA (screen),Y
DEY
CPY #8
BPL bit7hi
RTS
Even self-modifying code within a loop to jump over the correct number
of shifts has a penalty of 24 cycles, since the jump is taken eight
times. And taking the clearing out of the character plotting loops
make the per bit routines much shorter.
I think if the routine does not clear the screen, but instead clears
the top two rows and then clears the row below the current row when
starting a new line, the delay from the high res character display
would be not much more than for a four bit wide character display.
LDX bitoffset
CPX #4
BMI clearlo
LDA maskhi,x
TAX
LDY #15
lphi:
TXA
AND (screen),Y
STA (screen),Y
DEY
CPY #8
BPL lphi
clearlo:
LDY #7
LDX bitoffset
LDA masklo,x
TAX
lplo:
TXA
AND (screen),Y
STA (screen),Y
DEY
BPL lplo
RTS
The mask table is 12 bytes, with: maskhi=masklo+4.
; POSXY - assumes screen is aligned on page boundary
; X = col (0-64)
; Y = row (0-23)
; Row base pre-computed as an absolute address
; Col pre-computed as an offset byte, overflow inferred on col>50
; Result: screen=address, col=column, row=row, mod=bit-offset, uses A
Pos64XY
STX col
STY row
TXA
AND #7
TAX ; now X=0..7
LDA bitoffset,x
CLC
LDA rowlo,y
ADC colbyte,x
STA screen
LDA rowhi,y
CPX #50 ; ...which makes this test have only one result
------------------------------
Oh heck, let's unroll:
ClearPos:
ldy mod
cpy #4 ; left cell only ?
bcc + ; b:yes
ldx maskhi,y
ldy #16-1
jsr ClearCell ; clear right cell
ldy mod
+ ldx masklo,y
ldy #8-1 ; fall through
ClearCell:
txa
and (screen),y
sta (screen),y
.repeat 7
dey
txa
and (screen),y
sta (screen),y
.endrepeat
rts
- Anton Treuenfels
I like the fully unrolled 8-byte mask. 8x6 vs 1x8 is an extra 16
bytes, to save 23 clock cycles.