53-columns

David Murray

unread,

Dec 22, 2010, 2:29:50 PM12/22/10

to

I know we've all seen 80-columns done on the C64, and it is quite hard
to read. Here is a compromise: 53 columns. I wrote this routine
this morning which gives a little more room for text, while still
being quite readable. Here is a screenshot:

http://www.ibookguy.com/wp-content/uploads/2010/12/53COUMNS.png

Anyone wants a copy of the executable or source code, I'll be happy to
send it. I thought it might make a good routine for a text-adventure.

Payton Byrd

unread,

Dec 22, 2010, 3:22:56 PM12/22/10

to

Would you be willing to wrap this in CONIO for cc65? I'll pay $$$.
It would need to be a complete drop-in replacement for the stock CONIO
library for the 64.

David Murray

unread,

Dec 22, 2010, 3:46:17 PM12/22/10

to

> Would you be willing to wrap this in CONIO for cc65? I'll pay $$$.
> It would need to be a complete drop-in replacement for the stock CONIO
> library for the 64.

Funny thing is, I already thought of doing that. However, truth be
told, I don't know how.

BruceMcF

unread,

Dec 22, 2010, 6:18:49 PM12/22/10

to

On Dec 22, 2:29 pm, David Murray <adri...@yahoo.com> wrote:
> I know we've all seen 80-columns done on the C64, and it is quite hard
> to read. Here is a compromise: 53 columns. I wrote this routine
> this morning which gives a little more room for text, while still
> being quite readable. Here is a screenshot:

But I want 64 ~ the one thing that grated with C64 Forth
implementations was the "not quite standard" Blocks because of 40
column wide lines instead of 64 column wide lines.

Heck, switch back to 40 columns wide in the bottom 9 rows, just get a
16 line by 64 column block display.

David Murray

unread,

Dec 22, 2010, 8:31:51 PM12/22/10

to

> But I want 64 ~ the one thing that grated with C64 Forth
> implementations was the "not quite standard" Blocks because of 40
> column wide lines instead of 64 column wide lines.

This could be done too, using 5-pixel font. However, at this point
you would loose the double-thick lines common with the standard 64
fonts, which make it easy to read on NTSC. Once you move to single-
pixel lines the text become blurry. That was sort of my goal with the
6-pixel wide font was to attempt to keep it looking like the standard
font.

winston...@yahoo.com

unread,

Dec 23, 2010, 12:36:57 AM12/23/10

to

On the TI-99, we used a 4-pixel font to get 64 columns in Forth...
wasn't too bad, although the characters were somewhat blocky... but it
made for a great Forth editor, 64 columns by 16 rows...

Tried a 'proportional' font where some characters used only 3 pixels,
but the results were - well, pretty bad...

BruceMcF

unread,

Dec 23, 2010, 9:49:31 AM12/23/10

to

At least it allows a full 1024 block to be on the screen (you lose 24
with an 80x25 wide screen), even with a non-standard line length (just
don't use the ``\'' comment word) of 1 12 long and 21 48 long lines.

If a 5-wide font routine puts in the blank pixel itself, it can be a 4-
pixel font, which allows the font to be doubled to reduce the amount
of shifting required.

Piotr "Curious" Slawinski

unread,

Dec 23, 2010, 12:23:12 PM12/23/10

to

David Murray wrote:

does it involve border-opening tricks/sprites?

--

David Murray

unread,

Dec 23, 2010, 12:40:03 PM12/23/10

to

> does it involve border-opening tricks/sprites?

No, it is simply using the 320x240 hi-res graphics mode with a 6-pixel
wide font and a program designed to plot the characters on the screen
very quickly.

Piotr "Curious" Slawinski

unread,

Dec 23, 2010, 1:00:30 PM12/23/10

to

David Murray wrote:

pretty interesting then, though personally i've gave up
using c64 for text terminal not so long ago in favour of much more power-and
cost efficient microcontrollers (like atmega simple tv terminal)

--

iAN CooG

unread,

Dec 23, 2010, 5:09:28 PM12/23/10

to

David Murray <adr...@yahoo.com> wrote:
>> does it involve border-opening tricks/sprites?
>
> No, it is simply using the 320x240 hi-res graphics mode

you wish, 320x200 ;)

--
-=[]=--- iAN CooG/HVSC & C64Intros ---=[]=-
Dog crawls under doors, programs crawls under windows...

Anton Treuenfels

unread,

Dec 23, 2010, 7:35:57 PM12/23/10

to

"BruceMcF" <agi...@netscape.net> wrote in message
news:f266112c-09db-4370...@o4g2000yqd.googlegroups.com...

---------------

I dunno if shifting is that big a deal.

I wrote a 64x32 display routine for a terminal emulator once, using 5x6
characters (scrunched down from the real terminal's 8x16 characters). Didn't
have any trouble keeping up at 1200 baud even though shifting (if necessary)
always went to the right. Always used two bytes, a 'left' and a 'right',
both always plotted using masks, even though in many cases the shift didn't
actually put any pixels in the 'right' byte.

Much later I realized I could get by with one byte by rotating bits that
'fall off' the right edge into the left edge (since both 'left' and 'right'
bytes had to be masked anyway and there was no overlap between what they
held). Even that wouldn't be necessary unless the shift was for more than
three bits (for a five-pixel wide character).

And of course when using one byte with 'stuffing' at the other end of a
shift, a one-bit left shift has the same result as a seven-bit right shift.
Extending that, no more than four shifts are needed in any case. An indirect
jump set at the start of plotting each character points to the appropriate
place in a 'cascade' of shifts to start shifting each byte.

And if that wasn't fast enough, there was always the possibility of
pre-computing the shifts in seven lookup tables. (the emulator was only
about 3K in its first version; I had lots of space).

Still, I calculated that even the slowest version could place a character
faster than the next one could arrive at 1200 baud. The real bottleneck was
erasing the whole screen; that always took many character arrival times to
complete. What I should have done was use two hi-res screens, erasing a
small bit of the currently unused screen each pass through the main loop.
When the command to erase the screen came, in the best case it would simply
mean switching screens during the vertical blanking interval. In the worst
case, if nothing at all had been done to erase the unused screen before the
next erase command came, it was no worse than what I was doing already (and
switching during the vertical blank would have prevented the user from
seeing the erasing going on, another plus).

- Anton Treuenfels

BruceMcF

unread,

Dec 23, 2010, 7:50:45 PM12/23/10

to

On Dec 23, 7:35 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:

> I dunno if shifting is that big a deal.
>
> I wrote a 64x32 display routine for a terminal emulator once, using 5x6
> characters (scrunched down from the real terminal's 8x16 characters). Didn't
> have any trouble keeping up at 1200 baud even though shifting (if necessary)
> always went to the right.

Keeping up with 1200 baud is not a very high bar to set.

Anton Treuenfels

unread,

Dec 24, 2010, 6:46:55 PM12/24/10

to

"BruceMcF" <agi...@netscape.net> wrote in message

news:51dff1ef-94f0-4e28...@r29g2000yqj.googlegroups.com...

----------------------

Maybe not, but it was the fastest speed I had access to at the time (early
80's), and the fastest modem speed ever widely available for the C64 anyway.
Moreover, since the C64 has no hardware UART, everything regarding bit
pushing has to be handled in software during NMI interrupts, further
reducing time available to plot.

Still, you're correct in that, even if it takes as much as 2000 cycles to
plot each character, 120 characters per second needs only 240K of the 1000K
cycles available per second (lots of time left for those 1200 NMIs).

But I'm not entirely sure what your point is. If I understand what you mean
by "font doubling", you want to place two identical patterns in each half of
a character definition byte to "reduce shifting". Well, if the width of a
character before plotting is only four pixels, there are only 16 different
patterns that characters can be made out of. So only 16 bytes are necessary
to represent those patterns shifted by one pixel, 16 more by two, etc. 112
bytes suffices to represent all shifts from one to seven pixels. Since it's
so few, why not go all the way and replace all shifts with pre-computed
lookups?

- Anton Treuenfels

BruceMcF

unread,

Dec 24, 2010, 8:32:36 PM12/24/10

to

On Dec 24, 6:46 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:
> "BruceMcF" <agil...@netscape.net> wrote in message

>
> news:51dff1ef-94f0-4e28...@r29g2000yqj.googlegroups.com...
> On Dec 23, 7:35 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:
>
> > I dunno if shifting is that big a deal.
>
> > I wrote a 64x32 display routine for a terminal emulator once, using 5x6
> > characters (scrunched down from the real terminal's 8x16 characters).
> > Didn't
> > have any trouble keeping up at 1200 baud even though shifting (if
> > necessary)
> > always went to the right.
>
> Keeping up with 1200 baud is not a very high bar to set.
>
> ----------------------
>
> Maybe not, but it was the fastest speed I had access to at the time (early
> 80's), and the fastest modem speed ever widely available for the C64 anyway.
> Moreover, since the C64 has no hardware UART, everything regarding bit
> pushing has to be handled in software during NMI interrupts, further
> reducing time available to plot.

The fastest user port serial port is 9600baud, but AFAIR, that was
developed later than the 80's ... but hardware serial port cartridges
were available, and ran up to 36kb (I guess faster for direct serial
connections, but I didn't get a 56kb modem until after I got my first
two-3.5" floppy DOS computer, a clunky transportable with supertwist
monochrome LCD screen).

In any event, Forth at the time normally worked with 8 or more 1K
blocks cached in RAM, with individual blocks edited with the blocks in
RAM, so being able to keep up as a terminal on a 1200baud serial port
does not seem to be saying much with respect to how much of a
perceived slowdown there is in the context of editing text held in RAM
buffers.

> Still, you're correct in that, even if it takes as much as 2000 cycles to
> plot each character, 120 characters per second needs only 240K of the 1000K
> cycles available per second (lots of time left for those 1200 NMIs).

> But I'm not entirely sure what your point is. If I understand what you mean
> by "font doubling", you want to place two identical patterns in each half of
> a character definition byte to "reduce shifting". Well, if the width of a
> character before plotting is only four pixels, there are only 16 different
> patterns that characters can be made out of. So only 16 bytes are necessary
> to represent those patterns shifted by one pixel, 16 more by two, etc. 112
> bytes suffices to represent all shifts from one to seven pixels. Since it's
> so few, why not go all the way and replace all shifts with pre-computed
> lookups?

Because with only zero, one, or two shifts required, the shifts are on
average as fast as the lookup. TAX ; LDA table,X is six cycles. And a
distinct routine for each base bit avoids a bit of computing as well.

But for plotting, and not just sequential EMIT with CR (carriage
return), probably do want to precompute the base address of each
screen line and the offset of each column, as well as the base pixel
bit of each column, which can be 0 through 7.

Off the top of my head, something like:

; PLOTXY, screencode in A, row in Y, column in X, to set row and col
PHA
CLC
LDA LINELO,Y
ADC COLLO,X
STA SCRN
LDA LINEHI,Y
ADC COLHI,X
STA SCRN+1
; screencode in A
PLA
LDY #$(<(font/8))
STY CHAR+1
ASL A
ROL CHAR+1
ASL
ROL CHAR+1
ASL
ROL CHAR+1
STA CHAR
LDA MOD6x8,X
BNE BIT1
LDY #7
LP0:
LDA (SCRN),Y
AND #$F0
STA TEMP
LDA (CHAR),Y
AND #$0F
ORA TEMP
STA (SCRN),Y
DEY
BPL LP0
RTS
BIT1:
TAX
DEX
BNE BIT2
LDY #7
LP1:
LDA (SCRN),Y
AND #$E1
STA TEMP
LDA (CHAR),Y
AND #$0F
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT2:
DEX
BNE BIT3
LDY #7
LP2:
LDA (SCRN),Y
AND #$87
STA TEMP
LDA (CHAR),Y
AND #$0F
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT3:
DEX
BNE BIT4
LDY #7
LP3:
LDA (SCRN),Y
AND #$87
STA TEMP
LDA (CHAR),Y
AND #$F0
LSR A
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT4:
DEX
BNE BIT5
LDY #7
LP4:
LDA (SCRN),Y
AND #$0F
STA TEMP
LDA (CHAR),Y
AND #$F0
ORA TEMP
STA (SCRN),Y
DEY
BPL LP1
RTS
BIT5:
DEX
BNE BIT6
LDY #15
LPHI5:
LDA (SCRN),Y
AND #$FE
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
LSR A
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI5
LPLO5:
LDA (SCRN),Y
AND #$1F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO5
RTS

BIT6:
DEX
BNE BIT7
LDY #15
LPHI6:
LDA (SCRN),Y
AND #$FC
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI5
LPLO6:
LDA (SCRN),Y
AND #$3F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO6
RTS

BIT7:
LDY #15
LPHI7:
LDA (SCRN),Y
AND #$F8
STA TEMP
LDA (CHAR),Y
AND #$0F
LSR A
ORA TEMP
STA (SCRN),Y
DEY
CPY #8
BPL LPLHI7
LPLO7:
LDA (SCRN),Y
AND #$7F
STA TEMP
LDA (CHAR),Y
AND #$F0
ASL A
ASL A
ASL A
ORA TEMP
STA (SCRN),Y
DEY
BPL LPLLO7
RTS

RobertB

unread,

Dec 25, 2010, 12:35:53 AM12/25/10

to

On Dec 24, 3:46 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:

> Maybe not, but it was the fastest speed I had access to at the time (early
> 80's), and the fastest modem speed ever widely available for the C64 anyway.

When I graduated from a Commodore 1200-baud modem
to the Aprotek Minimodem-C24 running at 2400 baud, it
was quite a wonder to cruise at such a speed!

Merry Christmas!
Robert Bernardo
Fresno Commodore User Group
http://videocam.net.au/fcug

Anton Treuenfels

unread,

Dec 26, 2010, 4:41:28 AM12/26/10

to

"BruceMcF" <agi...@netscape.net> wrote in message

news:46fa7794-d87a-4d40...@z19g2000yqb.googlegroups.com...

--------------------------------------

Well, it's a start. Looks pretty complicated to me. My first attempt at
something like this was a pipeline-type thing: copy char pattern to temp,
shift temp, copy shifted temp to screen. Simple, but not the only way to do
it, of course.

One thing I notice is that these routines assume four-pixel wide chars, but
I understood you to mean four-pixel wide definitions of five-pixel wide
chars. The fifth pixel is implied as always clear, hence does need to be
stored and is "cleared out" by the plotting routine. I don't see that
happening here.

Anyhow...sixty-four five-pixel wide chars on each row means that chars begin
on pixel columns 0, 5, 10, 15, 20, 25, 30, 35, and so on. The eight-pixel
wide columns of the C64 hires screen mean that the pixel column remainders
run in the sequence 0, 5, 2, 7, 4, 1, 6, 3 and then repeat. So using your
scheme of the same four-pixel pattern repeated in both halves of a character
image byte, something like this should happen, I think:

0: mask out bits 0..4 of screen byte, OR in bits 0..3 of pattern byte (no
shift)

5: mask out bits 5..7 of screen byte, OR in bits 4..6 of pattern byte (right
shift one),
mask out bits 0..1 of screen byte+8, OR in bit 3 of pattern byte (left shift
three)

2: mask out bits 2..6 of screen byte, OR in bits 0..4 of pattern byte (right
shift two)

7: mask out bit 7 of screen byte,OR in bit 4 of pattern byte (right shift
three),
mask out bits 0..3 of of screen byte+8, OR in bits 1..3 of pattern byte
(left shift one)

4: mask out bits 4..7 of screen byte, OR in bits 4..7 of pattern byte (no
shift),
mask out bit 0 of screen byte+8

1: mask out bits 1..5 of screen byte, OR in bits 0..3 of pattern byte (right
shift one)

6: mask out bits 6..7 of screen byte, OR in bits 4..5 of pattern byte (right
shift two),
mask out bits 0..3 of screen byte+8, OR in bits 2..3 of pattern byte (left
shift two)

3: mask out bits 3..7 of screen byte, OR in bits 4..7 of pattern byte (left
shift one)

Still looks pretty complicated to me. I might write the main loop for the
pixel remainders 0..3 (the easy cases that don't cross cell boundaries)
something like this:

lda x_coor
and #%0011 ; mod 4
tax ; 0..3
lda shiftvct_lo,x
sta shift_vct
lda shiftvct_hi,x
sta shift_vct+1

ldy #8-1
- lda (char_ptr),y
jmp (shift_vct)

rght3:
lsr
rght2:
lsr
rght1:
lsr

plot:
sta temp
lda (screen_ptr),y
and mask,x
ora temp
sta (screen_ptr),y
dey
bpl -
rts

For pixel remainders 4..7 (the nasty cases that cross cell boundaries), I
might do something like this:

lda x_coor
and #%0111 ; mod 8
tax ; 4..7 (not 0..7)
lda shiftvct_lo,x
sta shift_vct
lda shiftvct_hi,x
sta shift_vct+1

clc
lda screen_ptr
adc #8
sta screen_ptr2
lda screen_ptr+1
adc #0
sta screen_ptr2+1

ldy #8
- lda #$00
sta temp
lda (char_ptr),y
jmp (shift_vct)

rgt4:
asl ; 4 lft = 4 rgt
rol temp
rgt5:
asl ; 3 lft = 5 rgt
rol temp
rgt6:
asl ; 2 lft = 6 rgt
rol temp
rgt7:
asl ; 1 lft = 7 rgt
rol temp

plot2:
sta temp+1
lda (screen_ptr),y
and lft_mask,x
ora temp
sta (screen_ptr),y
lda (screen_ptr2),y
and rgt_mask,x
ora temp+1
sta (screen_ptr2),y
dey
bne -
rts

That's only an outline of one way to do it with lots of details left out, of
course.

- Anton Treuenfels

BruceMcF

unread,

Dec 26, 2010, 4:41:07 PM12/26/10

to

On Dec 26, 4:41 am, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:

> One thing I notice is that these routines assume four-pixel wide chars, but
> I understood you to mean four-pixel wide definitions of five-pixel wide
> chars. The fifth pixel is implied as always clear, hence does need to be
> stored and is "cleared out" by the plotting routine. I don't see that
> happening here.

No, I realized that after I posted it. The AND masks are off by one
bit, and the four wide needs to clear the top bit of bytes 8~15.

The shift vector is a cycles faster done directly, and more compact
when the table is included:
clc

> lda x_coor
> and #%0011 ; mod 4
tax

adc #>rght3
sta shift_vct
lda #<rght3
adc #0
sta shift_vct
ldy #7

> - lda (char_ptr),y
> jmp (shift_vct)
>
> rght3:
> lsr
> rght2:
> lsr
> rght1:
> lsr
>
> plot:
> sta temp

plot1:

> lda (screen_ptr),y
> and mask,x
> ora temp
> sta (screen_ptr),y
> dey

> bpl plot1
> rts

Obviously, unrolling the loop is faster than that for the first four
bit positions, since the masks are immediates and the "DEX / BMI"
averages faster than the jump vector, though in a second draft I'd
look to splitting it between 0~3 and 4~7. Like all unrolled loops, it
is more repetitive than complex.

Anton Treuenfels

unread,

Dec 26, 2010, 5:40:53 PM12/26/10

to

"BruceMcF" <agi...@netscape.net> wrote in message

news:b2e95aec-3d56-4f18...@s5g2000yqm.googlegroups.com...

Obviously, unrolling the loop is faster than that for the first four
bit positions, since the masks are immediates and the "DEX / BMI"
averages faster than the jump vector, though in a second draft I'd
look to splitting it between 0~3 and 4~7. Like all unrolled loops, it
is more repetitive than complex.

---------------------------

Speaking of second drafts, here's a more complete version. This one
minimizes code size. If that's not a concern then it could be made a bit
faster by writing each of the eight cases as a separate routine, eliminating
the indirect jump and using immediate values for masks.

; A = screen code of char to plot (0..255)
; Y = row (0..24)
; X = col (0..63)

PlotChar:

; create pointer to character image definition
; - bits 0..4 of eight consecutive bytes

asl
rol temp+1
asl
rol temp+1
asl
rol temp+1
clc
adc #<char_image_base
sta char_ptr
lda temp+1
and #%00000111
adc #>char_image_base
sta char_ptr+1

; create pointers to screen cells

txa
sta temp
asl ; *2
asl ; *4
adc temp ; *5
tax
lda #$00
rol
sta temp+1
txa
and #%11111000
adc screen_row_lo,y
sta screen_ptr_lft
lda temp+1
adc screen_row_hi,y
sta screen_ptr_lft+1
tay
lda screen_ptr_lft
adc #8
sta screen_ptr_rgt
bcc +
iny
+ sty screen_ptr_rgt+1

; set indirect jump for character image shifting

txa
and #%00000111
tax
lda char_shift_lo,x
sta shift_vct
lda char_shift_hi,x
sta shift_vct+1

; set screen cell masks

lda screen_lft_mask,x
sta lft_mask
lda screen_rgt_mask,x
sta rgt_mask

; main loop

ldy #8-1

next_row:
lda (char_ptr),y
jmp (shift_vct)

; plots that cross cell boundaries

rgt4: ; lft 4 = rgt 4
cmp #$80
rol
rgt5: ; lft 3 = rgt 5
cmp #$80
rol
rgt6: ; lft 2 = rgt 6
cmp #$80
rol
rgt7: ; lft 1 = rgt 7
cmp #80
rol

; plot right screen cell

tax
eor (screen_ptr_rgt),y
and rgt_mask
eor (screen_ptr_rgt),y
sta (screen_ptr_rgt),y
txa
bpl rgt0 ; b:always

; plots that do not cross cell boundaries

rgt3:
lsr
rgt2:
lsr
rgt1:
lsr
rgt0:

; plot left screen cell

eor (screen_ptr_lft),y
and lft_mask
eor (screen_ptr_lft),y
sta (screen_ptr_lft),y

; another row ?

dey
bpl next_row ; b:yes
rts

; character shift vectors
; - can replace "char_shift_hi" lookup with constant page number
; if guaranteed that all eight targets are on the same page

char_shift_lo:
.byte <rgt0,<rgt1,<rgt2,<rgt3,<rgt4,<rgt5,<rgt6,<rgt7
char_shift_hi:
.byte >rgt0,>rgt1,>rgt2,>rgt3,>rg4,>rgt5,>rgt6,>rgt7

; screen cell masks
; the first four bytes of "screen_rgt_mask" are "don't care" since they
; will never be used (but it doesn't save anything to check for this
; when setting masks - maybe can overlap tables somewhere if useful)

screen_lft_mask:
.byte %11111000,%01111100,%00111110,%00011111
.byte %00001111,%00000111,%00000011,%00000001
screen_rgt_mask:
.byte %00000000,%00000000,%00000000,%00000000
.byte %10000000,%11000000,%11100000,%11110000

; screen row starts

screen_row_lo:
]ptr = screen_base
.repeat 25
.byte <]ptr
]ptr = ]ptr + 320
.endr

screen_row_hi:
]ptr = screen_base
.repeat 25
byte >]ptr
]ptr = ]ptr + 320
.endr

BruceMcF

unread,

Dec 26, 2010, 10:18:45 PM12/26/10

to

On Dec 26, 5:40 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:
> ; A = screen code of char to plot (0..255)
> ; Y = row (0..24)
> ; X = col (0..63)

; POSXY - assumes screen is aligned on page boundary
; X = col (0-64)
; Y = row (0-23)
; Row base pre-computed as an absolute address
; Col pre-computed as an offset byte, overflow inferred on col>50
; Result: screen=address, col=column, row=row, mod=bit-offset, uses A

Pos64XY
STX col
STY row
TXA
AND #7
TAX
LDA bitoffset,x
CLC
LDA rowlo,y
ADC colbyte,x
STA screen
LDA rowhi,y
CPX #50
BPL Pos64XY1
ADC #0
STA screen+1
LDA modebyte,x
TAX
RTS
Pos64XY1
ADC #1
STA screen+1
LDA colmod,x
STA bitoffset
RTS

NextCol:
; Result: screen=address, col=column, X=bit offset
; logic: bit offsets 0,1,2,3,4 crossed a byte boundary
LDX col
INX
CPX #64
BEQ NextRow
STX col
LDA colmod,x
CMP #5
BPL NextCol1
CLC
LDA screen
ADC #8
STA screen
NextCol1:
STA bitoffset
RTS

; NextRow
; Result: screen=address, row=row, col=bitoffset=0
; carry set = screen overflow, row stalls
; ; could also wrap around
SEC
LDX #0
STX col
STX bitoffset
LDY row
CPY #24
BEQ NextRow1
INY
CLC
NextRow1:
LDA rowlo,y
STA screen
LDA rowhi,y
STA screen+1
RTS

> PlotChar:
>
> ; create pointer to character image definition
> ; - bits 0..4 of eight consecutive bytes

; this assumes that the font table is aligned on an 8-byte boundary
; aligning on a binary page boundary would be the most common

clc
adc #>(char_image_base / 8)
sta char_ptr
lda #<(char_image_base / 8)
adc #0
asl char_ptr
rol
asl char_ptr
rol
asl char_ptr
rol
sta char_ptr+1

It occurs to me that a clear row is faster than a clear character
position, so I might split out clear row and clear character position
routines and have the character plot routine assume that the target is
cleared:

; ClearRow
; clear a full row of characters, screen is set to base of row

ClearRow
LDA #0
TAY
ClearRow1
STA (screen),Y
INY
BNE ClearRow1
INC screen+1
LDY #63
ClearRow2
STA (screen),Y
DEY
BPL ClearRow2
DEC screen+1
RTS

; ClearPos
; clear the current position, already set
; 12 byte mask table, maskhi=masklo+4

ClearPos
LDX mod
CPX #4
BPL ClearPos2
LDY #15
ClearPos1:
LDA (screen),Y
AND maskhi,x
STA (screen),Y
DEY
CPY #8
BPL ClearPos1
ClearPos2
LDY #7
LDA (screen),Y
AND masklo,x
STA (screen),Y
DEY
BPL ClearPos1
RTS

Having the higher level printing routine determine whether to clear a
full row or clear a single character position simplifies the character
plot. For one thing, the anomoly of bitoffset=4 where a single bit of
the high byte is cleared but it is not touched by the character plot
is handled in the set-up of the mask.

Joe Forster/STA

unread,

Dec 27, 2010, 11:21:51 AM12/27/10

to

I haven't delved into these codes but couldn't be self-modifying code
a good compromise between code size and speed? I mean there's a single
skeleton of code and the differences - shift counts, bit masks - are
patched into it beforehands.

BruceMcF

unread,

Dec 27, 2010, 2:09:24 PM12/27/10

to

Yes, it costs 9 cycles to do the patch of one bit mask if the mask is
tabled, and the immediate saved 3 cycles per iteration versus the X-
indexed read. See below, that is actually faster on average than the
unrolled loop.

But I think a similar saving is available through stashing the mask in
X, if the clearing of the field and the plotting of the character are
done independently.
...
LDA masklo,X
TAX
LDY #7
LP:
TXA
AND (screen),Y
STA (screen),Y
DEY
BPL LP

... and since that costs 7 cycles for the single byte loop set-up,
where the unrolled loop (with a previous split between 0-3 and 4-7)
would be:
(1) bit 0, 2 cycles (fall through a branch)
(2) bit 1, 9 cycles, store to X, DEX, take one branch, fall through a
branch
(3) bit 2, 14 cycles store to X, DEX*2, take two branches, fall
through a branch
(4) bit 3, 15 cycles, store to X, DEX*2, take three branches

An average of 10 cycles overhead, so stashing in X and using it is
about three cycles faster as well as much more compact.

Doing the clearing of the space separately from the plotting not only
saves a bucket of cycles when clearing a whole row at a time, it also
means that the X-index only needs to be used once. However, its still
faster to do have eight separate loops:

LDY #7
LDX mod
BEQ bit0
CPX #4
BPL bit47
CPX #2
BMI bit1
BEQ bit2
BPL bit3
bit47:
BEQ bit4
CPX #6
BMI bit5
BEQ bit6
BNE bit7

bit0:
LDA (char),Y
AND #$0F
ORA (screen),Y
STA (screen),Y
DEY
BPL bit0
RTS

bit1:
LDA (char),Y
AND #$0F
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit1
RTS

bit2:
LDA (char),Y
AND #$0F
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit2
RTS

bit3:
LDA (char),Y
AND #$F0
LSR
ORA (screen),Y
STA (screen),Y
DEY
BPL bit3
RTS

bit4:
LDA (char),Y
AND #$F0
ORA (screen),Y
STA (screen),Y
DEY
BPL bit4
RTS

bit5:
LDA (char),Y
AND #$F0
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit5
LDY #15
bit5hi:
LDA (char),Y
AND #$0F
LSR
ORA (screen),Y

STA (screen),Y
DEY
CPY #8

BPL bit5hi
RTS

bit6:
LDA (char),Y
AND #$F0
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit6
LDY #15
bit6hi:
LDA (char),Y
AND #$0F
LSR
LSR
ORA (screen),Y

STA (screen),Y
DEY
CPY #8

BPL bit6hi
RTS

bit7:
LDA (char),Y
AND #$F0
ASL
ASL
ASL
ORA (screen),Y
STA (screen),Y
DEY
BPL bit7
LDY #15
bit7hi:
LDA (char),Y
AND #$0F
LSR
LSR
LSR
ORA (screen),Y

STA (screen),Y
DEY
CPY #8

BPL bit7hi
RTS

Even self-modifying code within a loop to jump over the correct number
of shifts has a penalty of 24 cycles, since the jump is taken eight
times. And taking the clearing out of the character plotting loops
make the per bit routines much shorter.

I think if the routine does not clear the screen, but instead clears
the top two rows and then clears the row below the current row when
starting a new line, the delay from the high res character display
would be not much more than for a four bit wide character display.

BruceMcF

unread,

Dec 27, 2010, 3:31:08 PM12/27/10

to

If the bitoffset is passed in X, the base of the screen passed in AY:

LDX bitoffset
CPX #4
BMI clearlo
LDA maskhi,x
TAX
LDY #15
lphi:

TXA
AND (screen),Y
STA (screen),Y
DEY

CPY #8
BPL lphi
clearlo:
LDY #7
LDX bitoffset
LDA masklo,x
TAX
lplo:

TXA
AND (screen),Y
STA (screen),Y
DEY

BPL lplo
RTS

The mask table is 12 bytes, with: maskhi=masklo+4.

Anton Treuenfels

unread,

Dec 27, 2010, 6:36:15 PM12/27/10

to

"BruceMcF" <agi...@netscape.net> wrote in message

news:e27f244d-a2c1-4c16...@j25g2000yqa.googlegroups.com...

On Dec 26, 5:40 pm, "Anton Treuenfels" <teamtemp...@yahoo.com> wrote:
> ; A = screen code of char to plot (0..255)
> ; Y = row (0..24)
> ; X = col (0..63)

; POSXY - assumes screen is aligned on page boundary
; X = col (0-64)
; Y = row (0-23)
; Row base pre-computed as an absolute address
; Col pre-computed as an offset byte, overflow inferred on col>50
; Result: screen=address, col=column, row=row, mod=bit-offset, uses A

Pos64XY
STX col
STY row
TXA
AND #7

TAX ; now X=0..7

LDA bitoffset,x
CLC
LDA rowlo,y
ADC colbyte,x
STA screen
LDA rowhi,y

CPX #50 ; ...which makes this test have only one result

------------------------------

Oh heck, let's unroll:

ClearPos:
ldy mod
cpy #4 ; left cell only ?
bcc + ; b:yes
ldx maskhi,y
ldy #16-1
jsr ClearCell ; clear right cell
ldy mod
+ ldx masklo,y
ldy #8-1 ; fall through

ClearCell:
txa
and (screen),y
sta (screen),y
.repeat 7
dey
txa
and (screen),y
sta (screen),y
.endrepeat
rts

- Anton Treuenfels

BruceMcF

unread,

Dec 27, 2010, 8:13:44 PM12/27/10

to

Yeah, there's a missing LDX col in there somewhere.

I like the fully unrolled 8-byte mask. 8x6 vs 1x8 is an extra 16
bytes, to save 23 clock cycles.