Fast Fizz Buzz program

fizz buzz

unread,

Jul 16, 2018, 3:01:10 AM7/16/18

to

Hello.

Me and my friend are trying to create fastest Fizz Buzz program. This is very simple test that often appear as interview question. Here are the original rules:

> Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."

Many examples and some interesting articles can be found here:

<http://wiki.c2.com/?FizzBuzzTest>

We changed the quantity of numbers from 100 to 1000000 (1 million) so it will be easier to measure program performance. Here is what I got so far. This code is written for x86_64 Linux and optimized to print numbers only up to 7 digits long.

bash$ time ./asm-bitmask-mul > /dev/null

real 0m0.010s
user 0m0.006s
sys 0m0.003s

I was wondering if anyone can suggest how to optimize it further. I will try to send same question to alt.lang.asm.

Thank you.

###############################################################################
.set SYS_EXIT, 60
.set SYS_WRITE, 1

.set STDIN, 0
.set STDOUT, 1
.set STDERR, 2

.set ITERATIONS, 1000000

.set FIZZBUZZ_MASK, 38504 # 1001011001101000
.set BUZZ_MASK, 1056 # 0000010000100000

###############################################################################
.data

.lcomm OutputBuffer, ITERATIONS*9
# .lcomm OutputBuffer, 6274073
Fizz:
.ascii "Fizz\n "
Buzz:
.ascii "Buzz\n "
FizzBuzz:
.ascii "FizzBuzz\n "

Str100p:
.word 0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930
.word 0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931
.word 0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932
.word 0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933
.word 0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934
.word 0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935
.word 0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936
.word 0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937
.word 0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938
.word 0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939

###############################################################################
.text

.globl start
.globl _main

.p2align 4, 0x90
start:
_main:

Initializing:
xorl %r12d, %r12d # counter
leaq OutputBuffer(%rip), %rdi # output buffer pointer
incq %rdi
xorl %r14d, %r14d # secondary counter ( 1 .. 15 )

movq $FIZZBUZZ_MASK, %r13 # FIZZBUZZ_MASK, 0 - print number, 1 - print word
movq $BUZZ_MASK, %r15 # BUZZ_MASK, 0 - print Fizz, 1 - print Buzz

xorl %r11d, %r11d
incb %r11b # mask pointer

movq Fizz(%rip), %r10
movq Buzz(%rip), %rbp
movq FizzBuzz(%rip), %r9

leaq Str100p(%rip), %rsp # Str100p table pointer

.p2align 4, 0x90
MainLoop:
incq %r12 # increase counter
incb %r14b # increase secondary counter
shlq %r11 # shift bitmask pointer

movl %r13d, %esi
andl %r11d, %esi # check mask
jz PrintMainCounter

cmpb $15, %r14b # check if we're at the FizzBuzz point
je PrintFizzBuzz

andl %r15d, %esi
jnz PrintBuzz
PrintFizz:
movq %r10, (%rdi)
addq $5, %rdi

cmpq $ITERATIONS, %r12
jne MainLoop
jmp PrintBuffer

.p2align 4, 0x90
PrintBuzz:
movq %rbp, (%rdi)
addq $5, %rdi

cmpq $ITERATIONS, %r12
jne MainLoop
jmp PrintBuffer

.p2align 4, 0x90
PrintFizzBuzz:
movq %r9, (%rdi)
addq $9, %rdi
movb $0xA, -1(%rdi)

incb %r11b # reset mask pointer
xorb %r14b, %r14b # reset secondary counter

cmpq $ITERATIONS, %r12
jne MainLoop
jmp PrintBuffer

.p2align 4, 0x90
PrintMainCounter:
movl %r12d, %eax
PrintDecimalNumber:
movl $0xA, %esi # accumulator
xorl %ecx, %ecx # counter
incb %cl

.p2align 4, 0x90
1:
movslq %eax, %rax
imulq $1374389535, %rax, %rdx # magic number 0x51EB851F
movq %rdx, %rbx
shrq $63, %rbx
sarq $37, %rdx
addl %ebx, %edx
imull $100, %edx, %ebx # %rax = 1234567, %rbx = 1234500, %rdx = 12345
subl %ebx, %eax
xchgl %edx, %eax

testl %eax, %eax
jz 2f

shlq $16, %rsi
movw (%rsp, %rdx, 2), %si
addb $2, %cl
jmp 1b
2:
shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

cmpb $0x30, %sil
jne 3f

decq %rdi # adjust output buffer pointer - current number and buffer content will overlap
movb $0xA, %sil # overlapped byte should always be '\n'
3:
addb $2, %cl
movq %rsi, (%rdi)
addq %rcx, %rdi

cmpq $ITERATIONS, %r12
jne MainLoop
PrintBuffer:
movq %rdi, %rdx
leaq OutputBuffer(%rip), %rsi
incq %rsi
subq %rsi, %rdx # number of symbols
movl $STDOUT, %edi
movl $SYS_WRITE, %eax
syscall
Exit:
xorl %edx, %edx
movl $SYS_EXIT, %eax
syscall

###############################################################################

Terje Mathisen

unread,

Jul 16, 2018, 5:01:23 AM7/16/18

to

fizz buzz wrote:
> Hello.
>
> Me and my friend are trying to create fastest Fizz Buzz program. This
> is very simple test that often appear as interview question. Here are
> the original rules:
>
>> Write a program that prints the numbers from 1 to 100. But for
>> multiples of three print “Fizz” instead of the number and for the
>> multiples of five print “Buzz”. For numbers which are multiples of
>> both three and five print “FizzBuzz”."
>
> Many examples and some interesting articles can be found here:
>
> <http://wiki.c2.com/?FizzBuzzTest>
>
> We changed the quantity of numbers from 100 to 1000000 (1 million) so
> it will be easier to measure program performance. Here is what I got
> so far. This code is written for x86_64 Linux and optimized to print
> numbers only up to 7 digits long.

This is still trivial...

I assume you run the program with the output redirected to a file or
NULL? ... Yes, I see that you do so!

Otherwise the console output time will totally dominate.

The easiest way to speed this up is to unroll the code 15 times, in
which case you take away _all_ the internal decision-making and only the
counter increment and output remains.

It is tempting to preformat the output, unroll by 30 and work in ascii
since that would leave all the ascii carries in fixed positions, i.e.
the bottom digit would stay fixed and then you would add '3' to the
second digit and handle any carries. I do suspect thought that you can
in fact work in binary and do the bin-to-ascii conversion on the fly for
the 8 out of 15 values that require this.

I would definitely allocate an output buffer large enough to store the
entire output, so that the printout becomes a single write() call.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

fizz buzz

unread,

Jul 17, 2018, 1:10:21 AM7/17/18

to

On Monday, July 16, 2018 at 4:01:23 AM UTC-5, Terje Mathisen wrote:
> fizz buzz wrote:
> > Hello.
> >
> > Me and my friend are trying to create fastest Fizz Buzz program. This
> > is very simple test that often appear as interview question. Here are
> > the original rules:
> >
> >> Write a program that prints the numbers from 1 to 100. But for
> >> multiples of three print “Fizz” instead of the number and for the
> >> multiples of five print “Buzz”. For numbers which are multiples of
> >> both three and five print “FizzBuzz”."
> >
> > Many examples and some interesting articles can be found here:
> >
> > <http://wiki.c2.com/?FizzBuzzTest>
> >
> > We changed the quantity of numbers from 100 to 1000000 (1 million) so
> > it will be easier to measure program performance. Here is what I got
> > so far. This code is written for x86_64 Linux and optimized to print
> > numbers only up to 7 digits long.
>
> This is still trivial...

Yes, of course, I'm not a professional programmer.

> I assume you run the program with the output redirected to a file or
> NULL? ... Yes, I see that you do so!
>
> Otherwise the console output time will totally dominate.

Correct; we're not trying to measure speed of console output.

> The easiest way to speed this up is to unroll the code 15 times, in
> which case you take away _all_ the internal decision-making and only the
> counter increment and output remains.

I tried to do so, but for some reason the resulting code was actually slower
than previous version. Perhaps I missed some mistake in the code? I thought the
resulting program is too big to fit to processor cache or something like that so I
came up with bimask version, which is still faster than any other program doing
honest loop with decision-making.

> It is tempting to preformat the output, unroll by 30 and work in ascii
> since that would leave all the ascii carries in fixed positions, i.e.
> the bottom digit would stay fixed and then you would add '3' to the
> second digit and handle any carries. I do suspect thought that you can
> in fact work in binary and do the bin-to-ascii conversion on the fly for
> the 8 out of 15 values that require this.

I'm not sure I fully understand this paragraph. Can you please give a code example?
(Don't have to be actual assembler code, just pseudo-code.)

Thank you!

Terje Mathisen

unread,

Jul 17, 2018, 11:56:04 AM7/17/18

to

fizz buzz wrote:
> On Monday, July 16, 2018 at 4:01:23 AM UTC-5, Terje Mathisen wrote:
>> It is tempting to preformat the output, unroll by 30 and work in
>> ascii since that would leave all the ascii carries in fixed
>> positions, i.e. the bottom digit would stay fixed and then you
>> would add '3' to the second digit and handle any carries. I do
>> suspect thought that you can in fact work in binary and do the
>> bin-to-ascii conversion on the fly for the 8 out of 15 values that
>> require this.
>
> I'm not sure I fully understand this paragraph. Can you please give a
> code example? (Don't have to be actual assembler code, just
> pseudo-code.)

count = 0;
for (j = 0; j < 30; j+= 10) {
number = sprintf("%d", count+j);
lastdigit = number + strlen(number)-1;

for (i = j; i < j+10; i++) {

if ((i % 15) == 0) {
printf("FizzBuzz");
}
else if ((i % 5) == 0) {
printf("Buzz");
}
else if ((i %3) == 0) {
printf("Fizz");
}
else {
printf(number);
}
printf("\n");
lastdigit++;
}
}

Instead of the printf() statements you would output the actual ASM code
to handle this particular case, i.e. use a C program to generate an
unrolled ASM program.

By only generating the ascii numeric value once for every 10 lines, the
bin-to-ascii overhead pretty much goes away and you end up with asm code
like this:

mod_0:
mov esi, offset FizzBuzz
mov ecx, 10 ; FizzBuzz+CRLF length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_1:
mov esi, offset number
mov ecx, edx ; Current number length, including CRLF
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_2:
mov esi, offset number
mov ecx, edx ; Current number length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_3:
mov esi, offset Fizz
mov ecx, 6 ; Fizz+CRLF length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_4:
mov esi, offset number
mov ecx, edx ; Current number length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_5:
mov esi, offset Buzz
mov ecx, 6 ; Buzz+CRLF length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

etc up to

mod_14:
mov esi, offset number
mov ecx, edx ; Current number length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

mod_15:
mov esi, offset FizzBuzz
mov ecx, 10 ; FizzBuzz+CRLF length
rep movsb
inc bl ; Last digit value
mov [ebp],bl ; Update last digit

repeat for 16 to 29

The next obvious tweak is to only update the ascii number value when
needed, i.e. increment by two when the next line will output Fizz/Buzz
and then skip the update for those lines.

Total time should be ~5 cycles per line, so 5 million cycles or 1.5 ms
to count from 1 to 1E6.

The next idea would use SSE regs to write the output bytes, with
unaligned store operations but no memory value updates:

We can fit two 6-digit numbers+CRLF in a 16-byte SSE reg, 4 in AVX256
and 8 in AVX512, so we only need 8 AVX regs to handle all 30 output
lines for an unrolled block. Even if we have to specialcase all values
below 100 000, the 900K values from there and up would run at pretty
much streaming memory store speed, so several GB/s and the possibility
to get down to less than a cycle/line.

Maybe 0.2 to 0.3 ms in total, a bit more if we have to be able to save
the output to a file?

Rick C. Hodgin

unread,

Jul 17, 2018, 2:56:15 PM7/17/18

to

On 7/16/2018 12:41 AM, fizz buzz wrote:
> Hello.
>

> Me and my friend are trying to create fastest Fizz Buzz program...[snip]

Wouldn't the fastest fizz buzz program be to just print the output in
a hard-coded form?

Here's an idea I had for a general algorithm. If there was a better
way to do the bit shift across two 64-bit values it would be better.

-----[ Begin ]-----

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

void fizz(int num) { printf("fizz\n"); }
void buzz(int num) { printf("buzz\n"); }
void fizzbuzz(int num) { printf("fizzbuzz\n"); }
void other(int num) { printf("%d\n", num); }

struct STargets { void (*func)(int num); };
STargets funcs[4] = { other, fizz, buzz, fizzbuzz };

uint64_t three_data[2] = { 0x4924924924924924, 0x0000000492492492 };
uint64_t five_data[2] = { 0x0842108421084210, 0x0000000842108421 };

int main(int argc, char* argv[])
{
uint32_t lnI;

for (lnI = 1; lnI <= 100; ++lnI)
{
funcs[(three_data[0] & 0x1) +
(2 * (five_data[0] & 0x1))].func(lnI);

three_data[0] = (three_data[0] >> 1) |
((three_data[1] & 1) << 63);
five_data[0] = (five_data[0] >> 1) |
((five_data[1] & 1) << 63);

three_data[1] >>= 1;
five_data[1] >>= 1;
}
return(0);
}

-----[ End ]-----

--
Rick C. Hodgin

Rick C. Hodgin

unread,

Jul 17, 2018, 3:26:18 PM7/17/18

to

A more generalized algorithm. Everything else remains the same:

-----[ Begin ]-----

// Set cycling bit values
uint32_t three_data = 0x924;
uint32_t five_data = 0x84210;

int main(int argc, char* argv[])
{
uint32_t lnI;

for (lnI = 1; lnI <= 100; ++lnI)
{

funcs[(three_data & 0x1) + (2 * (five_data & 0x1))].func(lnI);

three_data = (three_data >> 1) | ((three_data & 1) << 11);
five_data = (five_data >> 1) | ((five_data & 1) << 19);

Rick C. Hodgin

unread,

Jul 17, 2018, 3:56:22 PM7/17/18

to

I apologize for not posting in assembly. Here's the same algorithm,
but I left the print algorithms because they can be made generic as
per other fast / small algorithms for printing. Two parameters in
a table indicating how many bytes to add, and how many bytes to print
could be used to create a single "fizzbuzz" variable, and a single
print algorithm for fizz,buzz,fizzbuzz. A separate one for the int
value would be needed, and since the Fast Fizz Buzz OP said they had
it go to a higher value than 100, it would probably need to be a
general purpose algorithm.

In addition, for speed testing, it would probably be wise just to
generate the data for printing, but not to actually print, as that
would likely be the slowest factor, even if it was piped to an
output file:

_asm
{
mov esi,offset funcs
mov ecx,100
mov edx,offset three_data
mov edi,offset five_data

top_loop:
mov eax,[edx]
mov ebx,[edi]
and eax,1
and ebx,1
shl ebx,1
add eax,ebx
lea ebx,[esi+eax*4]

; Print algorithm can be modified to something simpler
pushad
mov eax,101
sub eax,ecx
push eax
call [ebx]
add esp,4
popad

; Shift algorithms as above
mov eax,dword ptr [edx]
mov ebx,dword ptr [edi]
and eax,1
and ebx,1
shr dword ptr [edx],1
shr dword ptr [edi],1
shl eax,11
shl ebx,19
or dword ptr [edx],eax
or dword ptr [edi],ebx

loop top_loop
}

--
Rick C. Hodgin

Rod Pemberton

unread,

Jul 18, 2018, 1:41:53 AM7/18/18

to

On Sun, 15 Jul 2018 21:41:09 -0700 (PDT)
fizz buzz <demon.a...@gmail.com> wrote:

> Me and my friend are trying to create fastest Fizz Buzz program.

First, see CLAX FizzBuzz thread in 2016 started by Mike Gonta:

https://groups.google.com/d/msg/comp.lang.asm.x86/Kx2fY75bwlo/xdHOI-nPAwAJ
Usenet msg-id neg376$k9u$1...@dont-email.me

Second, since Terje and Rick posted some C, I'll do so as well. As you
can see, no division or modulo is required for the base problem. You
can use down counters. The code also falls through instead of if-else.

Since it's not unrolled like Terje's example, nor optimized assembly
with shifts like yours, I'm curious as to how slow it is. Could you
time it?

Please don't send output to the screen as you'll time the slow console
output instead of the code. Yes, a while() usually optimizes better
than a for().

/* gcc -Wall -ansi -pedantic -O2 -o fizzbuzz fizzbuzz.c */

#include <stdio.h>

int main(void)
{
int i=0,i3=3,i5=5;

while(1)
{
i++;
i3--;
i5--;
if(!i3)
{
printf("Fizz");
i3=3;
}
if(!i5)
{
printf("Buzz");
i5=5;
}
if((i3!=3)&(i5!=5))
printf("%d",i);
printf("\n");
if(i==100)
break;
}

return(0);
}

Rod Pemberton
--
As long as the UK continues to work with the EU, Brexit won't happen.
The first pawn sacrifice: Gibraltar. Set Gibraltar free.

Rod Pemberton

unread,

Jul 18, 2018, 1:41:55 AM7/18/18

to

On Sun, 15 Jul 2018 21:41:09 -0700 (PDT)
fizz buzz <demon.a...@gmail.com> wrote:

> Many examples and some interesting articles can be found here:
>
> <http://wiki.c2.com/?FizzBuzzTest>
>

Rosetta Code has very many examples in many languages. Unfortunately,
all the C examples use slow modulo arithmetic.

http://rosettacode.org/wiki/FizzBuzz

Rick C. Hodgin

unread,

Jul 18, 2018, 8:12:19 AM7/18/18

to

On Tuesday, July 17, 2018 at 3:56:22 PM UTC-4, Rick C. Hodgin wrote:
> On 7/17/2018 3:17 PM, Rick C. Hodgin wrote:
> >     // Set cycling bit values
> >     uint32_t three_data = 0x924;
> >     uint32_t five_data = 0x84210;

An optimizing occurred to me driving in to work this morning. If
we shift the bits in five_data left by one, and cycle through 21
bits instead of 20, we can avoid the multiply by ANDing the value
by 2 instead of 1.

Change to:

uint32_t five_data = 0x84210 << 1;

> and ebx,1 <===
> shl ebx,1 <===
> add eax,ebx <===
> lea ebx,[esi+eax*4]

Change to:

and ebx,2

add eax,ebx
lea ebx,[esi+eax*4]

> ; Print algorithm can be modified to something simpler
> pushad
> mov eax,101
> sub eax,ecx
> push eax
> call [ebx]
> add esp,4
> popad
>
> ; Shift algorithms as above
> mov eax,dword ptr [edx]
> mov ebx,dword ptr [edi]
> and eax,1
> and ebx,1
> shr dword ptr [edx],1
> shr dword ptr [edi],1
> shl eax,11

> shl ebx,19 <====

Change to:

shl ebx,20

> or dword ptr [edx],eax
> or dword ptr [edi],ebx
>
> loop top_loop
> }

Just popped in my head while driving. :-)

--
Rick C. Hodgin

fizz buzz

unread,

Jul 18, 2018, 8:42:22 AM7/18/18

to

On Tuesday, July 17, 2018 at 10:56:04 AM UTC-5, Terje Mathisen wrote:

> count = 0;
> for (j = 0; j < 30; j+= 10) {
> number = sprintf("%d", count+j);
> lastdigit = number + strlen(number)-1;
>
> for (i = j; i < j+10; i++) {
>
> if ((i % 15) == 0) {
> printf("FizzBuzz");
> }
> else if ((i % 5) == 0) {
> printf("Buzz");
> }
> else if ((i %3) == 0) {
> printf("Fizz");
> }
> else {
> printf(number);
> }
> printf("\n");
> lastdigit++;
> }
> }
>
> Instead of the printf() statements you would output the actual ASM code
> to handle this particular case, i.e. use a C program to generate an
> unrolled ASM program.
>
> By only generating the ascii numeric value once for every 10 lines, the
> bin-to-ascii overhead pretty much goes away and you end up with asm code
> like this:

Yes, I understand now, thank you! I used your idea and I was able to considerably
reduce execution time. On my laptop it went down just a bit, but on my Linux server
user execution time shows as 0.0001 s! Six times faster than before.

However, I should note couple things:

> mod_14:
> mov esi, offset number
> mov ecx, edx ; Current number length
> rep movsb
> inc bl ; Last digit value
> mov [ebp],bl ; Update last digit
>
> mod_15:
> mov esi, offset FizzBuzz
> mov ecx, 10 ; FizzBuzz+CRLF length
> rep movsb
> inc bl ; Last digit value
> mov [ebp],bl ; Update last digit

I don't use REP MOVSB neither when printing a number nor when
printing a word. I found that it's faster to keep everything in registers, copy
the whole register (8 bytes) to a buffer, and then adjust buffer length by
actual length of data.

If you check my first source code, you'll see that I don't do any memory reads
except in PrintNumber routine. All constants (strings "Fizz", "Buzz", "FizzBuzz",
etc.) are kept in registers. Even in PrintNumber routine I copy and accumulate
ASCII digits in a register instead of memory buffer. That's why I can print numbers
only up to seven digits (seven digits plus newline character is eight bytes, exactly
the size of 64 bit register).

> The next obvious tweak is to only update the ascii number value when
> needed, i.e. increment by two when the next line will output Fizz/Buzz
> and then skip the update for those lines.

Yes, that's what I'm doing.

> Total time should be ~5 cycles per line, so 5 million cycles or 1.5 ms
> to count from 1 to 1E6.
>
> The next idea would use SSE regs to write the output bytes, with
> unaligned store operations but no memory value updates:
>
> We can fit two 6-digit numbers+CRLF

Linux needs only one character to represent new line.

> in a 16-byte SSE reg, 4 in AVX256
> and 8 in AVX512, so we only need 8 AVX regs to handle all 30 output
> lines for an unrolled block. Even if we have to specialcase all values
> below 100 000, the 900K values from there and up would run at pretty
> much streaming memory store speed, so several GB/s and the possibility
> to get down to less than a cycle/line.

I have to check if my laptop and server both support AVX512.

However, I suspect it won't be as easy as it seems. Even after first 100000 values there
still will be words that have to be printed (i.e. stored in AVX register mixed with numeric
values), and those words have different lengths. Plus, if I remember correctly, it's
possible to OR memory content and AVX reg, but not regular reg and AVX reg. In other
words, every time there is a need to shift AVX reg and load new string in it, one will have
to do memory read, and suffer speed penalty for that.

> Maybe 0.2 to 0.3 ms in total, a bit more if we have to be able to save
> the output to a file?

Current speed on server:

bash$ time ./asm-unfold-mul > /dev/null

real 0m0.007s
user 0m0.001s
sys 0m0.006s

On laptop:

bash$ time ./asm-unfold-mul > /dev/null

real 0m0.008s
user 0m0.004s
sys 0m0.003s

P.S. My current code is quite long. Is it OK to post (almost 10 Kb), would it
be OK to post it here?

Rick C. Hodgin

unread,

Jul 18, 2018, 8:57:25 AM7/18/18

to

On 7/18/2018 8:31 AM, fizz buzz wrote:
> P.S. My current code is quite long. Is it OK to post (almost 10 Kb), would it
> be OK to post it here?

If you are concerned, use pastebin.com, and paste a link here.
The content on pastebin is persistent over years.

Many Usenet users tell you to always post here as it leaves a
permanent record.

--
Rick C. Hodgin

fizz buzz

unread,

Jul 18, 2018, 9:12:30 AM7/18/18

to

On Wednesday, July 18, 2018 at 12:41:53 AM UTC-5, Rod Pemberton wrote:

> First, see CLAX FizzBuzz thread in 2016 started by Mike Gonta:
>
> https://groups.google.com/d/msg/comp.lang.asm.x86/Kx2fY75bwlo/xdHOI-nPAwAJ
> Usenet msg-id neg376$k9u$1...@dont-email.me

Thank you, very interesting.

> Second, since Terje and Rick posted some C, I'll do so as well. As you
> can see, no division or modulo is required for the base problem. You
> can use down counters. The code also falls through instead of if-else.
>
> Since it's not unrolled like Terje's example, nor optimized assembly
> with shifts like yours, I'm curious as to how slow it is. Could you
> time it?

On my laptop:

bash$ time ./fizzbuzz > /dev/null

real 0m0.168s
user 0m0.154s
sys 0m0.002s

fizz buzz

unread,

Jul 18, 2018, 9:12:32 AM7/18/18

to

On Wednesday, July 18, 2018 at 7:57:25 AM UTC-5, Rick C. Hodgin wrote:

> Many Usenet users tell you to always post here as it leaves a
> permanent record.

###############################################################################
.set SYS_EXIT, 60
.set SYS_WRITE, 1

.set STDIN, 0
.set STDOUT, 1
.set STDERR, 2

.set ITERATIONS, 1000000

###############################################################################
.data

.lcomm OutputBuffer, ITERATIONS*9

Fizz:
.ascii "Fizz\n "
Buzz:
.ascii "Buzz\n "
FizzBuzz:
.ascii "FizzBuzz\n "

FirstNum:
.ascii "0\n "

Str100p:
.word 0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930
.word 0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931
.word 0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932
.word 0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933
.word 0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934
.word 0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935
.word 0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936
.word 0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937
.word 0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938
.word 0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939

###############################################################################
.text

.globl start
.globl _main

.p2align 4, 0x90
start:
_main:

Initializing:
xorl %r12d, %r12d # counter
leaq OutputBuffer(%rip), %rdi # output buffer pointer
incq %rdi

movq Fizz(%rip), %r9
movq Buzz(%rip), %r10
movq FizzBuzz(%rip), %r11

leaq Str100p(%rip), %rsp # Str100p tasile pointer

movq FirstNum(%rip), %rsi
movq $2, %r13
xorl %ecx, %ecx
jmp mod_01

.p2align 4, 0x90
mod_00:

incq %r12 # increase counter

movl %r12d, %eax

movl $0xA, %esi # accumulator
xorl %ecx, %ecx

xorl %r13d, %r13d # counter
incb %r13b

.p2align 4, 0x90
1:
movslq %eax, %rax
imulq $1374389535, %rax, %rdx # magic number 0x51EB851F
movq %rdx, %rbx
shrq $63, %rbx
sarq $37, %rdx
addl %ebx, %edx
imull $100, %edx, %ebx # %rax = 1234567, %rbx = 1234500, %rdx = 12345
subl %ebx, %eax
xchgl %edx, %eax

testl %eax, %eax
jz 2f

shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

addb $2, %r13b

jmp 1b
2:
shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

cmpb $0x30, %sil
jne 3f

decb %r13b
shrq $8, %rsi
3:
movb %r13b, %cl
shlb $3, %cl
addb $2, %r13b

movq %r11, (%rdi) # print FizzBuzz

addq $9, %rdi
movb $0xA, -1(%rdi)

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_01:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_02:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_03:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_04:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_05:
incq %r12

movq %r10, (%rdi) # print Buzz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_06:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_07:
incq %r12

xorl %eax, %eax
addb $3, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_08:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_09:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_10:

incq %r12 # increase counter

movl %r12d, %eax

movl $0xA, %esi # accumulator
xorl %ecx, %ecx

xorl %r13d, %r13d # counter
incb %r13b

.p2align 4, 0x90
1:
movslq %eax, %rax
imulq $1374389535, %rax, %rdx # magic number 0x51EB851F
movq %rdx, %rbx
shrq $63, %rbx
sarq $37, %rdx
addl %ebx, %edx
imull $100, %edx, %ebx # %rax = 1234567, %rbx = 1234500, %rdx = 12345
subl %ebx, %eax
xchgl %edx, %eax

testl %eax, %eax
jz 2f

shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

addb $2, %r13b

jmp 1b
2:
shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

cmpb $0x30, %sil
jne 3f

decb %r13b
shrq $8, %rsi
3:
movb %r13b, %cl
shlb $3, %cl
addb $2, %r13b

movq %r10, (%rdi) # print Buzz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_11:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_12:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_13:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_14:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_15:
incq %r12

movq %r11, (%rdi) # print FizzBuzz

addq $9, %rdi
movb $0xA, -1(%rdi)

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_16:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_17:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_18:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_19:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_20:

incq %r12 # increase counter

movl %r12d, %eax

movl $0xA, %esi # accumulator
xorl %ecx, %ecx

xorl %r13d, %r13d # counter
incb %r13b

.p2align 4, 0x90
1:
movslq %eax, %rax
imulq $1374389535, %rax, %rdx # magic number 0x51EB851F
movq %rdx, %rbx
shrq $63, %rbx
sarq $37, %rdx
addl %ebx, %edx
imull $100, %edx, %ebx # %rax = 1234567, %rbx = 1234500, %rdx = 12345
subl %ebx, %eax
xchgl %edx, %eax

testl %eax, %eax
jz 2f

shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

addb $2, %r13b

jmp 1b
2:
shlq $16, %rsi
movw (%rsp, %rdx, 2), %si

cmpb $0x30, %sil
jne 3f

decb %r13b
shrq $8, %rsi
3:
movb %r13b, %cl
shlb $3, %cl
addb $2, %r13b

movq %r10, (%rdi) # print Buzz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_21:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_22:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_23:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_24:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_25:
incq %r12

movq %r10, (%rdi) # print Buzz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_26:
incq %r12

xorl %eax, %eax
addb $3, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_27:
incq %r12

movq %r9, (%rdi) # print Fizz

addq $5, %rdi

cmpq $ITERATIONS, %r12

jae PrintBuffer
mod_28:
incq %r12

xorl %eax, %eax
addb $2, %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer
mod_29:
incq %r12

xorl %eax, %eax
incb %al
shlq %cl, %rax
addq %rax, %rsi
movq %rsi, (%rdi) # emulate number
addq %r13, %rdi

cmpq $ITERATIONS, %r12
jae PrintBuffer

jmp mod_00

.p2align 4, 0x90

PrintBuffer:
movq %rdi, %rdx
leaq OutputBuffer(%rip), %rsi
incq %rsi
subq %rsi, %rdx # number of symbols
movl $STDOUT, %edi
movl $SYS_WRITE, %eax
syscall
Exit:

xorl %edi, %edi

fizz buzz

unread,

Jul 18, 2018, 9:12:33 AM7/18/18

to

On Tuesday, July 17, 2018 at 1:56:15 PM UTC-5, Rick C. Hodgin wrote:

> Wouldn't the fastest fizz buzz program be to just print the output in
> a hard-coded form?

Yes, but that would be cheating. :-)

demon.a...@gmail.com

unread,

Jul 18, 2018, 12:54:29 PM7/18/18

to

I think that my original code would be faster. I understand it might be harder to read since I use AT&T syntax. I'll try to emphasize actual logic:

.set FIZZBUZZ_MASK, 38504 # 1001011001101000
.set BUZZ_MASK, 1056 # 0000010000100000

...

xorl %r14d, %r14d # secondary counter ( 1 .. 15 )

movq $FIZZBUZZ_MASK, %r13 # FIZZBUZZ_MASK, 0 - print number, 1 - print word
movq $BUZZ_MASK, %r15 # BUZZ_MASK, 0 - print Fizz, 1 - print Buzz

xorl %r11d, %r11d
incb %r11b # mask pointer

...
MainLoop:
...

incb %r14b # increase secondary counter
shlq %r11 # shift bitmask pointer

movl %r13d, %esi
andl %r11d, %esi # check mask
jz PrintMainCounter

cmpb $15, %r14b # check if we're at the FizzBuzz point
je PrintFizzBuzz

andl %r15d, %esi
jnz PrintBuzz

...
jmp MainLoop

I use 5 registers: two bitmask constants, one counter, one bitmask
pointer, and one temporary register. I use 9 commands in loop (inc,
shl, mov, and, jz, cmp, je, and, jnz). Basically all decision making is
two ANDs and one CMP.

> --
> Rick C. Hodgin

Rick C. Hodgin

unread,

Jul 18, 2018, 1:09:40 PM7/18/18

to

On 7/18/2018 8:48 AM, demon.a...@gmail.com wrote:
> I think that my original code would be faster. I understand it might be harder to read since I use AT&T syntax. I'll try to emphasize actual logic:

> [snip]

>
> I use 5 registers: two bitmask constants, one counter, one bitmask
> pointer, and one temporary register. I use 9 commands in loop (inc,
> shl, mov, and, jz, cmp, je, and, jnz). Basically all decision making is
> two ANDs and one CMP.

I'll post an updated version of this later. I've thought of some
additional optimizations to my algorithm.

It's been about 15 years since I did assembly programming on any
daily basis. I'm very rusty. But, it's coming back to me as I
go about my day. :-)

I enjoy these kinds of mental exercises. I was hoping Terje would
show up and school me on the many points I missed.

One is I don't need the full constant, just the initial bits to
cycle through. Another is I don't need the reference to funcs
consuming esi. I can probably also free up edx and edi and use
direct constants. Then I can keep everything in registers. The
code to print should be optimized, but I haven't thought of that
yet. Since fizz and buzz are each 4-bytes, it would work as a
direct constant move into a memory buffer. Not sure on the code
to generate an arbitrary ASCII representation of an integer value.
Will have to think about that.

--
Rick C. Hodgin

Rick C. Hodgin

unread,

Jul 18, 2018, 1:54:44 PM7/18/18

to

On 7/18/2018 1:05 PM, Rick C. Hodgin wrote:
> I'll post an updated version of this later. I've thought of some
> additional optimizations to my algorithm.

Here's the latest version:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

void fizz(int num) { printf("fizz\n"); }
void buzz(int num) { printf("buzz\n"); }
void fizzbuzz(int num) { printf("fizzbuzz\n"); }
void other(int num) { printf("%d\n", num); }

void (*funcs[4])(int num) = { other, fizz, buzz, fizzbuzz };

int main(int argc, char* argv[])
{

_asm
{
xor esi,esi ; Count (1..100)
mov edx,1 << 2 ; 3-bit cycler
mov edi,1 << 5 ; 5-bit cycler

top_loop:
inc esi ; Increase our count
mov eax,edx ; Load 3-bit value
mov ebx,edi ; Load 5-bit value
and eax,1 ; Lower-bit of 3-bit value
and ebx,2 ; 2nd bit of 5-bit value
add eax,ebx ; Add and get function offset
lea ebx,[offset funcs + eax*4]

; Print algorithm can be modified to something simpler
pushad

push esi

call [ebx]
add esp,4
popad

; Rotate lower 3-bits
mov eax,edx
and eax,1
shr edx,1
shl eax,2
or edx,eax

; Rotate bits 6..1 (ignores lower bit after shifting)
shr edi,1
mov ebx,edi
and ebx,1
shl ebx,5
or edi,ebx

; Are we done?
cmp esi,100
jb top_loop
}
return(0);
}

I still see some possible optimizations. Will work on it when I
get time. :-)

Oh Terje ... please school us on how to better optimize.

--
Rick C. Hodgin

Rod Pemberton

unread,

Jul 18, 2018, 4:24:53 PM7/18/18

to

On Wed, 18 Jul 2018 06:00:13 -0700 (PDT)
fizz buzz <demon.a...@nospicedham.gmail.com> wrote:

> On Wednesday, July 18, 2018 at 12:41:53 AM UTC-5, Rod Pemberton wrote:

> > Second, since Terje and Rick posted some C, I'll do so as well. As
> > you can see, no division or modulo is required for the base
> > problem. You can use down counters. The code also falls through
> > instead of if-else.
> >
> > Since it's not unrolled like Terje's example, nor optimized assembly
> > with shifts like yours, I'm curious as to how slow it is. Could you
> > time it?
>
> On my laptop:
>
> bash$ time ./fizzbuzz > /dev/null
>
> real 0m0.168s
> user 0m0.154s
> sys 0m0.002s
>

Well, I was interested in timings for the other C versions ... So, I'll
post timings for assembly (your 2 posts) and C code (mine, Terje, Rick)
for 64-bit Linux on my older processor. I also post the modified and
completed versions of Terje's, Rick's, and my C code that I used for
testing.

(3.2Ghz AMD Phenom II X2 555 dual-core, 64-bit Linux)

Please note that results from "time" can vary +/- 0m0.002/0m0.003 or
more on my machine. It's not real precise. I've not averaged the
results of multiple runs, but simply picked the most representative or
frequent result.

For 100 to 1M, for my C version, I get:
(gcc -Wall -ansi -pedantic -O2)

:# time ./fizzbuzz > /dev/null

real 0m0.083s
user 0m0.081s
sys 0m0.000s

For 100 to 1M, for Terje's C version (reworked, completed), I get:
(gcc -Wall -ansi -pedantic -O2)

:# time ./fizzbuzz > /dev/null

real 0m0.085s
user 0m0.083s
sys 0m0.001s

For 0 (not 100) to 1M for Rick's C version (extended cycle), I get:
(gcc -Wall -ansi -pedantic -O2)

:# time ./rick > /dev/null

real 0m0.077s
user 0m0.074s
sys 0m0.002s

Unfortunately, there is no easy way to adjust the bitmaps in Rick's code
for a different initial value, i.e., 100 instead of 1. Shifting the
bitmaps or creating them at runtime adds time too. So, we'll just
subtract Rick's time for 0 to 100 from his 0 to 1M time:

real 0m0.001s
user 0m0.000s
sys 0m0.000s

For 100 to 1M, net time for Rick's C version (extended cycle):

real 0m0.076s
user 0m0.074s
sys 0m0.002s

Rick wins the C version for 100 to 1M, even after having the cycle
extended, i.e., the code is doing much more work now. From the
assembly, it seems that this code generates well above average
instruction pairing and near full 64-bit register usage. So, the
substantially longer, i.e., huge, instruction sequence is apparently
nullified. I've never before seen C compilers like GCC generate code
like this, but I'm used to code from the 32-bit GCC 3.x series, not
64-bit 4.x series.

Now for your two assembly versions.

For 100 to 1M, for your assembly version asm-bitmask-mul, I get:

root:# time ./asm-bitmask-mul > /dev/null
(gcc -nostdlib)

real 0m0.014s
user 0m0.010s
sys 0m0.004s

That's slightly slower than your server:

> bash$ time ./asm-bitmask-mul > /dev/null
>
> real 0m0.010s
> user 0m0.006s
> sys 0m0.003s

For 100 to 1M, for your assembly version asm-unfold-mul, I get:
(gcc -nostdlib)

:# time ./asm-unfold-mul > /dev/null

real 0m0.007s
user 0m0.003s
sys 0m0.003s

That's slightly faster than you laptop and very close to your server:

> Current speed on server:
>
> bash$ time ./asm-unfold-mul > /dev/null
>
> real 0m0.007s
> user 0m0.001s
> sys 0m0.006s
>
> On laptop:
>
> bash$ time ./asm-unfold-mul > /dev/null
>
> real 0m0.008s
> user 0m0.004s
> sys 0m0.003s

My 3.2Ghz AMD Phenom II X2 555 dual-core processor is from 2009 which is
well over a decade old. Your C results are twice as slow on your
laptop versus my machine, but the assembly is slightly faster for both
your server and laptop. How can that be? ... I don't know why there
would be such a huge discrepancy for the C code, while your assembly is
slightly faster.

Did you use -O2 for the C code? (gcc -Wall -ansi -pedantic -O2). Even
if I use -O0 on my C code, it only slows down very slightly:

:# time ./fizzbuzz > /dev/null
(gcc -Wall -ansi -pedantic -O0)

real 0m0.087s
user 0m0.083s
sys 0m0.002s

> On my laptop:
>
> bash$ time ./fizzbuzz > /dev/null
>
> real 0m0.168s
> user 0m0.154s
> sys 0m0.002s
>

Clearly, it doesn't slow down to the speed you're getting for C on your
laptop, as I would expect, given that the assembly results are so close
together.

What C compiler are you using? (GCC v. 4.7.2 here, 64-bit Linux)

What processors are being used on your laptop and server? I.e., are
they a decade old like mine is or are they new and underpowered?

Here's the reduced and completed C version of Terje's:

#include <stdio.h>
#include <string.h>

int main(void)
{
int i;

for(i=100;i<1000001;i++) {

if((i%15)==0) {
printf("FizzBuzz");
}
else if((i%5)==0) {
printf("Buzz");
}
else if((i%3)==0) {
printf("Fizz");
}
else {

printf("%d",i);
}
printf("\n");
}

return(0);
}

Here's the adjusted C version of mine:

#include <stdio.h>

int main(void)
{
int i,i3,i5;

i=100;
i3=((3-(i%3))%3)+1;
i5=((5-(i%5))%5)+1;
while(1)
{

i3--;
i5--;
if(!i3)
{
printf("Fizz");
i3=3;
}
if(!i5)
{
printf("Buzz");
i5=5;
}

if((i3<3)&(i5<5))

printf("%d",i);
printf("\n");

i++;
if(i==1000001)
break;
}

return(0);
}

Here's the corrected and bitmask extended C version of Rick's:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

void fizz(uint32_t num) {printf("Fizz\n");}
void buzz(uint32_t num) {printf("Buzz\n");}
void fizzbuzz(uint32_t num) {printf("FizzBuzz\n");}
void other(uint32_t num) {printf("%d\n",num);}

struct STargets {void (*func)(uint32_t num);};
struct STargets funcs[4]={{other},{fizz},{buzz},{fizzbuzz}};

uint64_t three_data[3]={0x4924924924924924,0x2492492492492492,
0x9249249249249249};
uint64_t five_data[5]={0x0842108421084210,0x1084210842108421,
0x2108421084210842,0x4210842108421084,
0x8421084210842108};

int main(int argc, char* argv[])
{

uint32_t lnI;
uint64_t saved_data;

/* WARNING bitmasks start at value one */
/* bitmasks must be shifted for lnI != 1 */

for(lnI=1;lnI<=1000000;++lnI)
{
funcs[(three_data[0]&0x1)+(2*(five_data[0]&0x1))].func(lnI);

saved_data=three_data[0];
three_data[0]=(three_data[0]>>1)|((three_data[1]&1)<<63);
three_data[1]=(three_data[1]>>1)|((three_data[2]&1)<<63);
three_data[2]=(three_data[2]>>1)|((saved_data&1)<<63);

saved_data=five_data[0];
five_data[0]=(five_data[0]>>1)|((five_data[1]&1)<<63);
five_data[1]=(five_data[1]>>1)|((five_data[2]&1)<<63);
five_data[2]=(five_data[2]>>1)|((five_data[3]&1)<<63);
five_data[3]=(five_data[3]>>1)|((five_data[4]&1)<<63);
five_data[4]=(five_data[4]>>1)|((saved_data&1)<<63);
}

return(0);
}

HTH,

Rick C. Hodgin

unread,

Jul 18, 2018, 6:55:02 PM7/18/18

to

There's a simpler version that's been posted now on comp.lang.c.
It is written by Ben Bacarisse, altered slightly here by me. I
have not tested it, but you can see it here and see if it's any
faster using less values, and less complexity:

#include <stdio.h>

void fizz(void) { printf("fizz\n"); }
void buzz(void) { printf("buzz\n"); }
void fizzbuzz(void) { printf("fizzbuzz\n"); }
void other(void) { printf("%d\n", num); }

void (*funcs[4])(int) = { other, fizz, buzz, fizzbuzz };

int num;

int main(void)
{
int threes = 1 << 2, fives = 1 << 4;

for (num = 1; num <= 100; ++num) {
funcs[(threes & 1) + (2 * (fives & 1))]();

threes = (threes >> 1) | ((threes & 1) << 2);
fives = (fives >> 1) | ((fives & 1) << 4);
}
}

--
Rick C. Hodgin

Terje Mathisen

unread,

Jul 19, 2018, 4:55:45 AM7/19/18

to

Just do a few typical snippets, showing various stages.

Kerr-Mudd,John

unread,

Jul 19, 2018, 4:55:48 AM7/19/18

to

On Wed, 18 Jul 2018 13:00:13 GMT, fizz buzz
<demon.a...@nospicedham.gmail.com> wrote:

> On Wednesday, July 18, 2018 at 12:41:53 AM UTC-5, Rod Pemberton wrote:
>
>> First, see CLAX FizzBuzz thread in 2016 started by Mike Gonta:
>>
>> https://groups.google.com/d/msg/comp.lang.asm.x86/Kx2fY75bwlo/xdHOI-nP
>> AwAJ Usenet msg-id neg376$k9u$1...@dont-email.me
>
> Thank you, very interesting.

I was focussed on mimimal size; and the spec was up to 100, which allowed
a 16bit count in display decimal in 1 x86 register;

I think this was my best for that spec -

org 0x100 ; FizzBuzz MJ R [ 63 ]
cpu 8086 ; use TotalCnt to end. p/p ax
start: ; ax=0 bx=0 cx=00xx dx=cs=ds=es=xxxx si=0100 di=sp=FFFE bp=09xx nz
; prog assumes ch=0, mem available at 0x3030, nz at start
mov dx,Prtarea ; harded code dword [ 3 ]
eachnum: ; -setup main- [ 5 ]
mov di,dx ; reset to print next
mov si,FBtbl-textlth ; fixup for fallthru 1st time
nextFB: ; -main- [ 16 ]
mov cl,textlth ; reset for each Textstr
jnz notthisFB ; then skip Textstr to next
mov byte [si-2],ah ; have a FB, reset count
repnz movsb ; cpy Textstr
notthisFB:
add si,cx ; add 0 if moved, or textlth to skip.
inc byte [si] ; count up: either FBcnt or TotalCnt
lodsw ; FB:al=num, ah=reset, si->Textstr
jng nextFB ; TC:al=TotalCnt,ah=09,si->crlf$
push ax
cmp dx,di ; if any FB set, skip num [ 4 ]
jne goprt
aam ; -cpynum- ah overwritten [ 7 ]
xchg ah,al ; other endian; div costs more
or ax,dx ; 2 digit ascii num; dx=0x3030
stosw ; with leading Z
goprt:
movsw ; -cpycrlf$- [ 2 ]
movsb ; add eos
; ### stack not happy around int21
;; int 0x21 ; flags preserved!
pop ax
cmp al,100 ; - end/ prt- max 100 [ 7 ]
int 0x21 ; flags preserved, but not al
jl eachnum ; nz from cmp
exit: ret
textlth equ 4
FBtbl db -3,-3,"Fizz" ; (tbl entries=2)*6 [ 12 ]
db -5,-5,"Buzz"
; db -7,-7,"Bang"
; db -11,-11,"Zapp"
TotalCnt db 0 ; endFB [ 2 ]
Dosprtstr db 0x09 ; loaded
Crlf db 0x0D,0x0A,'$' ; must be next after TotalCnt [ 3 ]
proglth equ $-start
Prtarea equ 0x3030 ; hard coded dw total [*63*]
[]

--
Bah, and indeed, Humbug.

Terje Mathisen

unread,

Jul 19, 2018, 4:55:50 AM7/19/18

to

Rod Pemberton wrote:
> For 100 to 1M, for Terje's C version (reworked, completed), I get:
> (gcc -Wall -ansi -pedantic -O2)
>
> :# time ./fizzbuzz > /dev/null
>
> real 0m0.085s
> user 0m0.083s
> sys 0m0.001s

Please don't time that code!

This is only there to generate an unrolled-by-30 version with no tests
or branches.

Alternatively I could create an unrolled C version, this would be of
similar speed (within 20%) of the unrolled asm.

Bernhard Schornak

unread,

Jul 19, 2018, 7:11:00 AM7/19/18

to

fizz buzz wrote:

<very much...>

Theoretical thought:

The inserted FIZZes, BUZZes and FIZZBUZZes always follow the
same pattern

n1
n2
FIZZ
n4
BUZZ
FIZZ
n7
n8
FIZZ
BUZZ
n11
FIZZ
n13
n14
FIZZBUZZ

Hence, we can save all comparisons regarding replacement and
prefill the output buffer with the above pattern as often as
required to reach the desired number. If we define a *fixed*
output size (e.g.: 8 byte for 6 digit numbers plus line feed
plus trailing zero), the above patterns could be loaded into
eight XMM registers and moved to the output buffer n/8 times
to "bypass" the comparisons completely.

If we use another fixed pattern for the numbers to print, we
can skip the hex to dec translation, as well, and replace it
by brute force incrementing zeroes until we reach the nines.
Then we just had to increment the next higher digits and re-
set the lowest one by zero to repeat until the entire number
is processed.

Just my five cents...

Bernhard Schornak

fizz buzz

unread,

Jul 19, 2018, 7:41:03 AM7/19/18

to

This is a good thought. However, a *fixed* output size might be considered
cheating, since all other FizzBuzz programs have to deal with non-fixed
strings.

If it were possible to do fixed-size strings, I probably would try to write
threaded version. One thread for Fizzes, one for Buzzes, one for FizzBuzzes,
and one for numbers. However, that would only work in case of fixed-size
strings, because it would be possible to calculate string offsets without caring
about lengths of other strings.

Here is another cheating thought. Original rules say:

> Write a program that prints the numbers from 1 to 100...

But it does not say that they have to be printed in sequence! "From 1 to
100" is just a range, so theoretically output can be in any sequence or even
out of sequence. This makes threaded version possible. Again, one thread for
Fizzes, one for Buzzes, one for FizzBuzzes, and one for numbers. The only
problem Is that output will not be synchronized... but should it? :-)

fizz buzz

unread,

Jul 19, 2018, 8:26:07 AM7/19/18

to

On Wednesday, July 18, 2018 at 3:24:53 PM UTC-5, Rod Pemberton wrote:

> My 3.2Ghz AMD Phenom II X2 555 dual-core processor is from 2009 which is
> well over a decade old. Your C results are twice as slow on your
> laptop versus my machine, but the assembly is slightly faster for both
> your server and laptop. How can that be? ... I don't know why there
> would be such a huge discrepancy for the C code, while your assembly is
> slightly faster.

I would consider time command output as relative, not absolute numbers. There are
million factors that can affect it (for example, background processes can slow
system down). I know that it will show me different numbers depending on the
laptop's battery state (if laptop plugged in or not).

> Did you use -O2 for the C code? (gcc -Wall -ansi -pedantic -O2).

Yes.

> What C compiler are you using? (GCC v. 4.7.2 here, 64-bit Linux)

Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin17.0.0

> What processors are being used on your laptop and server? I.e., are
> they a decade old like mine is or are they new and underpowered?

My laptop is 3.1 GHz Intel Core i7, my server is just a VPS, so I'm not sure how accurate
it would be to measure its speed. Theoretically it's "Intel Broadwell microarchitecture,
speed 2200 MHz (estimated)".

If we really want measure programs performance we should come up with some
better software, like OProfile (<http://oprofile.sourceforge.net/faq/>), but I didn't find
any profiler which would run on both Linux and OS X.

Rosario19

unread,

Jul 19, 2018, 8:41:09 AM7/19/18

to

On Sun, 15 Jul 2018 21:41:09 -0700 (PDT), fizz buzz wrote:

>Hello.

>
>Me and my friend are trying to create fastest Fizz Buzz program.

as someone said for me the fastes solution it is:

write manually the output that it is
and than put it in a data label
and than print it
the only instruction is print one array of chars

fizz buzz

unread,

Jul 19, 2018, 8:56:12 AM7/19/18

to

According to this logic, fastest prime number calculating algorithm would
just print text file with already calculated prime numbers.

No cheating, guys! :-)

Bernhard Schornak

unread,

Jul 19, 2018, 11:56:27 AM7/19/18

to

fizz buzz wrote:

>> Write a program that prints the numbers from 1 to 100...

Sorry, but I prefer practical stuff useful for daily tasks,
where the fastest way is the best solution. Actually, there
is no rule not to use my method in your exercise, so you're
free to choose any way leading to the proper solution. Just
not to use the fastest way because no on else had that idea
is a little bit stupid, isn't it?

Greetings from Augsburg

Bernhard Schornak

Rick C. Hodgin

unread,

Jul 19, 2018, 12:26:31 PM7/19/18

to

On 7/18/2018 4:20 PM, Rod Pemberton wrote:
> ... From the

> assembly, it seems that this code generates well above average
> instruction pairing and near full 64-bit register usage. So, the
> substantially longer, i.e., huge, instruction sequence is apparently
> nullified. I've never before seen C compilers like GCC generate code
> like this, but I'm used to code from the 32-bit GCC 3.x series, not
> 64-bit 4.x series.

It was hand-coded by me. I translated the C code idea into assembly.

The newer versions use registers explicitly and I think they would be
notably faster. I'll work on a fast print algorithm once I get the
chance. I think a table referencing what to print would be desirable,
and another table indicating the lengths to copy. The ecx register
is still available below, as would be ebp if we didn't have a stack
frame.

There's still room for improvement in design, let alone optimization:

-----

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

void fizz(int num) { printf("fizz\n"); }
void buzz(int num) { printf("buzz\n"); }
void fizzbuzz(int num) { printf("fizzbuzz\n"); }
void other(int num) { printf("%d\n", num); }

void (*funcs[4])(int num) = { other, fizz, buzz, fizzbuzz };

int main(int argc, char* argv[])
{

--
Rick C. Hodgin

Alex McDonald

unread,

Jul 19, 2018, 12:55:32 PM7/19/18

to

It's called memoization, and it's a great technique that trades time for
space. https://en.wikipedia.org/wiki/Memoization.

Lookup tables for expensive-to-calculate functions have been used by
human computers since Jesus was a boy. Perhaps the ultimate memoization
tools prior to our obsession for calculating this stuff from raw with
algorithms, was the slide rule, or the astrolabe -- or just even the
fairly thick books of trig tables with sin, cos and tan to 5 decimal
places that I remember as a boy using in my maths classes.

--
Alex

Terje Mathisen

unread,

Jul 19, 2018, 4:10:48 PM7/19/18

to

Alex McDonald wrote:
> It's called memoization, and it's a great technique that trades time for
> space. https://en.wikipedia.org/wiki/Memoization.
>
> Lookup tables for expensive-to-calculate functions have been used by
> human computers since Jesus was a boy. Perhaps the ultimate memoization
> tools prior to our obsession for calculating this stuff from raw with
> algorithms, was the slide rule, or the astrolabe -- or just even the
> fairly thick books of trig tables with sin, cos and tan to 5 decimal
> places that I remember as a boy using in my maths classes.
>

I put such a 5-digit bok on my christmas wish list one year, and
actually got it. :-)

James Van Buskirk

unread,

Jul 19, 2018, 6:10:56 PM7/19/18

to

"fizz buzz" wrote in message
news:f32b85f8-e105-4869...@googlegroups.com...

> According to this logic, fastest prime number calculating algorithm would
> just print text file with already calculated prime numbers.

Whoa, you can calculate and format the primes via the Sieve
of Eratosthenes way faster than you can read the data off disk.

Terje Mathisen

unread,

Jul 20, 2018, 4:56:29 AM7/20/18

to

Exactly right, and we have the same case here:

As long as you get above 10-100K numbers it will be much faster to
calculate the output on the fly instead of loading it from a disk file.

Alex McDonald

unread,

Jul 20, 2018, 9:11:42 AM7/20/18

to

On 20-Jul-18 09:54, Terje Mathisen wrote:
> James Van Buskirk wrote:
>> "fizz buzz" wrote in message
>> news:f32b85f8-e105-4869...@googlegroups.com...
>>
>>> According to this logic, fastest prime number calculating algorithm
>>> would just print text file with already calculated prime numbers.
>>
>> Whoa, you can calculate and format the primes via the Sieve of
>> Eratosthenes way faster than you can read the data off disk.
>>
> Exactly right, and we have the same case here:
>
> As long as you get above 10-100K numbers it will be much faster to
> calculate the output on the fly instead of loading it from a disk file.
>
> Terje
>

It's my understanding that neither sys or user process time under Linux
includes the load of the executable, since there's no process in place
to time. So if I start the timing of my program from when it starts
execution and exclude the load time, then a program with memoized values
as part of the executable will always appear to be faster.

--
Alex

Terje Mathisen

unread,

Jul 20, 2018, 11:41:52 AM7/20/18

to

That seems obvious, but isn't actually true:

It can be significantly faster to store data from SIMD registers than to
copy it from the memory image, i.e. from one address to another.

It is only if you replace the entire program with a single (unbuffered,
so needs 4kB aligned data array) write() call that an immediate array
will be faster.

James Van Buskirk

unread,

Jul 20, 2018, 12:26:55 PM7/20/18

to

"Terje Mathisen" wrote in message news:pisv6a$1hkd$1...@gioia.aioe.org...

> It can be significantly faster to store data from SIMD registers than to
> copy it from the memory image, i.e. from one address to another.

> It is only if you replace the entire program with a single (unbuffered,
> so needs 4kB aligned data array) write() call that an immediate array
> will be faster.

But shouldn't the task be converting a stream of random inputs
into Fizz Buzz rather that a trivial arithmetic progression? Sounds
like more opportunity for fun assembly programming to me.

Terje Mathisen

unread,

Jul 20, 2018, 1:42:01 PM7/20/18

to

That's actually an interesting challenge!

Write the fastest possible function to print fixx/buzz/number given a
random 32 (or 64?) bit input.

I.e. this requires calculating %3 & %5 as fast as possible, while at the
same time doing bin to ascii conversion.

fizz buzz

unread,

Jul 21, 2018, 3:12:44 AM7/21/18

to

Not sure about that. I suspect time command does count loading time.

As you can see in the code, I define output buffer as ".lcomm OutputBuffer,
ITERATIONS*9", so the memory is requested by loader at the loading time.
However, if I'll try to do "OutputBuffer: .fill ITERATIONS*9, 1, 0x00" so the
OutputBuffer will be part of executable, then output of time command will
show much bigger numbers. I guess it's because it takes much longer to
load such big executable. I don't have other explanation.

> --
> Alex

fizz buzz

unread,

Jul 21, 2018, 3:27:47 AM7/21/18

to

On Thursday, July 19, 2018 at 10:56:27 AM UTC-5, Bernhard Schornak wrote:
> fizz buzz wrote:
>
>
> >> Write a program that prints the numbers from 1 to 100...
>
>
> Sorry, but I prefer practical stuff useful for daily tasks,
> where the fastest way is the best solution.

What we're doing right now is similar to sport. Everyone knows that
fastest way to run 100 meters is to use a car, but for some reason
athletes still trying to beat the record using their legs. :-)

> Actually, there
> is no rule not to use my method in your exercise, so you're
> free to choose any way leading to the proper solution. Just
> not to use the fastest way because no on else had that idea
> is a little bit stupid, isn't it?

I am sorry, Bernhard, I didn't try to look mean or anything like that.
It all depends on how we define rules and what we consider cheating.

Me and my friend have a collection of FizzBuzz programs, and all those
programs have same output:

"1\n2\nFizz\n4Buzz\nFizz\n...999997\n999998\nFizz\nBuzz\n"

File with this output will be 6274073 bytes long. We compare the output
of our version with this file to ensure our programs giving us the right results.

If we starting to stretch the rules, then how far can we go?

For example, rules technically say nothing about new line characters. I
can say that I don't really have to divide output with new lines or blanks.
It will make my life much easier because then I can print "Fizz" (4 bytes) and
"FizzBuzz" (8 bytes) while now I'm forced to print "Fizz\n" (5 bytes) and
"FizzBuzz\n" (9 bytes).

Kerr-Mudd,John

unread,

Jul 21, 2018, 4:57:56 AM7/21/18

to

On Sat, 21 Jul 2018 07:24:52 GMT, fizz buzz
<demon.a...@nospicedham.gmail.com> wrote:

> On Thursday, July 19, 2018 at 10:56:27 AM UTC-5, Bernhard Schornak
> wrote:
>> fizz buzz wrote:
>>
>>
>> >> Write a program that prints the numbers from 1 to 100...
>>

[]

> What we're doing right now is similar to sport. Everyone knows that
> fastest way to run 100 meters is to use a car, but for some reason
> athletes still trying to beat the record using their legs. :-)
>

[]

> It all depends on how we define rules and what we consider cheating.
>

[]

> If we starting to stretch the rules, then how far can we go?

[]

I think the point of FizzBuzz (that seems to be getting lost here) is the
followup question; how easy is it to amend your program for larger
values, or alternative numbers (I used "Bang" for 7) or maybe adjust the
word lengths ("Splodge!").

Bernhard Schornak

unread,

Jul 21, 2018, 8:43:13 AM7/21/18

to

fizz buzz wrote:

> On Thursday, July 19, 2018 at 10:56:27 AM UTC-5, Bernhard Schornak wrote:
>> fizz buzz wrote:
>>
>>>> Write a program that prints the numbers from 1 to 100...
>>
>> Sorry, but I prefer practical stuff useful for daily tasks,
>> where the fastest way is the best solution.
>
> What we're doing right now is similar to sport. Everyone knows that
> fastest way to run 100 meters is to use a car, but for some reason
> athletes still trying to beat the record using their legs. :-)

Yes. To use a car requires to have a car, right? ;)

>> Actually, there
>> is no rule not to use my method in your exercise, so you're
>> free to choose any way leading to the proper solution. Just
>> not to use the fastest way because no on else had that idea
>> is a little bit stupid, isn't it?
>
> I am sorry, Bernhard, I didn't try to look mean or anything like that.

I never felt offended, and I hope you got it that my replies
- except my solution, of course - always were written with a
twinkling eye... ;)

> It all depends on how we define rules and what we consider cheating.

Agreed.

> Me and my friend have a collection of FizzBuzz programs, and all those
> programs have same output:

Oh! Now that you realised there is a much faster way, you're
coming up with hidden rules no one knew before?

> "1\n2\nFizz\n4Buzz\nFizz\n...999997\n999998\nFizz\nBuzz\n"

Yes, that's the obvious way to solve a task where 90 percent
of the task's rules are hidden from the testees eyes... ;)

> File with this output will be 6274073 bytes long. We compare the output
> of our version with this file to ensure our programs giving us the right results.

That's the most stupid thing I ever heard. What happens if a
testee uses my libraries and writes properly ordered numbers
to the output file:

1
2
...
9 999 999 997
9 999 999 998
9 999 999 999

Does that testee fail the test even if it's solved properly?
Isn't the "optical presentation" part of *any* challenge, as
well?

My conversion functions allow to "format" numbers in several
ways (including pseudo FP), but they always deliver properly
sized numbers like above. It hurts my eyes if numbers aren't
properly padded with spaces to keep powers of ten in one and
the same column, regardless of how many rows there are...

> If we starting to stretch the rules, then how far can we go?

As far as you want. But to be fair, you had to tell all your
testees the entire ruleset before the test starts, not after
the test, when your exercise was solved, but you do not like
that some of you students found a better way to solve it.

> For example, rules technically say nothing about new line characters. I
> can say that I don't really have to divide output with new lines or blanks.
> It will make my life much easier because then I can print "Fizz" (4 bytes) and
> "FizzBuzz" (8 bytes) while now I'm forced to print "Fizz\n" (5 bytes) and
> "FizzBuzz\n" (9 bytes).

Which limits the solution to Linux systems - locking out the
vast majority of Windows and MacOS users.

As I told, the entire FIZZely-BUZZely thing is an artificial
contest. To stay with your parable, it is a contest "Who can
walk on the tips of his fingers the fastest?" I'm not really
interested in such contests, but as a human being, I'm still
able to develop strategies how to be faster, even if I never
will partake in such contests... ;)

Enjoy the weekend!

Bernhard Schornak

fizz buzz

unread,

Jul 21, 2018, 12:28:33 PM7/21/18

to

No, no. I'm not trying to come up with "new hidden rules". I'm just pointing out
that rules are not completely defined here. What we did, we just took a bunch
of already written FizzBuzz programs, changed number of iterations from 100 to
1000000 (1 million), and tried to come up with fastest version. I already gave a
link where we got most of other versions:

<http://wiki.c2.com/?FizzBuzzTest>

All these programs use newline characters to separate output. The only reason
we used million iterations is because fast C++ and Assembler programs are working too
fast to get any kind of meaningful results doing just 100 iterations (they all will show
speed of 0.000 s, because time command is just not precise enough.)

Tell you the truth, I'm not sure myself how far we can stretch the rules in our
favor here. Perhaps we better vote or something like that?

> > For example, rules technically say nothing about new line characters. I
> > can say that I don't really have to divide output with new lines or blanks.
> > It will make my life much easier because then I can print "Fizz" (4 bytes) and
> > "FizzBuzz" (8 bytes) while now I'm forced to print "Fizz\n" (5 bytes) and
> > "FizzBuzz\n" (9 bytes).
>
>
> Which limits the solution to Linux systems - locking out the
> vast majority of Windows and MacOS users.

Actually, I'm writing my programs on MacBook. :-) It's the same as Linux
in this regard.

I definitely can change it to print "CRLF", it won't really affect anything, because
I always "print" 8 bytes at a time anyway, except of special "FizzBuzz" case, where I
have to do two memory writes, one for the word, one for newline character. The only
seven digit number (1000000) that program have to deal with is not printed anyway,
because it's divisible by 5, so program will end before it have to print 1000001.

> As I told, the entire FIZZely-BUZZely thing is an artificial
> contest. To stay with your parable, it is a contest "Who can
> walk on the tips of his fingers the fastest?" I'm not really
> interested in such contests, but as a human being, I'm still
> able to develop strategies how to be faster, even if I never
> will partake in such contests... ;)

It was not really a contest. I was wondering if I'm missing some important
optimization technique. My current program is just 3x slower than cheat
version (the program that just output huge text without doing any calculations.)

I actually do have idea how to (possibly) make it even faster, and, maybe,
have no newline issues anymore. Not sure if it will actually work. Unfortunately,
it will require Linux-specific system call (clone) which I don't have on my MacBook.

If I understand correctly, threads created with clone can access same memory addresses.

I am thinking to use separate thread to fill output buffer with "\n" characters (perhaps
using MOVQ filling 8 bytes at a time), and then just print words and numbers in main
thread without caring about newlines. For example, I copy word "Fizz" to the buffer,
which is 4 bytes, overwriting four newline characters, but add 5 to buffer pointer,
so next word or number won't overwrite newline character that's located in the buffer
right after the "Fizz".

Not sure if I'm explaining it clear enough.

There should be no way that additional thread will take
longer to execute, so words and numbers should never be overwritten by newline
characters. It will require more computing power, but it should be faster. Theoretically. :-)

Or am I wrong?

> Enjoy the weekend!

Thanks, you too.

> Bernhard Schornak

demon.a...@gmail.com

unread,

Jul 21, 2018, 1:18:29 PM7/21/18

to

On Friday, July 20, 2018 at 12:42:01 PM UTC-5, Terje Mathisen wrote:
> James Van Buskirk wrote:
> > "Terje Mathisen" wrote in message news:pisv6a$1hkd$1...@gioia.aioe.org...
> >> It can be significantly faster to store data from SIMD registers than
> >> to copy it from the memory image, i.e. from one address to another.
> >
> >> It is only if you replace the entire program with a single
> >> (unbuffered, so needs 4kB aligned data array) write() call that an
> >> immediate array will be faster.
> >
> > But shouldn't the task be converting a stream of random inputs
> > into Fizz Buzz rather that a trivial arithmetic progression? Sounds
> > like more opportunity for fun assembly programming to me.
> >
> That's actually an interesting challenge!
>
> Write the fastest possible function to print fixx/buzz/number given a
> random 32 (or 64?) bit input.
>
> I.e. this requires calculating %3 & %5 as fast as possible, while at the
> same time doing bin to ascii conversion.
>
> Terje

Sounds interesting. How program will receive random input? What's the
range and how many numbers input will contain? And how far can we
stretch original rules? :-)

Bart

unread,

Jul 21, 2018, 1:57:15 PM7/21/18

to

Maybe you /should/ have tested it, because it doesn't compile!

But I ran Ben's original version under gcc-O3, and it was the same speed
as any other C compiler with or without optimisation.

In fact, it was slightly slower than a version I did using interpreted
byte-code. Which sort of shows the pointlessness of optimising a program
like this, even with ultra-slow console i/o eliminated.

(Tested with N increased to 1 million, and redirecting output to a file
for the C version, writing (in one go) to a file for the interpreted
version [see after sig]. Both measured by calling clock() at entry and
exit from program, and both run under Windows.)

Saving modulo operations also seems pointless when you still use a "%d"
format which usually involves divide and remainder ops to turn a binary
number into text.

It might be a little more interesting if the task is reduced to one of
of filling a text buffer full of numbers, fizz and buzz (some 6.3MB to
be filled when N is 1 million and using '\n' separators, or a possible
7.3MB on Windows if using cr,lf).

With i/o eliminated, changing the algorithms might then make a bigger
difference, and you would expect optimised C to do better than
interpreted byte-code.

--
bart

proc fb2=
s:=""
n:=1 million
for i to n do
a:=i rem 3
b:=i rem 5
if a+b=0 then
s+:="FizzBuzz\n"
elsif a=0 then
s+:="Fizz\n"
elsif b=0 then
s+:="Buzz\n"
else
s+:=tostr(i)
s+:=10
fi
od

writestrfile("output",s)
end

Rick C. Hodgin

unread,

Jul 21, 2018, 6:27:41 PM7/21/18

to

On Saturday, July 21, 2018 at 1:57:15 PM UTC-4, Bart wrote:
> Maybe you /should/ have tested it, because it doesn't compile!

Exclamation point and all..

Bart, are you a developer? The error in the compilation is trivial
and easy to solve. I missed something in quick editing and I saw
it almost as I hit the post button, but I didn't think it was worth
reposting because I presume other people could fix it.

--
Rick C. Hodgin

James Van Buskirk

unread,

Jul 21, 2018, 6:57:45 PM7/21/18

to

wrote in message
news:02ad19d1-2b51-4c2d...@googlegroups.com...

> On Friday, July 20, 2018 at 12:42:01 PM UTC-5, Terje Mathisen wrote:

> > James Van Buskirk wrote:

> > > But shouldn't the task be converting a stream of random inputs
> > > into Fizz Buzz rather that a trivial arithmetic progression? Sounds
> > > like more opportunity for fun assembly programming to me.

> > That's actually an interesting challenge!

> > Write the fastest possible function to print fixx/buzz/number given a
> > random 32 (or 64?) bit input.

> > I.e. this requires calculating %3 & %5 as fast as possible, while at the
> > same time doing bin to ascii conversion.

> Sounds interesting. How program will receive random input? What's the
> range and how many numbers input will contain? And how far can we
> stretch original rules? :-)

Maybe some sort of linear congruential RNG with options to make it
more hateful if any test programs start correctly predicting some
branches. Maybe any unsigned 32-bit integer as input. If data were
to be precomputed so as not to mask conversion time by RNG time,
it would have to fit in some low level of cache, perhaps 1 MB or
262144 inputs.

I'm not sure just now whether the conversion would use mask
and multiply or PSHUFB and PSADBW to determine congruence
class mod 15. If one were to get really carried away distributing
tasks among the cores would be a nice addition to the problem.

Bart

unread,

Jul 21, 2018, 7:27:49 PM7/21/18

to

Is it that simple? The first error in your code was that 'num' was not
defined. So I moved that further up.

But there was another error: there were too few arguments to the
function call, one done via an array of functions pointers.

What should the arguments be? I don't know; I wanted to just blindly run
your code without needing to understand or tweak it. And if I did tweak
it, would I be running your code or mine?

Since you clearly didn't run the posted code, it didn't inspire much
confidence in wanting to fix it!

(I since have modified it to change the function pointer to take no
parameters - a guess - and ran it and it seemed to work. And somewhat
faster than Ben's version too, perhaps 15% faster. Although I would need
to run both within a short time of each other to properly compare.)

It is still not significantly faster than my interpreted code version
(218ms yours vs 234ms interpreted vs 249ms Ben's). So I don't think much
be done fiddling with the logic unless perhaps you get rid of printf and
directly write to a memory buffer. But it would need to be an expanding
one unless you can estimate the size from N.

--
bart

Terje Mathisen

unread,

Jul 22, 2018, 5:28:22 AM7/22/18

to

fizz buzz wrote:
> I am thinking to use separate thread to fill output buffer with "\n"
> characters (perhaps using MOVQ filling 8 bytes at a time), and then
> just print words and numbers in main thread without caring about
> newlines. For example, I copy word "Fizz" to the buffer, which is 4
> bytes, overwriting four newline characters, but add 5 to buffer
> pointer, so next word or number won't overwrite newline character
> that's located in the buffer right after the "Fizz".
>
> Not sure if I'm explaining it clear enough.
>
> There should be no way that additional thread will take longer to
> execute, so words and numbers should never be overwritten by newline
> characters. It will require more computing power, but it should be
> faster. Theoretically. :-)

If we allow multiple threads, then it becomes trivial to split the
output range into N slices, one for each thread. Since you can also
trivially calculate exactly how long (in characters) a given range will
be, you can even have those threads writing to a common/shared output
buffer.

It is however faster to keep writing to a single L1/L2 sized buffer and
then flush that to a file/NULL when nearly full.

I think for this test we should stay with

a) No precalculating the output, i.e. the range will be given on the
command line, this can be up to 1E9 so it will fit in a 32-bit register.

b) A single CPU thread.

c) A flag to flush/save the output.

Terje

Bernhard Schornak

unread,

Jul 22, 2018, 7:13:30 AM7/22/18

to

fizz buzz wrote:

> On Saturday, July 21, 2018 at 7:43:13 AM UTC-5, Bernhard Schornak wrote:
>> fizz buzz wrote:
>>
>>> On Thursday, July 19, 2018 at 10:56:27 AM UTC-5, Bernhard Schornak wrote:
>>>> fizz buzz wrote:
>>>>

>>>> <snip>

>>>>
>>> If we starting to stretch the rules, then how far can we go?
>>
>> As far as you want. But to be fair, you had to tell all your
>> testees the entire ruleset before the test starts, not after
>> the test, when your exercise was solved, but you do not like
>> that some of you students found a better way to solve it.
>
> No, no. I'm not trying to come up with "new hidden rules". I'm just pointing out
> that rules are not completely defined here.

Okay. If there are no rules, my method is perfectly valid.

> What we did, we just took a bunch
> of already written FizzBuzz programs, changed number of iterations from 100 to
> 1000000 (1 million), and tried to come up with fastest version. I already gave a
> link where we got most of other versions:
>
> <http://wiki.c2.com/?FizzBuzzTest>

Sorry, but I only speak German, English and assembler. ;)

> All these programs use newline characters to separate output. The only reason
> we used million iterations is because fast C++ and Assembler programs are working too
> fast to get any kind of meaningful results doing just 100 iterations (they all will show
> speed of 0.000 s, because time command is just not precise enough.)

How about timing a million runs of the program? Here we hit the
problem how to get proper timing results for (quite) small code
snippets. My solution is here:

https://drive.google.com/open?id=0B1OgMlxNnSNEMWIyNXdiT2FxT1k

The 'average' results come close to the true ones measured with
external equipment. It's crucial to stay below the time defined
as 'timeslice' for a thread. If the TaskManager interrupts your
thread (all processes are threads), you 'measure' the execution
time of not related threads, as well.

> Tell you the truth, I'm not sure myself how far we can stretch the rules in our
> favor here. Perhaps we better vote or something like that?

Might be a little bit too late to 'correct' errors of those who
initiated this challenge the very first time?

>>> For example, rules technically say nothing about new line characters. I
>>> can say that I don't really have to divide output with new lines or blanks.
>>> It will make my life much easier because then I can print "Fizz" (4 bytes) and
>>> "FizzBuzz" (8 bytes) while now I'm forced to print "Fizz\n" (5 bytes) and
>>> "FizzBuzz\n" (9 bytes).
>>
>> Which limits the solution to Linux systems - locking out the
>> vast majority of Windows and MacOS users.
>
> Actually, I'm writing my programs on MacBook. :-) It's the same as Linux
> in this regard.

The difference is that Linux uses a 'NewLine' character (0x0A),
while MacOS uses a 'CarriageReturn' character (0x0D). Only DOS,
Windoze and OS/2 use the proper sequence 'New Line / Line Feed'
(0x0D 0x0A), because a file can be sent to a dot matrix printer
rather than to screen. On MacOS systems, multi-lined text would
be printed in one line (=> no LineFeed), while on Linux systems
all characters would be printed at the last position of each of
the lines once the carriage reached the right-most position (=>
no CarriageReturn).

Both (Linux and MacOS) define conventions for lazy programmers,
but both require extra code to handle matrix printers and tele-
typewriters properly.

> I definitely can change it to print "CRLF", it won't really affect anything,

This does not address the problem I pointed out. With CRLF, the
filesize does not match the proposed size any longer - applying
your rules, files with different size are invalid per se. What,
as I told, is a quite stupid idea... ;)

> because
> I always "print" 8 bytes at a time anyway, except of special "FizzBuzz" case, where I
> have to do two memory writes, one for the word, one for newline character. The only
> seven digit number (1000000) that program have to deal with is not printed anyway,
> because it's divisible by 5, so program will end before it have to print 1000001.

What about languages requiring unicode characters, e.g. Eastern
Europe, Russia, Arabia, Asia? They produce output that will not
fit into a 64 bit register. In case of the FIZZBUZZ + CRLF, you
need a 256 bit YMM register to store FIZZBUZZ+CRLF/LF/CR...

>> As I told, the entire FIZZely-BUZZely thing is an artificial
>> contest. To stay with your parable, it is a contest "Who can
>> walk on the tips of his fingers the fastest?" I'm not really
>> interested in such contests, but as a human being, I'm still
>> able to develop strategies how to be faster, even if I never
>> will partake in such contests... ;)
>
> It was not really a contest. I was wondering if I'm missing some important
> optimization technique. My current program is just 3x slower than cheat
> version (the program that just output huge text without doing any calculations.)

Plain copy operations are faster than complex calculations with
a lot of comparisons and not continuously repeating conditional
jumps. One conditional branch with random branch target 'costs'
the same time than multiple cacheline writes, shuffling 64 byte
blocks simultaneously.

> I actually do have idea how to (possibly) make it even faster, and, maybe,
> have no newline issues anymore. Not sure if it will actually work. Unfortunately,
> it will require Linux-specific system call (clone) which I don't have on my MacBook.
>
> If I understand correctly, threads created with clone can access same memory addresses.

All threads belonging to the same process can access all memory
allocated by that main process (at least in Windows). Possibly,
you have to pass the block address (as long as it is not stored
in a global variable 'visible' to all child processes).

> I am thinking to use separate thread to fill output buffer with "\n" characters (perhaps
> using MOVQ filling 8 bytes at a time), and then just print words and numbers in main
> thread without caring about newlines. For example, I copy word "Fizz" to the buffer,
> which is 4 bytes, overwriting four newline characters, but add 5 to buffer pointer,
> so next word or number won't overwrite newline character that's located in the buffer
> right after the "Fizz".

Why not SSE registers? Using a fixed width of 16 byte can serve
all possible characters including LF, CR or CRLF. With variable
width, you had to play around with the content of XMM registers
and MOVDQU rather than MOVDQA (introducing penalties for uneven
addresses not divisible by 16), but this applies to all integer
registers as well, because unaligned memory accesses always are
punished with extra cycles. (Which might be the reason why your
attempts are slower than the reference...)

Thinking it over, it should be fastest to do all work in one 15
step loop. If you preload XMM0 through XMM2 with the FIZZes and
BUZZes and XMM3 with a 0x20 * 16 pattern, store 0x20 in as much
integer registers as digits required, you could skip the entire
hex to decimal conversation, as well. You just needed a routine
to increment the LSB register from 0x30 to 0x39, then increment
the next higher digit register, reset the LSB register to 0x30,
and so on. As last step of the routine the byte-portions of all
digit registers and the leading spaces (MOVQ XMM3) plus CRLF/LF
/CR (move BYTE/WORD to 0x14/0x15[address]) are written to their
proper places before the routine returns to the main loop. This
introduces one conditional jump per digit, two each tenth digit
(whenever the digit counter becomes zero), three each hundredth
digit, and so on (much faster than any conversion could be).

> Not sure if I'm explaining it clear enough.

Yes. Threading will not help you much, because this task better
is done in a single thread. If you invalidate cachelines, it is
no good idea to access the same addresses soon after. Writing a
straight sequence of continuous addresses is the fastest way to
go. If you thought about splitting the total number into pieces
matching the processor count - 1, it will accelerate execution,
of course.

> There should be no way that additional thread will take
> longer to execute, so words and numbers should never be overwritten by newline
> characters. It will require more computing power, but it should be faster. Theoretically. :-)
>
> Or am I wrong?

Possibly. Threading is a perfect solution for tasks that can be
broken down into smaller, endlessly repeatable subtasks. In our
case, there is nothing that can be done out of order - even the
initially proposed pre-filling was no good threading appliance,
because the main process would access that currently pre-filled
memory simultaneously with the filler-thread.

Rick C. Hodgin

unread,

Jul 22, 2018, 7:58:36 AM7/22/18

to

On Saturday, July 21, 2018 at 7:27:49 PM UTC-4, Bart wrote:
> Is it that simple? The first error in your code was that 'num' was not
> defined. So I moved that further up.

I only see one error in the code posted in this thread. I
forgot to remove the "int num" parameter from other(). It
should've been void.

--
Rick C. Hodgin

James Van Buskirk

unread,

Jul 22, 2018, 8:13:38 AM7/22/18

to

"Bernhard Schornak" wrote in message news:pj1ole$sl1$1...@dont-email.me...

> fizz buzz wrote:

> Plain copy operations are faster than complex calculations with
> a lot of comparisons and not continuously repeating conditional
> jumps. One conditional branch with random branch target 'costs'
> the same time than multiple cacheline writes, shuffling 64 byte
> blocks simultaneously.

That's why you do 8 in parallel in ymm registers with branchless
code.

Bart

unread,

Jul 22, 2018, 8:58:42 AM7/22/18

to

> void other(void) { printf("%d\n", num); }
>
> void (*funcs[4])(int) = { other, fizz, buzz, fizzbuzz };
>
> int num;
>

Problem 1: the body of other() uses 'num', which is not defined until later.

Problem 2; funcs[4] is declared as function pointers taking an int,
although it is initialised to functions which take no arguments, and the
pointers are called with no arguments.

Now you mention an 'int num' parameter, which I can't see in your code,
so a potential Problem 3.

It is anyway not a simple typo that anyone can see at a glance. More
importantly, if code hasn't been run, then we don't know if it's going
to work, especially after ad hoc modifications by someone else who
doesn't know what you had in mind.

That's why I went with Ben's version just to get a ball-park performance
figure for a C version of the algorithm.

--
bart

Terje Mathisen

unread,

Jul 22, 2018, 11:43:52 AM7/22/18

to

Bernhard Schornak wrote:
> Why not SSE registers? Using a fixed width of 16 byte can serve
> all possible characters including LF, CR or CRLF. With variable
> width, you had to play around with the content of XMM registers
> and MOVDQU rather than MOVDQA (introducing penalties for uneven
> addresses not divisible by 16), but this applies to all integer
> registers as well, because unaligned memory accesses always are
> punished with extra cycles. (Which might be the reason why your
> attempts are slower than the reference...)
>
> Thinking it over, it should be fastest to do all work in one 15
> step loop. If you preload XMM0 through XMM2 with the FIZZes and
> BUZZes and XMM3 with a 0x20 * 16 pattern, store 0x20 in as much
> integer registers as digits required, you could skip the entire
> hex to decimal conversation, as well. You just needed a routine
> to increment the LSB register from 0x30 to 0x39, then increment
> the next higher digit register, reset the LSB register to 0x30,
> and so on. As last step of the routine the byte-portions of all
> digit registers and the leading spaces (MOVQ XMM3) plus CRLF/LF
> /CR (move BYTE/WORD to 0x14/0x15[address]) are written to their
> proper places before the routine returns to the main loop. This
> introduces one conditional jump per digit, two each tenth digit
> (whenever the digit counter becomes zero), three each hundredth
> digit, and so on (much faster than any conversion could be).

This is similar to what I suggested a week or so ago, today I decided to
write a version and check the actual speed:

My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n", one
with "FizzBuzz\n" and one that contains the current number as ascii with
a trailing newline but no leading spaces/zero digits.

The final reg contains zero in all positions except the one that
corresponds to the last digit of the ascii number reg where it has a 1 byte.

I always output 3 blocks of 10 lines, using unaligned writes, and after
each line I update the ascii number by adding that increment register.

When I finish a block I have to handle the carry, so I go back to the
memory version of the number where I increment the second-to-last digit,
if that generates a carry I set it to '0' and step back one more digit
and retry the increment.

When I do this I also have to update the increment register since the
last digit has now been pushed one byte further out.

My first attempt was somewhat disappointing, it ran in almost exactly 10
cycles/line, i.e. 3+ ms on my 3+ GHz cpu. I suspect this is partly
because I have to write everything to a single 7 MB buffer, so I'm
getting a write speed of 2+ GB/s to RAM.

Terje Mathisen

unread,

Jul 22, 2018, 2:29:01 PM7/22/18

to

Terje Mathisen wrote:
> This is similar to what I suggested a week or so ago, today I decided
> to write a version and check the actual speed:
>
> My algorithm uses 5 SSE regs, one with "Fizz\n", one with "Buzz\n",
> one with "FizzBuzz\n" and one that contains the current number as
> ascii with a trailing newline but no leading spaces/zero digits.
>
> The final reg contains zero in all positions except the one that
> corresponds to the last digit of the ascii number reg where it has a
> 1 byte.
>
> I always output 3 blocks of 10 lines, using unaligned writes, and
> after each line I update the ascii number by adding that increment
> register.
>
> When I finish a block I have to handle the carry, so I go back to
> the memory version of the number where I increment the second-to-last
> digit, if that generates a carry I set it to '0' and step back one
> more digit and retry the increment.
>
> When I do this I also have to update the increment register since
> the last digit has now been pushed one byte further out.
>
> My first attempt was somewhat disappointing, it ran in almost exactly
> 10 cycles/line, i.e. 3+ ms on my 3+ GHz cpu. I suspect this is
> partly because I have to write everything to a single 7 MB buffer, so
> I'm getting a write speed of 2+ GB/s to RAM.

The actual write speed is quite a bit higher since I'm rewriting a lot
of output, i.e. for the range from 100K to 1M there are 190 output bytes
per 30 lines, or just 6.33 bytes/line so almost 10 of the 16 bytes are
wasted.

I made a small optimization where I stored "fuzz\n" directly after the
number value, then I could just save 5 more bytes for each of the 4
places in a 30-block where a number is followed by a fuzz line. I also
had to generate an inc2 register which would increment by 2 instead of 1
each time I skipped a line.

The end result was a drop from 10 to 8 cycles/line so now I'm getting
close to 1 cycle/byte.

I also considered storing those 190 output bytes in 12 SSE or 6 AVX
registers, but at that point I would also need to update the 18
different locations where a number is the output, and I could not reload
everything from memory for each iteration, i.e. I would need a full
190-byte increment array (rounded up to 192 and properly aligned) so
that I could increment all of them with 12/6 SSE/AVX byte adds.

This is still not enough since I would also have to handle carries into
at least one more digit, using compares against '9', byte shuffles and
another set of adds. I.e. this would be a lot more work for very
marginal gains. :-(

Rod Pemberton

unread,

Jul 22, 2018, 9:44:27 PM7/22/18

to

On Sun, 22 Jul 2018 13:48:31 +0100
Bart <b...@nospicedham.freeuk.com> wrote:

Ok, I'm not sure why he didn't post corrections yet. I'm tired of
reading this part of the thread. So, this is probably what he intended.

#include <stdio.h>

void fizz(int num) { printf("fizz\n"); }
void buzz(int num) { printf("buzz\n"); }
void fizzbuzz(int num) { printf("fizzbuzz\n"); }
void other(int num) { printf("%d\n", num); }

void (*funcs[4])(int) = { other, fizz, buzz, fizzbuzz };

int num;

int main(void)
{
int threes = 1 << 2, fives = 1 << 4;

for (num = 1; num <= 100; ++num) {

funcs[(threes & 1) + (2 * (fives & 1))](num);

threes = (threes >> 1) | ((threes & 1) << 2);
fives = (fives >> 1) | ((fives & 1) << 4);
}

return(0);
}

Rod Pemberton
--
As long as the UK continues to work with the EU, Brexit won't happen.
The first pawn sacrifice: Gibraltar. Set Gibraltar free.

Bernhard Schornak

unread,

Jul 23, 2018, 5:29:55 AM7/23/18

to

Here we differ. I think of putting all required digits into integer
registers and write their byte portions to memory. Assuming, R15 is
the current address of the output buffer, EAX...EBP hold the digits
0...6 (expand with R9...R10 if required), R14 is the current count,
R13 holds the current count (0...9), R12 holds 0x30, R11 holds 0x31
and XMM3 holds 16 spaces:

fizzbuzz_loop:

[align to next multiple of 32]

call write_routine # => 1st
call write_routine # => 2nd
movdqa %xmm0, 0x00(%r15) # => FIZZ
addq $0x10, %r15 # R15 = current EA
call write_routine # => 4th
movdqa %xmm1, 0x00(%r15) # => BUZZ
movdqa %xmm0, 0x10(%r15) # => FIZZ
addq $0x20, %r15 # R15 = current EA
call write_routine # => 7th
call write_routine # => 8th
movdqa %xmm0, 0x00(%r15) # => FIZZ
movdqa %xmm1, 0x10(%r15) # => BUZZ
call write_routine # => 11th
movdqa %xmm0, 0x00(%r15) # => FIZZ
call write_routine # => 13th
call write_routine # => 14th
movdqa %xmm2, 0x00(%r15) # => FIZZBUZZ
decq %r14 # R14 = loop count
jne fizzbuzz_loop

[END]

write_routine:

[align to next multiple of 32]

movq %xmm3, 0x00(%r15)
movb %bpl, 0x08(%r15)
movb %sil, 0x09(%r15)
movb %dil, 0x0A(%r15)
movb %dl, 0x0B(%r15)
movb %cl, 0x0C(%r15)
movb %bl, 0x0D(%r15)
movb %al, 0x0E(%r15)
movb $0x0A, 0x0F(%r15)
addq $0x10, %r15
incb %al # RAX = 1st digit
decq %r13
jne RET
incb %bl
movq $0x09, %r13 # R13 = digit counter
cmpb $0x20, %bl # RBX = 2nd digit
cmove %r11b, %bl
je RET
cmpb $0x09, %bl
cmova %r12b, %bl
jne RET
.
. # repeat for the remaining digits
.

[align to next multiple of 32]

RET: nop
ret

Because we just count from n to m, we do not need a conversion, but
can count up our number using simple INC instructions. the compares
(prefering CMOV over JMP) are much faster than calculating stuff.

Writing to straight ascending addresses is ways faster than writing
stuff out of order, not to speak of overwriting already invalidated
memory locations in one and the same function.

Bernhard Schornak

unread,

Jul 23, 2018, 5:44:57 AM7/23/18

to

Bernhard Schornak wrote:

> write_routine:
>
> [align to next multiple of 32]
>
> movq %xmm3, 0x00(%r15)
> movb %bpl, 0x08(%r15)
> movb %sil, 0x09(%r15)
> movb %dil, 0x0A(%r15)
> movb %dl, 0x0B(%r15)
> movb %cl, 0x0C(%r15)
> movb %bl, 0x0D(%r15)
> movb %al, 0x0E(%r15)
> movb $0x0A, 0x0F(%r15)
> addq $0x10, %r15
> incb %al # RAX = 1st digit
> decq %r13
> jne RET
> incb %bl
> movq $0x09, %r13 # R13 = digit counter
> cmpb $0x20, %bl # RBX = 2nd digit

This should be "cmpb $0x21, %bl", of course...

> cmove %r11b, %bl
> je RET
> cmpb $0x09, %bl
> cmova %r12b, %bl
> jne RET
> .
> . # repeat for the remaining digits
> .
>
>
> [align to next multiple of 32]
>
> RET: nop
> ret

Terje Mathisen

unread,

Jul 23, 2018, 5:59:59 AM7/23/18

to

"The proof of the pudding is in the eating"

I'm willing to bet that your routine is significantly slower than mine
(maybe by an order of magnitude?).

I.e. please time (with RDTSC) a run from 1 to 1E6. Your target should be
to beat or match my current 8M clock cycles. :-)

Terje

Rick C. Hodgin

unread,

Jul 25, 2018, 2:14:03 PM7/25/18

to

On Sunday, July 22, 2018 at 8:58:42 AM UTC-4, Bart wrote:
> On 22/07/2018 12:56, Rick C. Hodgin wrote:
> > On Saturday, July 21, 2018 at 7:27:49 PM UTC-4, Bart wrote:
> >> Is it that simple? The first error in your code was that 'num' was not
> >> defined. So I moved that further up.
> >
> > I only see one error in the code posted in this thread. I
> > forgot to remove the "int num" parameter from other(). It
> > should've been void.
> >
>
> > void other(void) { printf("%d\n", num); }
> >
> > void (*funcs[4])(int) = { other, fizz, buzz, fizzbuzz };
> >
> > int num;
> >
>
> Problem 1: the body of other() uses 'num', which is not defined until later.

Correct. I forgot C requires forward declarations. :-) CAlive
does not. My bad.

> Problem 2; funcs[4] is declared as function pointers taking an int,
> although it is initialised to functions which take no arguments, and the
> pointers are called with no arguments.

Yes. That was the only bug I was referring to earlier. I forgot
about the int num; needing to be defined before:

https://groups.google.com/d/msg/comp.lang.asm.x86/avLse_BYz1I/44ydyBD8CgAJ

> Now you mention an 'int num' parameter, which I can't see in your code,
> so a potential Problem 3.
>
> It is anyway not a simple typo that anyone can see at a glance. More
> importantly, if code hasn't been run, then we don't know if it's going
> to work, especially after ad hoc modifications by someone else who
> doesn't know what you had in mind.
>
> That's why I went with Ben's version just to get a ball-park performance
> figure for a C version of the algorithm.

Here's the corrected code:

#include <stdio.h>

==> // Moved up:
==> int num;

void fizz(void) { printf("fizz\n"); }
void buzz(void) { printf("buzz\n"); }
void fizzbuzz(void) { printf("fizzbuzz\n"); }

void other(void) { printf("%d\n", num); }

==> // Removed int param, changed to void:
==> void (*funcs[4])(void) = { other, fizz, buzz, fizzbuzz };

int main(void)
{
int threes = 1 << 2, fives = 1 << 4;

for (num = 1; num <= 100; ++num) {
funcs[(threes & 1) + (2 * (fives & 1))]();

threes = (threes >> 1) | ((threes & 1) << 2);
fives = (fives >> 1) | ((fives & 1) << 4);
}

return 0;
}

-----
My mistake, Bart. I apologize for disrupting your life as I have
done. It was not intentional.

--
Rick C. Hodgin