In the recently completed Advent of Code 2021 cycle, one of the puzzles
was called Lanternfish, where you had a population of these
light-flashing fish: Each fish would count down from 6 to 0, then it
would flash, reset to 6 and produce an offspring which needed two
additional days to mature, i.e. it would start at 8.
Directly modelling each fish was doable for 10+ generations (part 1
asked for 80 days), but part 2 required 256 days, at which point you had
to realize that you didn't need to model individual fishes, only the
counts of how many fish were at each countdown level. This insight
speeded up the process by many orders of magnitude and made the second
part easy instead of unfeasible.
Now we get to the interesting part for c.arch:
The obvious algorithm have a fish[9] array with counts for 0..8, then on
each day you do:
tmp = fish[0]; for (i = 1; i < 9; i++) fish[i-1]=fish[i];
fish[6]+=tmp; fish[8]=tmp;
and this runs very well indeed.
First optimization extends the array to 266 entries and simply moves the
front pointer instead of rotating, avoiding all of the copying:
f[7]+=f[9]=f[0]; f++;
Alternatively, make the array 16 entries long and do all accesses as
above, but masked by 15:
fish[(day+7)&15]+=fish[(day+9)&15]=fish[day&15]; day++;
However, Robert Collins came up with the to me non-obvious idea of
moving it all into registers and manually rotating them (using Rust):
for _ in 0..days {
tmp=t0;t0=t1;t1=t2;t2=t3;t4=t5;t5=t6;t6=t7+tmp;t7=t8;t8=tmp;
}
So this is 8 reg-reg moves and a single addition, plus the
tw-instruction loop overhead, but due to how modern OoO cpus can handle
such moves in the decoder, using zero execution slots, the running time
dropped by an order of magnitude compared to rotating memory variables.
(It did each iteration in ~0.5 cycles, so it had to run two
iterations/22 instructions per cycle!)
At this point I tried to unroll by four, which reduced the number of
reg-reg MOVes to 5 and did 4 ADDs per iteration:
u64 processdays(int days)
{
u64 tmp0, tmp1, tmp2, tmp3, t0 = fish[0], t1 = fish[1], t2 =
fish[2], t3 = fish[3],
t4 = fish[4], t5 = fish[5], t6 = fish[6], t7 = fish[7], t8 =
fish[8];
for (int d = 3; d < days; d += 4) {
tmp0 = t0, tmp1 = t1, tmp2 = t2, tmp3 = t3;
t0 = t4, t1 = t5, t2 = t6;
t3 = t7 + tmp0;
t4 = t8 + tmp1;
t5 = tmp0 + tmp2;
t6 = tmp1 + tmp3;
t7 = tmp2;
t8 = tmp3;
}
return t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
}
This was less than twice as fast as the single update/iteration, so the
OoO magic did a very good job with Robert's original scalar version.
Terje
--
- <Terje.Mathisen at
tmsw.no>
"almost all programming can be viewed as an exercise in caching"