Some bugs on the processor can be fixed by microcode.
A typical retail motherboard can have as many as eight
microcode files, covering the compatible CPU table
on the motherboard maker site, and those are stored
in the BIOS flash chip. Microcode is tiny, and variable
length. The last time I took apart that file, the segments
were in multiples of 2KB or so.
Microcode releases have a revision number, and
a patcher loading a microcode, is allowed to
install a patch which has a higher release number
than the one currently in the processor.
The BIOS has its microcode patcher. The microcode must
be good enough, to allow the system to boot into the OS.
So no storage bugs can exist with the shipped BIOS
microcode. All it has to do, is get the system booted.
Windows and Linux also have microcode patchers.
The Windows one does its job early after boot,
and then the service exits. So you don't really
see it.
The Windows one allows deployment of updates.
It's unclear how much faster either a BIOS update
would deploy a new version, versus how fast
Microsoft could push a new file via Windows Update.
If you have a copy of the Intel Processor Identification
Utility (PIU), the field "revision" is actually the
release number of the microcode. There was one incident,
where no microcode was getting loaded, and the number
was zero. Most of the time, you will find a small finite
number for that field. In some cases, the utility
mistakenly masks the value read out, and some digits
may not belong there. (Maybe you see F07 instead
of 07.)
Some bugs in processors are fixed by actual code.
When AMD had a TLB bug in the 9500, they distributed
maybe a 15KB or so code module, to be added to the
BIOS. This code disabled the TLB, or a portion of
it, costing a small amount of performance. A
fixed version of the processor, for the same family,
had "50" added to the lower digits, so if you bought
a 9550 you knew it was fixed, whereas a 9500 wasn't.
So that fix wasn't microcode based, because it
wasn't an actual instruction problem. It was a
problem with virtual to physical address translation
of some sort.
The average processor has 100 errata. Some of the errata
are discovered a year or two after the first batch is
distributed for sale. Testing continues after release.
Many bugs are repaired via microcode updates. Some
are labeled "won't fix", meaning even if a new mask
revision was in the pipe, they had no plan to patch
out the problem. Some issues are innocuous enough they
don't need fixing.
In the case of the Prime95 issue above, the hand-coded FFTs
are perfect material for uncovering bugs. Frequently,
compilers produce "lame" code that doesn't give particularly
good fault coverage. So you don't see bugs, because the
instruction sequences aren't that challenging.
One AMD processor, had an FPU bug caused by actual
electrical noise. It was discovered after release.
It took assembler code to do it. The assembler code
consisted of a nonsensical continuous sequence of
one FPU instruction after another. This drew enough
current to cause a noise problem in the substrate.
Errata like that receive a "will not fix" rating,
because it is not expected that anyone will be
coding with assembler, and using that stupid a sequence
of instructions. Real FPU code needs an occasional
bit test, branch condition, and so isn't solid 100% FPU
instructions one after another. And when a HLL is used,
the compiler/assembler wouldn't even get close to
the required FPU code density to break that processor.
(If I owned such a processor though, I'd be pissed.
For that not being caught in testing, or recognized
as a potential issue during design.)
When it comes to test benches for hardware design, you
run the important ones first (and try to finish them by
design close). The ridiculous tests are saved for later,
after production has begun. And that's when the AMD testers
carried out their artificial 100% density test and discovered
a problem. For our chip designs, some staff were running
simulation test cases a year after we had hardware in hand.
(And ours didn't have microcode to patch with either.
We had another feeble mechanism for emergencies :-) )
The level of bugs is rather constant. I don't recollect ever
looking at an errata sheet for a CPU and seeing zero bugs. It
just doesn't happen. I expect in some cases, staff already know
of multiple errata, even before design close, but the
boss says "ship it". I doubt they would hold up a mask
release, chasing every possible bug and making the CPU
two years late. That just isn't going to happen, especially
when the "good ole microcode" can pull your bacon out of the
fire.
So while it's sad that this "bendable" processor also has
an errata, it probably has another 99 errata to keep that
one company. Most of those errata are invisible to end
users. The janitorial staff already cleaned up the mess :-)
It's the ones that cost you performance, that
make the fanbois crazy. The AMD TLB bug certainly
upset a few happy owners of the affected silicon.
And the Intel FDIV bug certainly cost Intel
(all those Excel spreadsheet jokes). Intel has had
a few wakeup calls, and I like to think that Skylake
is another wakeup call ("staff getting sloppy,
poor decision making"). So far it hasn't cost them
any "big money". I don't know if users are
returning their "bent" processors or not.
Paul