> It's not a traditional Harvard arch -- it's an MB-lite.
AFAIK the MB-lite *is* a traditional Harvard arch. Separate instruction
memory from data memory, simultaneous access to both instruction *and*
data.
> AFAICT, there
> are no provisions in the standard core to provide "data" paths to the
> program store.
This is also my conclusion.
>> Often, if the data address space is large enough, a part of the data
>> address space is mapped to the program store, for write accesses.
>
> The OP would be better served (?) just adding the ancillary hardware
> to the core as he's already got the VHDL for the core; assuming the
> interface needn't be "terribly fast", adding an autoincrement register
> at a specific place in the (data) address space to which he can write
> a specific "starting (program) address"... and, another from which he
> can read the contents of the program memory and *write* back to it
> (letting the autoincrement register advance him to the next address)
> seems to be the easiest interface.
Even easier would be to let an external scrubber (few states FSM) to
take control over the memory bus (instruction and data), while the
processor is on hold.
Since scrubbing rate is not that high, it would be rather transparent to
the processor which would not even realize something has happened.
[]
> Presumably, the OP will be *designing* the EDAC -- over in another
> corner of the same die that implements the processor, itself. As
> such, he can build the scrubbing funtionality into it directly -- and
> just report status to the processor (i.e., instead of correcting the
> bits fed to the CPU, implement a RMW cycle in place of the "opcode
> fetch").
replacing the 'fetch' with such an artillery would be rather
unefficient. Scrubbing is necessary only to avoid errors accumulation
and bump in a situation of double error (not correctable in our case).
So scrubbing should not be done for *every* fetch, but rather for every
10K fetches or even more.
> He may want to be able to control this as it can slow down execution
> (or, cause the pipeline to starve). But, it seems more prudent to
> "fix" the bad read *now* (automatically) rather than hope some
> software routine gets around to it "eventually".
is a matter of tolerance to soft errors you want to achieve. I don't
have the numbers on the top of my head, but scrubbing a memory cell
every 500 us is something that would get you going without any impact
for quite some time (read years). The processor, running at 40MHz does
not even know the memory has been scrubbed.
> [the hooks to read/write "arbitrary" program memory locations still
> seem to be needed if he truly wants to walk *all* of program memory to
> ensure it is periodically scrubbed.
As said elsewhere in this thread, I've came to the conclusion that it
would be easier to implement the necessary hardware *around* the
processor in order to get the scrubbing done.
> The bigger issue I would pose (I've posed this to folks running server
> farms) is: what do you do when you get an error (corrected or
> otherwise)?
If the error is corrected you need to write it back in order to avoid
accumulation of errors, where your hamming code cannot give you more
than a double error failure. At that point you'd need to reboot or
rewrite the memory from a pristine area you trust (typically some sort
of non volatile memory).
> And, when do you lose confidence in the memory subsystem?
That's a tough question. How do you measure confidence? The way I
approach the problem is very simple: how many errors per unit of time
you can tolerate? Then from that number work out the probabilities your
system is likely to encounter an error and apply the necessary
mitigation techniques in order to meet desired goal *within* a certain
level of confidence (3 sigmas, 5 sigmas, 7 sigmas...typically this
factor is proportional to the level of criticality of your system).
> How many undetected errors are creeping in if some number of
> corrected/*uncorrected* errors occur??]
There's no such a thing as 'undetected errors'. Your code provides you
with a mean to *detect* and/or *correct* a certain class of errors.
Given the class of errors you want to protect, because in terms of
probability they correspond to the bulk of your errors, your code
implementation will provide the necessary protection.
In case of a server, which is supposedly using non rad-hard components,
a reasonable estimate of soft error over DRAM would be ranging between
few tens and few thousands FIT per Mb, while SRAM would be much more
sensitive, around few hundred thousands FIT. This is why todays caches
are ECC protected otherwise they'd experience a fault every half an
hour!
On the contrary protecting DRAM may not be extremely necessary (once in
two years for for a 10 Gib system). [1]
In the space market, EDAC are pretty the standard way of going when
coming to memory, but alone would not suffice. You need a scrubber that
read the corrected data and writes it back. In the write operation the
fault bit will be reset to the correct value and no accumulation occurs.
Al
[1] a good reference for all those numbers:
http://lambda-diode.com/opinion/investigations/transmutations/.../ecc-memory-3