writing program memory programmatically in MB-lite

alb

unread,

Nov 28, 2014, 9:07:19 AM11/28/14

to

Hi there,

I'm trying to understand if in an MB-lite I could use the program to
read/write itself.

The main idea behind is the implementation of a 'memory scrubber' which
detects and corrects errors in its own instruction memory (thanks to an
EDAC).

I know that Hardware Architecture has program memory and data memory
separated and I believe that instruction memory can only be fetched in
the pipeline. Is my understanding correct?

Any hint/suggestion/pointer is appreciated,

Al

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Don Y

unread,

Nov 28, 2014, 10:50:06 AM11/28/14

to

On 11/28/2014 7:07 AM, alb wrote:
> Hi there,
>
> I'm trying to understand if in an MB-lite I could use the program to
> read/write itself.
>
> The main idea behind is the implementation of a 'memory scrubber' which
> detects and corrects errors in its own instruction memory (thanks to an
> EDAC).
>
> I know that Hardware Architecture has program memory and data memory
> separated and I believe that instruction memory can only be fetched in
> the pipeline. Is my understanding correct?
>
> Any hint/suggestion/pointer is appreciated,

It's a Harvard Architecture. As such, a traditional deployment would
not support direct reading or writing of the program store *as* "data".

However, you can always add whatever external logic (data paths) you
deem appropriate to allow this to happen -- outside the scope of
the processor itself.

Dealing with cache/pipeline issues would have to be addressed based
on your resulting hardware design and software implementation.

You're better off designing a self-correcting program store (with
some feedback to the processor/support electronics to handle
"insurmountable problems" if/when they occur) and handling all
the error detection and correction outside of the scope of the
software itself.

This sure sounds like "homework" and not "production hardware"...

Niklas Holsti

unread,

Nov 28, 2014, 1:51:46 PM11/28/14

to

On 14-11-28 17:50 , Don Y wrote:
> On 11/28/2014 7:07 AM, alb wrote:
>> Hi there,
>>
>> I'm trying to understand if in an MB-lite I could use the program to
>> read/write itself.
>>
>> The main idea behind is the implementation of a 'memory scrubber' which
>> detects and corrects errors in its own instruction memory (thanks to an
>> EDAC).
>>
>> I know that Hardware Architecture has program memory and data memory
>> separated and I believe that instruction memory can only be fetched in
>> the pipeline. Is my understanding correct?
>>
>> Any hint/suggestion/pointer is appreciated,
>
> It's a Harvard Architecture. As such, a traditional deployment would
> not support direct reading or writing of the program store *as* "data".

Yes - but some very traditional Harvards have the ability to write to
the program store with special instructions which interpret the address
as an address in the program store (but I don't know if MB has such).

> However, you can always add whatever external logic (data paths) you
> deem appropriate to allow this to happen -- outside the scope of
> the processor itself.

Often, if the data address space is large enough, a part of the data
address space is mapped to the program store, for write accesses.

> Dealing with cache/pipeline issues would have to be addressed based
> on your resulting hardware design and software implementation.

In the OP's case, if I understood correctly, it is a matter of writing
back to the program store the same value as was read, in order to
refresh the EDAC error-correction bits. If a value read earlier from the
same program store address is in the caches/pipelines, that's OK because
it is the same value.

> You're better off designing a self-correcting program store (with
> some feedback to the processor/support electronics to handle
> "insurmountable problems" if/when they occur) and handling all
> the error detection and correction outside of the scope of the
> software itself.

In the space-based EDAC-equipped memory systems with which I am
familiar, the EDAC HW usually only corrects errors on the fly, as the
data are read, but does not itself write back the corrected data. That
is done by SW scrubbing, as I understand the OP intends.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

Don Y

unread,

Nov 28, 2014, 3:02:28 PM11/28/14

to

On 11/28/2014 11:52 AM, Niklas Holsti wrote:
> On 14-11-28 17:50 , Don Y wrote:
>> On 11/28/2014 7:07 AM, alb wrote:
>>> Hi there,
>>>
>>> I'm trying to understand if in an MB-lite I could use the program to
>>> read/write itself.
>>>
>>> The main idea behind is the implementation of a 'memory scrubber' which
>>> detects and corrects errors in its own instruction memory (thanks to an
>>> EDAC).
>>>
>>> I know that Hardware Architecture has program memory and data memory
>>> separated and I believe that instruction memory can only be fetched in
>>> the pipeline. Is my understanding correct?
>>>
>>> Any hint/suggestion/pointer is appreciated,
>>
>> It's a Harvard Architecture. As such, a traditional deployment would
>> not support direct reading or writing of the program store *as* "data".
>
> Yes - but some very traditional Harvards have the ability to write to the
> program store with special instructions which interpret the address as an
> address in the program store (but I don't know if MB has such).

It's not a traditional Harvard arch -- it's an MB-lite. AFAICT, there
are no provisions in the standard core to provide "data" paths to the
program store.

>> However, you can always add whatever external logic (data paths) you
>> deem appropriate to allow this to happen -- outside the scope of
>> the processor itself.
>
> Often, if the data address space is large enough, a part of the data address
> space is mapped to the program store, for write accesses.

The OP would be better served (?) just adding the ancillary hardware to
the core as he's already got the VHDL for the core; assuming the interface
needn't be "terribly fast", adding an autoincrement register at a specific
place in the (data) address space to which he can write a specific
"starting (program) address"... and, another from which he can read the
contents of the program memory and *write* back to it (letting the
autoincrement register advance him to the next address) seems to be
the easiest interface.

>> Dealing with cache/pipeline issues would have to be addressed based
>> on your resulting hardware design and software implementation.
>
> In the OP's case, if I understood correctly, it is a matter of writing back to
> the program store the same value as was read, in order to refresh the EDAC
> error-correction bits. If a value read earlier from the same program store
> address is in the caches/pipelines, that's OK because it is the same value.

Presumably, the OP will be *designing* the EDAC -- over in another corner
of the same die that implements the processor, itself. As such, he can
build the scrubbing funtionality into it directly -- and just report status
to the processor (i.e., instead of correcting the bits fed to the CPU,
implement a RMW cycle in place of the "opcode fetch").

He may want to be able to control this as it can slow down execution
(or, cause the pipeline to starve). But, it seems more prudent to
"fix" the bad read *now* (automatically) rather than hope some
software routine gets around to it "eventually".

[the hooks to read/write "arbitrary" program memory locations still seem
to be needed if he truly wants to walk *all* of program memory to ensure
it is periodically scrubbed. The bigger issue I would pose (I've posed
this to folks running server farms) is: what do you do when you get an
error (corrected or otherwise)? And, when do you lose confidence in the
memory subsystem? How many undetected errors are creeping in if some
number of corrected/*uncorrected* errors occur??]

alb

unread,

Nov 29, 2014, 3:46:31 PM11/29/14

to

Hi Don,

Don Y <th...@is.not.me.com> wrote:
[]

> It's a Harvard Architecture. As such, a traditional deployment would
> not support direct reading or writing of the program store *as* "data".
>

That confirms what I was thinking. In the past I've worked with an
ADSP218x that allows you to write and read PM 'at your leasure'. We've
implemented CRC over the program memory to detect memory corruption
which would have eventually compromised the behavior of the program, but
we couldn't do anything more then flagging the issue to the upper level
system which would have taken care of the recovery.

> However, you can always add whatever external logic (data paths) you
> deem appropriate to allow this to happen -- outside the scope of
> the processor itself.
>
> Dealing with cache/pipeline issues would have to be addressed based
> on your resulting hardware design and software implementation.

likely the core allows you to 'freeze' the data and instruction memory
access, so the external logic could perform scrubbing while freezing the
processor and then let it run from where it was.

Scrubbing is just a read and potentially write operation (in the event
the EDAC reports an error), so few clock cycle every now and then should
be totally transparent.

At some stage I thought the scrubber can be implemented by the software
itself, but the harward architecture exclude that option.

> You're better off designing a self-correcting program store (with
> some feedback to the processor/support electronics to handle
> "insurmountable problems" if/when they occur) and handling all
> the error detection and correction outside of the scope of the
> software itself.

The EDAC unit is on the memory path, so your data is 32bit but you store
40 with the hamming, while during reading, if there's one error only it
will be corrected on the 32bit side, but not in the memory (that is why
you need scrubbing to refresh the correct content when appropriate).

In the unlikely event of a double error, not correctable, the scrubber
shall report the event to the upper level, but the software can move on.
It is still possible that the double error is in a part of the code
which is not used at the moment.

>
> This sure sounds like "homework" and not "production hardware"...

Uhm...nope, I've passed my 'homework' phase long ago, and even if I
don't imply that I shouldn't go back and do some more, the question is
certainly not for homework.

Al

alb

unread,

Nov 29, 2014, 4:39:42 PM11/29/14

to

Hi Niklas,

Niklas Holsti <niklas...@tidorum.invalid> wrote:
[]

>> It's a Harvard Architecture. As such, a traditional deployment would
>> not support direct reading or writing of the program store *as* "data".
>
> Yes - but some very traditional Harvards have the ability to write to
> the program store with special instructions which interpret the address
> as an address in the program store (but I don't know if MB has such).

Any reference for those ones? It doesn't seem to me the MB has
instructions which provide such functionality.

>> However, you can always add whatever external logic (data paths) you
>> deem appropriate to allow this to happen -- outside the scope of
>> the processor itself.
>
> Often, if the data address space is large enough, a part of the data
> address space is mapped to the program store, for write accesses.

This is what I also thought, but you'd probably need a dual port memory
to handle the two different address busses. How would you handle
fetching an instruction from address N while writing it (scrubbing) to
address M?

>> Dealing with cache/pipeline issues would have to be addressed based
>> on your resulting hardware design and software implementation.
>
> In the OP's case, if I understood correctly, it is a matter of writing
> back to the program store the same value as was read, in order to
> refresh the EDAC error-correction bits. If a value read earlier from the
> same program store address is in the caches/pipelines, that's OK because
> it is the same value.

Exactly. The software could perform scrubbing if it had access to the pm.

[]

> In the space-based EDAC-equipped memory systems with which I am
> familiar, the EDAC HW usually only corrects errors on the fly, as the
> data are read, but does not itself write back the corrected data. That
> is done by SW scrubbing, as I understand the OP intends.

This was indeed the main intent. But considering the features of such
architecture it would be simpler to have the scrubber implemented at
hardware level (in the fpga fabric).

The processor would be 'frozen' (simply by using the 'busy' memory bit
in the memory interfaces) and the scrubber would get access to the
memory (instruction and data) in order to perform the read/write ops
necessary to refresh the correct value in the memory (1 bit correction).

In the event of a double error the scrubber can flag the error to upper
level and let the processor continue since it will be unlikely the error
is right there where the software is running. Even in the event of a
corrupted software, there are enough hardware protections and Failure
Detection and Isolation mechanisms in place that no harm would be caused
to the unit or to external ones.

In a past project on the International Space Station, the units had PM
protected with a simple CRC and routinely we ran a software routine to
verify the CRC. Every 2/3 days we would have a failed CRC over ~250
units and a simple reboot would recover the faulty memory content. Over
two years I've seen maybe once a unit with a bit flip which caused the
software onboard to go bananas.

Al

alb

unread,

Nov 29, 2014, 6:06:37 PM11/29/14

to

Hi Don,

Don Y <th...@is.not.me.com> wrote:
[]

> It's not a traditional Harvard arch -- it's an MB-lite.

AFAIK the MB-lite *is* a traditional Harvard arch. Separate instruction
memory from data memory, simultaneous access to both instruction *and*
data.

> AFAICT, there
> are no provisions in the standard core to provide "data" paths to the
> program store.

This is also my conclusion.

>> Often, if the data address space is large enough, a part of the data
>> address space is mapped to the program store, for write accesses.
>
> The OP would be better served (?) just adding the ancillary hardware
> to the core as he's already got the VHDL for the core; assuming the
> interface needn't be "terribly fast", adding an autoincrement register
> at a specific place in the (data) address space to which he can write
> a specific "starting (program) address"... and, another from which he
> can read the contents of the program memory and *write* back to it
> (letting the autoincrement register advance him to the next address)
> seems to be the easiest interface.

Even easier would be to let an external scrubber (few states FSM) to
take control over the memory bus (instruction and data), while the
processor is on hold.

Since scrubbing rate is not that high, it would be rather transparent to
the processor which would not even realize something has happened.

[]

> Presumably, the OP will be *designing* the EDAC -- over in another
> corner of the same die that implements the processor, itself. As
> such, he can build the scrubbing funtionality into it directly -- and
> just report status to the processor (i.e., instead of correcting the
> bits fed to the CPU, implement a RMW cycle in place of the "opcode
> fetch").

replacing the 'fetch' with such an artillery would be rather
unefficient. Scrubbing is necessary only to avoid errors accumulation
and bump in a situation of double error (not correctable in our case).
So scrubbing should not be done for *every* fetch, but rather for every
10K fetches or even more.

> He may want to be able to control this as it can slow down execution
> (or, cause the pipeline to starve). But, it seems more prudent to
> "fix" the bad read *now* (automatically) rather than hope some
> software routine gets around to it "eventually".

is a matter of tolerance to soft errors you want to achieve. I don't
have the numbers on the top of my head, but scrubbing a memory cell
every 500 us is something that would get you going without any impact
for quite some time (read years). The processor, running at 40MHz does
not even know the memory has been scrubbed.

> [the hooks to read/write "arbitrary" program memory locations still
> seem to be needed if he truly wants to walk *all* of program memory to
> ensure it is periodically scrubbed.

As said elsewhere in this thread, I've came to the conclusion that it
would be easier to implement the necessary hardware *around* the
processor in order to get the scrubbing done.

> The bigger issue I would pose (I've posed this to folks running server
> farms) is: what do you do when you get an error (corrected or
> otherwise)?

If the error is corrected you need to write it back in order to avoid
accumulation of errors, where your hamming code cannot give you more
than a double error failure. At that point you'd need to reboot or
rewrite the memory from a pristine area you trust (typically some sort
of non volatile memory).

> And, when do you lose confidence in the memory subsystem?

That's a tough question. How do you measure confidence? The way I
approach the problem is very simple: how many errors per unit of time
you can tolerate? Then from that number work out the probabilities your
system is likely to encounter an error and apply the necessary
mitigation techniques in order to meet desired goal *within* a certain
level of confidence (3 sigmas, 5 sigmas, 7 sigmas...typically this
factor is proportional to the level of criticality of your system).

> How many undetected errors are creeping in if some number of
> corrected/*uncorrected* errors occur??]

There's no such a thing as 'undetected errors'. Your code provides you
with a mean to *detect* and/or *correct* a certain class of errors.
Given the class of errors you want to protect, because in terms of
probability they correspond to the bulk of your errors, your code
implementation will provide the necessary protection.

In case of a server, which is supposedly using non rad-hard components,
a reasonable estimate of soft error over DRAM would be ranging between
few tens and few thousands FIT per Mb, while SRAM would be much more
sensitive, around few hundred thousands FIT. This is why todays caches
are ECC protected otherwise they'd experience a fault every half an
hour!

On the contrary protecting DRAM may not be extremely necessary (once in
two years for for a 10 Gib system). [1]

In the space market, EDAC are pretty the standard way of going when
coming to memory, but alone would not suffice. You need a scrubber that
read the corrected data and writes it back. In the write operation the
fault bit will be reset to the correct value and no accumulation occurs.

Al

[1] a good reference for all those numbers:
http://lambda-diode.com/opinion/investigations/transmutations/.../ecc-memory-3

Richard Damon

unread,

Nov 29, 2014, 6:53:18 PM11/29/14

to

On 11/29/14, 4:39 PM, alb wrote:
> Hi Niklas,
>
> Niklas Holsti <niklas...@tidorum.invalid> wrote:
> []
>>> It's a Harvard Architecture. As such, a traditional deployment would
>>> not support direct reading or writing of the program store *as* "data".
>>
>> Yes - but some very traditional Harvards have the ability to write to
>> the program store with special instructions which interpret the address
>> as an address in the program store (but I don't know if MB has such).
>
> Any reference for those ones? It doesn't seem to me the MB has
> instructions which provide such functionality.
>

The Microchip PICs (at least the PIC24 and dsPIC) have this ability.
There are special instructions to read and write to the program memory
(since program memory is flash on most of these parts, the write is
slow, and really is expected to be done in a block).

Note that data memory is 16 bits wide, while program memory is 24 bits
wide, so there are two read and two write instructions, one for the low
16 bits, and one for the upper 8 bits. It can also map a section of the
program memory into a window in the data memory address space (lower 16
bits only).

upsid...@downunder.com

unread,

Nov 30, 2014, 3:32:16 AM11/30/14

to

On 28 Nov 2014 14:07:15 GMT, al.b...@gmail.com (alb) wrote:

>Hi there,
>
>I'm trying to understand if in an MB-lite I could use the program to
>read/write itself.
>
>The main idea behind is the implementation of a 'memory scrubber' which
>detects and corrects errors in its own instruction memory (thanks to an
>EDAC).

What kind of memory are you using for program storage, Flash or
(D)RAM?

If RAM, how do you initially load it from some non-volatile storage ?
If some non-volatile storage + RAM is used, why not periodically
reload the program from non-volatile storage to RAM.

By 'memory scrubber' do you mean something similar to memory 'flusher'
as they had to use on the Hubble telescope (HST) each time it flew
through he South Atlantic Anomaly (SAA) ?

alb

unread,

Dec 1, 2014, 3:16:07 AM12/1/14

to

Hi upsidedown,

upsid...@downunder.com wrote:
[]

> What kind of memory are you using for program storage, Flash or
> (D)RAM?

It's an SRAM.

> If RAM, how do you initially load it from some non-volatile storage ?

The FPGA logic takes care of copying the content from an EEPROM to the
SRAM before releasing the reset of the processor core.

The memory interface of the core should also allow to hold the memory
access and allow the external logic to perform the scrubbing.

> If some non-volatile storage + RAM is used, why not periodically
> reload the program from non-volatile storage to RAM.

because I do not have the time to reload it (the program running
implements a PID for a motor controller and I cannot suspend it for such
a long time).

> By 'memory scrubber' do you mean something similar to memory 'flusher'
> as they had to use on the Hubble telescope (HST) each time it flew
> through he South Atlantic Anomaly (SAA) ?

Nope, a 'memory scrubber' is a mechanism to read and write back the
*same* content into a memory cell. The EDAC that is on the data path
will correct the reading, while the writing through the EDAC allows to
correct the memory content.

I do not know the details of the HST, but onboard the ISS we had few
errors per week due to SEU, while passing several times per day in the
SAA. There was no EDAC nor scrubbing on board and none of the logic was
triplified, yet we managed to get quite valuable results
(http://www.google.com/url?sa=t&rct=j&q=ams02%20article%20basili&source=web&cd=1&ved=0CCEQFjAA&url=https%3A%2F%2Fphysics.aps.org%2Ffeatured-article-pdf%2F10.1103%2FPhysRevLett.110.141102&ei=uSJ8VJb7H8vcPYbUgLAB&usg=AFQjCNHC0dWEuQ3vAIUjKnlWs9y1uctM2w&bvm=bv.80642063,d.ZWU).

The only mechanism was to verify the CRC of the program memory for
*most* (not all) of the processing units every half an hour (or there
about) and if a CRC failure was in place we would have stopped the
acquisition run, reboot the node and restart a new run.

The SAA (as well as the poles) was to be avoided at all costs during
calibrations, given the high cosmic rate which would have caused the
instrument to saturate and report a completely wrong backgrounds in
nominal observations. We would usually calibrate around the equator
crossings, with a timed task set with a periodicity of ~92 minutes and
updated every once in a while due to the drag.

Al