ECC or non/ECC Memory

26 views
Skip to first unread message

Gregory Abbey

unread,
Oct 13, 1998, 3:00:00 AM10/13/98
to
I'm building a 350 MHz Pentium based Linux box with an Asus P2B
mainboard. I need to make a decision on wether to use ECC or non/ECC
RAM. Is there any overhead associated with error correction?? What
are the pros and cons?? I know that ECC will cost more!!

david kahana

unread,
Oct 13, 1998, 3:00:00 AM10/13/98
to
Gregory Abbey wrote:

I think that there is some time overhead
for ECC relative to non ECC RAM. The ECC
algorithm takes some time to execute. I don't
think that it amounts to a major overhead in actual
use, maybe only a couple of percent, from an
extra wait state or so per memory read.

There used to be something called `parity' memory.
Probably there still is, though I've heard that
to save a (little) bit of money it's being eliminated
in some mb designs. Anyway I have an old 286 with 640
KB of parity memory. Parity operates just as fast as
non-parity memory. It can detect one bit errors but not
correct them.

Therefore if you got a parity error, you were basically
dead. The machine crashed (I think the memory controller
generated a NMI and shut down the processor). Then the
BIOS told you there was a parity error. You could then
reboot and try again, or replace the memory, but that was
about it.

How common are these errors? I don't know in general.
It actually never happened to me on my 286, not even once,
and I used that machine for about three or four years, though
not continuously by any means. But I saw it happen on other
old PC's.

The ECC chipsets for PC's that I have heard of can
correct single bit errors in a 64 bit block of
information, and can detect 2, 3 or 4 bit errors
but can't correct them. A NMI is generated if one
of those comes up, and the processor shuts down.

In any case you usually have to enable ECC in the BIOS
if you have it on the motherboard. On some you can choose
ECC, parity only, or non-parity. The single bit errors
are corrected in ECC mode, and possibly recorded. If there
is any pattern to them, you probably have a hardware problem.
The operating system can keep records of this. Linux
can, I believe, and my motherboard actually keeps some
records in some buffers in the BIOS I think which I can check
on a reboot. I don't actually know how to get Linux to do it though.
Anyone?

I don't think Win95 can do it at all. Maybe NT can but I have
no experience.

I have so far never seen any single bit errors on
my Linux box at home. It has ECC memory, is up more
or less continuously, has been running about 1.5 years
and I use it reasonably heavily for numerical calculations.
It's not a server or anything, but it is used to run
CPU/memory intensive programs that take a long time to
finish. That is why I used ECC memory.

I think I probably could have gotten away without using ECC,
but then again, maybe the ECC memory I bought is a bit higher
quality, and that's why I haven't gotten any errors. If I do
in the future, they would presumably will be corrected,
too -- I am not looking forward to the day, but I'm not very
worried about it either ;)

In summary, I would say if you are not using your system
in a very critical spot, as a server which really shouldn't
go down much, and which should maybe give you warning if
the memory is about to go bad, then you probably can do
without ECC. Otherwise, it's probably worth it, and you will
probably not care about the extra cost in that situation.

Sorry for the long response.

cheers,

-dave k.

david kahana

unread,
Oct 13, 1998, 3:00:00 AM10/13/98
to
david kahana wrote:

> How common are these errors? I don't know in general.

For the record, I found some measurements on the web
relevant to the rate of soft single bit errors expected from
cosmic ray background radiation. See:

http://net.wpi.edu/ram/ibmnasa/ibm

It has improved over time and is manufacturer dependent.
It seems to go from about 1 error per year in a 256 Kb chip in
1986 to as low as 0.00046 errors per year in a 4Mb chip in 1993.
But there is a range of about a factor of 100 depending on the
manufacturer.

I don't know whether cosmic rays are the main source of
errors. Seemingly electronic noise might matter too, and who
knows what else. But these seem to be actual measurements.

I suppose they were taken at sea level ....

NASA also did some research, by directly bombarding the
chips with proton beams:

http://flick.gsfc.nasa.gov/radhome/papers/d121696a.htm

But they only give you the cross-section per bit, for a
16Mb chip. You will have to work out the error rate given
the cosmic ray flux at your location. I'm leaving that as
an exercise for the reader :)

cheers,

- dave k.


RobinHood @ Parts-Unknown Com

unread,
Oct 13, 1998, 3:00:00 AM10/13/98
to
In article <3624c679.20332256@news>,

Gregory Abbey <g-abbey'nospam'@home.com> wrote:
>I'm building a 350 MHz Pentium based Linux box with an Asus P2B
>mainboard. I need to make a decision on wether to use ECC or non/ECC
>RAM. Is there any overhead associated with error correction?? What
>are the pros and cons?? I know that ECC will cost more!!
>
>

You can get 8ns cycle time 6ns access time ECC Memory it just costs
more. I have worked in the UNIX Server world for a while now and
Memory does fail. On OS's other than Linux you can get reports about
your memory. Every few months we had issues. Magnetic disturbence,
temperature fluctuations most errors were logged and corrected by the
ECC memory. It takes some of the Randomness out of your system. It is
not cost effective unless you really have criticle data on your system.

On the same token, I bought 128MB 8ns/8ns NEC DIMM, for my computer. I
can't stand it when computers act spurious. My NT machine at my office
is continually having memory troubles, and I believe it is having them
more than it reports them, which is part of the reason it reboots, and
my compiler crashes now and then.... There is also the shitty software
theory... But my old Compaq had different instability problems.

Most memory failures don't crash systems, they just make them act
weird, and they wreak havoc on your compiling.


--
-R*S

Henrik Carlqvist

unread,
Oct 13, 1998, 3:00:00 AM10/13/98
to
david kahana wrote:

> Gregory Abbey wrote:
> > Is there any overhead associated with error correction??
>
> I think that there is some time overhead
> for ECC relative to non ECC RAM. The ECC
> algorithm takes some time to execute.

As far as I know it is done in hardware so you will not loose any
performance with ECC.

> Parity operates just as fast as non-parity memory. It can detect one
> bit errors but not correct them.

Yes, that as also done in hardware.

> Therefore if you got a parity error, you were basically
> dead. The machine crashed (I think the memory controller
> generated a NMI and shut down the processor).

It generates an NMI, however, it's up to the OS to shut down. Linux
doesn't shut down, it only gives a message in syslog. It might not be a
bad idea to shut down, a bad memory could cause even worse things to
happen.

> How common are these errors?

I have seen parity errors and ecc errors on both PCs and Suns. However,
this is not always because of bad memory. Just as often it has been
because of a bad motherboard or oxide on the simms.

> If there is any pattern to them, you probably have a hardware
> problem. The operating system can keep records of this. Linux
> can, I believe, and my motherboard actually keeps some
> records in some buffers in the BIOS I think which I can check
> on a reboot. I don't actually know how to get Linux to do it though.
> Anyone?

No, I don't know how to make Linux do this.

> I think I probably could have gotten away without using ECC,
> but then again, maybe the ECC memory I bought is a bit higher
> quality, and that's why I haven't gotten any errors. If I do
> in the future, they would presumably will be corrected,
> too -- I am not looking forward to the day, but I'm not very
> worried about it either ;)

And best of all, you know that you can trust your memory and your
program results.

regards Henrik
--
spammer strikeback:
root@localhost te...@AOCI.COM extr...@WWNET.NET tech-c...@WWNET.NET
t...@NET-SHOPPE.COM tra...@INTEGRACOM.NET ad...@INTEGRACOM.NET

david kahana <dek@bnl.gov>

unread,
Oct 15, 1998, 3:00:00 AM10/15/98
to henrik.c...@swipnet.se
Henrik Carlqvist wrote:

> david kahana wrote:
>> I think that there is some time overhead
>> for ECC relative to non ECC RAM. The ECC
>> algorithm takes some time to execute.
>
> As far as I know it is done in hardware so you will not loose any
> performance with ECC.

Yes, for sure it's done in hardware, otherwise the time cost
would presumably be horrific. But hardware runs at a finite
speed, and I had heard that ECC was a bit slower than a simple
parity check, which is a pretty trivial thing. On a read of
a 64-bit + 8-bit block with ECC, you have to compute some
7-bit checksum for the block and compare it against the stored
7-bit checksum, as well as read out the whole data block. For
parity it's more or less the same, you have to read out and
do a comparison too, but the computation is seemingly
much simpler.

With ECC you know which bit has changed if there is an error,
whereas with parity you don't. There must be some cost for that,
if not in time, then in extra transistors.

I thought it is similar to the way some complex instructions
(on non-RISC machines) can take more cycles to execute in
the processor than for example a simple integer addition does.

Maybe I'm wrong, though, and it actually is done quickly
enough that the memory doesn't operate any slower. I don't
know for sure. But I seem to remember someone I trusted
to know such things saying it.

Do you know the actual way ECC is done, timings and such?

I really don't, and I'm not trying to be snippety, just
would like to know for sure.

>> Therefore if you got a parity error, you were basically
>> dead. The machine crashed (I think the memory controller
>> generated a NMI and shut down the processor).
>>
> It generates an NMI, however, it's up to the OS to shut down. Linux
> doesn't shut down, it only gives a message in syslog. It might not be a
> bad idea to shut down, a bad memory could cause even worse things to
> happen.

Yes you are definitely right about that. I was thinking of
the behaviour under DOS when I said that :) And it's true
enough, too, that it could be better to shut down in some cases.

>> I think I probably could have gotten away without using ECC,
>> but then again, maybe the ECC memory I bought is a bit higher
>> quality, and that's why I haven't gotten any errors. If I do
>> in the future, they would presumably will be corrected,
>> too -- I am not looking forward to the day, but I'm not very
>> worried about it either ;)
>
> And best of all, you know that you can trust your memory and your
> program results.

Absolutely right. For me that is the main advantage.

cheers,

- dave k.

Henrik Carlqvist

unread,
Oct 15, 1998, 3:00:00 AM10/15/98
to
david kahana wrote:
> With ECC you know which bit has changed if there is an error,
> whereas with parity you don't. There must be some cost for that,
> if not in time, then in extra transistors.

> Do you know the actual way ECC is done, timings and such?

I would guess that it is all done in hardware without any performance
loss. ECC memory costs some extra and all motherboards don't support
ECC, that makes me guess that you will have to pay some extra to get ECC
support. However, there is one way to find out if noone knows. As I have
ECC memory I could try to run a benchmark like lmbench without parity
check, with parity check and with ECC. Then we could see if there is any
difference.

Eric Lee Green

unread,
Oct 16, 1998, 3:00:00 AM10/16/98
to
On 13 Oct 1998 18:50:26 GMT, RobinHood @ Parts-Unknown Com <sp...@msn.com> wrote:
>In article <3624c679.20332256@news>,
>Gregory Abbey <g-abbey'nospam'@home.com> wrote:
>>I'm building a 350 MHz Pentium based Linux box with an Asus P2B
>>mainboard. I need to make a decision on wether to use ECC or non/ECC
>>RAM. Is there any overhead associated with error correction?? What
>>are the pros and cons?? I know that ECC will cost more!!

>ECC memory. It takes some of the Randomness out of your system. It is


>not cost effective unless you really have criticle data on your system.

It's not really that much more expensive nowdays, especially if you're
paying PC-100 rates anyhow.

We made a "command decision" a couple of months ago that we were no longer
going to bother with non-ECC PC-100 memory. The cost difference wasn't
that much, and the gains too great. We figure that if somebody is paying
the price for a PII-350 and PC-100 memory, chopping $50 off the cost of the
system by using non-ECC memory is false economy.

--
Eric Lee Green er...@linux-hw.com http://www.linux-hw.com/~eric
"To call Microsoft an innovator is like calling the Pope Jewish ..."
-- James Love (Consumer Project on Technology)

david kahana <dek@bnl.gov>

unread,
Oct 16, 1998, 3:00:00 AM10/16/98
to
Henrik Carlqvist wrote:

> david kahana wrote:
> > With ECC you know which bit has changed if there is an error,
> > whereas with parity you don't. There must be some cost for that,
> > if not in time, then in extra transistors.
>
> > Do you know the actual way ECC is done, timings and such?
>
> I would guess that it is all done in hardware without any performance
> loss. ECC memory costs some extra and all motherboards don't support
> ECC, that makes me guess that you will have to pay some extra to get ECC
> support. However, there is one way to find out if noone knows. As I have
> ECC memory I could try to run a benchmark like lmbench without parity
> check, with parity check and with ECC. Then we could see if there is any
> difference.

Sounds like the best idea. I will try the same thing too, if
I can manage it. I think that not all ECC memory can be
run as parity memory though, since the extra eight bits
used to store the ECC checksum can't always be read
out individually. The only way to see if it will work is
to try it.

Please let me know how it comes out, and I'll do the
same.

cheers,

- dave k.

david kahana

unread,
Oct 18, 1998, 3:00:00 AM10/18/98
to henrik.c...@swipnet.se
david kahana wrote:

I downloaded the lmbench package and ran it on my system:
Intel PR440FX, 2 x PPro 200MHz, 8k L1 cache, 256k L2 cache,
cache line size 64 bytes, processors not matched, one has
stepping 07 the other 09. However that is supposed to work
according to Intel ...

My kernel is 2.1.117.

This is a very nice package. I ran it both with ECC enabled in
the BIOS: AMI 1.0.0.8DI0, and with ECC disabled. I have
192 MB of ECC EDO DIMMS installed. I will put the graphs
of memory load latency on my machine at work for download.
(not until this evening).

ftp://bnlnth.phy.bnl.gov/pub/linux/ecc/ecc.ps.gz

for the ECC enabled results, and:

ftp://bnlnth.phy.bnl.gov/pub/linux/ecc/noecc.ps.gz

for the non-ECC results.

The processor cycles at 5 nanoseconds. L1 cache
has a latency of about 20 nanoseconds, L2 of 30-40,
depending on the stride.

The conclusion: main memory back to back load
latency is 250 nanoseconds with ECC enabled. The
only stride which differs from this value is stride 16,
at 175 nanoseconds.

With ECC disabled the picture is more complicated.
The latency to main memory is dependent on the
stride. The values generally cluster at 200-220
nanoseconds, with the fastest at 175ns (stride 16 again),
and the slowest at 225ns (stride 1028).

So it looks like there is a time cost for the ECC,
for whatever reason. It amounts to some 50
nanoseconds, or 10 extra processor cycles.

However, on all other system tests, there is no
distinguishable difference between the ECC
enabled and not enabled. Memory bandwidth,
context switches, everything else looks just the
same. So I don't think it will be a serious cost in
normal use.

One anomaly I noticed, is that the L2 cache, which
is supposed to be 256kB, actually seems to degrade
in performance at an array size of 128kB. I don't know
what the cause could be, but it is strange. Maybe a result
of my mismatched processors??

All the tests were run with the system in single user mode,
with a quiet system, no network connections up. That, I
found out makes quite a difference to the results.

cheers,

- dave k.


Reply all
Reply to author
Forward
0 new messages