nfs-server silent data corruption

Arno J. Klaassen

unread,

Apr 20, 2008, 7:37:00 PM4/20/08

to sta...@freebsd.org, n...@freebsd.org

Hello,

I've a strange problem with a box I'm setting up as nfs-server
under 7-stable :

- tyan S2895 MB, 2*285Dualcore Opteron, 4G-ECC, ahd-scsi, nfe-network
- stripped GENERIC as kernel
- sources as of last saturday afternoon (European time)

I removed everything from /boot/loader.conf and /etc/sysctl.conf, still
I get "easily" data corruption when exporting ahd-scsi over nfs
(NB exporting geom_raid5 gives same data corruption)

Testing with the following pseudo code :

while checksum1 == checksum2 do
create random file of $1 MBytes
calculate md5 checksum1
copy
calculate md5 checksum2 on copy

Tested on both (as nfs-client) a 6-stable-i386 from a couple of weeks
ago as well as a linux 2.6.15-gentoo-r1 of about two years ago :
within half an hour the copy will be different .... ;(

I played with nfs-options on client side (nfs[23], conn, intr, [udp|tcp],
-r=, -w= ) but none seem to matter.

Start/Stop rpc.lock/sttatd on server/client just provoked some :

cp: utimes: BIG2: No such file or directory
cp: chown: BIG2: Stale NFS file handle
cp: chmod: BIG2: Stale NFS file handle
cp: chflags: BIG2: Operation not supported
cp: BIG2: Stale NFS file handle
cp: setting permissions for `BIG2': Stale NFS file handle
cp: closing `BIG2': Stale NFS file handle

[and then the while loop continued ... as if the NFS handle where not
that stale ..]

Anyway, I'll try to nail this down more (e.g. nfs-write performance
is horrible ... (nfsd falling down to 0% cpu and then after while
'wake up' and be at around 3-6% again))

I didn't stress-test this MB for a while, but last time I did was
with 7-PRELEASE/RC?/CANTremember-exactly-but-close-to-release
and all worked great

I did add 2G ECC to the 2nd CPU since, though I doubt that interferes
with NFS.

Bref, if anyone has a suggestion ???? (I will try downgrade
to RELENG_7_0 iff noone has a new suggestion for RELENG_7, but I'd like
to go forward and test some maybe suspect recent MFC or other
suggestion)

Thanx in advance,

best, Arno

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Kris Kennaway

unread,

Apr 21, 2008, 5:48:21 AM4/21/08

to Arno J. Klaassen, sta...@freebsd.org, n...@freebsd.org

On Mon, Apr 21, 2008 at 01:02:33AM +0200, Arno J. Klaassen wrote:

> I didn't stress-test this MB for a while, but last time I did was
> with 7-PRELEASE/RC?/CANTremember-exactly-but-close-to-release
> and all worked great
>
> I did add 2G ECC to the 2nd CPU since, though I doubt that interferes
> with NFS.

Uh, you're getting server-side data corruption, it could definitely be
because of the memory you added.

Kris

--
In God we Trust -- all others must submit an X.509 certificate.
-- Charles Forsythe <fors...@alum.mit.edu>

Arno J. Klaassen

unread,

Apr 21, 2008, 10:56:21 AM4/21/08

to Kris Kennaway, sta...@freebsd.org, Clayton Milos, n...@freebsd.org

Kris Kennaway <kr...@FreeBSD.ORG> writes:

> On Mon, Apr 21, 2008 at 01:02:33AM +0200, Arno J. Klaassen wrote:
>
> > I didn't stress-test this MB for a while, but last time I did was
> > with 7-PRELEASE/RC?/CANTremember-exactly-but-close-to-release
> > and all worked great
> >
> > I did add 2G ECC to the 2nd CPU since, though I doubt that interferes
> > with NFS.
>
> Uh, you're getting server-side data corruption, it could definitely be
> because of the memory you added.

yop, though I'm still not convinced the memory is bad (the very same
Kingston ECC as the 2*1G in use for about half a year already) :

I added it directly to the 2nd CPU (diagram on page 9 of
http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
seems to be the interaction between nfe0 and powerd .... :

- if I stop powerd, problems go away
- I let run powerd but turn of txcsum and tso4 on the interface,
the problem is a lot harder to produce (if ever this gives
a hint to anyone)

Device is :

nfe0@pci0:0:10:0: class=0x068000 card=0x289510f1 chip=0x005710de rev=0xa3 hdr=0x00
vendor = 'Nvidia Corp'
device = 'nForce4 Ultra NVidia Network Bus Enumerator'
class = bridge
cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0

(this is with the default BIOS setting " LAN Bridge Enabled", disabling
that setting makes pciconf say "class = network" but does not influence
my problem)

I will restart my tests now by populating all 4G to only CPU1 and
say whether that matters.

Best, Arno

Jeremy Chadwick

unread,

Apr 21, 2008, 11:45:09 AM4/21/08

to Arno J. Klaassen, Clayton Milos, Kris Kennaway, sta...@freebsd.org, n...@freebsd.org

On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> Kris Kennaway <kr...@FreeBSD.ORG> writes:
> > Uh, you're getting server-side data corruption, it could definitely be
> > because of the memory you added.
>
> yop, though I'm still not convinced the memory is bad (the very same
> Kingston ECC as the 2*1G in use for about half a year already) :

Can you download and run memtest86 on this system, with the added 2G ECC
insalled? memtest86 doesn't guarantee showing signs of memory problems,
but in most cases it'll start spewing errors almost immediately.

One thing I did notice in the motherboard manual below is something
called "Hammer Configuration". It appears to default to 800MHz, but
there's an "Auto" choice. Does using Auto fix anything?

> I added it directly to the 2nd CPU (diagram on page 9 of
> http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> seems to be the interaction between nfe0 and powerd .... :

That board is the weirdest thing I've seen in years.

Two separate CPUs using a single (shared) memory controller, two
separate (and different!) nVidia chipsets, a SMSC I/O controller
probably used for serial and parallel I/O, two separate nVidia NICs with
Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two
separate PCI-e busses (each associated with a separate nVidia chipset),
two separate PCI-X busses... the list continues.

I know you don't need opinions at this point, but what a behemoth. I
can't imagine that thing running reliably.

> - if I stop powerd, problems go away

This would imply that clock frequency stepping is somehow attributing
itself to the corruption. I don't see any BIOS options for controlling
things related to AMD's Cool-n-Quiet or PowerNow! feature, which is
usually what handles this.

> - I let run powerd but turn of txcsum and tso4 on the interface,
> the problem is a lot harder to produce (if ever this gives
> a hint to anyone)

Possibly shared interrupts are causing problems? MSI/MSI-X doing
something odd? Have you tried disabling MSI/MSI-X and see if it makes a
difference?

Can you boot the machine in verbose mode, and put the dmesg up
somewhere?

> Device is :
>
> nfe0@pci0:0:10:0: class=0x068000 card=0x289510f1 chip=0x005710de rev=0xa3 hdr=0x00
> vendor = 'Nvidia Corp'
> device = 'nForce4 Ultra NVidia Network Bus Enumerator'
> class = bridge
> cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0
>
> (this is with the default BIOS setting " LAN Bridge Enabled", disabling
> that setting makes pciconf say "class = network" but does not influence
> my problem)

I think you mean "MAC LAN Bridge", according to the motherboard manual.
I'm not even sure what that really does; somehow trunks the two NICs
together to give you the equivalent of 2000mbit of traffic? I don't
know.

Does the corruption you see go away if you install a separate NIC (e.g.
an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs
(should be "MAC LAN: Disable" on both the primary and slave)?

Mike Tancsa

unread,

Apr 21, 2008, 12:07:39 PM4/21/08

to Arno J. Klaassen, sta...@freebsd.org

At 10:52 AM 4/21/2008, Arno J. Klaassen wrote:

>Device is :
>
>nfe0@pci0:0:10:0: class=0x068000 card=0x289510f1
>chip=0x005710de rev=0xa3 hdr=0x00
> vendor = 'Nvidia Corp'
> device = 'nForce4 Ultra NVidia Network Bus Enumerator'
> class = bridge
> cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0
>
>(this is with the default BIOS setting " LAN Bridge Enabled", disabling
> that setting makes pciconf say "class = network" but does not influence
> my problem)
>
>I will restart my tests now by populating all 4G to only CPU1 and
>say whether that matters.

Hi,
How long does it take for the problem to show up ? I have what
appears to be a very similar Tyan board (I have an Socket 939 X2 cpu)
with the same NIC, but this one is running RELENG_7 from April
17th. There have been a few fixes for the nfe driver since 7.0

I am running this small script below on a nfs client (em nic) against
the server (nfe) ( mount options on the client 192.168.245.1:/backup
/backup nfs rw,-r=32768,-w=32768,tcp,noauto )

#!/bin/sh
i=0
while true
do
i=`expr $i + 1`
dd if=/dev/urandom of=/tmp/junk.txt bs=1024 count=81920 > /dev/null 2>&1
cp -p /tmp/junk.txt /backup/
orig=`md5 -q /tmp/junk.txt`
umount /backup
sleep 2
mount /backup
copy=`md5 -q /backup/junk.txt`
echo "$orig and $copy on $i"
if [ $orig != $copy ]; then
echo "\a copy not ok on $i"
exit 255
fi
done

On the server, I have

nfe0@pci0:0:10:0: class=0x068000 card=0x286510f1

chip=0x005710de rev=0xa3 hdr=0x00
vendor = 'Nvidia Corp'
device = 'nForce4 Ultra NVidia Network Bus Enumerator'
class = bridge
cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0

# ifconfig nfe0
nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=10b<RXCSUM,TXCSUM,VLAN_MTU,TSO4>
ether 00:e0:81:58:91:6a
inet 192.168.245.1 netmask 0xffffff00 broadcast 192.168.245.255
media: Ethernet autoselect (1000baseTX <full-duplex,flag0,flag1>)
status: active

How long does it take for the problem to come up ?

---Mike

Erik Trulsson

unread,

Apr 21, 2008, 12:47:35 PM4/21/08

to Jeremy Chadwick, Clayton Milos, Kris Kennaway, sta...@freebsd.org, n...@freebsd.org

On Mon, Apr 21, 2008 at 08:43:33AM -0700, Jeremy Chadwick wrote:
> On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> > Kris Kennaway <kr...@FreeBSD.ORG> writes:
> > > Uh, you're getting server-side data corruption, it could definitely be
> > > because of the memory you added.
> >
> > yop, though I'm still not convinced the memory is bad (the very same
> > Kingston ECC as the 2*1G in use for about half a year already) :
>
> Can you download and run memtest86 on this system, with the added 2G ECC
> insalled? memtest86 doesn't guarantee showing signs of memory problems,
> but in most cases it'll start spewing errors almost immediately.
>
> One thing I did notice in the motherboard manual below is something
> called "Hammer Configuration". It appears to default to 800MHz, but
> there's an "Auto" choice. Does using Auto fix anything?
>
> > I added it directly to the 2nd CPU (diagram on page 9 of
> > http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> > seems to be the interaction between nfe0 and powerd .... :
>
> That board is the weirdest thing I've seen in years.
>
> Two separate CPUs using a single (shared) memory controller,

No. Each CPU contains its own memory controller (just like all AMD's
Opteron/Athlon64 CPUs does.)

> two
> separate (and different!) nVidia chipsets,

More like one chipset consisting of several physical chips. (Which is
actually quite common. The most common division is a
"nortbridge/southbridge" division, but other ways are possible too.)

The only unusual thing is that there are several chips connected directly
to the CPUs, instead of having the CPUs talk to a single chip which in
turn talks to another chip which can easily create bottlenecks.

> a SMSC I/O controller
> probably used for serial and parallel I/O

Just like almost all other motherboards.

>, two separate nVidia NICs with
> Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?)

What is so wierd about that? If you want to have more than one ethernet
connection, then you normally have more than one NIC.
Bridging can easily (and commonly) be done over separate NICs.

>, two
> separate PCI-e busses (each associated with a separate nVidia chipset),

Since it is always the case that each PCI-E slot or PCI-E device sits
on its own bus I fail to see anything strange about that.
(And it is actually very common to have the PCI-E slots on motherboards
be connected to different chips.)

> two separate PCI-X busses... the list continues.

Having more than one PCI-X bus used to be fairly common on server boards for
performance reasons. Nowadays PCI-X is slowly being replaced with PCI-E
so on the latest generation of serverboards there are usually no more than
one PCI-X bus.

>
> I know you don't need opinions at this point, but what a behemoth. I
> can't imagine that thing running reliably.

I would rather say it is a quite elegant design for a high-end motherboard
intended for server/workstation installations.

It is a dual-socket Opteron board. Each Opteron has its own
memory-controller and uses HyperTransport to connect to other components.
Each dual-socket Opteron has three HyperTransport links available.
One from each CPU will be needed to the other CPU, leaving two links from
each CPU available to connect to other chips. From that starting point it
is a fairly obvious design.
To maximise the available bandwidth one would want to spread out the chips
over these links, which this motherboard does fairly well, using three
of the four available links.
(And hanging the most important things from CPU0, so you can actually use
the board even if you have only one CPU installed.)

As for reliability I see no particular reason for that board to be less
reliable than any other multi-CPU board.

--
<Insert your favourite quote here.>
Erik Trulsson
ertr...@student.uu.se

Arno J. Klaassen

unread,

Apr 21, 2008, 1:11:51 PM4/21/08

to Jeremy Chadwick, sta...@freebsd.org

Hello,

Jeremy Chadwick <koi...@freebsd.org> writes:

> On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> > Kris Kennaway <kr...@FreeBSD.ORG> writes:
> > > Uh, you're getting server-side data corruption, it could definitely be
> > > because of the memory you added.

[ .. stuff deleted; I'll answer in more detail later ..]

> Can you boot the machine in verbose mode, and put the dmesg up
> somewhere?

attached.

Jeremy Chadwick

unread,

Apr 21, 2008, 1:30:52 PM4/21/08

to Arno J. Klaassen, Clayton Milos, Kris Kennaway, sta...@freebsd.org, n...@freebsd.org

On Mon, Apr 21, 2008 at 06:24:45PM +0200, Erik Trulsson wrote:
> As for reliability I see no particular reason for that board to be less
> reliable than any other multi-CPU board.

Sorry for my complete and total opinionated noise, then.

_______________________________________________

Arno J. Klaassen

unread,

Apr 21, 2008, 2:32:58 PM4/21/08

to Jeremy Chadwick, Kris Kennaway, sta...@freebsd.org

yet another quick partial answer :

Jeremy Chadwick <koi...@freebsd.org> writes:

> On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> > Kris Kennaway <kr...@FreeBSD.ORG> writes:
> > > Uh, you're getting server-side data corruption, it could definitely be
> > > because of the memory you added.
> >
> > yop, though I'm still not convinced the memory is bad (the very same
> > Kingston ECC as the 2*1G in use for about half a year already) :
>
> Can you download and run memtest86 on this system, with the added 2G ECC
> insalled? memtest86 doesn't guarantee showing signs of memory problems,
> but in most cases it'll start spewing errors almost immediately.

It's running for 15 minutes now without any warning; I'll let it run
while cooking a meal [ with 2*1G mem for each CPU to be clear ].

NB, (CC to kris@ for this) why is memtest86 port marked as i386-only?
It only seems to install floppy.bin and memtest.iso, but alas
(maybe I should leave one box dedicated to freebsd-i386 for things
like this ;) )

Best, Arno

Peter Jeremy

unread,

Apr 21, 2008, 3:20:31 PM4/21/08

to Arno J. Klaassen, sta...@freebsd.org

On Mon, Apr 21, 2008 at 08:30:48PM +0200, Arno J. Klaassen wrote:
>NB, (CC to kris@ for this) why is memtest86 port marked as i386-only?

Basically because it's a bootable i386 binary image.

>(maybe I should leave one box dedicated to freebsd-i386 for things
>like this ;) )

No need - just download the image, burn it to a CD and boot the CD.

--
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.

Arno J. Klaassen

unread,

Apr 21, 2008, 5:48:09 PM4/21/08

to Jeremy Chadwick, Clayton Milos, Kris Kennaway, sta...@freebsd.org, n...@freebsd.org

re,

Jeremy Chadwick <koi...@freebsd.org> writes:

> On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> > Kris Kennaway <kr...@FreeBSD.ORG> writes:
> > > Uh, you're getting server-side data corruption, it could definitely be
> > > because of the memory you added.
> >
> > yop, though I'm still not convinced the memory is bad (the very same
> > Kingston ECC as the 2*1G in use for about half a year already) :
>
> Can you download and run memtest86 on this system, with the added 2G ECC
> insalled? memtest86 doesn't guarantee showing signs of memory problems,
> but in most cases it'll start spewing errors almost immediately.

it finished in a bit less than 3 hours without a single error/warning

I feel pretty confident all memory is fine

> One thing I did notice in the motherboard manual below is something
> called "Hammer Configuration". It appears to default to 800MHz, but
> there's an "Auto" choice. Does using Auto fix anything?

Nope

> > I added it directly to the 2nd CPU (diagram on page 9 of
> > http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> > seems to be the interaction between nfe0 and powerd .... :
>
> That board is the weirdest thing I've seen in years.

;) I agree I lifted (?) my eye-brows the first time I saw that
diagram

> Two separate CPUs using a single (shared) memory controller, two
> separate (and different!) nVidia chipsets, a SMSC I/O controller
> probably used for serial and parallel I/O, two separate nVidia NICs with
> Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two
> separate PCI-e busses (each associated with a separate nVidia chipset),
> two separate PCI-X busses... the list continues.

some may say "it's just four wheels, an engine and a steer", she looks
different compared to most others

> I know you don't need opinions at this point, but what a behemoth. I
> can't imagine that thing running reliably.

though it does ;) (till the day I decided she deserved a -stable upgrade
and 2 more gigs ...)

> > - if I stop powerd, problems go away
>
> This would imply that clock frequency stepping is somehow attributing
> itself to the corruption. I don't see any BIOS options for controlling
> things related to AMD's Cool-n-Quiet or PowerNow! feature, which is
> usually what handles this.

you can turn it on/off; anyway, the problem *seems* easy to reproduce
when freq drops quickly form 2600Mhz to 1000Mhz ....
I just inspected a few corrupted copies, but out of 10-200Mbytes
just 1 byte was 0 iso \t

> > - I let run powerd but turn of txcsum and tso4 on the interface,
> > the problem is a lot harder to produce (if ever this gives
> > a hint to anyone)
>
> Possibly shared interrupts are causing problems?

don't think so; I first had two Promise TX4 cards in this box iso
the Marvell 8port card; since I had problems with TX4 some time
ago I first suspected them. The board is still running memtest86, but
from the dmesg I posted I don't see a shared irq.

> MSI/MSI-X doing
> something odd? Have you tried disabling MSI/MSI-X and see if it makes a
> difference?

MSI is disabled as is PCI-e Error reporting (or something like
that)

>
> I think you mean "MAC LAN Bridge", according to the motherboard manual.
> I'm not even sure what that really does; somehow trunks the two NICs
> together to give you the equivalent of 2000mbit of traffic? I don't
> know.

probably; I never tried ;) I need the second NIC for a seperate
subnet

> Does the corruption you see go away if you install a separate NIC (e.g.
> an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs
> (should be "MAC LAN: Disable" on both the primary and slave)?

Don't have one available right now (for a 2U server).
I will test if I do not find another solution.

Thanx, Arno

Arno J. Klaassen

unread,

Apr 21, 2008, 5:59:06 PM4/21/08

to Mike Tancsa, sta...@freebsd.org

Hello,

Mike Tancsa <mi...@sentex.net> writes:

> At 10:52 AM 4/21/2008, Arno J. Klaassen wrote:
>
> >Device is :
> >
> > nfe0@pci0:0:10:0: class=0x068000 card=0x289510f1
> > chip=0x005710de rev=0xa3 hdr=0x00
> > vendor = 'Nvidia Corp'
> > device = 'nForce4 Ultra NVidia Network Bus Enumerator'
> > class = bridge
> > cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0
> >
> >(this is with the default BIOS setting " LAN Bridge Enabled", disabling
> > that setting makes pciconf say "class = network" but does not influence
> > my problem)
> >
> >I will restart my tests now by populating all 4G to only CPU1 and
> >say whether that matters.
>
> Hi,
> How long does it take for the problem to show up ?

Less than an hour in general (running the same client script
simultanuously on a 100Mbps linux box and 1Gbps bds6-x86)

quite the same as what I do (apart from the umount/sleep/mount and I
use same partition for write and copy) :

SIZE=$1

COUNTER=${2:-20}

until [ $COUNTER -lt 1 ]; do
echo "**** Still $COUNTER iterations to go *** "
echo
echo -n Creating random file of $SIZE MBytes ...
dd if=/dev/random of=BIG bs=1048576 count=${SIZE} > /dev/null 2>&1
echo Done
echo -n Calculating md5 checksum ...
CS1=`md5 -q BIG`
echo Done
echo -n Copying file ...
cp -fp BIG BIG2
echo Done
echo -n Calculating md5 checksum ...
CS2=`md5 -q BIG2`
echo Done
if [ ${CS1} != ${CS2} ]; then
echo CHECKSUM MISMATCH
exit -1
else
echo
fi
let COUNTER-=1
done

for info, I test with args '38 999' (38M, try 999 times) on linux
(slightly adapted script BTW) and '138 999' on bsd. The best 'score' I
got was 'still 871 iterations to go'

> On the server, I have
>
> nfe0@pci0:0:10:0: class=0x068000 card=0x286510f1 chip=0x005710de
> rev=0xa3 hdr=0x00
> vendor = 'Nvidia Corp'
> device = 'nForce4 Ultra NVidia Network Bus Enumerator'
> class = bridge
> cap 01[44] = powerspec 2 supports D0 D1 D2 D3 current D0

idem

> # ifconfig nfe0
> nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=10b<RXCSUM,TXCSUM,VLAN_MTU,TSO4>
> ether 00:e0:81:58:91:6a
> inet 192.168.245.1 netmask 0xffffff00 broadcast 192.168.245.255
> media: Ethernet autoselect (1000baseTX <full-duplex,flag0,flag1>)
> status: active

idem

> How long does it take for the problem to come up ?

as said : approximately half an hour; never more than 4 hours

Best, Arno

Mike Tancsa

unread,

Apr 21, 2008, 9:06:07 PM4/21/08

to Arno J. Klaassen, sta...@freebsd.org

At 05:57 PM 4/21/2008, Arno J. Klaassen wrote:

>Less than an hour in general (running the same client script
>simultanuously on a 100Mbps linux box and 1Gbps bds6-x86)

Hi,
I ran it for over an hour without cpufreq and powerd without
problems with just one client. I will recompile the kernel tomorrow
(cant load it as a kld for some reason) and see if I can trigger it.

> > I have what appears
> > to be a very similar Tyan board (I have an Socket 939 X2 cpu) with the
> > same NIC, but this one is running RELENG_7 from April 17th. There
> > have been a few fixes for the nfe driver since 7.0
> >
> > I am running this small script below on a nfs client (em nic) against
> > the server (nfe) ( mount options on the client 192.168.245.1:/backup
> > /backup nfs rw,-r=32768,-w=32768,tcp,noauto )
>

>quite the same as what I do (apart from the umount/sleep/mount and I
>use same partition for write and copy) :

I do the umount/mount as its a good way to make sure the file is not in cache.

What version of nfe are you running ? If its older than

% ident /usr/src/sys/dev/nfe/if_nfe.c
/usr/src/sys/dev/nfe/if_nfe.c:
$OpenBSD: if_nfe.c,v 1.54 2006/04/07 12:38:12 jsg Exp $
$FreeBSD: src/sys/dev/nfe/if_nfe.c,v 1.21.2.5 2008/04/17
04:22:32 yongari Exp $

perhaps try that on your machine to see if it helps. Prior version
did not work very well for me at all. Not sure if it will just
"work" or you need to do a full cvsup buildworld/buildkernel.

---Mike

Mike Tancsa

unread,

Apr 22, 2008, 11:03:33 AM4/22/08

to Arno J. Klaassen, sta...@freebsd.org

At 05:57 PM 4/21/2008, Arno J. Klaassen wrote:
> > Hi,
> > How long does it take for the problem to show up ?
>
>
>Less than an hour in general (running the same client script
>simultanuously on a 100Mbps linux box and 1Gbps bds6-x86)

I am running my nic at gig speeds only... I recompiled the kernel
this morning to include cpufreq as well as made sure the cool&quiet
was enabled in the BIOS.

>for info, I test with args '38 999' (38M, try 999 times) on linux
>(slightly adapted script BTW) and '138 999' on bsd. The best 'score' I
>got was 'still 871 iterations to go'

So far I have done 150 loops with an 80MB file and no issues and 200
loopswith a 160MB file. My nfe nic does not support MSI and has its
own interrupt

# vmstat -i
interrupt total rate
irq1: atkbd0 5 0
irq4: sio0 3049 1
irq16: twe0 327046 164
irq19: bge0 385147 194
irq21: atapci1 976355 492
irq23: nfe0 11876726 5986
cpu0: timer 3966420 1999
cpu1: timer 3964392 1998

I have powerd started up with
powerd_enable="YES"
powerd_flags="-a adaptive -b adaptive -n adaptive"

FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
ioapic0: Changing APIC ID to 2
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <Nvidia AWRDACPI> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, dfde0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0
cpu0: <ACPI CPU> on acpi0
powernow0: <Cool`n'Quiet K8> on cpu0
cpu1: <ACPI CPU> on acpi0
powernow1: <Cool`n'Quiet K8> on cpu1
acpi_button0: <Power Button> on acpi0
.
.
nfe0: <NVIDIA nForce4 CK804 MCP9 Networking Adapter> port
0xb400-0xb407 mem 0xfebf9000-0xfebf9fff irq 23 at device 10.0 on pci0
miibus0: <MII bus> on nfe0
e1000phy0: <Marvell 88E1111 Gigabit PHY> PHY 1 on miibus0
e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
1000baseTX-FDX, auto
nfe0: Ethernet address: 00:e0:81:58:91:6a
nfe0: [FILTER]

With the "sleep" in my test script, powerd does seem to be fiddling
with frequencies as well during the inactivity.

# sysctl dev. | grep -i fre
dev.cpu.0.freq: 1800
dev.cpu.0.freq_levels: 2200/110000 2000/105600 1800/89100 1000/49000
dev.powernow.0.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
dev.powernow.1.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%parent: cpu1

# sysctl dev. | grep -i fre
dev.cpu.0.freq: 2200
dev.cpu.0.freq_levels: 2200/110000 2000/105600 1800/89100 1000/49000
dev.powernow.0.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
dev.powernow.1.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%parent: cpu1

---Mike

Arno J. Klaassen

unread,

Apr 22, 2008, 1:40:22 PM4/22/08

to Mike Tancsa, sta...@freebsd.org

Hello,

Mike Tancsa <mi...@sentex.net> writes:

irq1: atkbd0 4 0
irq14: ata0 69 0
irq20: nfe0 11650955 5283
irq24: atapci1 94 0
irq28: atapci2 178 0
irq29: ahd0 355704 161
cpu0: timer 4409020 1999
cpu1: timer 4391646 1991
cpu2: timer 4391643 1991
cpu3: timer 4391641 1991

> I have powerd started up with
> powerd_enable="YES"
> powerd_flags="-a adaptive -b adaptive -n adaptive"

slightly different, I mostly use "-b adaptive -i 90 -n adaptive -r 80"
but the problem shows up without flags as well.

> With the "sleep" in my test script, powerd does seem to be fiddling
> with frequencies as well during the inactivity.

I most often provoke slight swapping for "randomizing" frequency changes
and a burnK7 or similar to psuh up and down by hand

> # sysctl dev. | grep -i fre
> dev.cpu.0.freq: 1800
> dev.cpu.0.freq_levels: 2200/110000 2000/105600 1800/89100 1000/49000
> dev.powernow.0.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
> dev.powernow.1.freq_settings: 2200/110000 2000/105600 1800/89100 1000/49000
> dev.cpufreq.0.%driver: cpufreq
> dev.cpufreq.0.%parent: cpu0
> dev.cpufreq.1.%driver: cpufreq
> dev.cpufreq.1.%parent: cpu1

funny, when I do that :

# sysctl dev. | grep -i fre

dev.cpu.0.freq: 995
dev.cpu.0.freq_levels: 2587/95000 2388/90300 2189/76200 1990/63800 1791/53200 995/36100
dev.powernow.0.freq_settings: 2587/95000 2388/90300 2189/76200 1990/63800 1791/53200 995/36100
dev.powernow.1.freq_settings: 2587/95000 2388/90300 2189/76200 1990/63800 1791/53200 995/36100
dev.powernow.2.freq_settings: 2587/95000 2388/90300 2189/76200 1990/63800 1791/53200 995/36100
dev.powernow.3.freq_settings: 6747/95000 6228/90300 5709/76200 5190/63800 4671/53200 2595/36100

dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%parent: cpu0
dev.cpufreq.1.%driver: cpufreq
dev.cpufreq.1.%parent: cpu1

dev.cpufreq.2.%driver: cpufreq
dev.cpufreq.2.%parent: cpu2
dev.cpufreq.3.%driver: cpufreq
dev.cpufreq.3.%parent: cpu3

especially the dev.powernow.3.freq_settings look weird ...

that said, I once more dug up the old acpi_ppc.c and slightly
adapted it for fbsd7 (basically some name changes and using
read_cpu_time() i.s.o. cp_time) and the problem disappears ...

the algo of acpi_ppc makes it somewhat harder to push up frequencies,
though I doubt that matters.

I tried as well with hint.acpi_throttle.0.disabled="1" in loader.conf
with no luck (using powerd).

I'm out of office tomorrow but will try to find time tommorow evening
to test with another NIC.

Best, Arno

Mike Tancsa

unread,

Apr 22, 2008, 1:43:41 PM4/22/08

to Arno J. Klaassen, sta...@freebsd.org

At 01:38 PM 4/22/2008, Arno J. Klaassen wrote:

>I'm out of office tomorrow but will try to find time tommorow evening
>to test with another NIC.

Are you using the latest RELENG_7, or at least the latest version of
nfe thats in RELENG_7 ?

---Mike

Arno J. Klaassen

unread,

Apr 22, 2008, 1:53:13 PM4/22/08

to Peter Jeremy, sta...@freebsd.org

Hello,

Peter Jeremy <peter...@optushome.com.au> writes:

> On Mon, Apr 21, 2008 at 08:30:48PM +0200, Arno J. Klaassen wrote:
> >NB, (CC to kris@ for this) why is memtest86 port marked as i386-only?
>
> Basically because it's a bootable i386 binary image.

yop, but building it could be allowed on more archs (at least amd64 imho)

but no hard feelings! just a thought

Arno J. Klaassen

unread,

Apr 22, 2008, 2:01:56 PM4/22/08

to Mike Tancsa, sta...@freebsd.org

Mike Tancsa <mi...@sentex.net> writes:

> At 01:38 PM 4/22/2008, Arno J. Klaassen wrote:
>
> >I'm out of office tomorrow but will try to find time tommorow evening
> >to test with another NIC.
>
>
> Are you using the latest RELENG_7, or at least the latest version of
> nfe thats in RELENG_7 ?

Think so :

# cvs status if_nfe.c
===================================================================
File: if_nfe.c Status: Up-to-date

Working revision: 1.21.2.5 Sat Apr 19 14:27:41 2008
Repository revision: 1.21.2.5 /home/ncvs/src/sys/dev/nfe/if_nfe.c,v
Sticky Tag: RELENG_7 (branch: 1.21.2)
Sticky Date: (none)
Sticky Options: (none)

++, Arno

PS, finally the memory seems not involved : populating 4G in CPU1 or
2G in CPU1 and 2G in CPU2 does not make a difference

Mike Tancsa

unread,

Apr 22, 2008, 2:07:55 PM4/22/08

to Arno J. Klaassen, sta...@freebsd.org

At 02:00 PM 4/22/2008, Arno J. Klaassen wrote:
> >
> > Are you using the latest RELENG_7, or at least the latest version of
> > nfe thats in RELENG_7 ?
>
>
>Think so :

OK, and it is the latest RELENG_7 ? Or just the if_nfe.c file has
been manually updated ? Also, you are using ULE or the 4BSD scheduler
? I still have 4BSD on the box I am testing on.

---Mike

Arno J. Klaassen

unread,

Apr 22, 2008, 2:36:26 PM4/22/08

to Mike Tancsa, sta...@freebsd.org

re,

Mike Tancsa <mi...@sentex.net> writes:

> At 02:00 PM 4/22/2008, Arno J. Klaassen wrote:
> > >
> > > Are you using the latest RELENG_7, or at least the latest version of
> > > nfe thats in RELENG_7 ?
> >
> >
> >Think so :
>
> OK, and it is the latest RELENG_7 ?

from saturday (but I didn't see any RELENG_7 commit possibly related to
this since)

> Also, you are using ULE or the 4BSD scheduler ? I
> still have 4BSD on the box I am testing on.

Interesting, this is with ULE. I didn't really test 4BSD on this
box (I believed those who said SMP needs ULE *and* am quite
satisfied with overall performance). I'll try 4BSD though time
is getting short; I promised to deliver this box next thursday but will
still have some days for on-site testing.

++, Arno

pluknet

unread,

Apr 22, 2008, 4:15:10 PM4/22/08

to Mike Tancsa, sta...@freebsd.org

On 22/04/2008, Mike Tancsa <mi...@sentex.net> wrote:
> At 02:00 PM 4/22/2008, Arno J. Klaassen wrote:
>
> > >
> > > Are you using the latest RELENG_7, or at least the latest version of
> > > nfe thats in RELENG_7 ?
> >
> >
> > Think so :
> >
>
> OK, and it is the latest RELENG_7 ? Or just the if_nfe.c file has been
> manually updated ? Also, you are using ULE or the 4BSD scheduler ? I still
> have 4BSD on the box I am testing on.

Hi, I have the same problem with data corruption (with nfe on nfs server side),
particularly when transferring large files.
Maybe this is somehow associated with the topic.

My simple test case:
truncate -s 1000m bigfile
^^ here I get zero-filed file
cp bigfile /nfs/mounted
^^ here I get not-at-all-zero-filed file, after uploading to nfs server

I looked at the corrupted file. It contains a few ranges, filed with
non-zero bytes:
equal to zero? real 4-byte value offset
======================================
not equal 1200355616 at pos=38797316
.. <-- this range contains per-4bytes garbage, omit
not equal 3879749905 at pos=38813696

not equal 161160732 at pos=45613060
.. <-- ditto
not equal 575257183 at pos=45629440

not equal 1943682165 at pos=59768836
.. <-- ditto
not equal 2843639625 at pos=59785216

not equal 2653910121 at pos=60293124
.. <-- ditto
not equal 3462830780 at pos=60309504

Some info:

nfs server on 8-CURRENT as of Apr 17
nfs client on 7.0-STABLE as of Apr 12

dmesg | grep nfe
nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port 0xe000-0xe007 mem
0xe2001000-0xe2001fff irq 20 at device 4.0 on pci0

miibus0: <MII bus> on nfe0

nfe0: Ethernet address: 00:04:61:6c:76:b1
nfe0: [FILTER]
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
nfe0: tx v1 error 0x6001
^^^
This appears while cp'ing file to server.
(btw they do not appear with disabled polling, probably it's an another issue)

vmstat -i | grep nfe
irq20: nfe0 ohci0 1 0

nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

options=48<VLAN_MTU,POLLING>
ether 00:04:61:6c:76:b1
inet 192.168.200.137 netmask 0xffffff00 broadcast 192.168.200.255
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
I can reproduce it regardless polling presence.

nfe0@pci0:0:4:0: class=0x020000 card=0x10001695 chip=0x006610de
rev=0xa1 hdr=0x00

wbr,
pluknet

Mike Tancsa

unread,

Apr 22, 2008, 5:56:48 PM4/22/08

to Arno J. Klaassen, sta...@freebsd.org

At 02:35 PM 4/22/2008, Arno J. Klaassen wrote:

> > Also, you are using ULE or the 4BSD scheduler ? I
> > still have 4BSD on the box I am testing on.
>
>Interesting, this is with ULE. I didn't really test 4BSD on this
>box (I believed those who said SMP needs ULE *and* am quite
>satisfied with overall performance). I'll try 4BSD though time
>is getting short; I promised to deliver this box next thursday but will
>still have some days for on-site testing.

I have recompiled the kernel with ULE, and it seems fine as well. I
ran 160 iterations of a 300MB file and there was no corruption. Same
process - copy a junk random file over nfs mount, unmount the nfs
mount, remount it copy it back, compare the files.

---Mike

Mike Tancsa

unread,

Apr 22, 2008, 9:09:50 PM4/22/08

to pluknet, sta...@freebsd.org

At 04:13 PM 4/22/2008, pluknet wrote:
>Hi, I have the same problem with data corruption (with nfe on nfs
>server side),
>particularly when transferring large files.
>Maybe this is somehow associated with the topic.
>
>My simple test case:
>truncate -s 1000m bigfile
>^^ here I get zero-filed file
>cp bigfile /nfs/mounted
>^^ here I get not-at-all-zero-filed file, after uploading to nfs server
>

>nfs server on 8-CURRENT as of Apr 17
>nfs client on 7.0-STABLE as of Apr 12

Hi,

On a RELENG_6 client and the same RELENG_7 (now ULE) server
with nfe, all seems to work fine

[ns1]# mount /backup/
[ns1]# cp -p j2.txt /backup/
[ns1]# umount /backup/
[ns1]# mount /backup/
[ns1]# cp -p /backup/j2.txt /tmp/j2-copy.txt
[ns1]# md5 /tmp/j2-copy.txt
MD5 (/tmp/j2-copy.txt) = b0977dceb7b511bb8c542ac4f18c7128
[ns1]# md5 /tmp/j2.txt
MD5 (/tmp/j2.txt) = b0977dceb7b511bb8c542ac4f18c7128
[ns1]# cd /tmp
[ns1]# truncate -s 1000m bigfile
[ns1]# ls -lh bigfile
-rw-r--r-- 1 root wheel 1.0G Apr 22 21:03 bigfile
[ns1]# md5 bigfile
MD5 (bigfile) = e5c834fbdaa6bfd8eac5eb9404eefdd4
[ns1]# cp -p bigfile /backup/
[ns1]# umount /backup/
[ns1]# mount /backup/
[ns1]# cp -p /backup/bigfile /tmp/b2
[ns1]# md5 /tmp/b2
MD5 (/tmp/b2) = e5c834fbdaa6bfd8eac5eb9404eefdd4
[ns1]#

---Mike

Pyun YongHyeon

unread,

Apr 23, 2008, 1:21:46 AM4/23/08

to pluknet, sta...@freebsd.org

On Wed, Apr 23, 2008 at 12:13:44AM +0400, pluknet wrote:
> On 22/04/2008, Mike Tancsa <mi...@sentex.net> wrote:
> > At 02:00 PM 4/22/2008, Arno J. Klaassen wrote:
> >
> > > >
> > > > Are you using the latest RELENG_7, or at least the latest version of
> > > > nfe thats in RELENG_7 ?
> > >
> > >
> > > Think so :
> > >
> >
> > OK, and it is the latest RELENG_7 ? Or just the if_nfe.c file has been
> > manually updated ? Also, you are using ULE or the 4BSD scheduler ? I still
> > have 4BSD on the box I am testing on.
>
> Hi, I have the same problem with data corruption (with nfe on nfs server side),
> particularly when transferring large files.
> Maybe this is somehow associated with the topic.
>
> My simple test case:
> truncate -s 1000m bigfile
> ^^ here I get zero-filed file
> cp bigfile /nfs/mounted
> ^^ here I get not-at-all-zero-filed file, after uploading to nfs server
>
> I looked at the corrupted file. It contains a few ranges, filed with
> non-zero bytes:
> equal to zero? real 4-byte value offset
> ======================================
> not equal 1200355616 at pos=38797316

> ... <-- this range contains per-4bytes garbage, omit

> not equal 3879749905 at pos=38813696
>
> not equal 161160732 at pos=45613060

> ... <-- ditto

> not equal 575257183 at pos=45629440
>
> not equal 1943682165 at pos=59768836

> ... <-- ditto

> not equal 2843639625 at pos=59785216
>
> not equal 2653910121 at pos=60293124

> ... <-- ditto

> not equal 3462830780 at pos=60309504
>
> Some info:
>
> nfs server on 8-CURRENT as of Apr 17
> nfs client on 7.0-STABLE as of Apr 12
>
> dmesg | grep nfe
> nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port 0xe000-0xe007 mem
> 0xe2001000-0xe2001fff irq 20 at device 4.0 on pci0
> miibus0: <MII bus> on nfe0
> nfe0: Ethernet address: 00:04:61:6c:76:b1
> nfe0: [FILTER]
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> nfe0: tx v1 error 0x6001
> ^^^

I'm not sure it's related with data corruption issue but 0x6001
would mean Tx underflow error. I recall these Tx errors were seen
on nfe(4) if negotiated speed/duplex does not match with link
partner or MACs.
Does link partner also agree on speed/duplex settings of nfe(4)?
What PHY driver nfe(4) use?

> This appears while cp'ing file to server.
> (btw they do not appear with disabled polling, probably it's an another issue)
>
> vmstat -i | grep nfe
> irq20: nfe0 ohci0 1 0
>
> nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=48<VLAN_MTU,POLLING>
> ether 00:04:61:6c:76:b1
> inet 192.168.200.137 netmask 0xffffff00 broadcast 192.168.200.255
> media: Ethernet autoselect (100baseTX <full-duplex>)
> status: active
> I can reproduce it regardless polling presence.
>
> nfe0@pci0:0:4:0: class=0x020000 card=0x10001695 chip=0x006610de
> rev=0xa1 hdr=0x00
>

--
Regards,
Pyun YongHyeon

pluknet

unread,

Apr 23, 2008, 5:53:31 AM4/23/08

to pyu...@gmail.com, sta...@freebsd.org

2008/4/23 Pyun YongHyeon <pyu...@gmail.com>:

One unmanaged 10/100 switch is between them (which are both 100baseTX),
so I cannot say exactly :( Though I can achieve speed upto 100mbps.
I can test later directly on demand.

> What PHY driver nfe(4) use?
>

$ kldload if_nfe

nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port 0xe000-0xe007 mem
0xe2001000-0xe2001fff irq 20 at device 4.0 on pci0

nfe0: Ethernet address: 00:04:61:6c:76:b1
nfe0: [FILTER]
miibus0: <MII bus> on nfe0

rlphy0: <RTL8201L 10/100 media interface> PHY 1 on miibus0
rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
nfe0: link state changed to DOWN
nfe0: link state changed to UP

So, it seems to be rlphy.

>
> > This appears while cp'ing file to server.
> > (btw they do not appear with disabled polling, probably it's an another issue)
> >
> > vmstat -i | grep nfe
> > irq20: nfe0 ohci0 1 0
> >
> > nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> > options=48<VLAN_MTU,POLLING>
> > ether 00:04:61:6c:76:b1
> > inet 192.168.200.137 netmask 0xffffff00 broadcast 192.168.200.255
> > media: Ethernet autoselect (100baseTX <full-duplex>)
> > status: active
> > I can reproduce it regardless polling presence.
> >
> > nfe0@pci0:0:4:0: class=0x020000 card=0x10001695 chip=0x006610de
> > rev=0xa1 hdr=0x00
> >

wbr,
pluknet

Thomas Hurst

unread,

Apr 24, 2008, 12:48:00 PM4/24/08

to Jeremy Chadwick, Clayton Milos, Kris Kennaway, sta...@freebsd.org, n...@freebsd.org

* Jeremy Chadwick (koi...@freebsd.org) wrote:

> > I added it directly to the 2nd CPU (diagram on page 9 of
> > http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> > seems to be the interaction between nfe0 and powerd .... :
>
> That board is the weirdest thing I've seen in years.

K8WE's a very popular workstation board. I've been using one for years.

> Two separate CPUs using a single (shared) memory controller,

Er, no. Where are you getting that? 4 DIMMs are connected per CPU,
though it's hardly strange to only have one, just cheap and nasty.

> two separate (and different!) nVidia chipsets, a SMSC I/O controller
> probably used for serial and parallel I/O,

Er, so? Sun X4x00 M2's do exactly the same; they run a 2200 off one CPU
and a 2050 off another (via an AMD 8132 no less). !M2's did much the
same with a pair of AMD 8131's. They use SMSC IO controllers too:

http://www.sun.com/servers/entry/x4100/arch-wp.pdf
http://www.sun.com/servers/netra/x4200/wp.pdf

We've used dozens of these systems in production in various
configurations for years wuthout a problem.

> two separate nVidia NICs with Marvell PHYs (yet somehow you can bridge
> the two NICs and PHYs?),

They're not seperate, they hang off the same chip according to the
linked document. They are nve nonsense, though, not worth using imo.

> two separate PCI-e busses (each associated with a separate nVidia
> chipset), two separate PCI-X busses... the list continues.

Again, nothing surprising. Each CPU gets its own bus via its own HT
link. Back in the day when the K8WE was first released, this was the
only way to get a pair of 16x PCIe slots.

> I know you don't need opinions at this point, but what a behemoth. I
> can't imagine that thing running reliably.

The only stability problems I've experience have been the occasional
lockup using PowerNow since migrating from dual single core to dual dual
core.

--
Thomas 'Freaky' Hurst
http://hur.st/

Arno J. Klaassen

unread,

Apr 26, 2008, 11:15:55 AM4/26/08

to Mike Tancsa, sta...@freebsd.org, plu...@gmail.com

Hello,

Mike Tancsa <mi...@sentex.net> writes:

> At 02:35 PM 4/22/2008, Arno J. Klaassen wrote:
>
> > > Also, you are using ULE or the 4BSD scheduler ? I
> > > still have 4BSD on the box I am testing on.
> >
> >Interesting, this is with ULE. I didn't really test 4BSD on this
> >box (I believed those who said SMP needs ULE *and* am quite
> >satisfied with overall performance). I'll try 4BSD though time
> >is getting short; I promised to deliver this box next thursday but will
> >still have some days for on-site testing.
>
>
> I have recompiled the kernel with ULE, and it seems fine as well. I
> ran 160 iterations of a 300MB file and there was no corruption. Same
> process - copy a junk random file over nfs mount, unmount the nfs
> mount, remount it copy it back, compare the files.

Let me summarise my investigations till now :

- in all failing cases just *one* byte is currupted, 4 or all 8 bits
set to zero *and* the original value is one out of the limited
subset {1, 8, 9} ....

here is the output of `cmp -x $i/BIG $i/BIG2` for some failing
cases I saved :

03869a48 09 00
05209d88 09 00
01777148 09 00
00f10f88 09 00
01f4c4c8 11 00
06c3d6c8 11 00
0725ca48 18 00
01608008 09 00
00f3b888 18 00

07aa45c8 29 20

- it does *not* seem to depend on :

- the interface : I could produce it using nfe0, nfe1 and
re0 using some netgear pci-card

- the distribution of the 4Gig memory : installing 4G at
CPU1 or 1G at CPU1 and 2G at CPU2 produces same results
(NB, all memory passed memtest.iso in both situtations
for complete run)

- the frequency control method : easier to produce with
cpufreq/powerd, but finally I can reproduce the cooruption
as well using acpi_ppc

- the nfs-client and options (not exhaustively tested, but different
test include i386-releng6, amd64-releng6 and linux, and quite
a set of different try and see mounf_nfs options

I am testing right now with a fixed frequency of 1Ghz.

I am not so inclined to test 4BSD, since reboot possibilities are
limited for me now on this box, but I set up next week a similar
board (S3992e) (iff I can find quad-core socket F over here ...)
and in a certain sense hope I can reproduce it an that board as well.

Best, Arno

Arno J. Klaassen

unread,

May 1, 2008, 2:56:30 PM5/1/08

to Mike Tancsa, sta...@freebsd.org, plu...@gmail.com

Hello,

> [ .. stuff deleted .. ]

> > I have recompiled the kernel with ULE, and it seems fine as well. I
> > ran 160 iterations of a 300MB file and there was no corruption. Same
> > process - copy a junk random file over nfs mount, unmount the nfs
> > mount, remount it copy it back, compare the files.
>
>
> Let me summarise my investigations till now :

> [ .. more stuff deleted .. ]

> - it does *not* seem to depend on :
>
> - the interface : I could produce it using nfe0, nfe1 and
> re0 using some netgear pci-card
>
> - the distribution of the 4Gig memory : installing 4G at
> CPU1 or 1G at CPU1 and 2G at CPU2 produces same results
> (NB, all memory passed memtest.iso in both situtations
> for complete run)
>
> - the frequency control method : easier to produce with
> cpufreq/powerd, but finally I can reproduce the cooruption
> as well using acpi_ppc
>
> - the nfs-client and options (not exhaustively tested, but different
> test include i386-releng6, amd64-releng6 and linux, and quite
> a set of different try and see mounf_nfs options
>
> I am testing right now with a fixed frequency of 1Ghz.

I cannot reproduce it at fixed cpu-frequency with cpufreq loaded (I
ran my test for three days without prob, normally a couple of hours
was enough).

But I looked again at the corrupted copies :

# for i in raid5/xps/SAVE/1 raid5/pxe/SAVE/1 raid5/pxe/SAVE/2 raid5/pxe/SAVE/3 raid5/blockhead/SAVE/1 scsi/pxe/SAVE/1 scsi/blockhead/SAVE/1
scsi/blockhead/SAVE/2 scsi/blockhead/SAVE/3 scsi/blockhead/SAVE/4; do
ls -l $i/BIG; cmp -x $i/BIG $i/BIG2; echo; done

-rw-r--r-- 1 root wheel 144703488 Apr 26 16:06 raid5/xps/SAVE/1/BIG
004fd908 18 00
02c9e6c8 11 00
034ab6c8 90 00
037e4648 09 00
039e85c8 91 01
04484408 00 09
06115cc8 00 81
06e5d148 01 91
07016048 18 00
074307c8 08 19
07aa45c8 29 20
080bfb88 00 11

-rw-r--r-- 1 root wheel 144703488 Apr 20 14:07 raid5/pxe/SAVE/1/BIG
03869a48 09 00

-rw-r--r-- 1 root wheel 144703488 Apr 20 14:47 raid5/pxe/SAVE/2/BIG
05209d88 09 00

-rw-r--r-- 1 root wheel 39845888 Apr 20 15:17 raid5/pxe/SAVE/3/BIG
01777148 09 00

-rw-r--r-- 1 root wheel 144703488 Apr 20 14:54 raid5/blockhead/SAVE/1/BIG
00f10f88 09 00

-rw-r--r-- 1 root wheel 39845888 Apr 20 16:08 scsi/pxe/SAVE/1/BIG
01f4c4c8 11 00

-rw-r--r-- 1 root wheel 144703488 Apr 20 15:38 scsi/blockhead/SAVE/1/BIG
06c3d6c8 11 00

-rw-r--r-- 1 root wheel 144703488 Apr 20 16:11 scsi/blockhead/SAVE/2/BIG
0725ca48 18 00

-rw-r--r-- 1 root wheel 144703488 Apr 20 17:32 scsi/blockhead/SAVE/3/BIG
01608008 09 00

-rw-r--r-- 1 root wheel 144703488 Apr 23 19:26 scsi/blockhead/SAVE/4/BIG
00f3b888 18 00

The output from raid5/xps/SAVE/1/BIG is after installing at a lab with
without doubt more sophisticated switches than I use and the first I was
able to produce with more that just one byte corrupted, but still with
the same pattern :

it looks like the position always is 2^3 * 'somethin without power of two'

(e.g. factor(hex2dec('00f10f88')) = 2 2 2 809 2441

factor(hex2dec('01f4c4c8')) = 2 2 2 317 12941 )

and the corruption is one out of the following half-byte transitions :

1 -> 0
8 -> 0
9 -> 0
0 -> 1
0 -> 8
0 -> 9
8 -> 9

Maybe this gives a hint to someone ...