SLES10 LNXI users?

3 views
Skip to first unread message

nacks

unread,
Jul 23, 2008, 3:41:30 PM7/23/08
to Linux Networx Users Group

We recently went through a pretty significant upgrade here going from
SLES9.3 to SLES10.1 on our LNXI cluster (some older supermicro based
nodes but mostly dell nodes). For the most part things have gone
smoothly but we are running into some scaling issues that we weren't
seeing on our test system (and that we weren't seeing in our SLES9.3
environment). Several things that we ran into the day of the upgrade:

- default arp cache size setting in SLES10.1 was way too low:

# net.ipv4.neigh
net.ipv4.neigh.default.gc_thresh3 = 3072
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1536
net.ipv4.neigh.default.gc_stale_time = 300
net.ipv4.neigh.eth0.gc_stale_time = 300
net.ipv4.neigh.ib0.gc_stale_time = 300
net.ipv4.neigh.lo.gc_stale_time = 300
# net.ipv6.neigh
net.ipv6.neigh.default.gc_thresh3 = 3072
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh1 = 1536
net.ipv6.neigh.default.gc_stale_time = 900
net.ipv6.neigh.eth0.gc_stale_time = 900
net.ipv6.neigh.ib0.gc_stale_time = 900
net.ipv6.neigh.lo.gc_stale_time = 900

Prior to setting that on all of our nodes we were seeing arp table
floods (to the point that our provisioning network was unusable). We
set the management server's tables higher based on planned expansion
coming shortly (with the above settings we still see around 1400 arp
table entries on our management server due to the total number of
nodes in our cluster and all of the various network interfaces).

Another big thing that we ran into is that we had to force tso offload
off on all of our ethernet cards (this was done in SLES9.3 as well but
changes to boot order,etc caused us to change where that was done in
our configuration).

Right now we are still struggling with SLES10.1's default nscd
configuration. We use LDAP at our site and following the upgrade
using the same default nscd configuration that we had under SLES9.3 we
were seeing a 0% cache hit rate and the load on the ldap servers had
gone up 4x (due to no caching being done, all queries were going
directly to the ldap servers).

We are just wondering if anyone else out there has gone through this
upgrade and has seen similar or different problems. Maybe we can
exchange some information.

thanks
-Nick
Reply all
Reply to author
Forward
0 new messages