Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Mathematica problems on Solaris

237 views
Skip to first unread message

dave (from the uk)

unread,
Mar 23, 2006, 8:53:50 AM3/23/06
to
There were several messages here recently about excessive CPU usage of
Matematica on Solaris (SPARC).

I thought some my like to know I have raised this with Wolfram Research
who are investiating it. I've not received confirmation that they have
verified it. I'd hope they contact Sun about
this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
have caused this to appear, although that does not mean it is a Solaris
bug - it might well be a Mathematica bug.

I hope it is a Solaris bug, as I suspect it will get fixed quicker - I
have no idea when a new release of Mathematica is due out. Still, even
ignoring this bug, Mathematica performance on SPARC is *very* poor,
making it a poor platform. In contrast Mathematica performance on a Sun
Ultra 40 running Solaris is *excellent* - as is documented on some Sun
web pages, as at on point it helded a Mathematica benchmark record.

If others have support contracts on this they might like to raise it
too.

Dave (from the UK)

PS
I'm sending from my PDA via Google groups, so don't reply to this emai
. I'm sure you can work out my address if you want to.

Wes W

unread,
Mar 23, 2006, 9:48:40 AM3/23/06
to
dave (from the uk) wrote:
> There were several messages here recently about excessive CPU usage of
> Matematica on Solaris (SPARC).
>
> I thought some my like to know I have raised this with Wolfram Research
> who are investiating it. I've not received confirmation that they have
> verified it. I'd hope they contact Sun about
> this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
> have caused this to appear, although that does not mean it is a Solaris
> bug - it might well be a Mathematica bug.
>
> I hope it is a Solaris bug, as I suspect it will get fixed quicker - I
> have no idea when a new release of Mathematica is due out. Still, even
> ignoring this bug, Mathematica performance on SPARC is *very* poor,
> making it a poor platform. In contrast Mathematica performance on a Sun
> Ultra 40 running Solaris is *excellent* - as is documented on some Sun
> web pages, as at on point it helded a Mathematica benchmark record.
>
> If others have support contracts on this they might like to raise it
> too.
>
> Dave (from the UK)
>


Thanks for the update. Keep up posted if you hear of the certified
cause or solution.

Casper H.S. Dik

unread,
Mar 25, 2006, 8:09:51 AM3/25/06
to
"dave (from the uk)" <sorry_no_kn...@hotmail.com> writes:

>I thought some my like to know I have raised this with Wolfram Research
>who are investiating it. I've not received confirmation that they have
>verified it. I'd hope they contact Sun about
>this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
>have caused this to appear, although that does not mean it is a Solaris
>bug - it might well be a Mathematica bug.

You don't have this problem on older versions of S10?

Have you tried this with Solaris 10 (unpatched) vs Solaris 10 (patched).

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Dave (from the UK)

unread,
Mar 25, 2006, 12:41:42 PM3/25/06
to
Casper H.S. Dik wrote:
> "dave (from the uk)" <sorry_no_kn...@hotmail.com> writes:
>
>
>>I thought some my like to know I have raised this with Wolfram Research
>>who are investiating it. I've not received confirmation that they have
>>verified it. I'd hope they contact Sun about
>>this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
>>have caused this to appear, although that does not mean it is a Solaris
>>bug - it might well be a Mathematica bug.
>
>
> You don't have this problem on older versions of S10?
>
> Have you tried this with Solaris 10 (unpatched) vs Solaris 10 (patched).
>
> Casper

Casper,

I have not tried this with an unpatched Solaris 10 system, and I don't
know anyone that has. All cases I know of are on patched Solaris 10
(3/05 or 1/06) SPARC systems.

I had resisted the temptation to set up a system for testing this, but
your post has inspired me (sad person I am!!), so I will do a fresh
install and test it without any patches.

But I noticed this bug before 19th Jan 06. Given I have no Sun contract,
this probably means there were no patches on my system when I first
found the problem. Or would there have been publically available patches
for Solaris 10 update 1 on the 19th Jan 2006??

--
Dave K MCSE.

MCSE = Minefield Consultant and Solitaire Expert.

Please note my email address changes periodically to avoid spam.
It is always of the form: month-year@domain. Hitting reply will work
for a couple of months only. Later set it manually.

Dave (from the UK)

unread,
Mar 26, 2006, 12:23:27 AM3/26/06
to
Casper H.S. Dik wrote:
> "dave (from the uk)" <sorry_no_kn...@hotmail.com> writes:
>
>
>>I thought some my like to know I have raised this with Wolfram Research
>>who are investiating it. I've not received confirmation that they have
>>verified it. I'd hope they contact Sun about
>>this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
>>have caused this to appear, although that does not mean it is a Solaris
>>bug - it might well be a Mathematica bug.
>
>
> You don't have this problem on older versions of S10?
>
> Have you tried this with Solaris 10 (unpatched) vs Solaris 10 (patched).
>
> Casper


Casper, I've just done a few more tests. The Ultra 60 referred to is the
same one in each case (360 MHz, 512 MB).


1) Ultra 60, fresh install of Solaris 9 (first release, whenever that
was). No problems - Mathematica 5.2 works fine.

2) Ultra 60, fresh install of Solaris 10 3/05, with no patches. Problem
present.

3) Ultra 60, fresh install of Solaris 10 1/06, with no patches. Problem
present.

4) Unknown machine. I think it might be a Netra T1, but I don't know for
sure. (I only have remote access). uname -a shows.

SunOS wickedwitch 5.10 Generic_118822-19 sun4u sparc SUNW,UltraAX-i2

This Netra T1 (or whatever) has Solaris 10, patched a month or two ago.
Problem present.

As has been pointed out once before by someone else

http://groups.google.co.uk/group/comp.unix.solaris/tree/browse_frm/thread/3f751c580fffde23/a23fd5b7de81c11e?rnum=11&hl=en&_done=%2Fgroup%2Fcomp.unix.solaris%2Fbrowse_frm%2Fthread%2F3f751c580fffde23%2F35233d67d579039f%3Flnk%3Draot%26hl%3Den%26#doc_ecebd6ae8db1c775

a Blade 1000, despite having Solaris 10, had *no* such bug. Are modern
machines exempt??

I don't personally have access to a more modern machine with Solaris 10
on it. The machines either have Solaris 10 and are pretty old, or are
new and have Solaris 9.

So it seems you need all these 3 have a problem.

1) Mathematica 5.1 or 5.2. Both 4.0 and 4.2 are unaffected. I can't say
for Mathematica 5, as I don't have it installed.

2) Solaris 10 (either release, with or without patches)

3) UltraSPARC II or IIe CPU. Given there is only one Solaris 10 machine
that does not suffer the problem, it may be unwise to say only
UltraSPARC II and IIe are affected.

Any thoughts on how one might be able to get a workaround? Using truss
on the CPU bound process (MathKernel), I found this a few weeks ago.

http://groups.google.co.uk/group/comp.unix.solaris/browse_frm/thread/3f751c580fffde23/35233d67d579039f?lnk=raot&hl=en#35233d67d579039f

Casper H.S. Dik

unread,
Mar 26, 2006, 5:06:47 AM3/26/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>Casper H.S. Dik wrote:
>> "dave (from the uk)" <sorry_no_kn...@hotmail.com> writes:
>>
>>
>>>I thought some my like to know I have raised this with Wolfram Research
>>>who are investiating it. I've not received confirmation that they have
>>>verified it. I'd hope they contact Sun about
>>>this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
>>>have caused this to appear, although that does not mean it is a Solaris
>>>bug - it might well be a Mathematica bug.
>>
>>
>> You don't have this problem on older versions of S10?
>>
>> Have you tried this with Solaris 10 (unpatched) vs Solaris 10 (patched).
>>
>> Casper

>Casper,

>I have not tried this with an unpatched Solaris 10 system, and I don't
>know anyone that has. All cases I know of are on patched Solaris 10
>(3/05 or 1/06) SPARC systems.

>I had resisted the temptation to set up a system for testing this, but
>your post has inspired me (sad person I am!!), so I will do a fresh
>install and test it without any patches.

>But I noticed this bug before 19th Jan 06. Given I have no Sun contract,
>this probably means there were no patches on my system when I first
>found the problem. Or would there have been publically available patches
>for Solaris 10 update 1 on the 19th Jan 2006??

I'm interested about what happens on S10 03/05 (01/06 is a "patchy"
release; it has tons of patches applied in the factory)

Casper H.S. Dik

unread,
Mar 26, 2006, 5:09:55 AM3/26/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>1) Ultra 60, fresh install of Solaris 9 (first release, whenever that
>was). No problems - Mathematica 5.2 works fine.

>2) Ultra 60, fresh install of Solaris 10 3/05, with no patches. Problem
>present.

Ok, so this is an issue introduced in S10.

>I don't personally have access to a more modern machine with Solaris 10
>on it. The machines either have Solaris 10 and are pretty old, or are
>new and have Solaris 9.

I have no idea how this could be a CPU specific issue.

>So it seems you need all these 3 have a problem.

>1) Mathematica 5.1 or 5.2. Both 4.0 and 4.2 are unaffected. I can't say
>for Mathematica 5, as I don't have it installed.

>2) Solaris 10 (either release, with or without patches)

>3) UltraSPARC II or IIe CPU. Given there is only one Solaris 10 machine
>that does not suffer the problem, it may be unwise to say only
>UltraSPARC II and IIe are affected.

>Any thoughts on how one might be able to get a workaround? Using truss
>on the CPU bound process (MathKernel), I found this a few weeks ago.

>http://groups.google.co.uk/group/comp.unix.solaris/browse_frm/thread/3f751c580fffde23/35233d67d579039f?lnk=raot&hl=en#35233d67d579039f

Looping in poll, seems like? prstat confirms that /3 is the looping
thread?

Casper

Dave (from the UK)

unread,
Mar 26, 2006, 6:17:43 AM3/26/06
to
Casper H.S. Dik wrote:

>>I don't personally have access to a more modern machine with Solaris 10
>>on it. The machines either have Solaris 10 and are pretty old, or are
>>new and have Solaris 9.

I thought it a bit odd.

Could Solaris be making use of instructions that are present in the
Blade 1000's processor for maximum speed, but emulating them when the
CPU does not have them?

Perhaps bits of Solaris are compiled with -xarch=v8plusb to generate
code for the UltrSPARC III processor?

>>Any thoughts on how one might be able to get a workaround? Using truss
>>on the CPU bound process (MathKernel), I found this a few weeks ago.
>
>
>>http://groups.google.co.uk/group/comp.unix.solaris/browse_frm/thread/3f751c580fffde23/35233d67d579039f?lnk=raot&hl=en#35233d67d579039f
>
>
> Looping in poll, seems like? prstat confirms that /3 is the looping
> thread?
>
> Casper

prstat does indeed indicate that. Here is is on my Ultra 80 after
computing 1+1=2. This has 4 CPUs, so the 25% basically means it's CPU
bound.

sparrow /export/home/drkirkby % prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1

Casper H.S. Dik

unread,
Mar 26, 2006, 3:39:25 PM3/26/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>Could Solaris be making use of instructions that are present in the
>Blade 1000's processor for maximum speed, but emulating them when the
>CPU does not have them?

No, it generally refuses to execute such binaries (with the exception
if the V8 instructions V7 hardware and VIS instructions on some
hardware.

>Perhaps bits of Solaris are compiled with -xarch=v8plusb to generate
>code for the UltrSPARC III processor?

No, except bits which are only meant to run on US-III CPUs.

>prstat does indeed indicate that. Here is is on my Ultra 80 after
>computing 1+1=2. This has 4 CPUs, so the 25% basically means it's CPU
>bound.

>sparrow /export/home/drkirkby % prstat
>PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
>2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
>2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1

What does this thread do when there is no issue?

And does "truss -v" show anything interesting about the arguments?

E.g., is the timeout different on the platforms.

Casper

Dave (from the UK)

unread,
Mar 26, 2006, 5:05:29 PM3/26/06
to b...@nwowhv.de
Casper H.S. Dik wrote:

>>sparrow /export/home/drkirkby % prstat
>>PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>>3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
>>2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
>>2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1
>
>
> What does this thread do when there is no issue?

Casper,

I'm a bit "out of my depth" here, but I'll try.

First, I don't have access to the Blade 1000, so I can't do any testing
on the Solaris 10 machine which works fine. I assume some truss data
from that would be useful. Perhaps Rainer Beushausen (b...@nwowhv.de) will
post some if he is able to. I've copied this to him, in case he is not
following the thread.

Comparing the truss output from the two machine I have access to:

1) 360 MHz Ultra 60 running Solaris 9 (Mathematica 5.2 OK)
2) 4 x 450 MHz Ultra 80 running Solaris 10. (Mathematica 5.2 has problems)

the difference between the two systems seems to be that on the Solaris 9
box, thread 3 is calling poll(), but on the Solaris 10 box,
thread 3 calls pollsys(). I guess it would be interesting to
know what happens to thread 3 on the Blade 1000 running Solaris 10.

On a Solaris 9 box, the CPU usage is fine, with MathKernel (pid=662)
using 0.0% of the CPU time.

solaris9 % prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

713 drkirkby 4256K 3952K cpu0 39 0 0:00:00 0.3% prstat/1
693 drkirkby 6952K 5080K sleep 59 0 0:00:00 0.1% dtterm/1
620 drkirkby 9816K 8056K sleep 49 0 0:00:00 0.1% dtfile/1
695 drkirkby 2768K 2176K sleep 49 0 0:00:00 0.1% tcsh/1
662 drkirkby 62M 18M sleep 59 0 0:00:01 0.0% MathKernel/3

Truss shows a lot of calls to poll().

solaris9 % truss -p 662

662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0


Using the -v option on truss.

solaris9 % truss -v lwp_park -v poll -p 662
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/1: timeout: 0.019998000 sec
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/1: timeout: 0.019998000 sec
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0

I see a timeout of 20ms on lwp_park(), but nothing for poll() - I am
rather lost as to what is happening here.

Whereas on the Solaris 10 box the CPU usage of MathKernel is excessive
(25%, which is basically one CPU flat out)

solaris10 % prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

6205 drkirkby 63M 17M cpu2 40 0 0:01:48 25% MathKernel/3
6124 drkirkby 136M 68M sleep 59 0 0:00:58 2.5% mozilla-bin/3
5848 drkirkby 139M 22M sleep 59 0 0:00:23 0.7% Xsun/1
6050 drkirkby 233M 78M sleep 49 0 0:00:15 0.5% java/21
6151 drkirkby 84M 20M sleep 49 0 0:00:06 0.3% gnome-terminal/2

and truss shows repeated calls to pollsys() on thread 3.

solaris10 % truss -p 6205

/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/2: pollsys(0xFFFFFFFF79EF9CD0, 1, 0xFFFFFFFF79EF9E40, 0x00000000)
(sleeping...)
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0


> And does "truss -v" show anything interesting about the arguments?
>

I can't seem to see any way of getting a timeout on the Solaris 9 box,
but on the Solaris 10 there is a timeout of 1 us as you can see below,
which is clearly a lot shorter than the 20ms shown on the Solaris 9
machine. But these are not the same functions, so I am not sure.

solaris10 % truss -v pollsys -p 6205
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec

Chris Ridd

unread,
Mar 27, 2006, 1:52:52 AM3/27/06
to
On 26/3/06 11:05, in article
44271029...@southminster-branch-line.org.uk, "Dave (from the UK)"
<see-my-s...@southminster-branch-line.org.uk> wrote:

> solaris10 % truss -v pollsys -p 6205
> /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
> /3: fd=23 ev=POLLRDNORM rev=0
> /3: timeout: 0.000001000 sec

Do you know what fd 23 is associated with? I've a feeling lsof might tell
you this, or maybe dtrace can, or failing all that you'll have to truss the
process starting up and watch for open() calls.

Cheers,

Chris

Dan Foster

unread,
Mar 27, 2006, 3:56:41 AM3/27/06
to

% lsof -p 12171 | awk '$4 ~ /^23/ { print $1 " " $NF }'

should do the trick.

Info on where to get lsof from its author: http://people.freebsd.org/~abe/

I think SunFreeware has a binary lsof, though building from source is
most preferred as it will better match your running kernel's API
interface.

lsof is a small package and builds quickly.

Another option:

% /usr/proc/bin/pfiles 12171

(This will require SUNWesu be installed, and /proc mounted.)

-Dan

Casper H.S. Dik

unread,
Mar 27, 2006, 4:13:32 AM3/27/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>the difference between the two systems seems to be that on the Solaris 9
>box, thread 3 is calling poll(), but on the Solaris 10 box,
>thread 3 calls pollsys(). I guess it would be interesting to
>know what happens to thread 3 on the Blade 1000 running Solaris 10.

Ok, so in S9 it calls poll with a 1ms wait time; this causes a delay
of around one clock tick (really 10-20ms); so the below trace looks
somewhat strange if the lwp_park tmeout is the actual time slept
(perhaps you can truss with one of the time options)

>662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0

>Using the -v option on truss.

>solaris9 % truss -v lwp_park -v poll -p 662
>662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
>662/1: timeout: 0.019998000 sec
>662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
>662/3: fd=23 ev=POLLRDNORM rev=0
>662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
>662/3: fd=23 ev=POLLRDNORM rev=0
>662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
>662/3: fd=23 ev=POLLRDNORM rev=0
>662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
>662/3: fd=23 ev=POLLRDNORM rev=0

>I see a timeout of 20ms on lwp_park(), but nothing for poll() - I am

>rather lost as to what is happening here.

That's because truss doesn't show timeouts (it's the 3rd argument and
it's in ms)

>Whereas on the Solaris 10 box the CPU usage of MathKernel is excessive
>(25%, which is basically one CPU flat out)

>solaris10 % truss -v pollsys -p 6205


>/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
>/3: fd=23 ev=POLLRDNORM rev=0
>/3: timeout: 0.000001000 sec

Hm, this timeout value looks suspicious.

They might be calling select() with the smallest possible timeout (1us)
which would map to 1ms poll in S9 but 1us pollsys in S10.


Inspecting the code, it seems that pollsys() uses a cv_timedwait() which
uses the real time to compute when it needs to wakeup.

Because the *slow* systems can't manage poll setup in under 1us,
they will hit cv_timedwait when we're past the 1us mark and the
function will return immediately. Fast systems, OTOH, will be able
to make the 1us deadline and cv_timedwait will sleep until the next
clocktick.

If this hypothesis is correct, then perhaps the following preload
will work (preload w/ LD_PRELOAD=...); different options for 64 bit
executable.

/*
* Select roundup preload. (caspe...@you.know.where)
* cc -G -Kpic select_preload.c -o select_preload.so
*
*/


#include <dlfcn.h>
#include <sys/time.h>

#define FUN_PROTO(type,internal,parms) \
type internal parms

#define DECLARE(type,name, parms) static FUN_PROTO(type,(*name), parms)
#define CAST(type, parms) (FUN_PROTO(type,(*), parms))

DECLARE(int,next_select,(int, fd_set *, fd_set *, fd_set *, struct timeval *));

#ifdef __GNUC__
void loadit(void) __attribute__ ((constructor));
#else
#pragma init(loadit)
#endif

void
loadit(void)
{
extern char **environ;
char **env;
int offset;

next_select = CAST(int, (int, fd_set *, fd_set *, fd_set *, struct timeval *
))dlsym(RTLD_NEXT, "select");
}

int select(int nfds, fd_set *restrict readfds, fd_set *restrict writefds,
fd_set *restrict errorfds, struct timeval *restrict timeout)
{

if (timeout != NULL && timeout->tv_sec == 0 && timeout->tv_usec > 0 &&
timeout->tv_usec < 1000)
timeout->tv_usec = 1000;

return (next_select(nfds, readfds, writefds, errorfds, timeout));
}


Casper

Dave (from the UK)

unread,
Mar 27, 2006, 4:42:49 AM3/27/06
to
Chris Ridd wrote:
> On 26/3/06 11:05, in article
> 44271029...@southminster-branch-line.org.uk, "Dave (from the UK)"
> <see-my-s...@southminster-branch-line.org.uk> wrote:
>
>
>>solaris10 % truss -v pollsys -p 6205
>>/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
>>/3: fd=23 ev=POLLRDNORM rev=0
>>/3: timeout: 0.000001000 sec
>
>
> Do you know what fd 23 is associated with?

I certainly did not, but I might have found out what it is - see
below.

> I've a feeling lsof might tell

First, I should explain something. The program causing the problem, which is
Mathematica.

http://www.wolfram.com/products/mathematica/index.html

consists of 4 important parts.

1) A *graphical* front end which is called by the user as
"mathematica" but that is in fact just a shell script
which calls a 32-bit binary "Mathematica".

I'm told this is 32-bit, as there is no advantage in having a
64-bit front end.

2) A command line driven front end.

3) MathKernel, which is a 64-bit binary. This does all the calculations.
Since Mathematica has rather a liking for memory, the kernel is 64-bit.

4) A link called 'mathlink'
http://www.wolfram.com/solutions/mathlink/
whcih allows a front end to communicate with the kernel.

Normally, I start the GUI front end and let it
communicate with the computational kernel over
'mathlink'. But one can write ones own front end if one
wishes.

What is interesting is that if the GUI is used, the MathKernel
uses loads of CPU time. But if the command line interface is used,
the MathKernal does not use excessive CPU time. So changing the front
end causes the kenel to either behave itself or not as the case may be.

The front end and kernel do not have to be run on the same machine, but
I have tried running them on differnet machines, but the problem still exists.

Here is the output from prstat an hour or so after mathemtica was asked to
compute 1+1. As you see, it has used over an hour of CPU time and is
CPU bound on the quad processor machine.

sparrow /export/home/drkirkby % uname -a
SunOS sparrow 5.10 Generic_118833-02 sun4u sparc SUNW,Ultra-80

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

8637 drkirkby 63M 17M cpu3 10 0 1:05:39 25% MathKernel/3
6705 drkirkby 236M 168M sleep 49 0 0:38:25 1.5% mozilla-bin/3
6453 drkirkby 180M 78M cpu0 52 0 0:16:31 0.8% Xsun/1
6611 drkirkby 80M 25M sleep 59 0 0:03:38 0.4% metacity/1
8305 drkirkby 205M 105M sleep 49 0 0:04:13 0.4% acroread/1
6688 drkirkby 233M 65M sleep 49 0 0:08:24 0.3% java/21
6758 drkirkby 99M 33M cpu3 49 0 0:01:47 0.2% gnome-terminal/2
13499 drkirkby 4824K 4456K cpu2 57 0 0:00:00 0.1% prstat/1
6630 drkirkby 73M 16M sleep 49 0 0:01:10 0.1% wnck-applet/1
6634 drkirkby 71M 13M sleep 49 0 0:01:04 0.0% gnome-netstatus/1
6609 drkirkby 10M 5096K sleep 59 0 0:00:10 0.0% gnome-smproxy/1
6636 drkirkby 71M 12M sleep 49 0 0:00:20 0.0% mixer_applet2/1
6615 drkirkby 75M 17M sleep 59 0 0:00:14 0.0% gnome-panel/1
190 root 2352K 1160K sleep 100 - 0:03:28 0.0% xntpd/1
13457 drkirkby 3064K 2512K sleep 57 0 0:00:00 0.0% tcsh/1
8607 drkirkby 15M 12M sleep 49 0 0:00:06 0.0% Mathematica/2

Just to check, we again see fd=23 in use.

sparrow /export/home/drkirkby % truss -v pollsys -p 8637


/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec

/1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME

/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec

Looking with lsof (is this the sort of thing you were thinking of?)

sparrow /export/home/drkirkby % lsof -d 23
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
inetd 433 root 23u VCHR 105,110 55050244 /devices/pseudo/tl@0:ticots->timod->tl
fmd 570 root 23r VREG 32,30 13829 23832 /usr/lib/fm/dict/SUN4U.dict
smbd 776 root 23u FIFO 0x60002699e60 0t0 406 (fifofs) PIPE->0x60002699ef0
Xsun 6453 drkirkby 23u unix 105,8584 0t2438596 55050244 /devices/pseudo/tl@0:ticots->/tmp/.X11-unix/X0 (0x60002db6678)
gnome-ses 6576 drkirkby 23u unix 105,8581 0t11380 55050244 /devices/pseudo/tl@0:ticots->/tmp/.X11-unix/X0 (0x600032b3840->0x60009b8b9c0)
gconfd-2 6585 drkirkby 23u unix 105,8606 0t49294 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19b9-0-44276f3ccec37 (0x60002db64e0)
bonobo-ac 6591 drkirkby 23u unix 105,8552 0t0 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19b0-0-44270d1feae4c (0x60002db7998->0x600066e2ac0)
gnome-set 6594 drkirkby 23u FIFO 0x60003126aa0 0t0 54056 (fifofs) PIPE->0x60003126b30
gnome-pan 6615 drkirkby 23u unix 105,8697 0t27416 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19e6-0-44277c1c4e44 (0x6000314f368->0x60008c16f40)
nautilus 6617 drkirkby 23u FIFO 0x60003127c20 0t0 54104 (fifofs) PIPE->0x60003127cb0
gnome-vfs 6622 drkirkby 23u FIFO 0x60003127a30 0t0 54106 (fifofs) PIPE->0x600031279a0
clock-app 6628 drkirkby 23u FIFO 0x60003127ae0 0t0 54117 (fifofs) PIPE->0x60003127b70
wnck-appl 6630 drkirkby 23u FIFO 0x60003da47a0 0t0 54122 (fifofs) PIPE->0x60003da4830
notificat 6632 drkirkby 23u FIFO 0x60003126320 0t0 54127 (fifofs) PIPE->0x600031263b0
gnome-net 6634 drkirkby 23u FIFO 0x60002f5b200 0t0 54132 (fifofs) PIPE->0x60002f5b290
mixer_app 6636 drkirkby 23u FIFO 0x60003da52e0 0t0 54137 (fifofs) PIPE->0x60003da5370
mozilla-b 6705 drkirkby 23wW VREG 32,7 0 460098 /export/home (/dev/dsk/c0t1d0s7)
gnome-ter 6758 drkirkby 23u FIFO 0x600026984b0 0t0 54281 (fifofs) PIPE->0x60002698420
acroread 8305 drkirkby 23r VREG 32,30 644551 7092 /./usr/bin/../lib/AdobeReader/Reader/sparcsolaris/plug_ins/SendMail.api
Mathemati 8607 drkirkby 23u FIFO 0x600026991e0 0t2 55844 (fifofs) PIPE->0x60002699270
MathKerne 8637 drkirkby 23u FIFO 0x60003da5060 0t2 55843 (fifofs) PIPE->0x60003da50f0


I assume there are multiple programs all with fd=23 in use. Most of them
are irrelevant I guess.

The last two lines of lsof output refer to processes Mathematica and MathKernal
and what seems to be a pipe. This makes sense for the reasons I explained about the
mathlink protocol connecting them. I'm a bit surprised there is nothing in the lsof
output to indicate that these are connected together. The DEVICE and NAME
of the last two lines does not agree. Should they? Perhaps there have lost
communication, although that seems unlikely, as the front end can still
get calculations done.

This is making me think. Could the problem be that the GUI and front end have a
timeout which is too short on the Ultra 80, but is sufficiently long on the faster
Blade 1000? That could explain the fact it works on the faster machine, but not
on the slower one.

Dave (from the UK)

unread,
Mar 27, 2006, 7:33:55 AM3/27/06
to
Casper H.S. Dik wrote:
>>solaris10 % truss -v pollsys -p 6205
>>/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
>>/3: fd=23 ev=POLLRDNORM rev=0
>>/3: timeout: 0.000001000 sec
>
>
> Hm, this timeout value looks suspicious.

> They might be calling select() with the smallest possible timeout (1us)
> which would map to 1ms poll in S9 but 1us pollsys in S10.

Ah, that makes some sense.

> Inspecting the code, it seems that pollsys() uses a cv_timedwait() which
> uses the real time to compute when it needs to wakeup.

Cheers.

> Because the *slow* systems can't manage poll setup in under 1us,
> they will hit cv_timedwait when we're past the 1us mark and the
> function will return immediately. Fast systems, OTOH, will be able
> to make the 1us deadline and cv_timedwait will sleep until the next
> clocktick.
>
> If this hypothesis is correct, then perhaps the following preload
> will work (preload w/ LD_PRELOAD=...); different options for 64 bit
> executable.
>
> /*
> * Select roundup preload. (caspe...@you.know.where)
> * cc -G -Kpic select_preload.c -o select_preload.so
> *
> */

<snip>

Thanks for that. I have compiled it 64 bit, as the Mathematica kernel is 64-bit.

% cc -xtarget=ultra -xarch=v9 -G -Kpic select_preload.c -o select_preload.so
su
# mv select_preload.so /usr/local/lib/

The 64-bit kernal gets started by the front end. The front end calls
a script which starts the kernel. The script ends in the line:

exec "${MathKernel}" "$@"

I'm not sure how I should alter that line, but I tried:

LD_PRELOAD=/usr/local/lib/select_preload.so exec "${MathKernel}" "$@"

That did not effect the CPU usage.

I guess there might be some point in starting the kernal with
truss and logging the data?

Casper H.S. Dik

unread,
Mar 27, 2006, 8:04:39 AM3/27/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>Thanks for that. I have compiled it 64 bit, as the Mathematica kernel is 64-bit.

>% cc -xtarget=ultra -xarch=v9 -G -Kpic select_preload.c -o select_preload.so
>su
># mv select_preload.so /usr/local/lib/

>The 64-bit kernal gets started by the front end. The front end calls
>a script which starts the kernel. The script ends in the line:

>exec "${MathKernel}" "$@"

>I'm not sure how I should alter that line, but I tried:

>LD_PRELOAD=/usr/local/lib/select_preload.so exec "${MathKernel}" "$@"

Try:

LD_PRELOAD_64=...

also, the above syntax does not work with "exec":

FOO=bar exec sh
echo $FOO
# Nothing

>That did not effect the CPU usage.

So try again:

LD_PRELOAD_64=/usr/local/lib/select_preload.so
export LD_PRELOAD_64
exec "${MathKernel}" "$@"

>I guess there might be some point in starting the kernal with


>truss and logging the data?

Sure.

Casper

Dave (from the UK)

unread,
Mar 27, 2006, 8:25:34 AM3/27/06
to
Casper H.S. Dik wrote:

>>That did not effect the CPU usage.
>
>
> So try again:
>
> LD_PRELOAD_64=/usr/local/lib/select_preload.so
> export LD_PRELOAD_64
> exec "${MathKernel}" "$@"

Cheers, you are a genius, that works.

CPU usage is now sensible

sparrow / % prstat


PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP

6705 drkirkby 235M 168M sleep 49 0 0:58:42 1.2% mozilla-bin/3
6453 drkirkby 189M 87M sleep 59 0 0:28:24 0.9% Xsun/1
6758 drkirkby 103M 41M sleep 59 0 0:05:22 0.8% gnome-terminal/2
8305 drkirkby 205M 103M sleep 49 0 0:08:30 0.4% acroread/1
6688 drkirkby 233M 64M sleep 49 0 0:11:51 0.3% java/21
20422 drkirkby 63M 18M sleep 59 0 0:00:04 0.3% MathKernel/3
6611 drkirkby 80M 25M sleep 59 0 0:08:05 0.2% metacity/1
6617 drkirkby 76M 18M sleep 59 0 0:00:18 0.2% nautilus/5
6630 drkirkby 74M 16M sleep 49 0 0:03:46 0.2% wnck-applet/1

MathKernel has only used 4 seconds despite having been run a
few minutes and done a couple of simple calculations.

You always seem to be able to solve any problems I have with commercial
software (remember the 'ib' driver for the National Instruments
GPIB card?) that the vendors can't work out.


>>I guess there might be some point in starting the kernal with
>>truss and logging the data?
>
>
> Sure.

Not much point now!!

>
> Casper


Cheers, Dave

Casper H.S. Dik

unread,
Mar 27, 2006, 9:16:40 AM3/27/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>Cheers, you are a genius, that works.

Thanks :-)

>You always seem to be able to solve any problems I have with commercial
>software (remember the 'ib' driver for the National Instruments
>GPIB card?) that the vendors can't work out.

I remember. It's not entirely fair to blame the vendors for not
understanding Solaris as well as a Solaris engineer.

In this particular case, I think it's a real Solaris bug:

6404383 select() behaviour changed in Solaris 10, breaking binary compatibility

The "1us" select() was understood as "poll for the shortest time possible"
and not poll and return immediately.

>Not much point now!!

Indeed.

as you can see, maintaining binary compatibility is *really* hard because
every change can be an incompatible one.

In Solaris 9 and before, we had just the "poll" system call as a
building block for select() and this caused the time to be round up
to 1ms.

In Solaris 10 came pselect() which uses a "struct timespec" (nano seconds)
and which adds a signal mask. The signal mask required a new interface
so we added pollsys() which is based on poll but carries a signal mask
and a struct timespec argument. poll, select and pselect were (re)written
to use the new interface and so select() now passes 1us to the kernel.
This then causes a failure on slow hardware (but on a 1.2GHz it also
can't always doe the pre-work in under 1us)

I think that we will round up the select timeout to 1ms again in the
library for select() (not for pselect() as the new code can be written
to live with this behaviour)

The program below should finish in 5 seconds, but runs much more quickly
on slower hardware.

/*
* Demo bug 6404383: program finishes much more quickly on slow
* system, pegging CPU at 100%.
*/
#include <sys/time.h>
#include <sys/param.h>
#include <stdio.h>
#include <string.h>

int
main(int argc, char **argv)
{
static struct timeval tv = { 0, 1 };
hrtime_t start, end;
int i;
int count = HZ * 5;
fd_set fdr;
int s1, sx;

FD_ZERO(&fdr);
start = gethrtime();
for (i = 0; i < count; i++) {
FD_SET(0, &fdr);
(void) select(3, &fdr, NULL, NULL, &tv);
}
end = gethrtime();

s1 = (int)((end - start)/count);
sx = 1000000000/HZ;
(void) printf("ns/call = %d, expected = %d (%.2f times too fast)\n",
s1, sx, (double)sx/s1);

Dave (from the UK)

unread,
Mar 27, 2006, 9:55:19 AM3/27/06
to
Casper H.S. Dik wrote:
> "Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:
>
>
>>Cheers, you are a genius, that works.
>
>
> Thanks :-)
>
>
>>You always seem to be able to solve any problems I have with commercial
>>software (remember the 'ib' driver for the National Instruments
>>GPIB card?) that the vendors can't work out.
>
>
> I remember. It's not entirely fair to blame the vendors for not
> understanding Solaris as well as a Solaris engineer.
>

True.

And in the 'ib' case, it seems to me Sun should have an agreed
convention for naming drivers, not just people picking names at random.

> In this particular case, I think it's a real Solaris bug:

I'm sure Wolfram Research will be glad to know Mathematica is not at fault!

> 6404383 select() behaviour changed in Solaris 10, breaking binary compatibility

Is that a bug that was already in the Sun database, or one you submitted
today?

> as you can see, maintaining binary compatibility is *really* hard because
> every change can be an incompatible one.

Yes, I can see that. But if you don't make changes, you don't make
progress.

I'll post some sort of summary of this to comp.soft-sys.math.mathematica
and sci.math.symbolic. It will take some time to appear, as
comp.soft-sys.math.mathematica is moderated. I can't say I am too keen
on moderated newsgroups myself. (In case you don't know, the moderator
of comp.soft-sys.math.mathematica is Steve of the Sunfreeware site)

Dave (from the UK)

unread,
Mar 27, 2006, 10:33:11 AM3/27/06
to

> * cc -G -Kpic select_preload.c -o select_preload.so
> *
> */
>
<snip>

Thanks for that. I have compiled it 64 bit, as the Mathematica kernel
is 64-bit.

% cc -xtarget=ultra -xarch=v9 -G -Kpic select_preload.c -o
select_preload.so
su
# mv select_preload.so /usr/local/lib/

The 64-bit kernal gets started by the front end. The front end calls
a script which starts the kernel. The script ends in the line:

exec "${MathKernel}" "$@"

I'm not sure how I should alter that line, but I tried:

LD_PRELOAD=/usr/local/lib/select_preload.so exec "${MathKernel}" "$@"

That did not effect the CPU usage.

I guess there might be some point in starting the kernal with


truss and logging the data?

--

Casper H.S. Dik

unread,
Mar 27, 2006, 10:34:06 AM3/27/06
to
"Dave (from the UK)" <see-my-s...@southminster-branch-line.org.uk> writes:

>> 6404383 select() behaviour changed in Solaris 10, breaking binary compatibility

>Is that a bug that was already in the Sun database, or one you submitted
>today?

Submitted today.

>Yes, I can see that. But if you don't make changes, you don't make
>progress.

Quite.

>I'll post some sort of summary of this to comp.soft-sys.math.mathematica
>and sci.math.symbolic. It will take some time to appear, as
>comp.soft-sys.math.mathematica is moderated. I can't say I am too keen
>on moderated newsgroups myself. (In case you don't know, the moderator
>of comp.soft-sys.math.mathematica is Steve of the Sunfreeware site)

Yeah, they don't work too well, IMHO.

Casper

0 new messages