threading question

og...@gene.pbi.nrc.ca

unread,

Jun 12, 2001, 2:30:11 PM6/12/01

to

Hello,

I am a summer student implementing a multi-threaded version of a very
popular bioinformatics tool. So far it compiles and runs without problems
(as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
sequential version of the tool on all of these machines (most of them are
dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
SGI IRIX). On dual-CPU machines the speedups are as follows: my version
is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
kernel. Why are the numbers on Linux machines so much lower? It is the
same multi-threaded code, I am not using any tricks, the code basically
uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
size is set to 8K (but the numbers are the same with larger/smaller stack
sizes).

Is there anything I am missing? Is this to be expected due to Linux way of
handling threads (clone call)? I am just trying to explain the numbers and
nothing else comes to mind....

Best regards,
Ognen Duzlevski
--
og...@gene.pbi.nrc.ca
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Davide Libenzi

unread,

Jun 12, 2001, 2:40:07 PM6/12/01

to

On 12-Jun-2001 og...@gene.pbi.nrc.ca wrote:
> Hello,
>
> I am a summer student implementing a multi-threaded version of a very
> popular bioinformatics tool. So far it compiles and runs without problems
> (as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
> OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
> sequential version of the tool on all of these machines (most of them are
> dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
> SGI IRIX). On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower? It is the
> same multi-threaded code, I am not using any tricks, the code basically
> uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> size is set to 8K (but the numbers are the same with larger/smaller stack
> sizes).
>
> Is there anything I am missing? Is this to be expected due to Linux way of
> handling threads (clone call)? I am just trying to explain the numbers and
> nothing else comes to mind....

How is your vmstat while your tool is running ?

- Davide

Olivier Sessink

unread,

Jun 12, 2001, 3:00:17 PM6/12/01

to

Hi all,

Today my girlfriend reported all programs that accessed my
NFS mounted drive where crashing. I use Linux 2.4.5 on the client
with these .config options (for NFS):
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_ROOT_NFS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y

The server is a very old install, running user-space NFS daemon:
fender:~$ /usr/sbin/rpc.nfsd --version
Universal NFS Server 2.2beta41

When running dmesg on the client I got this output:

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68
kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: cc703ba0 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c32fdfa4 esp: c32fdeec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1193, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 cc703ba0 c01409c7 cc703ba0 cceee320
cc703ba0
c015e62a cc703ba0 c013e5d6 cceee320 cc703ba0 cceee320 00000000
c013723c
cceee320 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c8587000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>]
[<c0106a7b>] [<c010002b>]

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68
kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: c62eb840 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c7135fa4 esp: c7135eec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1239, stackpage=c7135000)
Stack: c021b86d c021b8cc 000001e6 c62eb840 c01409c7 c62eb840 cf7285e0
c62eb840
c015e62a c62eb840 c013e5d6 cf7285e0 c62eb840 cf7285e0 00000000
c013723c
cf7285e0 c7135f68 c013795a c976eac0 c7135f68 00000000 c89ac000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>]
[<c0106a7b>] [<c010002b>]

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68
kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: c62ebde0 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c32fdfa4 esp: c32fdeec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1243, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 c62ebde0 c01409c7 c62ebde0 cf7288e0
c62ebde0
c015e62a c62ebde0 c013e5d6 cf7288e0 c62ebde0 cf7288e0 00000000
c013723c
cf7288e0 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c55df000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<d8e7dda3>] [<c0137f68>]
[<c0135276>] [<c0106a7b>] [<c010002b>]

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68

I have no idea what this means, to me it looks serious so I decided to
post it on the kernel mailinglist. Is this a real bug? If I have to provide
more detailed information please tell me what you need and how to get it.

thanks,
Olivier Sessink

Christoph Hellwig

unread,

Jun 12, 2001, 3:10:06 PM6/12/01

to

In article <Pine.LNX.4.30.010612...@gene.pbi.nrc.ca> you wrote:
> On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower?

Does your measurement include the time needed to actually create the
threads or do you even frequently create and destroy threads?

The code for creating threads is _horribly_ slow in Linuxthreads due
to the way it is implemented.

> It is the
> same multi-threaded code, I am not using any tricks, the code basically
> uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> size is set to 8K (but the numbers are the same with larger/smaller stack
> sizes).
>
> Is there anything I am missing? Is this to be expected due to Linux way of
> handling threads (clone call)? I am just trying to explain the numbers and
> nothing else comes to mind....

Linuxthreads is a rather bad pthreads implementation performance-wise,
mostly due to the rather different linux-native threading model.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

og...@gene.pbi.nrc.ca

unread,

Jun 12, 2001, 3:20:05 PM6/12/01

to

Hello,

due to the nature of the problem (a pairwise mutual alignment of n
sequences results in mx. n^2 alignments which can each be done in a
separate thread), I need to create and destroy the threads frequently.

I am not really comfortable with 1.4 - 1.5 speedups since the solution was
intended as a Linux one primarily and it just happenned that it works (and
now even better) on Solaris/SGI/OSF...

Best regards,
Ognen

On Tue, 12 Jun 2001, Christoph Hellwig wrote:

> In article <Pine.LNX.4.30.010612...@gene.pbi.nrc.ca> you wrote:
> > On dual-CPU machines the speedups are as follows: my version
> > is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> > 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> > kernel. Why are the numbers on Linux machines so much lower?
>
> Does your measurement include the time needed to actually create the
> threads or do you even frequently create and destroy threads?
>
> The code for creating threads is _horribly_ slow in Linuxthreads due
> to the way it is implemented.
>
> > It is the
> > same multi-threaded code, I am not using any tricks, the code basically
> > uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> > size is set to 8K (but the numbers are the same with larger/smaller stack
> > sizes).
> >
> > Is there anything I am missing? Is this to be expected due to Linux way of
> > handling threads (clone call)? I am just trying to explain the numbers and
> > nothing else comes to mind....
>
> Linuxthreads is a rather bad pthreads implementation performance-wise,
> mostly due to the rather different linux-native threading model.
>
> Christoph

-

Kip Macy

unread,

Jun 12, 2001, 3:20:07 PM6/12/01

to

This may sound like flamebait, but its not. Linux threads are basically
just processes that share the same address space. Their performance is
measurably worse than it is on most commercial Unixes and FreeBSD.
They are not, or at least two years ago, were not POSIX compliant
(they behaved badly with respect to signals). The impoverished
implementation of threads is not an accidental oversight, threads are not
looked upon favorably by most of the core linux kernel hackers. A quote
from Larry McVoy's home page attributed to Alan Cox illustrates this
reasonably well: "A computer is a state machine. Threads are for people
who can't program state machines." Sorry for not being more helpful.

-Kip

Alexander Viro

unread,

Jun 12, 2001, 3:20:11 PM6/12/01

to

On Tue, 12 Jun 2001, Kip Macy wrote:

> implementation of threads is not an accidental oversight, threads are not
> looked upon favorably by most of the core linux kernel hackers. A quote

s/threads/POSIX threads/.

Kip Macy

unread,

Jun 12, 2001, 3:20:13 PM6/12/01

to

For heavy threading, try a user-level threads package.

-Kip

On Tue, 12 Jun 2001 og...@gene.pbi.nrc.ca wrote:

Christoph Hellwig

unread,

Jun 12, 2001, 3:20:14 PM6/12/01

to

On Tue, Jun 12, 2001 at 01:07:11PM -0600, og...@gene.pbi.nrc.ca wrote:
> Hello,
>
> due to the nature of the problem (a pairwise mutual alignment of n
> sequences results in mx. n^2 alignments which can each be done in a
> separate thread), I need to create and destroy the threads frequently.
>
> I am not really comfortable with 1.4 - 1.5 speedups since the solution was
> intended as a Linux one primarily and it just happenned that it works (and
> now even better) on Solaris/SGI/OSF...

If you havily create threads under load you're rather srewed. If you want
to stay with the (IMHO rather suboptimal) posix threads API you might want
to take a look at the stuff IBM has produced:

http://oss.software.ibm.com/developerworks/projects/pthreads/

Otherwise a simple wrapper for clone might be a _lot_ faster, but has it's
own disadvantages: no ready-to-use lcoking primitives, no cross-platform
support (ok, it should be portable to the FreeBSD rfork easily).

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

Russell Leighton

unread,

Jun 12, 2001, 3:30:12 PM6/12/01

to

Any recommendations for alternate threading packages?

Alexander Viro wrote:

--
---------------------------------------------------
Russell Leighton russell....@247media.com
---------------------------------------------------

Christoph Hellwig

unread,

Jun 12, 2001, 3:40:05 PM6/12/01

to

In article <Pine.GSO.4.10.101061...@orbit-fe.eng.netapp.com> you wrote:
> For heavy threading, try a user-level threads package.

Sure, userlevel threading is the best way to get SMP-scalability...

--
Of course it doesn't work. We've performed a software upgrade.

Olivier Sessink

unread,

Jun 12, 2001, 4:20:10 PM6/12/01

to

Hi all,

this is try two, I just read the mailinglist FAQ and used ksymoops to
translate the symbols.

Today my girlfriend reported all programs that accessed my

NFS mounted drive where crashing. Prevously to this, she did
a lot of deleting and moving around of files on the NFS drive.

I use (Debian Woody) with Linux 2.4.5 on the client:
olivier@aria:~ $ uname -a
Linux aria 2.4.5 #4 Mon May 28 18:19:37 CEST 2001 i686 unknown
The kernel is build by myself using this gcc:
olivier@aria:~ $ gcc -v
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20010319 (Debian prerelease)
I used these .config options (for NFS):

CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_ROOT_NFS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y

The client is a ABIT BE6 motherboard, PIII450, 256MB ram,
IBM-DTLA-307045 IDE disk on HPT366 onboard controller running
udma4, SMC etherpowerII NIC running full duplex 100Mbit, NCR860
scsi board, miro pctv tv card and an old ensoniq soundscape isa
card.

The server is a very old install, running user-space NFS daemon:
fender:~$ /usr/sbin/rpc.nfsd --version
Universal NFS Server 2.2beta41

The server doesn't show any warning in any logfile.

this is the output from dmesg, used ksymoops to decode the symbols:

Warning (compare_maps): ksyms_base symbol
__VERSIONED_SYMBOL(shmem_file_setup) not found in System.map. Ignoring
ksyms_base entry

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68

Using defaults from ksymoops -t elf32-i386 -a i386

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 0f 0b ud2a
Code; 00000002 Before first symbol
2: 83 c4 0c add $0xc,%esp
Code; 00000005 Before first symbol
5: f6 83 f4 00 00 00 10 testb $0x10,0xf4(%ebx)
Code; 0000000c Before first symbol
c: 75 19 jne 27 <_EIP+0x27> 00000027 Before
first symbol
Code; 0000000e Before first symbol
e: 68 e8 01 00 00 push $0x1e8
Code; 00000013 Before first symbol
13: 68 00 00 00 00 push $0x0

kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: cc703ba0 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c32fdfa4 esp: c32fdeec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1193, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 cc703ba0 c01409c7 cc703ba0 cceee320
cc703ba0
c015e62a cc703ba0 c013e5d6 cceee320 cc703ba0 cceee320 00000000
c013723c
cceee320 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c8587000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>]
[<c0106a7b>] [<c010002b>]
Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68

>>EIP; c013fffb <clear_inode+33/f4> <=====
Trace; c01409c7 <iput+137/14c>
Trace; c015e62a <nfs_dentry_iput+22/28>
Trace; c013e5d6 <dput+d6/144>
Trace; c013723c <cached_lookup+48/54>
Trace; c013795a <path_walk+536/78c>
Trace; c0137f68 <__user_walk+3c/58>
Trace; c0135276 <sys_lstat64+16/70>
Trace; c0106a7b <system_call+33/38>
Trace; c010002b <startup_32+2b/a5>
Code; c013fffb <clear_inode+33/f4>
00000000 <_EIP>:
Code; c013fffb <clear_inode+33/f4> <=====
0: 0f 0b ud2a <=====
Code; c013fffd <clear_inode+35/f4>
2: 83 c4 0c add $0xc,%esp
Code; c0140000 <clear_inode+38/f4>
5: f6 83 f4 00 00 00 10 testb $0x10,0xf4(%ebx)
Code; c0140007 <clear_inode+3f/f4>
c: 75 19 jne 27 <_EIP+0x27> c0140022
<clear_inode+5a/f4>
Code; c0140009 <clear_inode+41/f4>
e: 68 e8 01 00 00 push $0x1e8
Code; c014000e <clear_inode+46/f4>
13: 68 00 00 00 00 push $0x0

kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: c62eb840 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c7135fa4 esp: c7135eec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1239, stackpage=c7135000)
Stack: c021b86d c021b8cc 000001e6 c62eb840 c01409c7 c62eb840 cf7285e0
c62eb840
c015e62a c62eb840 c013e5d6 cf7285e0 c62eb840 cf7285e0 00000000
c013723c
cf7285e0 c7135f68 c013795a c976eac0 c7135f68 00000000 c89ac000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>]
[<c0106a7b>] [<c010002b>]
Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68

>>EIP; c013fffb <clear_inode+33/f4> <=====
Trace; c01409c7 <iput+137/14c>
Trace; c015e62a <nfs_dentry_iput+22/28>
Trace; c013e5d6 <dput+d6/144>
Trace; c013723c <cached_lookup+48/54>
Trace; c013795a <path_walk+536/78c>
Trace; c0137f68 <__user_walk+3c/58>
Trace; c0135276 <sys_lstat64+16/70>
Trace; c0106a7b <system_call+33/38>
Trace; c010002b <startup_32+2b/a5>
Code; c013fffb <clear_inode+33/f4>
00000000 <_EIP>:
Code; c013fffb <clear_inode+33/f4> <=====
0: 0f 0b ud2a <=====
Code; c013fffd <clear_inode+35/f4>
2: 83 c4 0c add $0xc,%esp
Code; c0140000 <clear_inode+38/f4>
5: f6 83 f4 00 00 00 10 testb $0x10,0xf4(%ebx)
Code; c0140007 <clear_inode+3f/f4>
c: 75 19 jne 27 <_EIP+0x27> c0140022
<clear_inode+5a/f4>
Code; c0140009 <clear_inode+41/f4>
e: 68 e8 01 00 00 push $0x1e8
Code; c014000e <clear_inode+46/f4>
13: 68 00 00 00 00 push $0x0

kernel BUG at inode.c:486!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b ebx: c62ebde0 ecx: 00000001 edx: c025ba84
esi: c025ec60 edi: c976eac0 ebp: c32fdfa4 esp: c32fdeec
ds: 0018 es: 0018 ss: 0018
Process gmc (pid: 1243, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 c62ebde0 c01409c7 c62ebde0 cf7288e0
c62ebde0
c015e62a c62ebde0 c013e5d6 cf7288e0 c62ebde0 cf7288e0 00000000
c013723c
cf7288e0 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c55df000
00000000
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<d8e7dda3>] [<c0137f68>]
[<c0135276>] [<c0106a7b>] [<c010002b>]
Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68

>>EIP; c013fffb <clear_inode+33/f4> <=====
Trace; c01409c7 <iput+137/14c>
Trace; c015e62a <nfs_dentry_iput+22/28>
Trace; c013e5d6 <dput+d6/144>
Trace; c013723c <cached_lookup+48/54>
Trace; c013795a <path_walk+536/78c>
Trace; d8e7dda3 <END_OF_CODE+8502a84/????>
Trace; c0137f68 <__user_walk+3c/58>
Trace; c0135276 <sys_lstat64+16/70>
Trace; c0106a7b <system_call+33/38>
Trace; c010002b <startup_32+2b/a5>
Code; c013fffb <clear_inode+33/f4>
00000000 <_EIP>:
Code; c013fffb <clear_inode+33/f4> <=====
0: 0f 0b ud2a <=====
Code; c013fffd <clear_inode+35/f4>
2: 83 c4 0c add $0xc,%esp
Code; c0140000 <clear_inode+38/f4>
5: f6 83 f4 00 00 00 10 testb $0x10,0xf4(%ebx)
Code; c0140007 <clear_inode+3f/f4>
c: 75 19 jne 27 <_EIP+0x27> c0140022
<clear_inode+5a/f4>
Code; c0140009 <clear_inode+41/f4>
e: 68 e8 01 00 00 push $0x1e8
Code; c014000e <clear_inode+46/f4>
13: 68 00 00 00 00 push $0x0

I have no idea what this means, to me it looks serious so I decided to
post it on the kernel mailinglist. Is this a real bug? If I have to provide

more detailed information please tell me what you need and how to get it. I
had more messages like these, if posting all of them is useful, please tell
me.

thanks,
Olivier Sessink

Davide Libenzi

unread,

Jun 12, 2001, 5:50:08 PM6/12/01

to

On 12-Jun-2001 Christoph Hellwig wrote:
> In article <Pine.LNX.4.30.010612...@gene.pbi.nrc.ca> you
> wrote:
>> On dual-CPU machines the speedups are as follows: my version
>> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
>> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
>> kernel. Why are the numbers on Linux machines so much lower?
>
> Does your measurement include the time needed to actually create the
> threads or do you even frequently create and destroy threads?

This is an extract of the most busy vmstat report running under his tool :

12 0 0 15508 40980 24880 355480 0 0 0 0 141 481 100 0 0
19 0 0 15508 40248 24880 355480 0 0 0 0 142 564 100 0 0
12 0 0 15508 40112 24880 355480 0 0 0 0 150 543 100 0 0
11 0 0 15508 41272 24880 355480 0 0 0 0 156 594 99 1 0
17 0 0 15508 40408 24880 355480 0 0 0 0 156 474 99 1 0
17 0 0 15508 39840 24880 355480 0 0 0 0 135 475 100 0 0
21 0 0 15508 39568 24880 355480 0 0 0 0 125 409 100 0 0
21 0 0 15508 39668 24880 355480 0 0 0 0 135 420 100 0 0
16 0 0 15508 39760 24880 355480 0 0 0 0 149 486 100 0 0

The context switch is very low and the user CPU utilization is 100% , I don't
think it's system responsibility here ( clearly a CPU bound program ).
Even if the runqueue is long, the context switch is low.
I've just close to me a dual PIII 1GHz workstation that run an MTA that uses
linux pthreads with context switching ranging between 5000 and 11000 with a
thread creation rate of about 300 thread/sec ( relaying 600000 msg/hour ).
No problem at all with the system even if the load avg is a bit high
( about 8 ).

- Davide

og...@gene.pbi.nrc.ca

unread,

Jun 12, 2001, 6:00:10 PM6/12/01

to

Hello,

a good suggestion was given to me to actually create as many threads as
there are CPUs (or a bit more) and then keep them asking for work when
they are done. This should help it (and avoid the pthread_create,
pthread_exit). I will implement this and report my results if there is
interest.

Thank you all,
Ognen

--
Ognen Duzlevski

Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team

-

Albert D. Cahalan

unread,

Jun 12, 2001, 6:10:07 PM6/12/01

to

Davide Libenzi writes:
> On 12-Jun-2001 Christoph Hellwig wrote:
>> In article <Pine.LNX.4.30.010612...@gene.pbi.nrc.ca> you
>> wrote:

>>> On dual-CPU machines the speedups are as follows: my version
>>> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
>>> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux
>>> 2.4 kernel. Why are the numbers on Linux machines so much lower?

...

> The context switch is very low and the user CPU utilization is 100%,
> I don't think it's system responsibility here ( clearly a CPU bound
> program ). Even if the runqueue is long, the context switch is low.
> I've just close to me a dual PIII 1GHz workstation that run an MTA
> that uses linux pthreads with context switching ranging between 5000
> and 11000 with a thread creation rate of about 300 thread/sec (
> relaying 600000 msg/hour ). No problem at all with the system even
> if the load avg is a bit high ( about 8 ).

In that case, this could be a hardware issue. Note that he seems
to be comparing an x86 PC against SGI MIPS, Sun SPARC, and Compaq
Alpha hardware.

His data set is most likely huge. It's DNA data.

The x86 box likely has small caches, a fast core, and a slow bus.
So most of the time the CPU will be stalled waiting for a memory
operation.

Maybe there are performance monitor registers that could be used
to determine if this is the case.

(Not that the app design is sane though.)

Mike Castle

unread,

Jun 12, 2001, 7:30:12 PM6/12/01

to

On Tue, Jun 12, 2001 at 03:25:54PM -0400, Russell Leighton wrote:
> Any recommendations for alternate threading packages?

Does NSPR use native methods (ie, clone), or just ride on top of pthreads?

What about the gnu threading package?

mrc
--
Mike Castle dal...@ix.netcom.com www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

J . A . Magallon

unread,

Jun 12, 2001, 8:10:05 PM6/12/01

to

On 20010612 Albert D. Cahalan wrote:
>
> In that case, this could be a hardware issue. Note that he seems
> to be comparing an x86 PC against SGI MIPS, Sun SPARC, and Compaq
> Alpha hardware.
>
> His data set is most likely huge. It's DNA data.
>
> The x86 box likely has small caches, a fast core, and a slow bus.
> So most of the time the CPU will be stalled waiting for a memory
> operation.
>

Perhaps is just synchronization of caches.
say you want to sum all the elements of a vector in parallele split in
two pieces:

int total=0;
thread 1:
for fist half
total += v[i]
thread 2:
for second half
total += v[i]

and you tought: 'well, I need a mutex for access to total. that will slow
down things, lets use separate counters':

int bigtotal;
int total[2];
thread 1:
for fist half
total[0] += v[i]
thread 2:
for second half
total[1] += v[i]

bigtotal = total[0]+total[1]

The problem ? total[0] and total[1] are nearby one of each other. So in
the same cache line. So on every write to total[?], even if they are
independent, system has to synchrnize caches.

Big iron (SGI, Sparc), has special hardware, but cheap PC mobos...

--
J.A. Magallon # Let the source be with you...
mailto:jamag...@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

Pavel Machek

unread,

Jun 13, 2001, 5:40:09 AM6/13/01

to

Hi!

> I am a summer student implementing a multi-threaded version of a very
> popular bioinformatics tool. So far it compiles and runs without problems
> (as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
> OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
> sequential version of the tool on all of these machines (most of them are
> dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
> SGI IRIX). On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower? It is
> the

But this is all different hw, no?

So dual cpu SPARC is more efficient than dual cpu i686. Maybe SPARCs
have faster RAM and slower cpus...
Pavel
--
I'm pa...@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at dis...@linmodems.org

Kurt Garloff

unread,

Jun 13, 2001, 8:30:12 AM6/13/01

to

On Tue, Jun 12, 2001 at 01:07:11PM -0600, og...@gene.pbi.nrc.ca wrote:

> due to the nature of the problem (a pairwise mutual alignment of n
> sequences results in mx. n^2 alignments which can each be done in a
> separate thread), I need to create and destroy the threads frequently.
>
> I am not really comfortable with 1.4 - 1.5 speedups since the solution was
> intended as a Linux one primarily and it just happenned that it works (and
> now even better) on Solaris/SGI/OSF...

Nor would I.

What I do in my numerics code to avoid this problem, is to create all the
threads (as many as there are CPUs) on program startup and have then wait
(block) for a condition. As soon as there's something to to, variables for
the thread are setup (protected by a mutex) and the thread gets signalled
(cond_signal).
If you're interested in the code, tell me.

This is supposed to be much faster than thread creation.

Regards,
--
Kurt Garloff <gar...@suse.de> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG SCSI, Security

J . A . Magallon

unread,

Jun 13, 2001, 9:40:11 AM6/13/01

to

On 20010613 Kurt Garloff wrote:
>
> What I do in my numerics code to avoid this problem, is to create all the
> threads (as many as there are CPUs) on program startup and have then wait
> (block) for a condition. As soon as there's something to to, variables for
> the thread are setup (protected by a mutex) and the thread gets signalled
> (cond_signal).
> If you're interested in the code, tell me.
>

I use the reverse approach. you feed work to the threads, I create the threads
and let them ask for work to a master until it says 'done'. When the
master is queried for work, it locks a mutex, decide the next work for
that thread, and unlocks it. I think it gives the lesser contention and
is simpler to manage.

--
J.A. Magallon # Let the source be with you...
mailto:jamag...@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

Philips

unread,

Jun 13, 2001, 10:20:11 AM6/13/01

to

"J . A . Magallon" wrote:
>
> On 20010613 Kurt Garloff wrote:
> >
> > What I do in my numerics code to avoid this problem, is to create all the
> > threads (as many as there are CPUs) on program startup and have then wait
> > (block) for a condition. As soon as there's something to to, variables for
> > the thread are setup (protected by a mutex) and the thread gets signalled
> > (cond_signal).
> > If you're interested in the code, tell me.
> >
>
> I use the reverse approach. you feed work to the threads, I create the threads
> and let them ask for work to a master until it says 'done'. When the
> master is queried for work, it locks a mutex, decide the next work for
> that thread, and unlocks it. I think it gives the lesser contention and
> is simpler to manage.
>

BTW.
Question was poping in my mind and finally got negative answer by my mind ;-)

Is it possible to make somethis like:

char a[100] = {...}
char b[100] = {...}
char c[100];
char d[100];

1: { // run this on first CPU
for (int i=0; i<100; i++) c[i] = a[i] + b[i];
};
2: { // run this on any other CPU
for (int i=0; i<100; i++) d[i] = a[i] * b[i];
};

...
// do something else...
...

wait 1,2; // to be sure c[] and d[] are ready.

what was popping in my mind - some prefix (like 0x66 Intel used for 32
instructions) to say this instruction should run on other CPU?
I know - stupid idea. Too many questions will arise.
If we will do

PREFIX jmp far some_routing

and this routing will run on other CPU not blocking current execution thread.
(who will clean stack? when?.. question without answers...)

Is there anything like this in computerworld? I heard about old computers that
have a speacial instruction set to implicit run code on given processor.
Is it possible to emulate this behavior on PCs?

philips.vcf

og...@gene.pbi.nrc.ca

unread,

Jun 13, 2001, 11:10:12 AM6/13/01

to

Solaris has pset_create() and pset_bind() where you can bind LWPs to
specific processors, but I doubt this works on anything else....

Best regards,
Ognen

--

Ognen Duzlevski
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team

-

bert hubert

unread,

Jun 13, 2001, 1:50:08 PM6/13/01

to

On Tue, Jun 12, 2001 at 12:06:40PM -0700, Kip Macy wrote:
> This may sound like flamebait, but its not. Linux threads are basically
> just processes that share the same address space. Their performance is
> measurably worse than it is on most commercial Unixes and FreeBSD.

Thread creation may be a bit slow. But the kludges to provide posix threads
completely from userspace also hurt. Notably, they do not scale over
multiple CPUs.

> They are not, or at least two years ago, were not POSIX compliant
> (they behaved badly with respect to signals). The impoverished

POSIX threads are silly with respect to signals. I do almost all my
programming these days with pthreads and I find that I really do not miss
signals at all.

> from Larry McVoy's home page attributed to Alan Cox illustrates this
> reasonably well: "A computer is a state machine. Threads are for people
> who can't program state machines." Sorry for not being more helpful.

I got that response too. When I pressed kernel people for details it turns
out that they think having hundreds of runnable threads/processes (mostly
the same thing under Linux) is wasteful. The scheduler is just not optimised
for that.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Services
Trilab The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

Hubertus Franke

unread,

Jun 13, 2001, 3:10:11 PM6/13/01

to

>I got that response too. When I pressed kernel people for details it turns
>out that they think having hundreds of runnable threads/processes (mostly
>the same thing under Linux) is wasteful. The scheduler is just not
optimised
>for that.

Try out the http://lse.sourceforge.net/scheduling patches. The MQ kernel
scheduler sure can handle this
kind of load :-)

Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: fra...@us.ibm.com
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003

bert hubert <a...@ds9a.nl>@vger.kernel.org on 06/13/2001 01:31:39 PM

Sent by: linux-ker...@vger.kernel.org

To: linux-...@vger.kernel.org
cc:
Subject: Re: threading question

Helge Hafting

unread,

Jun 14, 2001, 3:00:08 AM6/14/01

to

bert hubert wrote:

> > from Larry McVoy's home page attributed to Alan Cox illustrates this
> > reasonably well: "A computer is a state machine. Threads are for people
> > who can't program state machines." Sorry for not being more helpful.
>
> I got that response too. When I pressed kernel people for details it turns
> out that they think having hundreds of runnable threads/processes (mostly
> the same thing under Linux) is wasteful. The scheduler is just not optimised
> for that.

The scheduler can be optimized for that, so far at the cost of
pessimizing
the common case with few threads. The bigger problem here is that
your cpu (particularly TLB's and caches) aren't optimized for switching
between a lot of threads either. This will always be a problem as long
as cpu's have level 1 caches much smaller than the combined working
set of your threads. So run one thread per cpu, perhaps two if you
expect
io stalls. The task at hand may easily be divided into many more parts,
but serializing those extra parts will be better for performance.

Helge Hafting

Alan Cox

unread,

Jun 14, 2001, 2:30:11 PM6/14/01

to

> they are done. This should help it (and avoid the pthread_create,
> pthread_exit). I will implement this and report my results if there is
> interest.

You should also check up the cache colouring. X86 boxes have relatively poor
memory performance and most x86 chips have lousy behaviour when data bounces
between processors or is driven out of cache

Alan Cox

unread,

Jun 14, 2001, 2:40:09 PM6/14/01

to

> just processes that share the same address space. Their performance is
> measurably worse than it is on most commercial Unixes and FreeBSD.

Actually their performance is massively superior. But that is because we were
not stupid enough to burden the kernel with all of the posix pthread crap.
Pthreads is an ugly compromise API that can be badly implemented in both
userland and kernel space. Unfortunately its also a standard.

So you have two choices
1. Pthread performance is poorer due to library glue
2. Every single signal delivery is 20% slower threaded or otherwise due
to all the crap that it adds
And it does damage to other calls too.

In the big picture #1 is definitely preferable.

There are really only two reasons for threaded programming.

- Poor programmer skills/language expression of event handling

- OS implementation flaws (and yes the posix/sus unix api has some of these)

Co-routines or better language choices are much more efficient ways to express
the event handling problem.

fork() is often a better approach than pthreads at least for the design of an
SMP threaded application because unless you explicitly think about what you
share you will never get the cache affinity you need for good performance.

And if you don't care about cache affinity then you shouldnt care about
pthread_create overhead because quite frankly pthread_create overhead is easily
mitigated (thread cache) and in most real world applications considerably less
of an performance hit

Alan

bert hubert

unread,

Jun 14, 2001, 3:10:06 PM6/14/01

to

On Thu, Jun 14, 2001 at 07:28:32PM +0100, Alan Cox wrote:

> There are really only two reasons for threaded programming.
>
> - Poor programmer skills/language expression of event handling

The converse is that pthreads are:

- Very easy to use from C at a reasonable runtime overhead

It is very convenient for a userspace coder to be able to just start a
function in a different thread. Now it might be so that a kernel is not
there to provide ease of use for userspace coders but it is a factor.

I see lots of people only using:
pthread_create()/pthread_join()
mutex_lock/unlock
sem_post/sem_wait
no signals

My gut feeling is that you could implement this subset in a way that is both
fast and right - although it would not be 'pthreads compliant'. Can anybody
confirm this feeling?

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Services
Trilab The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

Russell Leighton

unread,

Jun 14, 2001, 3:30:06 PM6/14/01

to

bert hubert wrote:

> <stuff deleted>

>
> I see lots of people only using:
> pthread_create()/pthread_join()
> mutex_lock/unlock
> sem_post/sem_wait
> no signals
>
> My gut feeling is that you could implement this subset in a way that is both
> fast and right - although it would not be 'pthreads compliant'. Can anybody
> confirm this feeling?

... add condition variables (maybe a small per-thread storage area)
and I'd toss out pthreads for most apps I write...especially if it is very efficient.

--
---------------------------------------------------
Russell Leighton russell....@247media.com
---------------------------------------------------

og...@gene.pbi.nrc.ca

unread,

Jun 14, 2001, 6:50:08 PM6/14/01

to

Hello,

I have implemented thread pooling (with an environment variable
where I can give the number of threads to be created). Results:

1. Linux, no change in the times (not under 2.2.x or 2.4)

2. SGI/Solaris/OSF/1: times decrease when the number of threads matched
the number of processors available. The times were the same as my
previous version or couple of percents better when I exhaggerated the
number of threads to create, say, 128 threads on a 2 CPU.

3. The load on the machines has decreased considerably with the new
solution. I consider this to be the only positive impact I have seen from
this solution.

The solution is basically designed in the following way:

1. Threads are created and they wait on a condition with pthread_cond_wait
2. The main thread sets up the data (which are global) and then signals
that there is work to be done on the same condition variable. The first
thread to get awaken takes the work. the remaining threads keep waiting.
3. Go to 2. until there is work to distribute

I am now pretty much inclined to believe that it is either a) hardware
issue (someone mentioned that SPARCs and MIPSes handle things differently)
or b) Linux for some reason just cant give me what IRIX/Solaris can in
this particular case

Regretfully, the organization I work for prohibits me from releasing the
code I am talking about until the lawyers decide what to do with it. My
hope is to be able to release it for free to anyone interested since this
sequence alignment tool is used a lot :). This kind of defeats the purpose
of my question(s) since without the code it is difficult to talk.

Best regards,
Ognen Duzlevski

On Thu, 14 Jun 2001, Alan Cox wrote:

> > they are done. This should help it (and avoid the pthread_create,
> > pthread_exit). I will implement this and report my results if there is
> > interest.
>
> You should also check up the cache colouring. X86 boxes have relatively poor
> memory performance and most x86 chips have lousy behaviour when data bounces
> between processors or is driven out of cache

--

Ognen Duzlevski
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team

Mike Castle

unread,

Jun 14, 2001, 7:10:08 PM6/14/01

to

On Thu, Jun 14, 2001 at 04:42:29PM -0600, og...@gene.pbi.nrc.ca wrote:
> 2. The main thread sets up the data (which are global) and then signals
> that there is work to be done on the same condition variable. The first
> thread to get awaken takes the work. the remaining threads keep waiting.

For curiosities sake, at what point would this technique result in a
thundering herd issue? Does it happen near the level at which the number of
schedulable entities equal the number of processors or does it have to be
much greater than that?

mrc
--
Mike Castle dal...@ix.netcom.com www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

J . A . Magallon

unread,

Jun 14, 2001, 7:10:10 PM6/14/01

to

On 20010614 Alan Cox wrote:
>
>So you have two choices
>1. Pthread performance is poorer due to library glue
>2. Every single signal delivery is 20% slower threaded or otherwise due
> to all the crap that it adds
> And it does damage to other calls too.
>

Pthreads are a standard. You say 'use linux native calls, are faster and
make signal management efficient'. But then portability goes to hell. Now
I can run the same code on Linux, Irix and Solaris. Your way, I would
have to write three versions with clone(), sproc() and lwp_xxxx().
Take the example of OpenGL on IRIX boxes. Time ago it was a wrapper over
IrisGL. Now it is native. If you have a notably poor implimentation of
an standard nobody will use your system.

>In the big picture #1 is definitely preferable.
>
>There are really only two reasons for threaded programming.
>
>- Poor programmer skills/language expression of event handling
>
>- OS implementation flaws (and yes the posix/sus unix api has some of these)
>
>Co-routines or better language choices are much more efficient ways to express
>the event handling problem.
>
>fork() is often a better approach than pthreads at least for the design of an
>SMP threaded application because unless you explicitly think about what you
>share you will never get the cache affinity you need for good performance.
>

Joking ? That only works if your more complex structure is an array. Try
to get a rendering program with a complex linked lits-tree data structure
for the geometry, materials, textures, etc and
thinking on cache affinity. You can only think about that locally: mmm, I
need a counter for each thread, I would not put them all in an array because
I will trash caches, lets put them in separate variables; need to return
data to a segment of a big array, lets use a local copy and then pass it back.
But no more. Yes, you can change all your malloc() or new for shm's, but
what is the gain ? That is the beauty of shared memory boxes.

What linux needs is a good implementation for POSIX threads. I do not mean
putting pthreads right into the kernel, but perhaps some small change or
addition can make the user space much much faster. There are many apps that
can benefit much from using threads, use a big data segment in ro mode,
and just communicate a bit between them (a threaded web server, a rendering
program).

--
J.A. Magallon # Let the source be with you...
mailto:jamag...@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

Dieter Nützel

unread,

Jun 14, 2001, 7:10:11 PM6/14/01

to

> Hello,
>
> I have implemented thread pooling (with an environment variable
> where I can give the number of threads to be created). Results:
>
> 1. Linux, no change in the times (not under 2.2.x or 2.4)

[snip]

> I am now pretty much inclined to believe that it is either a) hardware
> issue (someone mentioned that SPARCs and MIPSes handle things differently)
> or b) Linux for some reason just cant give me what IRIX/Solaris can in
> this particular case

[snip]

Hello Ognen,

can you get your hands on an dual AMD Athlon MP 1/1.2 GHz system?
The only mobo currently on the marked is the AMD 760MP based Tyan Thunder K7.
It has (all) the good stuff (Point-to-Point bus, crossbar) which former only
the (big) Alphas/SUN/SGI etc. had.

http://www.amd.com/products/cpg/server/athlon/index.html
http://www.tyan.com/products/html/thunderk7.html

Regards,
Dieter
--
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
Cognitive Systems Group
Vogt-Kölln-Straße 30
D-22527 Hamburg, Germany

email: nue...@kogs.informatik.uni-hamburg.de
@home: Dieter....@hamburg.de

Anil Kumar

unread,

Jun 15, 2001, 7:30:07 AM6/15/01

to

Since while using only a small subset of primitives provided by the pthreads
the burden for the other primitive maintanence is much more so i too feel
when we use only a small part its better to implement in our own requiredd
way for performance issues.

Michael Rothwell

unread,

Jun 16, 2001, 10:20:09 AM6/16/01

to

On 14 Jun 2001 19:28:32 +0100, Alan Cox wrote:

> Co-routines or better language choices are much more efficient ways to express
> the event handling problem.

Can you provide any info and/or examples of co-routines? I'm curious to
see a good example of co-routines' "betterness."

Thanks,

--
Michael Rothwell
roth...@holly-springs.nc.us

Alan Cox

unread,

Jun 16, 2001, 11:30:08 AM6/16/01

to

> Can you provide any info and/or examples of co-routines? I'm curious to
> see a good example of co-routines' "betterness."

With co-routines you don't need

8K of kernel stack
Scheduler overhead
Fancy locking

You don't get the automatic thread switching stuff though.

So you might get code that reads like this (note that aio_ stuff works rather
well combined with co-routines as it fixes a lack of asynchronicity in the
unix disk I/O world)

select(....)

if(FD_ISSET(copier_fd))
run_coroutine(&copier_state);

...

and the copier might be something like

while(1)
{
// Yes 1 at a time is dumb but this is an example..
// Yes Im ignoring EOF for this
if(read(copier_fd, buf[bufptr], 1)==-1)
{
if(errno==-EWOULDBLOCK)
{
coroutine_return();
continue;
}
}
if(bufptr==255 || buf[bufptr]=='\n')
{
run_coroutine(run_command, buf);
bufptr=0;
}
else
bufptr++;
}

it lets you express a state machine as a set of multiple such small state
machines instead. run_coroutine() will continue a routine where it last
coroutine_return()'d from. Thus in the above case we are expressing read
bytes until you see a new line cleanly - not mangled in with keeping state
in global structures but by using natural C local variables and code flow

Alan

Russell Leighton

unread,

Jun 16, 2001, 2:40:07 PM6/16/01

to

Is there a user-space implemenation (library?) for coroutines that would work from C?

Alan Cox wrote:

--

---------------------------------------------------
Russell Leighton russell....@247media.com
---------------------------------------------------

Michael Rothwell

unread,

Jun 16, 2001, 3:20:09 PM6/16/01

to

Try this:

http://lecker.essen.de/~froese/coro/

-M

--
Michael Rothwell
roth...@holly-springs.nc.us

Russell Leighton

unread,

Jun 16, 2001, 5:30:14 PM6/16/01

to

Any chance this or the equiv could become part of glibc?

This seems a very handy abstraction, in many apps
threads would then really only be needed for true parallelism.

Dan Maas

unread,

Jun 16, 2001, 6:30:06 PM6/16/01

to

> Is there a user-space implemenation (library?) for
> coroutines that would work from C?

Here is another one:

http://oss.sgi.com/projects/state-threads/

Regards,
Dan

Arun Sharma

unread,

Jun 17, 2001, 11:01:09 PM6/17/01

to

Christoph Hellwig <h...@caldera.de> wrote in message news:<2001061221...@caldera.de>...
>
> If you havily create threads under load you're rather srewed. If you want
> to stay with the (IMHO rather suboptimal) posix threads API you might want
> to take a look at the stuff IBM has produced:
>
> http://oss.software.ibm.com/developerworks/projects/pthreads/
>
> Otherwise a simple wrapper for clone might be a _lot_ faster, but has it's
> own disadvantages: no ready-to-use lcoking primitives, no cross-platform
> support (ok, it should be portable to the FreeBSD rfork easily).
>
> Christoph

The FreeBSD rfork based implementation already exists.

http://oss.software.ibm.com/pipermail/pthreads-devel/2001-May/000031.html

-Arun