I have run into a strange issue. My OpenMP parallelized program runs on
several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS),
as reported by omp_get_num_threads(). However, it appears that all
threads are running on a single CPU core, meaning that instead of a
near-perfect speedup I get some slowdown when increasing the number of
threads.
What is even stranger is that my code used to run nicely on multiple
cores, and indeed does so on intel fortran on our cluster. Has anyone
experienced similar issues with OpenMP, or can you suggest somewhere to
start looking for the trouble?
(I have found that most of the program runtime is spent in BLAS
routines, so the performance difference between compilers is small. I
have been using Goto BLAS for these tests, and the Goto BLAS routines
are indeed running nicely in multicore mode.)
Cheers,
Paul.
What operating system and gfortran version are you using? I've seen
no such OpenMP problems with gfortran 4.5.2 on Mac OSX 10.6.6 (Snow
Leopard).
Al Greynolds
www.ruda-cardinal.com
Sorry, I forgot to mention that. Goto BLAS does not employ several cores
on my mac os x 10.6.6 (snow leopard), but my own openmp-ified code does
parallelize nicely. I have also gfortran 4.5.2. I have to agree with
your conclusions there.
The question was asked with respect to gfortran 4.5.1 on Fedora Linux
release 14. The processor is an Intel Core i7 2.80 GHz (we have several
similar machines with varying GHz ratings), 4 cores with hyperthreading.
Here, Goto BLAS does employ several cores, but my own openmp-ified code
does not parallelize. There is in fact a small performance penalty to
increasing the number of OpenMP threads, and top reports CPU usage at
100% (i.e. one full core) so I believe that all threads are running on
one core.
Paul.
I checked it by noting that
1) 'top' in the linux terminal reports 100% CPU usage (it reports up to
800% for other programs)
2) Program execution does not speed up on the linux server when
increasing the OMP_NUM_THREADS environment variable, while it does speed
up on my mac laptop
3) omp_get_num_threads() reports 1,2,4,8 threads (as set by
OMP_NUM_THREADS env. var.) correctly, but there is no speedup
Perhaps omp_get_num_threads is reporting something else than the number
of actual threads? I do not know. I am posting here because I have no
idea where to look for fixes to this problem.
Cheers,
Paul
Try a simple OpenMP program (like a single for loop), see if you can
reproduce the issue with that, then post here your compiler version
(output of "gfortran -v") and the exact code and command line you are
using.
--
FX
This is part of the problem: the simple OpenMP loop runs on all cores.
Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
OpenMP code also works flawlessly on Rocks cluster linux with intel
fortran v. 11.
I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
14, and both of these compilers result in no speedup. I know I can't
expect anyone here to find my problem (as I have no idea where to look
myself and a simple program doesn't reproduce the error), but it would
be interesting to see if someone here has had the same experience.
Paul
You mentioned that your code "used to run nicely on multiple cores." Is
it possible that something in the code or in your environment has
changed recently?
You might think about methodically removing pieces of your program until
it either works properly or is equivalent (for some definition thereof)
to a simple program that does work. This is guaranteed to be tedious
and time-consuming, but it has a chance of helping you isolate the problem.
Louis
> I have run into a strange issue. My OpenMP parallelized program runs on
> several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS),
> as reported by omp_get_num_threads(). However, it appears that all
> threads are running on a single CPU core, meaning that instead of a
> near-perfect speedup I get some slowdown when increasing the number of
> threads.
OpenMP works fine on GCC, including gfortran.
Threads are managed by the operating system. If you have 8 logical
CPUs (e.g. Intel i7) and 8 threads, all CPUs could theoretically be
saturated. Thus, either the process is started affinity to one CPU
(this could e.g. be set by the shell), or there is something in your
program that forces sequential execution.
The latter could be an OpenMP pragma that specify that ony one thread
may execute a major portion of the code:
!$OMP SINGLE
!$OMP MASTER
or contention for a global mutex, e.g. an unnamed critical section:
!$OMP CRITICAL
There could also be programming mistakes such as an !$OMP PARALLEL
block without an !$OMP DO or multiple !$OMP SECTION inside. There
could also be a typo like !$OMP PARALLEL instead of !$OMP PARALLEL DO,
which will have the effect you reported.
If the program will parallelize nicely on other systems, the problem
must be related to restricted CPU affinity, which is an OS or command
shell issue. If it does not, the problem is likely related to the
OpenMP code.
If nothing else helps, I'd suggest you try to rebuild GCC and
recompile your program.
Sturla
This is interesting reading. I am posting my one single OpenMP loop
below, so you can have a look. I do believe it is a very simple !$OMP
PARALLEL DO and as mentioned, it usually (not always, oddly enough)
works nicely on intel fortran, as mentioned before.
It would appear that this might be an OS issue, then?
Paul
--------------------
!$omp parallel do &
!$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0) &
!$omp firstprivate(imin, imax) &
!$omp shared(LHS)
do icol = imin, imax
if (icol == 1) then
print *, 'omp num threads:', omp_get_num_threads() ! TODO Remove
end if
fra = reverse_lookup(icol, 1)
iq1 = reverse_lookup(icol, 2)
iq2 = reverse_lookup(icol, 3)
alpha_0 = alpha_0s(iq1, iq2)
do irow = 1, size(LHS, 1)
til = reverse_lookup(irow, 1)
ip1 = reverse_lookup(irow, 2)
ip2 = reverse_lookup(irow, 3)
LHS(irow, icol) = generate_element(pm_lhs, fra, til, iq1, &
iq2, ip1, ip2, numerics, alphas(ip1, ip2), &
alpha_0, fft_of_zetas, qs)
end do
end do
!$omp end parallel do
> It would appear that this might be an OS issue, then?
That is hard to tell. The function "reverse_lookup" or
"generate_element" could still be synchronized with a mutex. That it
sometimes fail with Intel's compiler suggest a programming issue.
Sturla
Just to clarify, reverse_lookup is in fact an array which is built
before entering the loop. Perhaps there is some sort of programming
issue with the generate_element() function, I will look into that
possibility. Thanks a lot for your suggestions!
Speaking of which, I guess reverse_lookup should be declared
firstprivate or shared? I didn't even think of that before.
Paul.
Hi,
Can you post the single loop so I can compile-link-run?
Fernando.
No, in fact the effect should be something like the opposite: every
thread will do all the job (race conditions aside).
> No, in fact the effect should be something like the opposite: every
> thread will do all the job (race conditions aside).
Yes. You are right. My pun, sorry.
Sturla
Sure! Just see below. The outer loop in a double loop is
OpenMP-parallelized. The function generate_element contains a function
generate_j which also contains a loop, so significant work is done even
in the innermost loop. Unfortunately, this loop is way too short (only
10 iterations) to usefully parallelize. I guess it would be easier to
debug that way, though.
Cheers,
Paul.
----------------------
subroutine setup_LHS(...)
! LHS is the left hand side of an equation system
(snip)
!$omp parallel do &
!$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0) &
!$omp firstprivate(qs, numerics, imin, imax, reverse_lookup, alpha_0s,
alphas) &
!$omp shared(LHS, fft_of_zetas)
Sorry, I don't have enough time to "complete" the code with
declarations/initializations/etc., that's why I need (and asked for)
some code I can directly compile-link-run-play almost directly from
the command line. Maybe you can just assign a constant to LHS just to
check for parallel behavior...
Looking the previous posts I was thinking in the line of OpenMP
implementation/OS problems, did you see something related to taskset?
Fernando.
I see, and I understand. However, the code is a bit complex. This is
probably part of the reason for the strange behavior! As mentioned
earlier, a simple do loop example compiles, runs and speeds up as expected.
I will consider your advice with respect to just assigning a constant to
LHS and see if that works out. I think it could be helpful!
I have never used taskset, so I have no idea how I would go about using
it in this context.
Paul.
I see, no problem, I understood from a previous post (I copy the text
here):
> This is part of the problem: the simple OpenMP loop runs on all cores.
> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> OpenMP code also works flawlessly on Rocks cluster linux with intel
> fortran v. 11.
> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
> 14, and both of these compilers result in no speedup.
that the simple loop did not work on Fedora, my mistake, sorry.
About code complexity and parallel performance, I think code
complexity only "affects" correctness, not parallel usage to only one
core, but obviously I don't know a looooot of details/facts.
About taskset: since I was thinking in OpenMP implementation/OS
problems, I suggested taking a look at taskset, which is used to
"retrieve or set a process's CPU affinity" -copied from
http://www.unix.com/man-page/Linux/1/taskset/
Since threads (I think always) inherit scheduling properties from the
process in which they are created, maybe you can verify/play with the
process scheduling properties. But it's just a guess...
Fernando.
That's more-or-less correct, in theory. Whether it is the case
in practice is less clear.
>About taskset: since I was thinking in OpenMP implementation/OS
>problems, I suggested taking a look at taskset, which is used to
>"retrieve or set a process's CPU affinity" -copied from
>http://www.unix.com/man-page/Linux/1/taskset/
>Since threads (I think always) inherit scheduling properties from the
>process in which they are created, maybe you can verify/play with the
>process scheduling properties. But it's just a guess...
This will sound a bit patronising, but you are completely lost.
Nobody except an expert (in using operating systems, primarily)
should think of using such facilities, as they don't work the way
that they appear to, and never have (on ANY operating system!)
And most experts know better than to use such facilities except
in extremis, for that reason and many others.
The most that it is worth doing is calling a little C function
that returns the TID (thread identifier), which will check that
different parts of the code are using different threads. But
DO be warned that the mapping between OpenMP threads and system
threads is not simple, not at all.
If they all give the same TID, then your problem is that it is not
starting enough threads; if they give different ones, you have a
scheduler issue, and you should stop looking at the OpenMP code
and look elsewhere. And the latter is HARD.
Regards,
Nick Maclaren.
What do you mean with "a bit patronising"? (I understand the part you
say I'm completely lost... ;P ...).
> Nobody except an expert (in using operating systems, primarily)
> should think of using such facilities, as they don't work the way
> that they appear to, and never have (on ANY operating system!)
I'm not exactly an expert (and I'm not able to define what an expert
is), and I've used such facilities, and they worked the way I
expected... This remains me that someone told to maintain some
software with the current bugs... Maybe I'm completely lost and I'm
completely lucky! :))
> And most experts know better than to use such facilities except
> in extremis, for that reason and many others.
Again, a realm I don't know. I just suggested a very simple task: to
play a little with an OS command just guessing something, in case of
nothing else appears.
> The most that it is worth doing is calling a little C function
> that returns the TID (thread identifier), which will check that
> different parts of the code are using different threads. But
> DO be warned that the mapping between OpenMP threads and system
> threads is not simple, not at all.
Yes, that's why I didn't suggest that, besides I don't know calling
little C functions from Fortran (yes, I'm far far away from being an
expert...). About OpenMP implementation: yes, that's a place I don't
want to be, I usually try to figure out only if the problem is just
mine or there is some bug in the OpenMP implementation. The answer is
usually known before I start looking for, but I like to find out the
reason...
> If they all give the same TID, then your problem is that it is not
> starting enough threads; if they give different ones, you have a
> scheduler issue, and you should stop looking at the OpenMP code
> and look elsewhere. And the latter is HARD.
Yes, I think so. Btw, do you have some suggestion to the OP not so
HARD or for experts only or for those who know better? I really don't
have any idea, so if there is not such a suggestion I would know the
answer is in the pile-of-things-i-don't-know. I'll keep learning,
however, just trying to reach those places and understand a little bit
more.
Thank you very much,
Fernando.
The best that I can offer is to Email the slides on a course that
I have part-written, largely on how to avoid getting into trouble
with OpenMP. But it doesn't contain ANYTHING about the actual
configuration and system usage - except general guidelines. Please
say if you would like to see it.
Regards,
Nick Maclaren.
If you have a course over how not to get into trouble with OpenMP, I for
one am interested.
Paul.
Me too, and thank you,
Fernando.