rput in OpenMP Parallel Region

58 views
Skip to first unread message

Sam Olivier

unread,
Apr 13, 2018, 7:18:47 PM4/13/18
to UPC++
Hi,
I'm trying to put together a distributed 3D FFT (for CS267 at Berkeley) and would like to add OpenpMP threading to compute the local 1D FFT's in parallel. I have enabled UPCXX_THREADMODE=par and have been able to use OpenMP threads elsewhere. Everything has worked except when I do rput while inside an OpenMP parallel region. Is this possible? I'm wondering if there is a Gasnet/UPC++ setting similar to the MPI settings for allowing communication from threads? I am on Cori.

Here's the code (it works when I compile without -fopenmp or even with OMP_NUM_THREADS=1):
#pragma omp parallel for
    for (INT k=0; k<m_Nz; k++) { // loop through 2 local dimensions
        for (INT j=0; j<m_dims[1]; j++) {
            cdouble* row = m_local + j*m_dims[0] + k*m_dims[0]*m_dims[1]; // current row of 3D array
            m_fft_x.transform(row, DIR); // transform each row with abstracted FFTW class
            INT send_to = j/Ny; // figure out who to send the data to for the global transpose
            INT row_num = j % Ny;
            INT loc = upcxx::rank_me()*m_Nz*m_dims[0]*Ny +
                k*m_dims[0]*Ny +
                m_dims[0]*row_num;
            if (send_to == upcxx::rank_me()) { // use memcpy if local (seems to improve performance slightly)
                memcpy(tlocal+loc, row, m_dims[0]*sizeof(cdouble));
            } else { // otherwise do rput asynchronously
                upcxx::rput(row, tmp[send_to]+loc, m_dims[0]);                
            }
        }
    }
    upcxx::barrier(); // not sure if a barrier implies the rput's have finished like future.wait() would

Thanks,
Sam

Dan Bonachea

unread,
Apr 13, 2018, 7:31:30 PM4/13/18
to Sam Olivier, UPC++
Hi Sam -
One problem that immediately jumps out is you are ignoring the return value of rput, which is never safe.
You need the future it returns in order to synchronize the result (a barrier does NOT synchronize rputs).

You probably want each thread to build a future chain - see section 10 of the Programmer's Guide for a full example.
So each thread would declare a thread-specific variable like:
  upcxx::future<> f = upcxx::make_future();
outside the loop,  and then in the body do
  f = upcxx::when_all(f, upcxx::rput(row, tmp[send_to]+loc, m_dims[0]));
and at the end do
  f.wait();
Note it's important these all need to be thread-specific, because UPC++ futures are never thread-safe.

Hope this helps!
-D


--
You received this message because you are subscribed to the Google Groups "UPC++" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Olivier

unread,
Apr 15, 2018, 5:02:11 PM4/15/18
to UPC++
Thanks for the future chain tip; it helps in another area. It unfortunately didn't help with the rput in OpenMP parallel region issue.

I've attached a minimal (conditionally) working example of what I'm trying to do (rput.cpp). I do module swap PrgEnv-intel PrgEnv-gnu; module load upcxx and export UPCXX_THREADMODE=par and compile with g++ rput.cpp -o rput.exe -fopenmp $(eval upcxx-meta PPFLAGS) $(eval upcxx-meta LDFLAGS) $(eval upcxx-meta LIBFLAGS) and run with srun -n2 ./rput.exe (with OMP_NUM_THREADS=2).

rput.cpp doesn't seg fault when OMP_NUM_THREADS=1 or if I only do rputs from the master OMP thread. It segfaults if I wrap the rputs in an OMP critical or if I try to send from only the non-master omp thread.

It seems to crash right when rput is called from the non-master OpenMP thread making me think that this isn't supported or I'm not configuring something correctly.

Thanks again,
Sam
To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+un...@googlegroups.com.
rput.cpp

Dan Bonachea

unread,
Apr 15, 2018, 7:38:38 PM4/15/18
to Sam Olivier, UPC++
Hi Sam - Can you please submit an issue in the UPC++ issue tracker so we can track progress on this?

Make sure to include your test program and build steps.

Please also try building in debug mode, by compiling after setting:

 export UPCXX_CODEMODE=debug

and enable automatic backtracing on crash at runtime with:
 
 export GASNET_BACKTRACE=1

and report the backtrace you get.

Other misc note: on the Cray you should compile C++ programs for the compute nodes using "CC" not "g++" - the former is Cray's wrapper that adds various tuning knobs and ensures you get the right g++ version to match the loaded PrgEnv. Equivalently, you could use $(eval upcxx-meta CXX) which expands to CC on Cray XC.

-D



To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+unsubscribe@googlegroups.com.

John Bachan

unread,
Apr 16, 2018, 6:58:53 PM4/16/18
to Dan Bonachea, Sam Olivier, UPC++
There's a barrier in the wrong place. The barrier after pointer broadcast needs to be moved after initialization of the `local[I]` values, otherwise there's a race between rank 0's rput's and rank 1's initialization.  I have tested the example works with this fix.

Sam Olivier

unread,
Apr 16, 2018, 7:39:19 PM4/16/18
to UPC++
Yes, the barrier position is incorrect. I've since noticed that rput.cpp runs and produces the correct output if I compile with UPCXX_THREADMODE=seq but it segfaults with UPCXX_THREADMODE=par (for both CC on Cori and g++ on my local machine). Did you run in sequential mode? I attached the code with the new barrier position just in case.
rput_v2.cpp

John Bachan

unread,
Apr 16, 2018, 8:17:03 PM4/16/18
to Sam Olivier, UPC++
I only tested it on my laptop, and that worked. Now I'm running on cori and reproducing the segfault. After poking around it appears the issue is yet another platform with broken C++11 "thread_local" support. It looks like thread-local variable constructors for non-trivial types are not being executed for the non-main threads. upcxx relies on this internally, so essentially upcxx thread-safe support is critically handicapped for this platform.

Sam, thanks for finding this issue. Due to the importance of running on cori, this highly motivates the need to move away from our internal reliance on thread_local. Unfortunately this will require non-trivial reengineering. So for the moment my only suggestion is to run with flat treadless processes. Sorry about that.

To unsubscribe from this group and stop receiving emails from it, send an email to upcxx+unsubscribe@googlegroups.com.

John Bachan

unread,
Apr 17, 2018, 1:19:40 PM4/17/18
to Sam Olivier, UPC++
Update! Sam, good news: I have a workaround. This isn't actually a bug in the platform, it is entirely the fault of the UPC++ runtime. The odr-use rules of C++11 thread_local storage are tricky, and I got it wrong.

You can ensure that upcxx gets its thread-specific internals properly initialized by adding a superfluous call to `upcxx::default_persona_scope()` at the beginning of each thread. See here:

```
if (upcxx::rank_me() == 0) {
#pragma omp parallel
{
  // Call this anytime we *might* be on a new thread, and before any other calls to upcxx by that thread.
  upcxx::default_persona_scope();
 
  int tid = omp_get_thread_num(); // OMP thread id 
  upcxx::future<> f = upcxx::make_future(); // setup future chain
```

smsol...@gmail.com

unread,
Apr 17, 2018, 2:16:46 PM4/17/18
to UPC++
> One problem that immediately jumps out is you are ignoring the return value of rput, which is never safe.You need the future it returns in order to synchronize the result (a barrier does NOT synchronize rputs).
Awesome! Thanks, that fixed it!

Paul Hargrove

unread,
Apr 26, 2018, 6:13:01 PM4/26/18
to Sam Olivier, UPC++

Sam,


There is now a version of UPC++ installed on NERSC's Cori and Edison which does not have this bug.

I encourage you to "module load upcxx/snapshot-2018-04-25" to get this fixed version.

Note this is not an officially vetted/well-tested release, but it should solve the OpenMP compatibility problem you encountered and light testing shows it to be stable.


Thank you, again, for bringing this issue to our attention.

-Paul


--
Paul H. Hargrove <PHHar...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory
Reply all
Reply to author
Forward
0 new messages