Very large BUPC jobs on x86_64 hardware with CentOS 7, BUPC 2019.4.4

3 views
Skip to first unread message

Riebs, Andy

unread,
Feb 28, 2020, 2:37:23 PM2/28/20
to upc-...@lbl.gov

Hi,

 

We were having a problem with a large job on a large cluster using bupc 2.28.0, so I tried to upgrade to bupc 2019.4.4.

 

With 2019.4.4, it seems that if I approach or exceed 65K processes (say 2000 nodes with 40 tasks/node), I get the message “job size exceeds ibv-conduit capabilities".

 

From gasnet/ibv-conduit/gasnet_core.c, this message appears in

 

#if GASNET_MAXNODES > 65535

#error "Update gasneti_bootstrapExchange for > 16-bit node count"

#endif

 

I wrote and tried this patch,

 

--- berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core.c 2019-09-19 21:22:52.000000000 +0000

+++ berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core.c      2020-02-28 18:03:45.215990950 +0000

@@ -429,9 +429,9 @@

   }

}

 

-#if GASNET_MAXNODES > 65535

-#error "Update gasneti_bootstrapExchange for > 16-bit node count"

-#endif

+// #if GASNET_MAXNODES > 65535

+// #error "Update gasneti_bootstrapExchange for > 16-bit node count"

+// #endif

 

#if GASNETC_USE_RCV_THREAD

   static gasneti_atomic_t gasnetc_sys_exchange_rcvd[2][16] =

diff -urN berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core_fwd.h berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core_fwd.h

--- berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core_fwd.h     2019-09-19 21:22:52.000000000 +0000

+++ berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core_fwd.h  2020-02-28 17:30:54.377906521 +0000

@@ -32,7 +32,7 @@

 

/* 16K is the limit on the LID space, but we must allow more than 1 proc per node */

/* 64K corresponds to 16 bits used in the AM Header and 16-bit gex_Rank_t */

-#define GASNET_MAXNODES        65535

+#define GASNET_MAXNODES        200000

 

but wound up with the message " *** FATAL ERROR (proc 47377): Unexpected error Invalid argument (rc=1 errno=22) from ibv_create_qp()" on only 128 nodes, so I’m guessing that 16 bits are more significant than I was hoping.

 

1. Does this mean we’re limited to 65K processes?

2. Did the same limit exist (perhaps implicitly, generating erroreous results but not an error  message) in bupc 2.28.0?

 

Andy

 

--

Andy Riebs

andy....@hpe.com

Hewlett Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

 

 

 

 

--

Andy Riebs

andy....@hpe.com

Hewlett Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

 

Paul Hargrove

unread,
Feb 28, 2020, 2:56:45 PM2/28/20
to Riebs, Andy, upc-...@lbl.gov
Andy,

1.
Yes, you are currently limited to 65,535 processes.
I hope (but cannot promise) this can be addressed within the next 12 to 18 months.

However, Berkeley UPC does support the use of pthreads as UPC ranks which means you could run one process per socket with many pthreads to get far larger UPC thread counts.

2.
Yes the same limitation was present in 2.28.0 (and earlier) but lacked the same reporting.
I cannot recall precisely how things would have failed.

As indicated in the following comment you quoted in diff context, there is a 16-bit field that is limiting things:

/* 64K corresponds to 16 bits used in the AM Header and 16-bit gex_Rank_t */


-Paul

--
You received this message because you are subscribed to the Google Groups "upc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upc-users+...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/upc-users/TU4PR8401MB05599B032711879A03645D309BE80%40TU4PR8401MB0559.NAMPRD84.PROD.OUTLOOK.COM.


--
Paul H. Hargrove <PHHar...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory

Riebs, Andy

unread,
Feb 28, 2020, 3:14:02 PM2/28/20
to Paul Hargrove, upc-...@lbl.gov

Many thanks Paul -- you’ve saved me a long weekend of hair-pulling and gnashing of teeth!

 

Andy

Reply all
Reply to author
Forward
0 new messages