Hi,
We were having a problem with a large job on a large cluster using bupc 2.28.0, so I tried to upgrade to bupc 2019.4.4.
With 2019.4.4, it seems that if I approach or exceed 65K processes (say 2000 nodes with 40 tasks/node), I get the message “job size exceeds ibv-conduit capabilities".
From gasnet/ibv-conduit/gasnet_core.c, this message appears in
#if GASNET_MAXNODES > 65535
#error "Update gasneti_bootstrapExchange for > 16-bit node count"
#endif
I wrote and tried this patch,
--- berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core.c 2019-09-19 21:22:52.000000000 +0000
+++ berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core.c 2020-02-28 18:03:45.215990950 +0000
@@ -429,9 +429,9 @@
}
}
-#if GASNET_MAXNODES > 65535
-#error "Update gasneti_bootstrapExchange for > 16-bit node count"
-#endif
+// #if GASNET_MAXNODES > 65535
+// #error "Update gasneti_bootstrapExchange for > 16-bit node count"
+// #endif
#if GASNETC_USE_RCV_THREAD
static gasneti_atomic_t gasnetc_sys_exchange_rcvd[2][16] =
diff -urN berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core_fwd.h berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core_fwd.h
--- berkeley_upc-2019.4.4.orig/gasnet/ibv-conduit/gasnet_core_fwd.h 2019-09-19 21:22:52.000000000 +0000
+++ berkeley_upc-2019.4.4/gasnet/ibv-conduit/gasnet_core_fwd.h 2020-02-28 17:30:54.377906521 +0000
@@ -32,7 +32,7 @@
/* 16K is the limit on the LID space, but we must allow more than 1 proc per node */
/* 64K corresponds to 16 bits used in the AM Header and 16-bit gex_Rank_t */
-#define GASNET_MAXNODES 65535
+#define GASNET_MAXNODES 200000
but wound up with the message " *** FATAL ERROR (proc 47377): Unexpected error Invalid argument (rc=1 errno=22) from ibv_create_qp()" on only 128 nodes, so I’m guessing that 16 bits are more significant than I was hoping.
1. Does this mean we’re limited to 65K processes?
2. Did the same limit exist (perhaps implicitly, generating erroreous results but not an error message) in bupc 2.28.0?
Andy
--
Andy Riebs
Hewlett Packard Enterprise
High Performance Computing Software Engineering
--
Andy Riebs
Hewlett Packard Enterprise
High Performance Computing Software Engineering
/* 64K corresponds to 16 bits used in the AM Header and 16-bit gex_Rank_t */
--
You received this message because you are subscribed to the Google Groups "upc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to upc-users+...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/upc-users/TU4PR8401MB05599B032711879A03645D309BE80%40TU4PR8401MB0559.NAMPRD84.PROD.OUTLOOK.COM.
Many thanks Paul -- you’ve saved me a long weekend of hair-pulling and gnashing of teeth!
Andy