Hi Daniel - Thanks for your query.
To answer the question you seem to be asking, UPC++/GASNet do not "transparently" or "implicitly" aggregate fine-grained communication operations that happen to be issued in temporal proximity. Initiating an RMA or RPC destined for an off-node process will inject a network packet to initiate that operation before the initiation call returns.
However UPC++
does provide an
explicit API for RMA aggregation - these are the "Non-Contiguous One-sided Communication" APIs in
chapter 15 of the Programmer's Guide and detailed further in the
Specification. These allow the programmer to explicitly aggregate RMA destined for the same peer; calls like
rput_irregular() and
rput_strided() will automatically pack together the specified discontiguous pieces of source data and pipeline sending appropriately large packets on the network, automatically unpacking them at the target to the specified destination memory. Because these use explicit aggregation, the "batching" is fully under the UPC++ programmer's control.
I should also mention that several groups have implemented explicit communication aggregation libraries layered over UPC++/GASNet primitives. One prominent example of this is the AggrStore library in
upcxx-utils. Another is the
Berkeley Container Library (BCL).
Finally, as you've observed all communication in UPC++ is asynchronous and we strongly encourage programmers to overlap communication latency with other communication and computation. UPC++ provides future/promise and completion callback synchronization mechanisms that support an aggressively asynchronous style of communication to hide network latency. In particular UPC++ includes features that make it easy to "batch together" the
synchronization for multiple operations, and even build entire DAGs of asynchronous communication and computation to execute dynamically as dependencies become satisfied.
Hope this helps..