Haskell bindings for CCI

Peter Braam

unread,

Jan 17, 2012, 4:22:13 PM1/17/12

to cci_dev...@email.ornl.gov, cloudh...@googlegroups.com

Hi -

Facundo implemented Haskell bindings for CCI and we tested the IB verbs driver, with Paul Monday's help, on our IB QDR cluster. The results are amazingly close to C performance - just a few % off. Below are some event-only and RDMA ping pong tests.

Hopefully everyone finds this good news!

Peter

---------- Forwarded message ----------
From: Facundo Domínguez <facundo....@parsci.com>
Date: 2012/1/14
Subject: [dev] Perhaps pingpong definitive numbers
To: Dev <d...@parsci.com>

Hi all,

Tests results follow. The hard thing in making these tests was
imitating the C pingpong behavior (messy program) in Haskell. Once
that was achieved, no optimizations other than the ones performed by
ghc automatically were required.

Cheers!
Facundo

Servers and clients are executed in different nodes.

== C pingpong implementation with a reliable ordered connection
sending active messages ==

[facundo.dominguez@pg73-v3 tests]$ CCI_CONFIG=../../../../cci.ini
./pingpong -h verbs://10.155.90.37:35480 -c RO
Using RO connection
Opened verbs://10.155.90.13:46154
Bytes Latency (one-way) Throughput
0 2.28 us 0.00 MB/s
1 2.95 us 0.34 MB/s
2 2.96 us 0.67 MB/s
4 2.95 us 1.36 MB/s
8 2.93 us 2.73 MB/s
16 3.01 us 5.32 MB/s
32 3.03 us 10.58 MB/s
64 3.00 us 21.36 MB/s
128 3.08 us 41.62 MB/s
256 3.24 us 79.05 MB/s
512 3.48 us 146.95 MB/s
1024 4.02 us 255.02 MB/s
2048 4.91 us 417.27 MB/s

== Haskell pingpong implementation with a reliable ordered connection
sending active messages ==

[facundo.dominguez@pg155-n17 cci-haskell]$ CCI_CONFIG=../cci.ini
dist/build/ex-pingpong/ex-pingpong -h verbs://10.155.90.13:46163
verbs://10.155.90.37:35483
Bytes Latency (one-way) Throughput
0 2.31 us 0.00 MB/s
1 2.93 us 0.34 MB/s
2 2.95 us 0.68 MB/s
4 2.96 us 1.35 MB/s
8 2.93 us 2.73 MB/s
16 2.93 us 5.47 MB/s
32 2.94 us 10.87 MB/s
64 2.91 us 21.96 MB/s
128 3.01 us 42.52 MB/s
256 3.21 us 79.76 MB/s
512 3.43 us 149.08 MB/s
1024 3.91 us 261.96 MB/s
2048 4.84 us 422.95 MB/s

== C pingpong implementation with a reliable ordered connection making
RMA writes ==

[facundo.dominguez@pg73-v3 tests]$ CCI_CONFIG=../../../../cci.ini
./pingpong -h verbs://10.155.90.37:35491 -c RO -w -m 4194304
Using RO connection
Opened verbs://10.155.90.13:46179
server RMA handle is 0x15759d0
local_rma_handle is 0x12f9530
Bytes Latency (round-trip) Throughput
1 3.04 us 0.33 MB/s
2 3.03 us 0.66 MB/s
4 3.03 us 1.32 MB/s
8 3.03 us 2.64 MB/s
16 3.16 us 5.07 MB/s
32 3.25 us 9.83 MB/s
64 3.32 us 19.30 MB/s
128 3.38 us 37.84 MB/s
256 3.47 us 73.69 MB/s
512 3.57 us 143.62 MB/s
1024 4.02 us 254.85 MB/s
2048 4.86 us 421.37 MB/s
4096 5.38 us 760.69 MB/s
8192 6.69 us 1225.22 MB/s
16384 9.13 us 1794.02 MB/s
32768 14.01 us 2338.28 MB/s
65536 23.72 us 2762.39 MB/s
131072 43.17 us 3036.47 MB/s
262144 81.99 us 3197.28 MB/s
524288 159.64 us 3284.23 MB/s
1048576 314.96 us 3329.23 MB/s
2097152 625.55 us 3352.51 MB/s
4194304 1247.76 us 3361.46 MB/s

== Haskell pingpong implementation with a reliable ordered connection
making RMA writes ==

[facundo.dominguez@pg155-n17 cci-haskell]$ CCI_CONFIG=../cci.ini
dist/build/ex-pingpong/ex-pingpong -h verbs://10.155.90.13:46171 -r
4194304
verbs://10.155.90.37:35487
Bytes Latency (one-way) Throughput
1 3.09 us 0.32 MB/s
2 3.13 us 0.64 MB/s
4 3.11 us 1.29 MB/s
8 3.10 us 2.58 MB/s
16 3.09 us 5.18 MB/s
32 3.14 us 10.19 MB/s
64 3.15 us 20.33 MB/s
128 3.24 us 39.46 MB/s
256 3.23 us 79.37 MB/s
512 3.45 us 148.58 MB/s
1024 3.87 us 264.53 MB/s
2048 4.66 us 439.10 MB/s
4096 5.30 us 772.11 MB/s
8192 6.53 us 1255.38 MB/s
16384 8.94 us 1832.89 MB/s
32768 13.82 us 2371.12 MB/s
65536 23.52 us 2786.72 MB/s
131072 42.98 us 3049.52 MB/s
262144 81.61 us 3212.22 MB/s
524288 158.95 us 3298.41 MB/s
1048576 313.33 us 3346.56 MB/s
2097152 622.05 us 3371.34 MB/s
4194304 1239.41 us 3384.12 MB/s

Facundo Domínguez

unread,

Jan 17, 2012, 5:03:41 PM1/17/12

to Atchley, Scott, Peter Braam, cci_dev...@email.ornl.gov, cloudh...@googlegroups.com

> Interestingly, after 8 bytes, the Haskell latencies are better than the C latencies. I would be curious to understand the differences between the Haskell pingpong and the C version.

I can think of a couple of sources for the small differences:
* Network performance fluctuates slightly between runs.
* The ghc compiler for Haskell uses a native code generator so the
binaries produced by ghc and gcc must be different. Consider that
because of temporary building issues, we are using ghc to compile ORNL
CCI implementation statically rather than making a shared library with
gcc. I didn't dig how ghc compiles C source code though.

If you are interested we could provide you the Haskell bindings and
pingpong draft for you to dig it further.

Cheers!
Facundo

On Tue, Jan 17, 2012 at 7:33 PM, Atchley, Scott <atch...@ornl.gov> wrote:
> Peter,
>
> Excellent!
>
> Interestingly, after 8 bytes, the Haskell latencies are better than the C latencies. I would be curious to understand the differences between the Haskell pingpong and the C version.
>
> Scott

>> _______________________________________________
>> CCI_Developers mailing list
>> CCI_Dev...@email.ornl.gov
>> https://email.ornl.gov/mailman/listinfo/cci_developers
>> To unsubscribe, send a blank email to cci_developer...@email.ornl.gov
>
> _______________________________________________
> CCI_Developers mailing list
> CCI_Dev...@email.ornl.gov
> https://email.ornl.gov/mailman/listinfo/cci_developers
> To unsubscribe, send a blank email to cci_developer...@email.ornl.gov

Ryan N

unread,

Jan 17, 2012, 11:42:49 PM1/17/12

to CloudHaskell

Hi Facundo,

Excellent!

> If you are interested we could provide you the Haskell bindings and
> pingpong draft for you to dig it further.

I'd love to run this on some our infiniband systems here. Sign me up
for getting this code as well please.

Best,
-Ryan

Rob Stewart

unread,

Jan 18, 2012, 8:25:05 AM1/18/12

to rrne...@gmail.com, CloudHaskell

Hi Facundo,

I'm looking fault tolerant distributed memory Haskell for my PhD. I
tried using mpich2 with only limited fault tolerant behaviour (v1.4.1
has a flag to withstand node failure), and so instead just focused on
silent failure semantics using sockets. I'd love to be using a higher
performance communication layer to investigate the fault tolerant
semantics of CCI. So.. I'd also very much like to sign up getting this
code, too.

Haskell bindings for CCI - hurray!

Peter Braam

Facundo Domínguez

Ryan N

Rob Stewart