自制并行计算benchmark，欢迎测试加速比(来战)

54 views

Skip to first unread message

C.D.Luminate

unread,

Mar 22, 2016, 4:53:16 AM3/22/16

to xidian_linux

Hi，

突然想试试写并行计算，就用OpenMP自己造了几个简易的BLAS Routine，代码见

https://github.com/CDLuminate/withLinux/blob/master/booknote/parallel/omp/benchmark.c

Makefile

https://github.com/CDLuminate/withLinux/blob/master/booknote/parallel/omp/Makefile

不过请注意这个Makefile是parallel暴力加速版本，没装GNU parallel的话需要调整Makefile

我的笔记本是 I5-2430M， 2C4T @ 2.5G，加速比（并行速度除以串行速度）

基本上在2左右。******** 欢迎来虐我的CPU *********

另外，由于float类型在我写程序过程中又炸出大数吃小数BUG，所以实现的例程

均为双精度。

Wellcome to Lumin's serial/parallel benchmark, init ... [OK]

I: [initialization] time cost is 0.811117 seconds.

--------------------------------------------------------------------------------

I: [dcopy in serial] time cost is 0.187758 seconds.

A 1.000000 1.000000 C 1.000000 1.000000

I: [dcopy in parallel] time cost is 0.110634 seconds.

A 1.000000 1.000000 C 1.000000 1.000000

--------------------------------------------------------------------------------

I: [dasum serial] time cost is 0.244707 seconds.

resA 67108864.000000

I: [dasum parallel] time cost is 0.126894 seconds.

resB 67108864.000000

--------------------------------------------------------------------------------

I: [ddot in serial] time cost is 0.249027 seconds.

resA 67108864.000000

I: [ddot in parallel] time cost is 0.119332 seconds.

resB 67108864.000000

--------------------------------------------------------------------------------

I: [dscal in serial] time cost is 0.244976 seconds.

A 0.500000 0.500000

I: [dscal in parallel] time cost is 0.102829 seconds.

A 0.250000 0.250000

--------------------------------------------------------------------------------

I: [daxpby in serial] time cost is 0.300344 seconds.

A 0.250000 0.250000 C 1.625000 1.625000

I: [daxpby in parallel] time cost is 0.178659 seconds.

A 0.250000 0.250000 C 2.562500 2.562500

--------------------------------------------------------------------------------

I: [dgemv in serial] time cost is 0.315916 seconds.

Y 0.250000 0.250000 DEST 2048.250000 2048.250000

I: [dgemv in parallel] time cost is 0.212380 seconds.

Y 0.250000 0.250000 DEST 2048.250000 2048.250000

I: [dgemv in parallelv2] time cost is 0.205021 seconds.

Y 0.250000 0.250000 DEST 2048.250000 2048.250000

--------------------------------------------------------------------------------

I: [dgemm in serial] time cost is 1.151119 seconds.

X 1.000000 1.000000 Y 1.000000 1.000000 DEST 512.000000 512.000000

I: [dgemm in parallel] time cost is 0.700665 seconds.

X 1.000000 1.000000 Y 1.000000 1.000000 DEST 512.000000 512.000000

--------------------------------------------------------------------------------

I: [All benchmark] time cost is 4.450948 seconds.

---
Regards,
C.D.Luminate

C.D.Luminate

unread,

Mar 22, 2016, 5:07:48 AM3/22/16

to xdl

有趣的一点是，当我还在用 float 类型时，

大量 1 的累加的确导致了IEEE 754单精度经典BUG，于是串行算法本来应该累加出

67108864, 结果只累加到 16777216。

这个BUG可以很容易的用这个程序来验证：

https://github.com/CDLuminate/withLinux/blob/master/booknote/parallel/omp/trap1.c

然而相同例程的并行版本却得出了正确结果[1]，并行例程使用了

reduction (+:sum)

这个黑魔法相当于

1. 把for循环分解，map到多个thread上，每个thread单独享有一个sum变量

2. reduce：将每个thread独有的sum都累加到最初的sum里

由于每个thread独有的sum数量级很大，于是此时float无法忽视reduce步骤中

任何加数，于是导致了正确结果。（其实巧合不止这一个）

涨姿势

[1] 常见的情况反而是串行靠谱并行容易炸，这里反过来了

---
Regards,
C.D.Luminate

--
您收到此邮件是因为您订阅了“西电开源社区”邮件列表。
要向此邮件列表发帖，请发送电子邮件至 xidian...@googlegroups.com。
要取消订阅，请发送电子邮件至 xidian_linux+unsub...@googlegroups.com。
请通过 https://groups.google.com/group/xidian_linux?hl=zh-CN 访问此网上论坛。
通过 [ipv6 enabled] http://xdlinux.info/ 或 http://xdl.in/
[ipv4 only] http://linux.xidian.edu.cn/
[手机]：http://m.xdlinux.info/
访问西电开源社区。
---
您收到此邮件是因为您订阅了Google网上论坛上的“西电开源社区邮件列表”群组。
要退订此群组并停止接收此群组的电子邮件，请发送电子邮件到xidian_linux...@googlegroups.com。
要查看更多选项，请访问https://groups.google.com/d/optout。

Reply all

Reply to author

Forward

0 new messages