Hi all,
有两个集群之前一直用的是Intel MPI,非常平稳也没有问题, 但是OpenMPI从来都不能用,每回mpirun都报告bash: orted: command not found,OpenMPI版本1.8.3,ICC 15和14编译
也有看过PATH和LD_LIBRARY,都没有问题,直接在Terminal里面执行orted的话倒是能找到。bash, sh, zsh, ksh几个都试过了,没有一个行的。OpenMPI列表里面的问题解决办法都不适用。
系统有在跑Torque,但是用(Interactive Job)或不用都有这个问题。
两个集群的系统版本一个是RHEL 6.5 (姑且叫它Cluster A),一个是CentOS 6.5 (Cluster B)。有考虑过是不是Mellanox卡的事,但是command not found怎么看也不像硬件的问题,而且MLX最新的固件驱动也都装过了。集群详细配置附在末尾。
大家有什么想法?似乎是bash哪地方的变量出了问题但是又找不到。环境是用Environment Modules管理的,这个倒像是出问题的地方
PATH相关部分节选:/ssd/dependencies/openmpi-1.8.3_intel15/bin:/ssd/compilers/intel15/composer_xe_2015.0.090/bin/intel64:/ssd/compilers/intel15/inspector_xe_2015/bin64:/ssd/compilers/intel15/advisor_xe_2015/bin64:/ssd/compilers/intel15/vtune_amplifier_xe_2015/bin64
LD_LIBRARY_PATH相关部分节选: /ssd/dependencies/openmpi-1.8.3_intel15/lib:/ssd/compilers/intel15/composer_xe_2015.0.090/mkl/lib/intel64:/ssd/compilers/intel15/composer_xe_2015.0.090/compiler/lib/intel64
详细配置:
Cluster A:
Infiniband:
03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
07:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]
CPU:
processor : 31
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Kernel:
2.6.32-431.20.3.el6.x86_64
LSB Release:
LSB Version: :base-4.0-amd64:base-4.0-ia32:base-4.0-noarch:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Release: 6.5
Cluster B:
Infiniband:
08:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]
CPU:
processor : 55
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
Kernel:
2.6.32-431.29.2.el6.x86_64
LSB Release:
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Release: 6.5
Regards.
Afa.L Cheng
Disclaimer:
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.
漏了Module File:
#%Module1.0#####################################################################
##
## OMPI 1.8.3 w/ Intel 15 modulefile
##
proc ModulesHelp { } {
global version modroot
puts stderr "\tOpenMPI 1.8.3 for Intel 15"
}
module-whatis "invoke OpenMPI 1.8.3 built with Intel 15.0.0 Compilers (64-bit)"
set openmpi /ssd/dependencies/openmpi-1.8.3_intel15
prepend-path PATH $openmpi/bin
prepend-path LD_LIBRARY_PATH $openmpi/lib
prepend-path LIBRARY_PATH $openmpi/lib
setenv CC mpicc
setenv CXX mpicxx
setenv FC mpifort
setenv F90 mpif90
setenv MPIHOME $openmpi
Regards.
Afa.L Cheng
Disclaimer:
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.
你們現在就開搞了....
Regards.
Afa.L Cheng
Disclaimer:
This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.
--
-- You received this message because you are subscribed to the Google Groups Shanghai Linux User Group group. To post to this group, send email to sh...@googlegroups.com. To unsubscribe from this group, send email to shlug+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/shlug?hl=zh-CN
---
您收到此邮件是因为您订阅了 Google 网上论坛的“Shanghai Linux User Group”论坛。
要退订此论坛并停止接收此论坛的电子邮件,请发送电子邮件到shlug+unsubscribe@googlegroups.com。
要查看更多选项,请访问 https://groups.google.com/d/optout。
--
Ray
http://maskray.me