执行OpenMPI mpirun时orted: command not found

709 views
Skip to first unread message

Alpha Cheng

unread,
Nov 4, 2014, 2:14:52 AM11/4/14
to sh...@googlegroups.com

Hi all,

有两个集群之前一直用的是Intel MPI,非常平稳也没有问题, 但是OpenMPI从来都不能用,每回mpirun都报告bash: orted: command not found,OpenMPI版本1.8.3,ICC 15和14编译

也有看过PATH和LD_LIBRARY,都没有问题,直接在Terminal里面执行orted的话倒是能找到。bash, sh, zsh, ksh几个都试过了,没有一个行的。OpenMPI列表里面的问题解决办法都不适用。

系统有在跑Torque,但是用(Interactive Job)或不用都有这个问题。

两个集群的系统版本一个是RHEL 6.5 (姑且叫它Cluster A),一个是CentOS 6.5 (Cluster B)。有考虑过是不是Mellanox卡的事,但是command not found怎么看也不像硬件的问题,而且MLX最新的固件驱动也都装过了。集群详细配置附在末尾。

大家有什么想法?似乎是bash哪地方的变量出了问题但是又找不到。环境是用Environment Modules管理的,这个倒像是出问题的地方

PATH相关部分节选:/ssd/dependencies/openmpi-1.8.3_intel15/bin:/ssd/compilers/intel15/composer_xe_2015.0.090/bin/intel64:/ssd/compilers/intel15/inspector_xe_2015/bin64:/ssd/compilers/intel15/advisor_xe_2015/bin64:/ssd/compilers/intel15/vtune_amplifier_xe_2015/bin64

LD_LIBRARY_PATH相关部分节选: /ssd/dependencies/openmpi-1.8.3_intel15/lib:/ssd/compilers/intel15/composer_xe_2015.0.090/mkl/lib/intel64:/ssd/compilers/intel15/composer_xe_2015.0.090/compiler/lib/intel64

详细配置:

Cluster A:
Infiniband:

03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
07:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]

CPU:

processor       : 
​31
cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Kernel:

2.6.32-431.20.3.el6.x86_64

LSB Release:

LSB Version:    :base-4.0-amd64:base-4.0-ia32:base-4.0-noarch:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Release:        6.5

Cluster B:
Infiniband:

08:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]

CPU:

processor       : 55
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz

Kernel:

2.6.32-431.29.2.el6.x86_64

LSB Release:

LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Release:        6.5

Regards.

Afa.L Cheng

Disclaimer:

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

Alpha Cheng

unread,
Nov 4, 2014, 2:16:45 AM11/4/14
to sh...@googlegroups.com

漏了Module File:

#%Module1.0#####################################################################
##
## OMPI 1.8.3 w/ Intel 15 modulefile
##
proc ModulesHelp { } {
        global version modroot

        puts stderr "\tOpenMPI 1.8.3 for Intel 15"
}

module-whatis "invoke OpenMPI 1.8.3 built with Intel 15.0.0 Compilers (64-bit)"

set openmpi /ssd/dependencies/openmpi-1.8.3_intel15

prepend-path  PATH    $openmpi/bin
prepend-path  LD_LIBRARY_PATH    $openmpi/lib
prepend-path  LIBRARY_PATH    $openmpi/lib

setenv CC   mpicc
setenv CXX  mpicxx
setenv FC   mpifort
setenv F90  mpif90

setenv    MPIHOME  $openmpi
​​

Regards.

Afa.L Cheng

Disclaimer:

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.


Ray Song

unread,
Nov 8, 2014, 12:47:49 PM11/8/14
to Alpha Cheng, sh...@googlegroups.com
你是 Purdue University 的?今年廣州見過面……

你們現在就開搞了....

On 2014-11-04, Alpha Cheng wrote:
>Hi all,
>
>有两个集群之前一直用的是Intel MPI,*非常平稳也没有问题*, 但是OpenMPI从来都不能用,每回mpirun都报告bash: orted:
>command not found,OpenMPI版本1.8.3,ICC 15和14编译
>
>也有看过PATH和LD_LIBRARY,都没有问题,直接在Terminal里面执行orted的话倒是能找到。bash, sh, zsh,
>ksh几个都试过了,没有一个行的。OpenMPI列表里面的问题解决办法都不适用。
>
>系统有在跑Torque,但是用(Interactive Job)或不用都有这个问题。
>
>两个集群的系统版本一个是RHEL 6.5 (姑且叫它Cluster A),一个是CentOS 6.5 (Cluster
>B)。有考虑过是不是Mellanox卡的事,但是command not
>found怎么看也不像硬件的问题,而且MLX最新的固件驱动也都装过了。集群详细配置附在末尾。
>
>大家有什么想法?似乎是bash哪地方的变量出了问题但是又找不到。环境是用Environment Modules管理的,这个倒像是出问题的地方
>
>PATH相关部分节选:
>/ssd/dependencies/openmpi-1.8.3_intel15/bin:/ssd/compilers/intel15/composer_xe_2015.0.090/bin/intel64:/ssd/compilers/intel15/inspector_xe_2015/bin64:/ssd/compilers/intel15/advisor_xe_2015/bin64:/ssd/compilers/intel15/vtune_amplifier_xe_2015/bin64
>
>LD_LIBRARY_PATH相关部分节选:
>/ssd/dependencies/openmpi-1.8.3_intel15/lib:/ssd/compilers/intel15/composer_xe_2015.0.090/mkl/lib/intel64:/ssd/compilers/intel15/composer_xe_2015.0.090/compiler/lib/intel64
>
>详细配置:
>
>Cluster A:
>*Infiniband*:
>
>03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
>07:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]
>
>*CPU*:
>
>processor :
>​31
>
>cpu family : 6
>model : 45
>model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
>
>*Kernel*:
>
>2.6.32-431.20.3.el6.x86_64
>
>*LSB Release*:
>
>LSB Version:
>:base-4.0-amd64:base-4.0-ia32:base-4.0-noarch:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>Distributor ID: RedHatEnterpriseServer
>Release: 6.5
>
>Cluster B:
>*Infiniband*:
>
>08:00.0 InfiniBand: Mellanox Technologies MT27600 [Connect-IB]
>
>*CPU*:
>
>processor : 55
>cpu family : 6
>model : 63
>model name : Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
>
>*Kernel*:
>
>2.6.32-431.29.2.el6.x86_64
>
>*LSB Release*:
>
>LSB Version:
>:base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>Distributor ID: CentOS
>Release: 6.5
>
>Regards.
>
>Afa.L Cheng
>
>Disclaimer:
>
>This message contains confidential information and is intended only for the
>individual named. If you are not the named addressee you should not
>disseminate, distribute or copy this e-mail. Please notify the sender
>immediately by e-mail if you have received this e-mail by mistake and
>delete this e-mail from your system. E-mail transmission cannot be
>guaranteed to be secure or error-free as information could be intercepted,
>corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
>The sender therefore does not accept liability for any errors or omissions
>in the contents of this message, which arise as a result of e-mail
>transmission. If verification is required please request a hard-copy
>version.
>​
>
>--
>-- You received this message because you are subscribed to the Google Groups Shanghai Linux User Group group. To post to this group, send email to sh...@googlegroups.com. To unsubscribe from this group, send email to shlug+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/shlug?hl=zh-CN
>---
>您收到此邮件是因为您订阅了 Google 网上论坛的“Shanghai Linux User Group”论坛。
>要退订此论坛并停止接收此论坛的电子邮件,请发送电子邮件到shlug+un...@googlegroups.com
>要查看更多选项,请访问 https://groups.google.com/d/optout

--
Ray
http://maskray.me

Alpha Cheng

unread,
Nov 8, 2014, 2:46:23 PM11/8/14
to Ray Song, sh...@googlegroups.com
暴露组织了…

问题没解决,也没时间解决了,决定弃用OMPI继续用Intel MPI了,除了占用资源多一点性能还是蛮好的。


PS:
你們現在就開搞了....
这回开搞的是半个月后的那个SC14。另,听说中山的那个HPL大神去清华了?

Regards.

Afa.L Cheng

Disclaimer:

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.


--
-- You received this message because you are subscribed to the Google Groups Shanghai Linux User Group group. To post to this group, send email to sh...@googlegroups.com. To unsubscribe from this group, send email to shlug+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/shlug?hl=zh-CN

---
您收到此邮件是因为您订阅了 Google 网上论坛的“Shanghai Linux User Group”论坛。
要退订此论坛并停止接收此论坛的电子邮件,请发送电子邮件到shlug+unsubscribe@googlegroups.com
要查看更多选项,请访问 https://groups.google.com/d/optout

--
Ray
http://maskray.me

Reply all
Reply to author
Forward
0 new messages