Why cannot mpirun utilize all cpu resources in NS3?

Tuvie

unread,

Apr 21, 2016, 4:54:44 AM4/21/16

to ns-3-users

Hi,

I tried to run a simulation on a server with 2 CPUs (Xeon 5650, 24 cores shown in /proc/cpuinfo totally).

If I use "mpirun -np 12 ....." to run the simulation, I get a CPU utilization of 50%.

But when I tried to use "mpirun -np 20 ..." to run this simulation, I still get a CPU utilization of about 50%.

At first, I thought the Intel Xeon CPU is the reason, because o has indeed 6 cores (can support 12 threads) indeed as follows.

So I tested another simple python program with "mpirun -np 24 python test.py".

However, the CPU utilization is 100%.

Why mpirun cannot utilize all cpu resources in NS3 ?

I tried to search it but I got nothing.

Could anyone kindly please tell me the reason?

Thanks in advance.

Nat P

unread,

Apr 21, 2016, 5:07:00 AM4/21/16

to ns-3-users

Il giorno giovedì 21 aprile 2016 10:54:44 UTC+2, Tuvie ha scritto:

Hi,

I tried to run a simulation on a server with 2 CPUs (Xeon 5650, 24 cores shown in /proc/cpuinfo totally).
If I use "mpirun -np 12 ....." to run the simulation, I get a CPU utilization of 50%.
But when I tried to use "mpirun -np 20 ..." to run this simulation, I still get a CPU utilization of about 50%.

It seems you've encountered a scalability limit of your program: it can utilize only the 50% of the resources. Some bugs in your code maybe?

Nat

Tuvie

unread,

Apr 21, 2016, 5:35:53 AM4/21/16

to ns-3-users

Hi Nat,

Thanks very much for your reply.

Do you know what kind of bugs may introduce the scalability limit?

在 2016年4月21日星期四 UTC+8下午5:07:00，Nat P写道：

Nat P

unread,

Apr 21, 2016, 9:42:37 AM4/21/16

to ns-3-users

Il giorno giovedì 21 aprile 2016 11:35:53 UTC+2, Tuvie ha scritto:

Hi Nat,

Thanks very much for your reply.

Do you know what kind of bugs may introduce the scalability limit?

Algorithmic problems (e.g. half the processor have a lot of computation while half the other have nothing to do) are the most common sources for them; however, there are theoretical issues and pratical issues. Distinguish between them (and finding them) is hard without the code, and without the time to analyze it.

Nat

pdbarnes

unread,

Apr 21, 2016, 10:47:16 AM4/21/16

to ns-3-users

To run an ns-3 model in parallel (with MPI) you have to partition the Nodes by assigning them to MPI ranks, using the SystemId arg I the constructor: Node (uint32_t systemId).

Perhaps you only use systemIds from 0-11? Then the extra ranks have (almost) no work, and you'll only see 50% utilization with 24 ranks.

Peter

Tuvie

unread,

Apr 21, 2016, 11:19:37 AM4/21/16

to ns-3-users

Hi Nat,

Thanks for your reply again.

It is embarrassing that I found I made a mistake. The CPU utilization is not about 50%. It is the CPU utilization by "user" that is about 58%, while CPU utilization by "system" is 25% (as Figure 1 below). Very sorry for the previous wrong description.

But there still a question. If I run "mpirun -np 12", all CPU utilization is "us", "sy" utilization is 0 (as Figure 2). However, when I run "mpirun -np 20", "sy" utilization is 25%. Why? What's the difference of two cases?

Figure 1: top command display after run "mpirun -np 20" on an 24-core CPU.

Figure 2: top command display after run "mpirun -np 12" on an 24-core CPU.

在 2016年4月21日星期四 UTC+8下午9:42:37，Nat P写道：

Tuvie

unread,

Apr 21, 2016, 11:23:53 AM4/21/16

to ns-3-users

Hi pdbarnes,

Thanks for your reply.

I have already adapted my simulation code to different processor number. (I changed the node's systemid for more processors.)

I think I made a wrong description in the first mail. You can see my last mail which I made the problem clear.

在 2016年4月21日星期四 UTC+8下午10:47:16，pdbarnes写道：

pdbarnes

unread,

Apr 21, 2016, 2:16:12 PM4/21/16

to ns-3-users

Hmm. I haven't looked at this but I wonder if hyper-threading induces more system time?

P

Tuvie

unread,

Apr 22, 2016, 4:10:30 AM4/22/16

to ns-3-users

Hi pdbarnes,

Hyper-threading looks like the best explain so far. It is said that hyper-threading not always increases the performance of mpi.

Maybe I should read more materials about mpi and hyper-threading to make it clear.

Thank you all

在 2016年4月22日星期五 UTC+8上午2:16:12，pdbarnes写道：

Nat P

unread,

Apr 22, 2016, 6:36:05 AM4/22/16

to ns-3-users

I think that it is still an algorithmic problem. Your algorithm seems to spend so much time in transferring data (25%) between nodes, and little time in doing operations (50%). Increasing the node numbers only worse the situation (more data to exchange).

But really, if you want to learn more about that my advice is to follow some course on parallel computing on coursera.

Nat

pdbarnes

unread,

Apr 22, 2016, 10:34:43 AM4/22/16

to ns-3-users

I just noticed you did mpirun -np 20, not 24, so the 12 cores are unevenly loaded (assuming your model is balanced). Perhaps the system is chasing that imbalance?

Could you please share your script (attach it please)?

Thanks,
Peter

Tuvie

unread,

Apr 28, 2016, 10:15:01 PM4/28/16

to ns-3-users

Hi Nat,

Sorry for the late reply.

I think you are right. I increase the data transferring between nodes from different lp when I use 20 lp.

I need to check it.

Thank you.

在 2016年4月22日星期五 UTC+8下午6:36:05，Nat P写道：

Tuvie

unread,

Apr 28, 2016, 10:33:34 PM4/28/16

to ns-3-users

Hi pdbarnes,

Sorry for the late reply.

I didn't share my script before because I modified the routing (beyond just script) as well.

Here is my script. I want to simulate a Fat-Tree topology of a datacenter (k=8).

There are two kinds of traffic, bulk send and udp echo, generated randomly.

When I use 12 LPs, I put 8 pods to 8 LPs and all the core switches to 4 LPs.

When I use 20 LPs, I put all hosts to 8 LPs, all switches in a pod to another 8 LPs, and all the core switches to 4 LPs.

Referring to Nat's answer, I think maybe it is because there are too many data transferring between two nodes in different LPs.

I will make more experiment to check that.

在 2016年4月22日星期五 UTC+8下午10:34:43，pdbarnes写道：

datacenter-ecmp.cc

Reply all

Reply to author

Forward