IS there anybody that installed and ran moose on commercial HPC cloud service ?

300 views
Skip to first unread message

Jean Francois Leon

unread,
Jul 14, 2015, 8:51:50 PM7/14/15
to moose...@googlegroups.com
Hi 
I dont have access to governement's or unversity's HPC resource.

My whole approach with Moose relies on the ability to run moose on a commercial cloud service. 

I want to validate my workflow and plan to spent some time tryng to do that early on even if my apps are not ready for it...

So my questions:
Did any body have already done that ?  
if yes any pitfall or specifics I should be aware of?

Are the cluster install note  valid for these services ?
[ I am thinking to start with amazon's  ec2 but I am open minded about google or other commercial providers]

Regardless I will keep you posted of my progress on that thread in the weeks to come.

Cheers
JF


Derek Gaston

unread,
Jul 14, 2015, 9:00:14 PM7/14/15
to moose...@googlegroups.com
I have run it on Amazon EC2... but not extensively.  If you use an Ubuntu 14.04 image on there... then our Redistributable Package will install just fine and you should be up and running.

I haven't used it in tandem with MOOSE (yet) but a lot of people are using Star Cluster ( http://star.mit.edu/cluster/ ) to manage Amazon EC2 clusters.  If you have some success using it with MOOSE we would love to hear about it!

Finally, our wiki does have Cluster Installation Instructions ( http://mooseframework.org/wiki/ClusterInstructions/ ).  There are two options: Multi-User and Single-User.  The difference is whether or not you want to install everything into the "system" directories (Multi-User) or if you just want to set up everything in your home directory (Single-User).

If you have any trouble at all, please email us!

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/moose-users/687c4fca-918c-4a53-8d9f-5543b2a2dd3c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean Francois Leon

unread,
Jul 15, 2015, 9:28:18 AM7/15/15
to moose...@googlegroups.com
Thanks for the input Derek

Reassuring to know I am not the first on that path..

I have started looking at  starcluster. It looks great by the way never heard of it before.

First thing I note thought is that both amazon AND starcluster recommend the use of centos based AMI for HPC on cluster.

Is there any known moose specific issue I should be aware of  in that regard compared to Ubuntu?
(I am familiar with the different distro in general so my question is really moose on cluster specific.)

Is there a good reason to go against this recommendation and go with Ubuntu AMI ?
JF

Miller, Jason M

unread,
Jul 15, 2015, 9:58:03 AM7/15/15
to moose...@googlegroups.com
A few reasons I would hand out:

We do not have a package for CentOS. You will be building everything from the ground up on your own for that platform (we would still help you of course!). The other reason; CentOS maintains its hold on using older stable libraries. Sounds great on paper. But sometimes these libraries end up being to old, and you simply end up installing (by hand) newer versions of these libraries anyway. Ubuntu, Mint, Fedora etc have a much more modern set of libraries available.

Just some insight :)
Jason



Jean Francois Leon

unread,
Jul 15, 2015, 10:04:33 AM7/15/15
to moose...@googlegroups.com
Thanks Jason that s a good set of reasons indeed!

So go with Ubuntu

No here come the first of many naive questions to come:

I have never compiled on cluster before.
Is there a compile specific to cluster or do I just compile for one instance and then the same exec are duplicated  on the nodes, openMPi managing the interactions?

I am really trying to figure out a test workflow here  to make sure it will work when I will need it...

JF

Miller, Jason M

unread,
Jul 15, 2015, 10:46:38 AM7/15/15
to moose...@googlegroups.com
I started answering this question, and quickly realized I may have given you the wrong impression about CentOS. The reason our HPC clusters are running SLES or RHEL (basically a CentOS clone) and not Ubuntu/Mint/Fedora... is because our job scheduler requires it. I am not familiar with Amazon's job scheduler. Or if there is one available for Ubuntu. I am _guessing_ there is one. Derek would know more about this me thinks. In fact, that may be exactly what Star Cluster does.

That being said, if you have a job scheduler available in cluster environment, you should only have to build your application once. That application can then be executed using the job scheduler (to utilize all the nodes and time you purchased). The job scheduler will manage the OpenMPI implementation.

Jason





Derek Gaston

unread,
Jul 15, 2015, 10:58:59 PM7/15/15
to moose...@googlegroups.com
You don't want to roll your own cluster image.  If you actually want to create a "cluster in the cloud" use one of the Star Cluster images.  You'll then want to go through our cluster install instructions I linked earlier.

If all you want is single image runs (i.e. you're not going to use more than one node at a time... like 8 processors or whatever at a time) then it's fine to just use a regular Ubuntu image from Amazon EC2 and install our package.

To answer your question about executables... the way it works is that your home directory will be NFS mounted ( https://en.wikipedia.org/wiki/Network_File_System ) across all instances of your image so your executables (and all of your data) will be available everywhere automatically.

Star Cluster is about more than just having an "OS".... it's a complete package that also takes care of dynamically starting up and shutting down nodes as you need them.

Derek

Jean Francois Leon

unread,
Jul 18, 2015, 7:26:37 AM7/18/15
to moose...@googlegroups.com
Hey
So I am going to do some test both on AWS ec2 ( Amazon) and gcloud [ Google]
I have been talking to people at google and their offer seems [ and it is recent ]coming strong on HPC applications. ...as they tell me a comparative test will show.( which I am going to do)

Right now I am preparing these tests  and I am facing a new ( and unknown) world....

So the first thing I did was review the cluster installation you have on the wiki [ single user mode]
and I have a very simple question to start with

When I compare these instructions with linux manual install's, I note that in the later case  you recommend a special install of openmpi befor installing petsc while on the other hand the cluster install note does note mention it at all
is it normal?

In my ( possibly incorrect) view the cluster install is compiled on one node ... and then the executable will be mirrored and used  on the compute node as needed  each time i launch a mpirun.

My understanding is that the moose cluster install note are really "single user without root access install note" and should work as such on any linux machine.

Am I correct?  (except for the mpi stuff and a few directory changes they are similar)
[ I tried them on my workstation in that context and petsc refuse to configure for compile because of mpi issue 

Thanks
JF

Derek Gaston

unread,
Jul 18, 2015, 11:55:08 AM7/18/15
to moose...@googlegroups.com
The single-user directions are intended for people using an existing cluster that someone else maintains (and like you say they don't have root access).  In that scenario the admin for that machine has already installed and setup MPI... because MPI is VERY specific on real clusters (you can't just use *anything*... you have to use exactly the right MPI for your hardware and queuing setup).

In this case you ARE the admin... so you need to decide what MPI to setup.

You are also still a bit off on how MPI works.  MPI does not "mirror your executable".  It depends on the nodes using a network filesystem that will give access to the very same executable across nodes in the same way.

When an MPI job starts it simply ssh's to each node and runs your executable... the executable must already be accessible there.

I think that this is probably all too much to learn at once and will once again point you at using Star Cluster (or an alternative).  If you've never set up a cluster before there is simply too much to learn.

Another option is to set up a local "cluster" first so you can learn.  Just get two machines on the same ethernet network and install Ubuntu on both of them and try to go through the steps to get to the point where you can run an MPI job across both of them.

Trying to learn Amazon EC2 and "cloud" stuff at the same time you're trying to learn how to set up a cluster is going to be a big PITA.

Derek

Jean Francois Leon

unread,
Jul 18, 2015, 2:43:34 PM7/18/15
to moose...@googlegroups.com
Hi Derek thanks for all that.
it is very helpful
I hear [ as I previously heard] your recommendation and intent to follow them.
In fact I am already using starcluster on ec2 and pointed to my googlecloud contact its ease of use  :-)

Now google cloud has a slightly different approach to clustering [ the cluster is built in using what they call container and pods] but they are willing to help me up and running.
So the reason i am doing all that is getting enough understanding of moose specific requirement to talk to these people... I am not going to reinvent the wheel.

In that context my early morning post was motivated by the following I stumbled onto doing some "homework"

1- I have moose up and running the "normal way' on my workstation.[ so I have mpi installed].
2 I tried to see if I could install moose as single user on this workstation ignoring the already available install [ i created a specific user for that purpose so that I don't alter my working environment.

 I tried to do that because   I could not find anything "cluster specific" in your cluster single user install instruction. and i try to understnat things as much as possible.
and it failed miserably  at the first step: configuring petsc... without a clear understanding of why.. [ the error message tell me my mpi option dont work] ...which tell me I am missing something...

my next step is an install on ec2 through starcluster.
... to be continued

JF

Jean Francois Leon

unread,
Jul 19, 2015, 4:34:11 PM7/19/15
to moose...@googlegroups.com
Hi All

I am (almost) up and running on Ec2
could someone be kind enough to post a documented example on how to launch a run on a cluster wit n nodes each nodes having p cores [ so the total number of core is n*p]
for examples 4 nodes ( one master + 3 slaves) of 32 cores each and use all of them.
It is understood that it is a virtual cluster build with starcluster and using ubuntu 14.04 even if It is most probably irrelevant for this question 

I have looked at cluster_launcher.py --dump
but it is not very helpful for a beginner [ most importantly because the relationship between node and PBS chunk is not clear to me and google is not really helpful on this one so far...]

so an example to start with will be very helpful.

Thanks again Guy I wont be there without you all!

JF

Derek Gaston

unread,
Jul 20, 2015, 9:52:27 AM7/20/15
to moose...@googlegroups.com
cluster launcher won't help in this scenario.  StarCluster uses Sun Grid Engine for job queuing... and currently cluster launcher only supports PBS (another job queuing system that is very popular... and is what we use at INL).  Conceivably, cluster_launcher could be extended at some point to support Sun Grid Engine... we'll have to see.

At any rate, you need to look at how to submit a job: http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html

You should make sure that a simple job works (like the 'hostname' job in the documentation I linked to) but you should really use a "Job Script" as is explained further down the page.

Toward the bottom of that documentation it tells you how to run an MPI job.  Read it carefully.

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Jean Francois Leon

unread,
Jul 21, 2015, 9:35:28 AM7/21/15
to moose...@googlegroups.com
Hi All

 I am up and running with moose on starcluster as of this morning.

I am running a bunch of tests and learn to use sge now.

Is there any cluster specific test it will make sense for me to run for test and benchmarking purpose? any suggestion?

what I have done so far:
 ./run_tests -j 2 on a 1 node 2 proc /core t2.medium ec2 instance configuration and get:

Ran 936 tests in 482.1 seconds
934 passed, 17 skipped, 0 pending, 2 FAILED

I then  ./run_tests -j 4 on a 2 nodes 2 proc /core (same instance) and get:

Ran 936 tests in 311.6 seconds
933 passed, 17 skipped, 0 pending, 3 FAILED

So more nodes matter already "out of the box" without any fancy sge/mpirun tweaks

Regarding test results, the one I fail are :
misc/jacobian.simple................................................................ FAILED (NO EXPECTED OUT)
misc/jacobian.med....
They pass on my workstation
Anything I should be worried about?

If there is an interest into it I can write a detailed post about the steps I had to go through [ and thanks to  Derek for providing the right pointers at the right time.. I dont know where I will be without them)
next week or the next one I will ( try to) do the same on cloud compute.... another new world.
In the meantime I will continue running cases on aws.

Thanks all again for your support

JF 

Derek Gaston

unread,
Jul 21, 2015, 10:59:36 AM7/21/15
to moose...@googlegroups.com
Don't run run_tests with SGE... it's not really intended to work that way.

Can you tell all 3 failures?  I'm sure they can all be ignored, but I just want to see what they are.

What you should do now is go run a test manually using SGE.  So go into moose/test/tests/kernels/simple_diffusion and run moose_test manually like so:

../../../moose_test-opt -i simple_diffusion.i

Then try to run the same thing by submitting a job to SGE.  It's a small run so just run on it on 4 or so processors.  If you want to make the problem a bit larger so you can run on more processors you can uniformly refine the mesh using something like this:

 ../../../moose_test-opt -i simple_diffusion.i Mesh/uniform_refine=2

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Jean Francois Leon

unread,
Jul 21, 2015, 1:10:05 PM7/21/15
to moose...@googlegroups.com
OK 
Here are the results of the suggested tests

First of all I was not able to reproduce the 3 test failing I reported earlier I got only 2 all the time as shown at the end of the test report
misc/jacobian.simple................................................................ FAILED (NO EXPECTED OUT)
misc/jacobian.med................................................................... FAILED (NO EXPECTED OUT)
-------------------------------------------------------------------------------------------------------------
Ran 936 tests in 117.6 seconds
934 passed, 17 skipped, 0 pending, 2 FAILED
[ run on the same test configuration described above 2 cores with 2 proc each.

I then run on the same configuration using the simple mpirun -n 4 ../../../moose_test-opt -i simple_diffusion.i Mesh/uniform_refine=4
 The mesh refinement level is 4 .Smaller number were way too fast. I copy the dump generated by this case.

Note" I am note using sge as I have had not time to read the manual... but it seems that I can use a lot of cores on aws without it.
[ I dont ignore the benefit of job scheduling though.. I am just not there].
Cheers
JF
=======================================================
test output:
Framework Information:
MOOSE version:           git commit 009c866 on 2015-07-20
PETSc Version:           3.6.0
Current Time:            Tue Jul 21 16:49:34 2015
Executable Timestamp:    Tue Jul 21 12:45:55 2015

Parallelism:
  Num Processors:          4
  Num Threads:             1

Mesh:
  Distribution:            serial
  Mesh Dimension:          2
  Spatial Dimension:       3
  Nodes:
    Total:                 25921
    Local:                 6572
  Elems:
    Total:                 34100
    Local:                 8587
  Num Subdomains:          1
  Num Partitions:          4
  Partitioner:             metis

Nonlinear System:
  Num DOFs:                25921
  Num Local DOFs:          6572
  Variables:               "u"
  Finite Element Types:    "LAGRANGE"
  Approximation Orders:    "FIRST"

Execution Information:
  Executioner:             Steady
  Solver Mode:             Preconditioned JFNK
  Preconditioner:          hypre boomeramg



 0 Nonlinear |R| = [32m1.268858e+01 [39m
      0 Linear |R| = [32m1.268858e+01 [39m
      1 Linear |R| = [32m3.293630e-01 [39m
      2 Linear |R| = [32m1.509189e-02 [39m
      3 Linear |R| = [32m5.418635e-04 [39m
      4 Linear |R| = [32m2.432376e-05 [39m
 1 Nonlinear |R| = [32m2.445487e-05 [39m
      0 Linear |R| = [32m2.445487e-05 [39m
      1 Linear |R| = [32m1.761497e-06 [39m
      2 Linear |R| = [32m8.878479e-08 [39m
      3 Linear |R| = [32m5.254326e-09 [39m
      4 Linear |R| = [32m2.591032e-10 [39m
      5 Linear |R| = [32m1.095354e-11 [39m
 2 Nonlinear |R| = [32m1.832061e-10 [39m

 ------------------------------------------------------------------------------------------------------------
| Moose Test Performance: Alive time=1.96121, Active time=0.813166                                           |
 ------------------------------------------------------------------------------------------------------------
| Event                         nCalls     Total Time  Avg Time    Total Time  Avg Time    % of Active Time  |
|                                          w/o Sub     w/o Sub     With Sub    With Sub    w/o S    With S   |
|------------------------------------------------------------------------------------------------------------|
|                                                                                                            |
|                                                                                                            |
| Exodus                                                                                                     |
|   output()                    2          0.0837      0.041843    0.0837      0.041843    10.29    10.29    |
|                                                                                                            |
| Solve                                                                                                      |
|   ComputeResidualThread       17         0.4281      0.025181    0.4281      0.025181    52.64    52.64    |
|   computeDiracContributions() 19         0.0000      0.000002    0.0000      0.000002    0.00     0.00     |
|   compute_dampers()           2          0.0000      0.000001    0.0000      0.000001    0.00     0.00     |
|   compute_jacobian()          2          0.1011      0.050557    0.1011      0.050559    12.43    12.44    |
|   compute_residual()          17         0.0920      0.005412    0.5237      0.030806    11.31    64.40    |
|   compute_user_objects()      44         0.0001      0.000001    0.0001      0.000001    0.01     0.01     |
|   residual.close3()           17         0.0018      0.000103    0.0018      0.000103    0.22     0.22     |
|   residual.close4()           17         0.0018      0.000106    0.0018      0.000106    0.22     0.22     |
|   solve()                     1          0.1046      0.104623    0.7295      0.729524    12.87    89.71    |
 ------------------------------------------------------------------------------------------------------------
| Totals:                       138        0.8132                                          100.00            |
 ------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------
| Setup Performance: Alive time=1.96141, Active time=0.338222                                                             |
 -------------------------------------------------------------------------------------------------------------------------
| Event                                      nCalls     Total Time  Avg Time    Total Time  Avg Time    % of Active Time  |
|                                                       w/o Sub     w/o Sub     With Sub    With Sub    w/o S    With S   |
|-------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                         |
|                                                                                                                         |
| Setup                                                                                                                   |
|   Create Executioner                       1          0.0002      0.000200    0.0002      0.000200    0.06     0.06     |
|   FEProblem::init::meshChanged()           1          0.0537      0.053701    0.0537      0.053701    15.88    15.88    |
|   Initial computeUserObjects()             1          0.0000      0.000008    0.0000      0.000008    0.00     0.00     |
|   Initial execMultiApps()                  1          0.0000      0.000002    0.0000      0.000002    0.00     0.00     |
|   Initial execTransfers()                  1          0.0000      0.000002    0.0000      0.000002    0.00     0.00     |
|   Initial updateActiveSemiLocalNodeRange() 1          0.0019      0.001925    0.0019      0.001925    0.57     0.57     |
|   Initial updateGeomSearch()               2          0.0000      0.000002    0.0000      0.000002    0.00     0.00     |
|   NonlinearSystem::update()                1          0.0089      0.008864    0.0089      0.008864    2.62     2.62     |
|   Output Initial Condition                 1          0.0647      0.064676    0.0647      0.064676    19.12    19.12    |
|   Prepare Mesh                             1          0.0002      0.000167    0.0002      0.000167    0.05     0.05     |
|   copySolutionsBackwards()                 1          0.0039      0.003856    0.0039      0.003856    1.14     1.14     |
|   eq.init()                                1          0.2031      0.203056    0.2031      0.203056    60.04    60.04    |
|   getMinQuadratureOrder()                  1          0.0000      0.000004    0.0000      0.000004    0.00     0.00     |
|   initial adaptivity                       1          0.0000      0.000001    0.0000      0.000001    0.00     0.00     |
|   maxQps()                                 1          0.0017      0.001749    0.0017      0.001749    0.52     0.52     |
|   reinit() after updateGeomSearch()        1          0.0000      0.000007    0.0000      0.000007    0.00     0.00     |
|                                                                                                                         |
| ghostGhostedBoundaries                                                                                                  |
|   eq.init()                                1          0.0000      0.000001    0.0000      0.000001    0.00     0.00     |
 -------------------------------------------------------------------------------------------------------------------------
| Totals:                                    18         0.3382                                          100.00            |
 -------------------------------------------------------------------------------------------------------------------------


Daniel Schwen

unread,
Jul 21, 2015, 1:14:16 PM7/21/15
to moose...@googlegroups.com
Hi Jean Francois,
These failing Jacobian tests are mine. The test the moose/python/jacobiandebug/analyzejacobian.py script, which launches moose by itself (and can only work in serial... I may need to restrict those tests to serial!). 
Tldr; don't worry about those failures for now!
Daniel

Cody Permann

unread,
Jul 21, 2015, 1:18:07 PM7/21/15
to moose...@googlegroups.com
Also, I frequently get this question but those "funny" characters are color codes. You can view them with a tool like "less", or you can remove them altogether with --no-color on the command line.

Cody

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Derek Gaston

unread,
Jul 21, 2015, 1:27:56 PM7/21/15
to moose...@googlegroups.com
This looks good JF.  MOOSE is definitely running on 4 processors... and the solve looks good.

Without running with SGE you won't have access to more processors than are available on your one (head) node.  So if you are running on a t2.medium you only really have access to 2 processors (note: you can oversubscribe them by doing "mpirun -n 4" but you won't see any speedup over doing "mpirun -n 2 ").

You should probably focus on using the C4 instances on Amazon.  They're using Haswell processors which are hella fast.  C3 is also a good choice.  It's slightly "cheaper" but your jobs will run slightly slower with the older Intel processors.

Of course, while you're just messing around and learning... using T2 instances is very cost effective :-)

Glad you're headed in the right direction!

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Jean Francois Leon

unread,
Jul 21, 2015, 1:45:15 PM7/21/15
to moose...@googlegroups.com
Derek
Fully agree with your comment regarding instance choice, the bigger the better + the c4 instance has infiniband interconnection + other hpc optimized goodies while the t2 are...well,  the bottom of the barrel
I am just testing workflow here on my own dime... so yeah I try to keep it cheap.

One point though current stable release  of starcluster (0.95.6) does not support c4 instance  c3 is the best..

Regarding sge.Point taken ==>I need to dive into it.

JF

Jean Francois Leon

unread,
Jul 23, 2015, 11:08:26 AM7/23/15
to moose-users, j...@galtenco.com
Hi all 

 Moose run in full glory on aws under sge environment.

I use the command line:
qsub -V -b y -cwd -pe orte 4 -e error_Log1.txt -o output.txt mpirun ../../../moose_test-opt -i simple_diffusion.i Mesh/uniform_refine=5

[ 103041 dof]

on my 2 nodes 2 cpu/node "bottom of the barrel" configuration and it works without a glitch

Now here is the quirk:
 on a slow  low power configuration it end up being much slower using a 2 nodes 2 cores configuration in full rather than using a one node 2 cores.
 no big deal so far. I know things should change with dedicated powerful cluster instances...

But it make me to understand more  want to look into details of a run to see where were the bottleneck are and and so forth...

So here is a new question f(or the sge savvy):

What tools/log files [ if any] can i activate to examine how a job will be run on a sge based cluster? by deffect I cannot see any log file in /var/log for example 
I looked at a few sge group list or doc  but did not found anything obvious....

Thanks again for all
JF


Cody Permann

unread,
Jul 23, 2015, 11:26:45 AM7/23/15
to moose-users, j...@galtenco.com
MOOSE can print out a performance log telling you at a high level where we are spending time. That can be turned on with the "print_perf_log = true" parameter in the Outputs block.

However if you want more detailed information, you'll have to dig deeper with a profiling tool. In a cluster environment there are often proprietary profiling tools available. "gprof" is always a reliable free alternative. On OS X with Xcode you can use "Instruments". All of these tools are designed to give you source level information about where your programs are spending the most time.

Cody

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Derek Gaston

unread,
Jul 23, 2015, 1:27:31 PM7/23/15
to moose...@googlegroups.com, j...@galtenco.com
I would be interested in seeing the MOOSE performance logs from those two runs.

One thing you need to keep in mind is that you need a large enough problem that can take advantage of more cores before you get any speed up from a parallel run. A good rule of thumb is that you need about 20,000 degrees of freedom per processor (anything smaller than that and communication time will swamp your computation time, especially for a really simple, linear problem). So for 4 processors you need about 80,000 DoFs... how many did you have in your test problem?

But, like you say, it's not surprising that using the cheap machines on EC2 gives poor results. Do you have any idea if the two nodes are even located "near" each other at all (like even in the same building... or even the same part of the country?

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Jean Francois Leon

unread,
Jul 25, 2015, 12:19:47 PM7/25/15
to moose-users, frie...@gmail.com
test results

OK I did a few tests this morning.

I used the test in kernels/simple diffusion and set the mesh size to have slightly more than 100,000 dof..
The option"print_perf_log = true" is activated so I generate the detail output files

I used a c4.-large instance. The c4 instances are this highest performing instance at the moment on aws and the -large one has 2 cores each.
with these instances, according to aws website and assuming I understood what I read, the nodes in one cluster are grouped together ( whatever that means..at least same building I guess) with 10G link

I ran the following test that solve the same problems in 3 different ways
test 1 : run the executable one One node One core [ no mpirun..]
test 2 I run mpirun -n 2 ..executable on one node
I then launched an additional node [ same instance].  I now have access to a total of 4 cores.

test 3: qsub -V -b y -cwd -pe orte 4 -e error_Log1.txt -o output.txt mpirun ..executable

here are the results I get
test 1: Moose Test Performance: Alive time=9.68299, Active time=7.29408   
test 2:Moose Test Performance: Alive time=11.508, Active time=6.79423    
test 3:Moose Test Performance: Alive time=10.3707, Active time=3.79845   

So If I assume that the active time is the "real working time", going from 1 to 2 cores change almost nothing in this case [ which I find surprising given the fact that the numer of core is doubled on the same workstation. I ran the tests several time and find very similar resulsts with variation of a few % in spent time.
Going from 1 to 2 nodes [ and 4 proc] gives a large  2X  speed gain on the active time but NO Decrease of alive time... I understand this is small test I hope the differences between alive and active time decrease rapidly with increasing problem size...

So right now what really puzzles me is the lack of improvement on one node configuration going from one core to 2... is it meaningful?
Cheers
JF
ps I tried to attach the log files but ge seerver error all the time.. not sure why they are small in size.
I will try to sent them privately to derek that asked for them and to anyone else interested,

Derek Gaston

unread,
Jul 25, 2015, 4:53:05 PM7/25/15
to Jean Francois Leon, moose-users
Wow - that is pretty terrible.

The number that you actually want to compare is the "Total Time With Sub" for solve().  That's the number that matters as you go to larger runs.  These small runs that only take tens of seconds will spend a significant percentage of their time just setting up and tearing down... which won't be true for real runs.

That said, it doesn't help things at all.  solve() is 6.9884 on one processor and 6.3589 on two.  That's terrible.

I just checked... and on my laptop I get this:

../../../moose_test-opt -i simple_diffusion.i Mesh/uniform_refine=5
solve(): 3.8154 seconds

mpiexec -n 2 ../../../moose_test-opt -i simple_diffusion.i Mesh/uniform_refine=5
solve(): 1.9606 seconds

Which is essentially perfect (as it should be).

I have no idea what's going on there on EC2.  It simply doesn't make sense.

Derek

Dmitry Karpeyev

unread,
Jul 25, 2015, 5:03:24 PM7/25/15
to Jean Francois Leon, moose-users

Any discussion of the solver performance  should be accompanied by the output of running with  -log_summary on the command line. Also, a basic connection, even if it claims to be "10G" (whatever that means) isn't likely to lead to good silver performance. You really need to figure out how to make those  instances get connected by a (virtualized) Infiniband interconnect.

Dmitry


--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Jean Francois Leon

unread,
Jul 27, 2015, 7:26:13 AM7/27/15
to moose-users, kar...@mcs.anl.gov
Thanks for the input and the help!

I understand and agree with everything..

It is is just a work in progress
I did a few more instances by launching instances of higher quality. ca xlarge c4 2xlarge...
the non obvious ( to me) findings is that while in theory those are based on the same hardware the One core test duration dramatically decrease with instances carrying more cores...
based on some benchmark I found on the web the best number we could get on aws ( or gcloud ) for that matter are 30% slower than running the the same test on underlying bare metal [again speaking of single instance here ] - this is the overhead for the virtualized environment and seems pretty standard.
when we get to cluster this s another ballgame when it comes to optimizing performances... and fast interconnect  for efficient mpi message passing seems to be key [ there is a line on this subject on the petsc website

Derek: regarding your test results: could you share with me your laptop specs please? ( I might want to buy it :-):

I have a dell m4700 with i7-3840QM (@2.80GHz)with opensuse13.2 ( a few year old granted) and the one core test give me an active time of 22 sec
With 2 cores it decrease to 11.5 sec
with 4 cores : 3 seconds.. much higher number than yours for one and 2 cores [ I ran the same test than previously with mesh_refine=5].

At that point I am going to take some time to do some "real work" and will keep you posted on this thread if anything significant happens on that front.
Despite these poors "absolute" performance, Moose on aws cluster work and can be an appealing and cost effective solution when ones  dont/cant invest in bare metal...


Cheers
JF

Dmitry Karpeyev

unread,
Jul 27, 2015, 7:29:30 AM7/27/15
to Jean Francois Leon, moose-users
On Mon, Jul 27, 2015 at 1:26 PM Jean Francois Leon <j...@galtenco.com> wrote:
Thanks for the input and the help!

I understand and agree with everything..

It is is just a work in progress
I did a few more instances by launching instances of higher quality. ca xlarge c4 2xlarge...
the non obvious ( to me) findings is that while in theory those are based on the same hardware the One core test duration dramatically decrease with instances carrying more cores...
based on some benchmark I found on the web the best number we could get on aws ( or gcloud ) for that matter are 30% slower than running the the same test on underlying bare metal [again speaking of single instance here ] - this is the overhead for the virtualized environment and seems pretty standard.
when we get to cluster this s another ballgame when it comes to optimizing performances... and fast interconnect  for efficient mpi message passing seems to be key [ there is a line on this subject on the petsc website
Yes, that's exactly why you need to make sure your instances are connected by a (virtualized) Infiniband (at least on AWS; I don't know exactly how Google does it). 

Derek Gaston

unread,
Jul 27, 2015, 6:37:10 PM7/27/15
to moose...@googlegroups.com, kar...@mcs.anl.gov
Thanks for the update.  I kind of wondered if the issue with non-scaling is due to oversubscription of EC2 resources.... I guess more testing is necessary eventually.

As for my laptop... that's easy.  Just choose the highest end stuff here: http://www.apple.com/shop/buy-mac/macbook-pro?product=MF841LL/A&step=config

You will never go back ;-)

Derek

--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/moose-users.

Derek Gaston

unread,
Jul 27, 2015, 6:38:16 PM7/27/15
to moose...@googlegroups.com, kar...@mcs.anl.gov
Whoops... wrong one.  Actually it's this one: http://www.apple.com/shop/buy-mac/macbook-pro?product=MJLT2LL/A&step=config

Sorry about that.

Derek
Reply all
Reply to author
Forward
0 new messages