Performance of running multiple chains

14 views
Skip to first unread message

larryh...@gmail.com

unread,
May 19, 2017, 2:42:33 PM5/19/17
to QUESO-users mailing list
Hi,
I tried to run multiple chains for an inverse problem by this command:
mpirun -np 4 myexe  my.inp  
Compared with only run 1 chain, the time used is increasing almost linearly. This is unreasonable for MPI since "np" is only 4 here and I have far more processors than 4.
I was thinking if the time is affected by writing the result to the output files since the more chains you have, the larger the "ip_raw_chain.m" file will be, I guess this file must be written sequentially. I checked the QUESO documents, I found this in The Parallel C++ Statistical Library for Bayesian Inference: QUESO:
QUESO also provides the user with the option of writing each chain-handled by its corresponding processor in a separate fi le, which is accomplished by setting the variable ip_mh_rawChain_dataOutputAllowedSet = 0 1 ... Ns-1. 
After setting this, I still got "ip_raw_chain.m" along with "ip_raw_chain_subX.m" files. For example, I set ip_mh_rawChain_dataOutputAllowedSet = 0 1, I got these files:  


My question is:
1. Is there any way that I can stop QUESO for generating "ip_raw_chain.m"?
2. Furthermore, can I stop QUESO for generating "ip_raw_chain_loglikelihood.m" , "ip_raw_chain_loglikelihood_subX.m",  "ip_raw_chain_logtarget.m" and  "ip_raw_chain_logtarget_subX.m"?
3. Can QUESO generate "ip_filt_chain" only instead of "ip_raw_chain" since the former one is much smaller?
4. Besides file writing, can you come up with any other reason that may cause the bad performance of running multiple chains?

Thanks!

Damon McDougall

unread,
May 22, 2017, 3:36:06 PM5/22/17
to larryh...@gmail.com, QUESO-users mailing list


On Fri, 19 May 2017, at 13:42, larryh...@gmail.com wrote:
> Hi,
> I tried to run multiple chains for an inverse problem by this command:
> mpirun -np 4 myexe my.inp
> Compared with only run 1 chain, the time used is increasing* almost
> linearly*. This is unreasonable for MPI since "np" is only 4 here and I
> have far more processors than 4.
> I was thinking if the time is affected by writing the result to the
> output
> files since the more chains you have, the larger the "ip_raw_chain.m"
> file
> will be, I guess this file must be written sequentially. I checked the
> QUESO documents, I found this in *The Parallel C++ Statistical Library
> for
> Bayesian Inference: **QUESO:*
> *QUESO also provides the user with the option of writing each
> chain-handled
> by its corresponding processor in a separate fi le, which is accomplished
> by setting the variable ip_mh_rawChain_dataOutputAllowedSet = 0 1 ...
> Ns-1. *
> After setting this, I still got "ip_raw_chain.m" along with
> "ip_raw_chain_subX.m" files. For example, I set
> *ip_mh_rawChain_dataOutputAllowedSet
> = 0 1, *I got these files:
> <https://lh3.googleusercontent.com/-lmKciG1159o/WR87wmQJFdI/AAAAAAAAADM/TgSv0QNelWY4LWDfTWk7aNgWO-Lw1eSTACLcB/s1600/%25E5%25BE%25AE%25E4%25BF%25A1%25E6%2588%25AA%25E5%259B%25BE_20170519133838.png>
>
>
> My question is:
> 1. Is there any way that I can stop QUESO for generating
> "ip_raw_chain.m"?
> 2. Furthermore, can I stop QUESO for generating
> "ip_raw_chain_loglikelihood.m" , "ip_raw_chain_loglikelihood_subX.m",
> "ip_raw_chain_logtarget.m" and "ip_raw_chain_logtarget_subX.m"?
> 3. Can QUESO generate "ip_filt_chain" only instead of "ip_raw_chain"
> since
> the former one is much smaller?
> 4. Besides file writing, can you come up with any other reason that may
> cause the bad performance of running multiple chains?
>
> Thanks!

Just a couple of questions before I address the I/O:

1. How long does your likelihood evaluation take?
2. How long does a run with -np 1 take versus a run with -np 4?
3. How many samples are you writing to disk?

Now to answer your questions:

1. As far as I'm aware, no. Feel free to open a ticket requesting this
feature though, you can open a ticket here:
https://github.com/libqueso/queso/issues/new. I honestly believe it'd
be surprising to the majority of users if, by default, we omitted
writing the concatenated file if the user asked for only a subset of
processes to write output files. If you can think of a less surprising
behaviour, I'm happy to hear it. In your example, if you had asked for
8 chains, only the first two processes would write their chains, and you
wouldn't be given the concatenated chain. That is, you would be missing
chains from processes 2, 3, ..., 7.

2. Yes. You can set
ip_mh_outputLogLikelihood = 0
ip_mh_outputLogTarget = 0

3. Yes. You can set
ip_mh_rawChain_dataOutputFileName = .

but bear in mind you will probably need to set one of the earlier ones,
like

env_subDisplayFileName = outputData8/display

just so the output directory gets created.

4. You might consider writing out an HDF5 file instead (you can use
ip_mh_dataOutputFileType = h5). Writing a binary file is much quicker
than converting floating point numbers to ASCII first. As I test this,
however, we seem to have a bug writing multiple output files from
multiple chains in HDF5 format, so hold off on that for now. I'll open
a ticket for this.

If any of this behaviour isn't acceptable, let us know. We're always
trying to improve the usability of QUESO so if you think something can
be improved, please do feel free to open a ticket on our GitHub page and
we can discuss it further.

--
Damon McDougall
http://dmcdougall.co.uk
Institute for Computational Engineering Sciences
201 E. 24th St., Stop C0200
The University of Texas at Austin
Austin, TX 78712-1229

larryh...@gmail.com

unread,
May 22, 2017, 11:19:05 PM5/22/17
to QUESO-users mailing list
Hi,

Thank you for your detailed reply.
My answer to your 3 questions:
1. I didn't test the time of likelihood evaluation;
2. In my current tests, for np = 1 it took about 9 minutes, for np = 4, it took more than 40 minutes(I tested the time with my own timer);
3. The number of unknown parameters is 5, and the chain length is 10000. So for each raw chain, there are 5*10000 floating point numbers, which takes about 1MB. Do you think this is the reason making the bad performance?


You might misunderstand my point on writing output data. My point is since we can write raw chain data to different output files(ip_raw_chain_subX), there is no need to write them into a single file "ip_raw_chain". I came up to this because I think writing all chains data into a single file should be done sequentially, however writing into different files can be down parallelly by each processor ( please correct me if I was wrong).  I think it will be more reasonable that user can choose to 1. write data only into a single file or 2. write data only into separate files, but currently, I didn't see how to do option 2. In my application, I have the plan to do inversion for different points simultaneously( think about doing seismic inversion in different positions). In this case, I need the output chains of every point separately, not a concatenated chain, though I can split them manually.
Also, in my example, I only ran 2 chains, sorry I didn't make it clear.

Thank you again!
Best regards,
Han

larryh...@gmail.com

unread,
May 24, 2017, 1:44:26 PM5/24/17
to QUESO-users mailing list
Hi Damon,

After several tests, I got an interesting result.
As I said before, when I run multiple chains the performance is very unreasonable.
I run my program in a cluster, this cluster has 4 nodes, each node has 32 cores(8 processors x 4 cores). Generally speaking, if I only use one node, I should see good performance with np <= 32. Even if each process uses one processor, it'll be good when np <= 4. In my experiments, the time consuming is more than 1 hour when np = 4( for np = 1 it took 10 minutes).
Then I tried to run 4 chains on 4 nodes, by using --host host1,host2,host3... surprisingly, the elapsed time is almost the same as running one chain. It seems that one thread is using one node, not a core or processor. 
Through the online search I found a mpirun option "--bind-to-core", which force 1 MPI process to use 1 core. I tried 4 processes and used "--bind-to-core", the result is good.
It turns out the that bad performance before is due to improper settings, I had always thought that MPI allocates one core to each process automatically. So I guess the problem is solved now. 
Thank you very much for your help!

Best regards,
Han 
On Friday, May 19, 2017 at 1:42:33 PM UTC-5, larryh...@gmail.com wrote:

Damon McDougall

unread,
May 24, 2017, 10:38:54 PM5/24/17
to queso...@googlegroups.com
Hi Han,

I was trying to locally recreate your runtime issue and I couldn't, and
now I understand why.

What you're experiencing is common and you'll slowly learn that MPI
stacks all behave differently. Intel's MPI and MVAPICH and OpenMPI all
have different default behaviour when it comes to thread and process
affinity.

I recently co-organised a conference where Jerome Vienne (TACC) gave a
talk about MPI tuning, and one of the topics was exactly this. I highly
recommend reading his slides:
https://drive.google.com/drive/folders/0B9iMLUCVC_INZk5uVVNCUHI5MXc
> > Compared with only run 1 chain, the time used is increasing* almost
> > linearly*. This is unreasonable for MPI since "np" is only 4 here and I
> > have far more processors than 4.
> > I was thinking if the time is affected by writing the result to the output
> > files since the more chains you have, the larger the "ip_raw_chain.m" file
> > will be, I guess this file must be written sequentially. I checked the
> > QUESO documents, I found this in *The Parallel C++ Statistical Library
> > for Bayesian Inference: **QUESO:*
> > *QUESO also provides the user with the option of writing each
> > chain-handled by its corresponding processor in a separate fi le, which is
> > accomplished by setting the variable ip_mh_rawChain_dataOutputAllowedSet =
> > 0 1 ... Ns-1. *
> > After setting this, I still got "ip_raw_chain.m" along with
> > "ip_raw_chain_subX.m" files. For example, I set *ip_mh_rawChain_dataOutputAllowedSet
> > = 0 1, *I got these files:
> > <https://lh3.googleusercontent.com/-lmKciG1159o/WR87wmQJFdI/AAAAAAAAADM/TgSv0QNelWY4LWDfTWk7aNgWO-Lw1eSTACLcB/s1600/%25E5%25BE%25AE%25E4%25BF%25A1%25E6%2588%25AA%25E5%259B%25BE_20170519133838.png>
> >
> >
> > My question is:
> > 1. Is there any way that I can stop QUESO for generating "ip_raw_chain.m"?
> > 2. Furthermore, can I stop QUESO for generating
> > "ip_raw_chain_loglikelihood.m" , "ip_raw_chain_loglikelihood_subX.m",
> > "ip_raw_chain_logtarget.m" and "ip_raw_chain_logtarget_subX.m"?
> > 3. Can QUESO generate "ip_filt_chain" only instead of "ip_raw_chain" since
> > the former one is much smaller?
> > 4. Besides file writing, can you come up with any other reason that may
> > cause the bad performance of running multiple chains?
> >
> > Thanks!
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "QUESO-users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to queso-users...@googlegroups.com.
> Visit this group at https://groups.google.com/group/queso-users.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages