Optimal number of CPUs

1,788 views
Skip to first unread message

Alex Harris

unread,
Oct 12, 2021, 10:04:37 PM10/12/21
to or-tools-discuss
Hello,

I am trying to figure out the optimal number of CPUs I need to use for the type of problem I am solving.
Based on my (somehow limited) observations increasing the number of CPUs to 8 leads to almost linear gain in performance which is great. I basically set "number of workers" to be equal to the number of CPUs.
However, when I switch from 8 to 16 CPUs I don't get any gain in performance.

Is this expected?
Does this depend on the OR-Tools version? (I have a version taken from "master" which is a couple of months old)
Can I configure the solver differently or change the problem in a way so I can get performance gains with > 8 CPUs? (BTW the problem is already decomposed to as small problems as possible and they are being solved in parallel)

Thanks!

Laurent Perron

unread,
Oct 12, 2021, 11:59:07 PM10/12/21
to or-tools-discuss
Which solver?

--
You received this message because you are subscribed to the Google Groups "or-tools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to or-tools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/or-tools-discuss/c1af6371-3ee8-4b17-bba7-412dd3a7cdd5n%40googlegroups.com.

Alex Harris

unread,
Oct 13, 2021, 12:49:06 AM10/13/21
to or-tools-discuss
CP-SAT

Laurent Perron

unread,
Oct 13, 2021, 1:43:07 AM10/13/21
to or-tools-discuss
So

The solver is using a portfolio approach. 
The gain of parallelism is not linear :-)

Until 6-12 months ago, we tuned the solver for 8 threads. This is the minimum for good performance.

More recently, we have started tuning for 16 threads. So we recommend users doing the same if they can have the right hardware.

Alex Harris

unread,
Oct 13, 2021, 2:29:35 AM10/13/21
to or-tools-discuss
Thanks Laurent!

Currently I use a docker image created around 20 July 2021 off "master" (snapshot between 9.0 and 9.1) i.e. it is not an official/released version of OR-Tools.
I see there is a newer official version - 9.1, released 12 days ago. I will switch to 9.1 and hopefully it would bring extra performance gains when using 16 CPUs.

Bruno De Backer

unread,
Oct 13, 2021, 2:29:53 AM10/13/21
to or-tools...@googlegroups.com
From what I understand, you have decomposed your main problem into many small, quickly solvable problems, right? 

Do you have real cores, or do you use hardware multithreading/SMT/hyperthreading (all the same thing)? I.e. do you have an 8-core/16-thread machine, or a 16-core machine?

I'm asking because:
1) as a rule of thumb, using SMT (i.e. running two streams of instructions on a single core simultaneously) gives you an 1.5x gain in performance, instead of the expected 2x. Your mileage will vary depending on memory accesses, types of instructions in the different threads, whether they not interfere in their itineraries in the processor pipeline, etc, etc.. You should get 1x only in rare cases. But I've personally never witnessed a 2x approach. 
2) If you have two processors (i.e. two sockets on the motherboard), it's again an all different story, because each of the processor manages its own memory, but can share it with the other processor, but at a huge cost, in the ballpark of 300 to 400 cycles. If your processes consume a lot of memory, you may end up having your processors exchanging data slowly, and at a high cost. 

If my premise is correct, then you can consider using a second machine (or more), or trying on a true 16-core machine, to see how it behaves. There will be memory contention at one point, but if you have one of the latest designs for workstations or servers, there should be ample cache size, and this should probably work.

I hope this answers the question from the right angle. It's not solver-related but hardware-related. 

Happy to follow up if you have more questions.

BdB 




--
BdB

Alex Harris

unread,
Oct 13, 2021, 3:38:14 AM10/13/21
to or-tools-discuss
Thank you Bruno!

I test the performance on AWS. I simply instantiate different "instance types" with different hardware specs. AWS has the notion of virtual CPUs (vCPU).
For example:
 - instance type "c5d.2xlarge" (8vCPUs) and 8 OR-Tools workers takes 640 seconds to solve my test case (same instance type but less OR-Tools workers leads to poorer performance)
 - instance type "c5d.4xlarge" (16vCPUs) and 16 OR-Tools workers takes 657 seconds to solve my test case

Based on these numbers it looks like throwing more powerful hardware (more vCPUs/memory) won't give me better results for instances with >8 vCPUs unless OR-Tools somehow takes advantage of the extra CPUs.
It might be the case that the problem I have (or at least the way I model the problem) is stopping me to take advantage of more CPUs.

I don't expect the gains to be linear. In this particular test case going from my laptop (without specifying OR-Tools worker i.e. I assume one gets created by default for the OR-Tools version I use currently) to a "c5d.2xlarge" with 8 OR-Tools workers led to gain in performance from 4373 seconds (laptop) to 640 seconds (c5d.2xlarge) which is a 6.8x increase. That's why I said the gain was linear. That's not very fair comparison and also it would be naive to have such expectations.
The OR-Tools team knows best how to select complementing strategies and distribute them amongst different workers when solving specific types of problems and this pushed me to ask this question.
Reply all
Reply to author
Forward
0 new messages