How to make a foreign function run as fast as possible in Windows?

jf...@ms4.hinet.net

unread,

Sep 26, 2016, 9:49:08 PM9/26/16

to

This function is in a DLL. It's small but may run for days before complete. I want it takes 100% core usage. Threading seems not a good idea for it shares the core with others. Will the multiprocessing module do it? Any suggestion?

Thanks ahead.

--Jach

eryk sun

unread,

Sep 26, 2016, 11:44:49 PM9/26/16

to

On Tue, Sep 27, 2016 at 1:48 AM, <jf...@ms4.hinet.net> wrote:
> This function is in a DLL. It's small but may run for days before complete. I want it
> takes 100% core usage. Threading seems not a good idea for it shares the core
> with others. Will the multiprocessing module do it?

The threads of a process do not share a single core. The OS schedules
threads to distribute the load across all cores. However, CPython's
global interpreter lock (GIL) does serialize access to the
interpreter. If N threads want to use the interpreter, then N-1
threads are blocked while waiting to acquire the GIL.

A thread that makes a potentially blocking call to a non-Python API
should first release the GIL, which allows another thread to use the
interpreter. Calling a ctypes function pointer releases the GIL if the
function pointer is from CDLL, WinDLL, or OleDLL (i.e. anything but
PyDLL).

If your task can be partitioned and executed in parallel, you could
use a ThreadPoolExecutor from the concurrent.futures module. Since the
task is CPU bound, use os.cpu_count() instead of the default number of
threads.

https://docs.python.org/3/library/concurrent.futures

jf...@ms4.hinet.net

unread,

Sep 27, 2016, 10:14:07 PM9/27/16

to

eryk sun at 2016/9/27 11:44:49AM wrote:
> The threads of a process do not share a single core. The OS schedules

> threads to distribute the load across all cores....

hmmm... your answer overthrow all my knowledge about Python threads completely:-( I actually had ever considered using ProcessPoolExecutor to do it.

If the load was distributed by the OS schedules across all cores, does it means I can't make one core solely running a piece of codes for me and so I have no contol on its performance?

Gene Heskett

unread,

Sep 28, 2016, 12:10:35 AM9/28/16

to

Someone in the know would have to elaborate on whether python knows
about, or can cooperate with the boot time parameter 'isolcpus ='.

Cheers, Gene Heskett
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

eryk sun

unread,

Sep 28, 2016, 1:05:32 AM9/28/16

to

On Wed, Sep 28, 2016 at 2:13 AM, <jf...@ms4.hinet.net> wrote:
> If the load was distributed by the OS schedules across all cores,
> does it means I can't make one core solely running a piece of codes
> for me and so I have no contol on its performance?

In Unix, Python's os module may have sched_setaffinity() to set the
CPU affinity for all threads in a given process.

In Windows, you can use ctypes to call SetProcessAffinityMask,
SetThreadAffinityMask, or SetThreadIdealProcessor (a hint for the
scheduler). On a NUMA system you can call GetNumaNodeProcessorMask(Ex)
to get the mask of CPUs that are on a given NUMA node. The cmd shell's
"start" command supports "/numa" and "/affinity" options, which can be
combined.

alister

unread,

Sep 28, 2016, 5:43:12 AM9/28/16

to

this would be implementation specific, not part of the language
pacification.

--
Ryan's Law:
Make three correct guesses consecutively
and you will establish yourself as an expert.

Paul Moore

unread,

Sep 28, 2016, 11:31:50 AM9/28/16

to

On Tuesday, 27 September 2016 02:49:08 UTC+1, jf...@ms4.hinet.net wrote:
> This function is in a DLL. It's small but may run for days before complete. I want it takes 100% core usage. Threading seems not a good idea for it shares the core with others. Will the multiprocessing module do it? Any suggestion?

Taking a step back from the more detailed answers, would I be right to assume that you want to call this external function multiple times from Python, and each call could take days to run? Or is it that you have lots of calls to make and each one takes a small amount of time but the total time for all the calls is in days?

And furthermore, can I assume that the external function is *not* written to take advantage of multiple CPUs, so that if you call the function once, it's only using one of the CPUs you have? Is it fully utilising a single CPU, or is it actually not CPU-bound for a single call?

To give specific suggestions, we really need to know a bit more about your issue.

Paul

jf...@ms4.hinet.net

unread,

Sep 28, 2016, 9:07:58 PM9/28/16

to

eryk sun at 2016/9/28 1:05:32PM wrote:
> In Unix, Python's os module may have sched_setaffinity() to set the
> CPU affinity for all threads in a given process.
>
> In Windows, you can use ctypes to call SetProcessAffinityMask,
> SetThreadAffinityMask, or SetThreadIdealProcessor (a hint for the
> scheduler). On a NUMA system you can call GetNumaNodeProcessorMask(Ex)
> to get the mask of CPUs that are on a given NUMA node. The cmd shell's
> "start" command supports "/numa" and "/affinity" options, which can be
> combined.

Seems have to dive into Windows to understand its usage:-)

jf...@ms4.hinet.net

unread,

Sep 28, 2016, 9:23:13 PM9/28/16

to

Paul Moore at 2016/9/28 11:31:50PM wrote:
> Taking a step back from the more detailed answers, would I be right to assume that you want to call this external function multiple times from Python, and each call could take days to run? Or is it that you have lots of calls to make and each one takes a small amount of time but the total time for all the calls is in days?
>
> And furthermore, can I assume that the external function is *not* written to take advantage of multiple CPUs, so that if you call the function once, it's only using one of the CPUs you have? Is it fully utilising a single CPU, or is it actually not CPU-bound for a single call?
>
> To give specific suggestions, we really need to know a bit more about your issue.

Forgive me, I didn't notice these detail will infulence the answer:-)

Python will call it once. The center part of this function was written in assembly for performance. During its execution, this part might be called thousands of million times. The function was written to run in a single CPU, but the problem it want to solve can be easily distributed into multiple CPUs.

--Jach

Paul Moore

unread,

Sep 30, 2016, 7:07:35 AM9/30/16

to

OK. So if your Python code only calls the function once, the problem needs to be fixed in the external code (the assembly routine). But if you can split up the task at the Python level to make multiple calls to the function, each to do a part of the task, then you could set up multiple threads in your Python code, each of which handles part of the task, then Python merges the results of the sub-parts to give you the final answer. Does that make sense to you? Without any explicit code, it's hard to be sure I'm explaining myself clearly.

Paul

jf...@ms4.hinet.net

unread,

Sep 30, 2016, 9:27:46 PM9/30/16

to

Paul Moore at 2016/9/30 7:07:35PM wrote:
> OK. So if your Python code only calls the function once, the problem needs to be fixed in the external code (the assembly routine). But if you can split up the task at the Python level to make multiple calls to the function, each to do a part of the task, then you could set up multiple threads in your Python code, each of which handles part of the task, then Python merges the results of the sub-parts to give you the final answer. Does that make sense to you? Without any explicit code, it's hard to be sure I'm explaining myself clearly.
>

That's what I will do later, to split the task into multiple cores by passing a range parameter (such as 0~14, 15~29, ..) to each instance. Right now, I just embedded the range in the function to make it simple on testing.

At this moment my interest is how to make it runs at 100% core usage. Windows task manager shows this function takes only ~70% usage, and the number varies during its execution, sometimes even drop to 50%.

I also had test a batch file (copied from another discussion forum):
@echo off
:loop
goto loop
It takes ~85% usage, and the number is stable.

The result is obviously different. My question is how to control it:-)

--Jach

Chris Angelico

unread,

Sep 30, 2016, 11:25:03 PM9/30/16

to

On Sat, Oct 1, 2016 at 11:27 AM, <jf...@ms4.hinet.net> wrote:
> At this moment my interest is how to make it runs at 100% core usage. Windows task manager shows this function takes only ~70% usage, and the number varies during its execution, sometimes even drop to 50%.
>

What's it doing? Not every task can saturate the CPU - sometimes they
need the disk or network more.

ChrisA

jf...@ms4.hinet.net

unread,

Oct 1, 2016, 5:34:47 AM10/1/16

to

Chris Angelico at 2016/10/1 11:25:03AM wrote:
> What's it doing? Not every task can saturate the CPU - sometimes they
> need the disk or network more.
>

This function has no I/O or similar activity, just pure data processing, and it takes less than 200 bytes of data area to work with.

My CPU is an i3 (4 threads/2 cores). I suppose after this job was assigned to run on a particular core, the OS shouldn't bother it by other system related tasks anymore. If it does, won't be this OS designed stupidly?

I was puzzled why the core has no 100% usage? Why always has some percentage of idle?

--Jach