RPC simulation commiunication overhead

Bence Balázs Mészáros

unread,

Jun 16, 2025, 10:39:31 AMJun 16

to ipi-users

Dear Developers,

I am aiming to run RPC simulations in a multiple-time-step setup with 32 beads using i-pi 3.0. The forcefield I use for the beads is really fast (<1 ms/step). Ideally, when calculating forces only on the beads and running 32 clients for the "cheap" forcefield parallel, one would expect a ~1 ms/step timing. However, the socket communication overhead (~1 ms/bead) adds up for the beads, so I get around ~30 ms/step (which is the normal behavior in i-pi, if I get it right). How could I circumvent this overhead? Here are the things that I have already tried, but didn't help:

- Reducing the latency in ffsocket to 1e-4

- Reducing the output printing of the simulation to minimize I/O

- Using a Python implementation of the "cheap" forcefield through ffdirect to avoid socket communication (both with threaded="True" and threaded="False"). For this, I have only run the i-pi server (i-pi config.xml), and did not launch any driver afterwards. Comparing timings with different numbers of beads, my conclusion was that there is no parallel evaluation in a multiple replica scenario; each replica was evaluated serially.

Is there a way to use ffdirect parallel for many beads, or any other option to evaluate forces without accumulating communication overhead?

Thanks for the help.

Best regards,

Bence Mészáros

Michele Ceriotti

unread,

Jun 16, 2025, 11:31:31 AMJun 16

to ipi-users

are you using a inet or unix socket? the latter will be much faster, and will work as long as you run on a single node.

unfortunately ffdirect cannot parallelize because of limitations of python.

also keep in mind that integrating PIMD in general has some overhead - lots of fourier transforms.

overall 30 ms/steps will still give you more than 1 ns/day, not too bad for 32 beads PIMD

Bence Balázs Mészáros

unread,

Jun 17, 2025, 11:17:28 AMJun 17

to ipi-users

Dear Michele,

In an example, I run PIMD with 32 beads using this cheap forcefield with 32 ffsocket clients, resulting in a 35 ms/step timing. My problem is that even when I use 8 cores, my CPU usage is only 17%, and if I increase the core number, the timing does not improve, and CPU efficiency will be proportionally lower. Based on this, it's not the fourier transforms taking the time, but rather different ffsocket clients waiting in the queue due to the communication overhead, resulting in the CPUs being unused in the majority of time.

Best regards,

Bence Mészáros

Michele Ceriotti

unread,

Jun 17, 2025, 3:12:20 PMJun 17

to ipi-users

Again, are you using inet or unix domain sockets? 35 ms/step is on the high side but not unreasonable depending on how many atoms you have.

And again, it should be fast enough to run more than 1ns day with 0.5fs/step, which should be enough for most applications where a cheapo FF makes sense, and NQEs are needed. If you want to dig more, it'd be best if you opened an issue on github and provide a MWE.

Bence Balázs Mészáros

unread,

Jun 17, 2025, 4:20:25 PMJun 17

to ipi-users

Sorry, forgot to mention, I'm using unix sockets, and I have 768 atoms. Not that 1 ns/day is a bad performance, just want to push the limits further, and exploit CPU resources as much as I can.

Reply all

Reply to author

Forward