Handling Connection Loss in SpiNNaker Simulations

31 views
Skip to first unread message

Ahmad Waseem

unread,
Mar 19, 2025, 2:30:37 PMMar 19
to SpiNNaker Users Group
Hi, 

I’ve been working on a SpiNNaker-based simulation for reinforcement learning, and I’ve been encountering issues where I occasionally lose connection to the SpiNNaker server during long-running simulations (episodes over 50). This disrupts the training process and results in lost progress.

I have shared the log at the end of the email.

I can implement a checkpointing mechanism to save the network weights periodically which can in theory allow me to reload the last saved weights and resume training in a new session, however, I’m curious to know if there's any way around this and if there are best practices or built-in features in SpiNNaker to handle connection loss more gracefully.

Specifically, I’d like to ask:
  1. Are there recommended strategies to maintain a stable connection to the SpiNNaker server during long simulations?
  2. Does SpiNNaker provide any built-in support for checkpointing or saving simulation state?
  3. Are there any tools or configurations to automatically recover from a lost connection?
Example Log 1
============================================================ 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: Application started; waiting 0.101s for it to stop 2025-03-07 06:55:39 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-07 06:55:39 INFO: Time 0:00:00.207826 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================Reconnected to spalloc server successfully. 2025-03-11 11:03:20 INFO: Reconnected to spalloc server successfully.

Example Log 2
2025-03-19 06:20:00 INFO: Time 0:00:00.046976 taken by ChipRuntimeUpdater 2025-03-19 06:20:00 INFO: *** Running simulation... *** Loading buffers |0% 50% 100%| ============================================================ 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: Application started; waiting 0.11s for it to stop 2025-03-19 06:20:01 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-19 06:20:01 INFO: Time 0:00:00.218819 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================

Thank you,
Ahmad

Andrew Rowley

unread,
Mar 20, 2025, 11:04:53 AMMar 20
to Ahmad Waseem, SpiNNaker Users Group
Hi,

This may have been caused by a couple of server restarts unfortunately. We are in the process of doing some updates, and although these shouldn't upset jobs, it can happen! Sorry I didn't let you know though... I think most of these are done now so hopefully it won't be a problem again.

Andrew :)

________________________________________
From: spinnak...@googlegroups.com <spinnak...@googlegroups.com> on behalf of Ahmad Waseem <ahmadwase...@gmail.com>
Sent: 19 March 2025 18:30
To: SpiNNaker Users Group
Subject: [SpiNNaker Mailing List] Handling Connection Loss in SpiNNaker Simulations

Hi, I’ve been working on a SpiNNaker-based simulation for reinforcement learning, and I’ve been encountering issues where I occasionally lose connection to the SpiNNaker server during long-running simulations (episodes over 50). This disrupts
ZjQcmQRYFpfptBannerStart
This Message Is From a New External Sender
You have not previously corresponded with this sender. Please exercise caution when opening links or attachments included in this message.

ZjQcmQRYFpfptBannerEnd
Hi,

I’ve been working on a SpiNNaker-based simulation for reinforcement learning, and I’ve been encountering issues where I occasionally lose connection to the SpiNNaker server during long-running simulations (episodes over 50). This disrupts the training process and results in lost progress.

I have shared the log at the end of the email.

I can implement a checkpointing mechanism to save the network weights periodically which can in theory allow me to reload the last saved weights and resume training in a new session, however, I’m curious to know if there's any way around this and if there are best practices or built-in features in SpiNNaker to handle connection loss more gracefully.

Specifically, I’d like to ask:

1. Are there recommended strategies to maintain a stable connection to the SpiNNaker server during long simulations?
2. Does SpiNNaker provide any built-in support for checkpointing or saving simulation state?
3. Are there any tools or configurations to automatically recover from a lost connection?

Example Log 1
============================================================ 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: Application started; waiting 0.101s for it to stop 2025-03-07 06:55:39 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-07 06:55:39 INFO: Time 0:00:00.207826 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================Reconnected to spalloc server successfully. 2025-03-11 11:03:20 INFO: Reconnected to spalloc server successfully.

Example Log 2
2025-03-19 06:20:00 INFO: Time 0:00:00.046976 taken by ChipRuntimeUpdater 2025-03-19 06:20:00 INFO: *** Running simulation... *** Loading buffers |0% 50% 100%| ============================================================ 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: Application started; waiting 0.11s for it to stop 2025-03-19 06:20:01 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-19 06:20:01 INFO: Time 0:00:00.218819 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================

Thank you,
Ahmad

--
You received this message because you are subscribed to the Google Groups "SpiNNaker Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spinnakeruser...@googlegroups.com<mailto:spinnakeruser...@googlegroups.com>.
To view this discussion, visit https://groups.google.com/d/msgid/spinnakerusers/b65f84bb-3575-48bc-b447-ac4387c7ec90n%40googlegroups.com [groups.google.com]<https://urldefense.com/v3/__https://groups.google.com/d/msgid/spinnakerusers/b65f84bb-3575-48bc-b447-ac4387c7ec90n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!PDiH4ENfjr2_Jw!GkP_khpC-s3teD0fVeDt0lESCnpx0N7_TTbkXJa_VC7cqiPGDzPKdip2FzejCRpj4NEtvL8lJggclDlSpA-S7bUm8DIrcnZ8pSFM1A$>.

Ahmad Waseem

unread,
Mar 27, 2025, 9:20:11 AMMar 27
to SpiNNaker Users Group
Hi Andrew, 

I tried running my program again and am encountering the same issue. Do you have any suggestions?

Andrew Rowley

unread,
Apr 2, 2025, 10:54:50 AMApr 2
to Ahmad Waseem, SpiNNaker Users Group
Hi,

A thought might be that the jupyter notebook is timing out when not in use, then when you reconnect it has to re-establish the connection which loses the simulation. Is this happening specifically on spinn-20.cs.man.ac.uk or via EBRAINS Lab? You could try to run something on a newer server we are experimenting with https://sands.cs.man.ac.uk/. This server will delete all local files and stop all running notebooks when you log out, except those stored within the work or EBRAINS drive, however it doesn't have a timeout that I know of, so if you don't log out, things should keep running.

Let me know if that helps, or if you still have the problems even then.

Thanks,

Andrew :)

________________________________________
From: spinnak...@googlegroups.com <spinnak...@googlegroups.com> on behalf of Ahmad Waseem <ahmadwase...@gmail.com>
Sent: 27 March 2025 13:20
To: SpiNNaker Users Group
Subject: Re: [SpiNNaker Mailing List] Handling Connection Loss in SpiNNaker Simulations

Hi Andrew, I tried running my program again and am encountering the same issue. Do you have any suggestions? On Thursday, March 20, 2025 at 11: 04: 53 AM UTC-4 Andrew Rowley wrote: Hi, This may have been caused by a couple of server restarts unfortunately. 

Hi Andrew,

I tried running my program again and am encountering the same issue. Do you have any suggestions?

On Thursday, March 20, 2025 at 11:04:53 AM UTC-4 Andrew Rowley wrote:
Hi,

This may have been caused by a couple of server restarts unfortunately. We are in the process of doing some updates, and although these shouldn't upset jobs, it can happen! Sorry I didn't let you know though... I think most of these are done now so hopefully it won't be a problem again.

Andrew :)

________________________________________
From: spinnak...@googlegroups.com <spinnak...@googlegroups.com> on behalf of Ahmad Waseem <ahmadwase...@gmail.com>
Sent: 19 March 2025 18:30
To: SpiNNaker Users Group
Subject: [SpiNNaker Mailing List] Handling Connection Loss in SpiNNaker Simulations

Hi, I’ve been working on a SpiNNaker-based simulation for reinforcement learning, and I’ve been encountering issues where I occasionally lose connection to the SpiNNaker server during long-running simulations (episodes over 50). This disrupts

Hi,

I’ve been working on a SpiNNaker-based simulation for reinforcement learning, and I’ve been encountering issues where I occasionally lose connection to the SpiNNaker server during long-running simulations (episodes over 50). This disrupts the training process and results in lost progress.

I have shared the log at the end of the email.

I can implement a checkpointing mechanism to save the network weights periodically which can in theory allow me to reload the last saved weights and resume training in a new session, however, I’m curious to know if there's any way around this and if there are best practices or built-in features in SpiNNaker to handle connection loss more gracefully.

Specifically, I’d like to ask:

1. Are there recommended strategies to maintain a stable connection to the SpiNNaker server during long simulations?
2. Does SpiNNaker provide any built-in support for checkpointing or saving simulation state?
3. Are there any tools or configurations to automatically recover from a lost connection?

Example Log 1
============================================================ 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-07 06:55:39 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-07 06:55:39 INFO: Application started; waiting 0.101s for it to stop 2025-03-07 06:55:39 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-07 06:55:39 INFO: Time 0:00:00.207826 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================Reconnected to spalloc server successfully. 2025-03-11 11:03:20 INFO: Reconnected to spalloc server successfully.

Example Log 2
2025-03-19 06:20:00 INFO: Time 0:00:00.046976 taken by ChipRuntimeUpdater 2025-03-19 06:20:00 INFO: *** Running simulation... *** Loading buffers |0% 50% 100%| ============================================================ 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: ** Sending start / resume message to external sources to state the simulation has started or resumed. ** 2025-03-19 06:20:00 INFO: ** Awaiting for a response from an external source to state its ready for the simulation to start ** 2025-03-19 06:20:00 INFO: Application started; waiting 0.11s for it to stop 2025-03-19 06:20:01 INFO: ** Sending pause / stop message to external sources to state the simulation has been paused or stopped. ** 2025-03-19 06:20:01 INFO: Time 0:00:00.218819 taken by ApplicationRunner Extracting IOBUF from the machine |0% 50% 100%| ==============================

Thank you,
Ahmad

--
You received this message because you are subscribed to the Google Groups "SpiNNaker Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spinnakeruser...@googlegroups.com<mailto:spinnakeruser...@googlegroups.com>.
To view this discussion, visit https://groups.google.com/d/msgid/spinnakerusers/b65f84bb-3575-48bc-b447-ac4387c7ec90n%40googlegroups.com [groups.google.com]<https://urldefense.com/v3/__https://groups.google.com/d/msgid/spinnakerusers/b65f84bb-3575-48bc-b447-ac4387c7ec90n*40googlegroups.com__;JQ!!PDiH4ENfjr2_Jw!CI-r77LvyNat40LLhPIssmJUh1HKxiuPBZeeWaF41HTflpVtTeLMqE9B0RkLVuXYKAXDGbOo3Yv7UJSrvQGyVXBu44KL08tKlTtGNw$> [groups.google.com [groups.google.com]<https://urldefense.com/v3/__http://groups.google.com__;!!PDiH4ENfjr2_Jw!CI-r77LvyNat40LLhPIssmJUh1HKxiuPBZeeWaF41HTflpVtTeLMqE9B0RkLVuXYKAXDGbOo3Yv7UJSrvQGyVXBu44KL08vr4-ZLlw$>]<https://urldefense.com/v3/__https://groups.google.com/d/msgid/spinnakerusers/b65f84bb-3575-48bc-b447-ac4387c7ec90n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!PDiH4ENfjr2_Jw!GkP_khpC-s3teD0fVeDt0lESCnpx0N7_TTbkXJa_VC7cqiPGDzPKdip2FzejCRpj4NEtvL8lJggclDlSpA-S7bUm8DIrcnZ8pSFM1A$>.

--
You received this message because you are subscribed to the Google Groups "SpiNNaker Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spinnakeruser...@googlegroups.com<mailto:spinnakeruser...@googlegroups.com>.
To view this discussion, visit https://groups.google.com/d/msgid/spinnakerusers/fe1247a5-439c-405f-898c-e50b267d6a15n%40googlegroups.com [groups.google.com]<https://urldefense.com/v3/__https://groups.google.com/d/msgid/spinnakerusers/fe1247a5-439c-405f-898c-e50b267d6a15n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!PDiH4ENfjr2_Jw!CI-r77LvyNat40LLhPIssmJUh1HKxiuPBZeeWaF41HTflpVtTeLMqE9B0RkLVuXYKAXDGbOo3Yv7UJSrvQGyVXBu44KL08ssbw8WCg$>.
Reply all
Reply to author
Forward
0 new messages