[euler-users] Addressing recent outages and how we will move forward

61 views
Skip to first unread message

Colin Vanden Heuvel

unread,
Nov 14, 2025, 6:27:03 PM11/14/25
to 'Colin Vanden Heuvel' via euler-users
Hello Euler Users,

As you many of you have already noticed, there has been a filesystem driver bug causing Euler nodes to periodically become unresponsive for about 8 weeks now. Moreover, in recent weeks, the frequency of deadlocks has increased due to increased cluster usage, and it's affecting some users as much as multiple times per day.

There are no available workarounds for this issue other than to reset unresponsive systems as soon as they are reported. To make matters more complicated, we at CAE were just last week made aware that the vendor has chosen not to fix the problem and has taken the extraordinary measure of dropping support for the filesystem entirely in their most recent patch release. As such, the only way out for us is to completely redeploy Euler using an OS which is still providing support for it.

A migration like this would normally require months of preparation. My colleagues and I have spent countless hours trying to identify and exploring fixes for the problem, and some of the preliminary work for a platform migration is already done. However, there is a lot left and we need to squeeze a lot of testing into a relatively short period of time if we want to have it resolved as soon as possible. The best way to make sure that we cover the things you, our end users, care about is to ask you to help us by testing your software in advance of the full deployment. Bear in mind that the filesystem locking issues won't go away until ALL systems are converted to the new platform. During the beta test, we are looking for any other problems which might be an obstacle to using this platform.


This week, our primary focus was on deploying the infrastructure needed to migrate the cluster. While most of that runs on the backend, there is now one major thing that needs testing: a new login node with new versions of most software. If you have some time this weekend or in the coming week, please visit euler-login-3.cae.wisc.edu and test building your software packages. Report any issues to euler-...@engr.wisc.edu and we will add them to the list of bugs to be repaired in the upcoming beta test. 

Our next milestone is to have some compute nodes deployed on the new platform and start allowing job scheduling. We expect to have a limited set of hardware prepared by the end of the coming week (Nov. 21st).

PLEASE NOTE: Until next week Tuesday (Nov 18) or Wednesday (Nov 19), euler-login-3 will only be accessible from systems with an IPv6 network connection. That capability is available on most CAE-managed systems, such as our Linux labs. A number of home internet providers also support this with a correctly-configured router, and it is supported by default on 5G mobile connections as well. If you can't access it right away, don't worry; we'll send out updated instructions that should work for everyone else next Wednesday.

Regards,
Colin Vanden Heuvel

Colin Vanden Heuvel

unread,
Nov 19, 2025, 6:23:15 PM11/19/25
to 'Colin Vanden Heuvel' via euler-users
Hello again,

The beta test cluster is now available via SSH at euler-beta.engr.wisc.edu; any SSH configuration that would normally work with Euler should also work with that address. A couple of compute nodes should be available to run jobs by the end of the week, as planned.

Regards,
Colin

P.S. I'd like to give special thanks to the one very diligent user who has already logged in to start testing things; you know who you are and I really appreciate it!


From: 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>
Sent: Friday, November 14, 2025 17:26
To: 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>
Subject: [euler-users] Addressing recent outages and how we will move forward
 
--
You received this message because you are subscribed to the Google Groups "euler-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to euler-users...@g-groups.wisc.edu.
To view this discussion visit https://groups.google.com/a/g-groups.wisc.edu/d/msgid/euler-users/DM6PR06MB43454591976FB2DFD4A61B53F6CAA%40DM6PR06MB4345.namprd06.prod.outlook.com.

Colin Vanden Heuvel

unread,
Nov 24, 2025, 1:32:46 PM11/24/25
to 'Colin Vanden Heuvel' via euler-users

The following message was originally sent on Friday, November 24th 2025, but it was not processed correctly by the mailing list software.


Greetings, Euler users,

One compute node is now available on the beta cluster for testing non-GPU jobs. Depending on the outcome of some pending tests, a node with GPU support may become available tomorrow. We expect 5 or more nodes to be available by the start of the Thanksgiving holiday; if you'd like your hardware to be considered for inclusion in the test at this time, please contact euler-...@engr.wisc.edu.

Please keep jobs relatively short (< 4 hours) in order to allow other users to have the chance to test their work. Keep the bug reports coming as well!

Best,
Colin Vanden Heuvel


From: 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>
Sent: Wednesday, November 19, 2025 17:23

To: 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>
Subject: Re: [euler-users] Addressing recent outages and how we will move forward
 

Colin Vanden Heuvel

unread,
Dec 4, 2025, 5:11:24 PM12/4/25
to 'Colin Vanden Heuvel' via euler-users
Hello again, Euler users,

The beta test has been going well and users have not had many problems to report, so it's almost time to move into the next phase of deployment. Unless a major issue is discovered, we will begin migrating hardware to the new platform in earnest next week! This is a shorter timeline than what we would normally prefer, but as I've shared before, the nature of the issues with the old platform are serious enough to justify getting away from it as quickly as possible.

Please take note of these upcoming dates and deadlines:

Monday, December 8th - Nodes which are in an "Idle", "Drain", or "Down" state will be moved to the beta cluster as opportunities become available.

Wednesday, December 10th - The SSH hostname "euler.engr.wisc.edu" will be updated to point to the NEW cluster by default. The old cluster will be reachable as "euler-legacy.engr.wisc.edu" until the final migration deadline.

Thursday, December 11th - euler-login-1 will move to the new cluster. euler-login-2 will remain connected to the old cluster until the final migration deadline.

Monday, December 15th - All nodes except for selected systems on the "instruction" and "critical" partitions will stop accepting new jobs. Jobs on other systems will be allowed to complete up until the final migration deadline — unless the owner of such hardware requests that we move it sooner.

Tuesday, December 16th - Environment module updates begin, documentation for the changed module system will be available by Friday of the same week.

Monday, December 22nd - Final migration deadline. Logins to the old cluster will be disabled and the remaining hardware will be merged into the new cluster at the next available opportunity.


If you have any questions or concerns about this process, please reach out to euler-...@engr.wisc.edu ASAP. Please submit bug reports for the new platform as well; the more we can resolve in advance, the smoother this transition will be for everyone.

Regards,
Colin Vanden Heuvel


From: 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>
Sent: Monday, November 24, 2025 12:32
Reply all
Reply to author
Forward
0 new messages