[euler-users] Unplanned Partial Outage

8 views
Skip to first unread message

Colin Vanden Heuvel

unread,
Jan 5, 2024, 2:58:43 PM1/5/24
to 'Colin Vanden Heuvel' via euler-users
Hello everyone,

It looks like a configuration file was corrupted and scrambled aspects of Euler's network settings; I'm working to remedy the issue, but it is likely to take several days and there may be some intermittent downtime as certain facilities are patched and rebooted.

I will send a follow-up when the issue is resolved or if anything changes.

If you have any questions or concerns, please contact euler-...@engr.wisc.edu for assistance.

Regards,
Colin Vanden Heuvel

Colin Vanden Heuvel

unread,
Jan 5, 2024, 7:27:25 PM1/5/24
to 'Colin Vanden Heuvel' via euler-users
A fix is in place as of about 15 minutes ago. We will need to schedule some rolling reboots on the near future to iron out some of the finer details, but we will send advance notice on that so as to minimize the interruption to everyone's work.

The scope of this problem was limited to legacy IPv4 traffic originating within Euler, so there was no impact to the cluster's storage or its compute capabilities. Only specific types of workloads which rely on communication with legacy-only services outside of Euler or on the public internet should have been affected.

We will be monitoring the situation for a few days to ensure that the fix is behaving as expected. Users may continue to work on Euler during this observation period. If you don't hear otherwise by Tuesday of this coming week, you may assume that the issue is fully resolved.

If you run into network issues on Euler, please let us know at euler-...@engr.wisc.edu ASAP so we are able to account for it in our fixes.

Regards,
Colin V.

Jan 5, 2024 13:59:08 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>:

--
You received this message because you are subscribed to the Google Groups "euler-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to euler-users...@g-groups.wisc.edu.
To view this discussion on the web visit https://groups.google.com/a/g-groups.wisc.edu/d/msgid/euler-users/CY4PR0601MB3604B9D16D873CEA389A2C2EF6662%40CY4PR0601MB3604.namprd06.prod.outlook.com.

Colin Vanden Heuvel

unread,
Jan 6, 2024, 3:22:56 AM1/6/24
to 'Colin Vanden Heuvel' via euler-users
Good evening Euler users,
I just woke up to about a hundred alert messages (and counting) that have allowed me to determine that the fix did NOT hold. What's worse is that it seems to have unearthed another bug elsewhere that's much more severe.

In order to prevent any damage or data loss from this new bug, there is now an emergency pause on all of Euler's file servers. That means the cluster is effectively offline until the issue can be resolved. There is no indication of data loss at this time; this pause aims to ensure that things stay that way while the people working on the issue get some much-needed rest.

Work will resume midmorning and will continue through the weekend or until the problem is fixed.

-- Colin

Jan 5, 2024 18:28:11 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>:

Colin Vanden Heuvel

unread,
Jan 7, 2024, 8:30:58 PM1/7/24
to 'Colin Vanden Heuvel' via euler-users
A brief update before the start of the work week:

Both the new bug and the original network issue were patched last night and appear to be holding stable after 24 hours of observation. The emergency pause has been lifted so folks can get back to work tomorrow morning. Because repair efforts were focused on core facilities like the cluster filesystem and network, it is still possible that individual compute nodes could be having problems. Please reach out if you see any job behavior that seems out of the ordinary.

Thanks for your patience and cooperation.

Regards,
Colin Vanden Heuvel

Jan 6, 2024 02:23:02 'Colin Vanden Heuvel' via euler-users <euler...@g-groups.wisc.edu>:

Reply all
Reply to author
Forward
0 new messages