ARC Maintenance Updates
Maintenance Dates:
Armis2/Lighthouse: January 6-9
Great Lakes: January 17 3:00pm - January 24
Overview
• Overview: This page provides key details about the maintenance schedule and purpose.
HPC (High-Performance Computing) - All Clusters
NEW version in BOLD |
OLD version |
Red Hat 8.8 EUS
|
Red Hat 8.6 EUS
|
Mlnx-ofa_kernel-modules
|
Mlnx-ofa_kernel-modules
|
Slurm 24.11.0 compiled with:
|
Slurm 23.11.6 copiles with:
|
PMIx LD config /opt/pmix/4.2.9/lib |
PMIx LD config /opt/pmix/4.2.9/lib |
PMIx versions available in /opt :
|
PMIx versions available in /opt :
|
Singularity CE (Sylabs.io)
|
Singularity CE (Sylabs.io)
|
NVIDIA driver 560.35.05
|
NVIDIA driver 550.54.15
|
Open OnDemand 3.1.9 |
Open OnDemand 3.1.7 |
Slurm Release Notes: Slurm-24.11
New Features/Behaviors in Slurm 24.11:
- Multiple QOS Support: Users can now submit jobs with multiple Quality of Service (QOS) levels, prioritized by their configured order.
- Support for
%b
in filenames for array task IDs modulo 10.- For instance, specifying an output file pattern like output_%b.txt in your Slurm script would result in tasks with IDs 0, 10, 20, etc., writing to output_0.txt, tasks with IDs 1, 11, 21, etc., writing to output_1.txt, and so on. This behavior is particularly useful when you want to distribute tasks or outputs into groups of 10.
- Remove srun --cpu-bind=rank
- The
--cpu-bind=rank
option has been removed. Users are now encouraged to utilize alternative binding options such as--cpu-bind=cores
,--cpu-bind=sockets
, or--cpu-bind=threads
to control task-to-CPU bindings. These options provide more explicit control over how tasks are distributed across the available CPU resources.
- The
- Remove salloc --get-user-env
- The
salloc
command, the--get-user-env
option has also been removed. This option was previously used to load the user's environment variables during the allocation process. Users should now manually ensure that the necessary environment variables are set within their job scripts or interactive sessions to maintain the desired environment during job execution.
- The
HPC (High-Performance Computing) - Great Lakes System
• Hardware Upgrade: The /scratch storage system on the Great Lakes cluster will be upgraded to enhance performance and capacity.
We will be updating the hardware that manages the Great Lakes /scratch
file-system, which must be replaced immediately. This replacement will necessitate a complete rebuild of the /scratch file-system, meaning all files currently in /scratch will be lost. We strongly encourage all users to back up all critical files in /scratch well before the maintenance period begins, as there may be contention for storage bandwidth closer to the start date. If you have jobs expected to be completed near the beginning of the maintenance period, please ensure that data is saved to an alternative location before the scheduled downtime. Additionally, any queued jobs will be deleted because of the rebuilding of /scratch, and users can resubmit them once the clusters are back in production.
HPC (High-Performance Computing) - Armis2 and Lighthouse
The Data Center Engineering team will be working with contractors to perform the annual preventative maintenance on the data center where Armis2 and Lighthouse are racked.
Preventative Maintenance Work
• MDC (Main Data Center) Maintenance
• Planned Shutdown: MDC will undergo a shutdown for comprehensive maintenance tasks, including:
• Emergency Power Off (EPO) testing
• Fire Alarm testing
• ECS (Environmental Control System) maintenance
• Electrical system scanning
• UPS (Uninterruptible Power Supply) maintenance
• High-voltage systems testing
HPC Software
All software libraries will be updated to the latest version, with these updates set as the new default versions.
Storage
1. Turbo System Networking Upgrade
• Networking Update: Upgrading networking infrastructure for Turbo to improve connectivity and resilience.
Globus
• Version: To Be Determined (TBD)
SES (Service Environment Systems)
1. Firmware and Networking Updates:
• Firmware Updates: Applying latest firmware versions to ensure stability and security.
• Networking Configuration and Hardware Update: Updating networking configuration and hardware. No end-user impacts expected.
Contact Information
For questions or additional support, please contact ARC Support at [ARC Support Email/Contact Information].
This format provides a clear breakdown of maintenance tasks and updates across different ARC services, with specifics on each section.