2025 ARC Winter Maintenance

ARC Maintenance Updates

Maintenance Dates:

Armis2/Lighthouse: January 6-9

Great Lakes:  January 17 3:00pm - January 24


Overview

•    Overview: This page provides key details about the maintenance schedule and purpose.


HPC (High-Performance Computing) - All Clusters

NEW version in BOLD

OLD version

Red Hat 8.8 EUS

  • Kernel 4.18.0-477.81.1.el8_8.x86_64

  • glibc-2.28-225.el8_8.11.x86_64

  • ucx-1.16.0-1.2310213.x86_64 (OFED LTS provided)

  • gcc-8.5.0-18.2.el8_8.x86_64

Red Hat 8.6 EUS

  • Kernel 4.18.0-477.51.1.el8_8.x86_64

  • glibc-2.28-225.el8_8.9.x86_64

  • ucx-1.16.0-1.2310213.x86_64 (OFED LTS provided)

  • gcc-8.5.0-18.2.el8_8.x86_64

Mlnx-ofa_kernel-modules 

  • OFED 23.10-2.1.3.1

Mlnx-ofa_kernel-modules 

  • OFED 23.10-2.1.3.1

Slurm 24.11.0 compiled with:

  • PMIx

    • /opt/pmix/4.2.9

    • /opt/pmix/5.0.4

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.16.0-1.2310213.x86_64 (OFED LTS provided)

  • slurm-libpmi

  • slurm-contribs

Slurm 23.11.6 copiles with:

  • PMIx

    • /opt/pmix/4.2.9

    • /opt/pmix/5.0.2

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.16.0-1.2310213.x86_64 (OFED LTS provided)

  • slurm-libpmi

  • slurm-contribs

PMIx LD config /opt/pmix/4.2.9/lib

PMIx LD config /opt/pmix/4.2.9/lib

PMIx versions available in /opt :

  • 4.2.9

  • 5.0.4

PMIx versions available in /opt :

  • 4.2.9

  • 5.0.2

Singularity CE (Sylabs.io)

  • 3.11.5

  • 4.1.3

Singularity CE (Sylabs.io)

  • 3.11.5

  • 4.1.3

NVIDIA driver 560.35.05

  • CUDA 12.6.3 support

 

NVIDIA driver 550.54.15

 

  • CUDA 12.4.1 support

Open OnDemand 3.1.9

Open OnDemand 3.1.7

Slurm Release Notes: Slurm-24.11

New Features/Behaviors in Slurm 24.11:

  • Multiple QOS Support: Users can now submit jobs with multiple Quality of Service (QOS) levels, prioritized by their configured order.
  • Support for %b in filenames for array task IDs modulo 10.
    • For instance, specifying an output file pattern like output_%b.txt in your Slurm script would result in tasks with IDs 0, 10, 20, etc., writing to output_0.txt, tasks with IDs 1, 11, 21, etc., writing to output_1.txt, and so on. This behavior is particularly useful when you want to distribute tasks or outputs into groups of 10.
  • Remove srun --cpu-bind=rank
    • The --cpu-bind=rank option has been removed. Users are now encouraged to utilize alternative binding options such as --cpu-bind=cores, --cpu-bind=sockets, or --cpu-bind=threads to control task-to-CPU bindings. These options provide more explicit control over how tasks are distributed across the available CPU resources.
  • Remove salloc --get-user-env
    • The salloc command, the --get-user-env option has also been removed. This option was previously used to load the user's environment variables during the allocation process. Users should now manually ensure that the necessary environment variables are set within their job scripts or interactive sessions to maintain the desired environment during job execution.


HPC (High-Performance Computing) - Great Lakes System

•    Hardware Upgrade: The /scratch storage system on the Great Lakes cluster will be upgraded to enhance performance and capacity.

READ CAREFULLY All data in /scratch will be removed

We will be updating the hardware that manages the Great Lakes /scratch file-system, which must be replaced immediately. This replacement will necessitate a complete rebuild of the /scratch file-system, meaning all files currently in /scratch will be lost. We strongly encourage all users to back up all critical files in /scratch well before the maintenance period begins, as there may be contention for storage bandwidth closer to the start date. If you have jobs expected to be completed near the beginning of the maintenance period, please ensure that data is saved to an alternative location before the scheduled downtime. Additionally, any queued jobs will be deleted because of the rebuilding of /scratch, and users can resubmit them once the clusters are back in production.

HPC (High-Performance Computing) - Armis2 and Lighthouse

The Data Center Engineering team will be working with contractors to perform the annual preventative maintenance on the data center where Armis2 and Lighthouse are racked.

Preventative Maintenance Work

•    MDC (Main Data Center) Maintenance

•    Planned Shutdown: MDC will undergo a shutdown for comprehensive maintenance tasks, including:

•    Emergency Power Off (EPO) testing

•    Fire Alarm testing

•    ECS (Environmental Control System) maintenance

•    Electrical system scanning

•    UPS (Uninterruptible Power Supply) maintenance

•    High-voltage systems testing

 

HPC Software

All software libraries will be updated to the latest version, with these updates set as the new default versions.

 


Storage

1.    Turbo System Networking Upgrade

•    Networking Update: Upgrading networking infrastructure for Turbo to improve connectivity and resilience.

Globus

•    Version: To Be Determined (TBD)

SES (Service Environment Systems)

1.    Firmware and Networking Updates:

•    Firmware Updates: Applying latest firmware versions to ensure stability and security.

•    Networking Configuration and Hardware Update: Updating networking configuration and hardware. No end-user impacts expected.

Contact Information

For questions or additional support, please contact ARC Support at [ARC Support Email/Contact Information].

This format provides a clear breakdown of maintenance tasks and updates across different ARC services, with specifics on each section.