User Tools

Site Tools


update_2021_overview

Update 2021 Overview

The cluster is being updated to Ubuntu 20.04. Along with the OS update comes several changes to the way the cluster works. This page summarizes those changes. Workshops were held in June and July of 2021 to discuss these changes in detail.

Rather than shutting the cluster down and doing the upgrade in one swell foop, we are providing a new cluster (with a new login node) already upgraded to Ubuntu 20.04 “Focal Fossa”. You can move to the new cluster when it is convenient (although you will have to move in the near future - defined as the next 3-4 months).

From September 6th no new jobs will be allowed on the old cluster. Jobs which are already running on that date will be allowed to continue running, but the old cluster now has a 30-day time limit on all queues.

Most of the old cluster nodes have now been moved to the new cluster. The new cluster includes 6 brand new compute nodes to which you can submit jobs using slurm as usual. These 6 nodes each have:

  • 2 Intel Xeon Gold 6226R CPUs (2.90GHz, 16 cores each)
  • 384GB of RAM

There are some important differences in the slurm submission configuration described below.

When you log in to the new cluster you will have access to your home directory exactly as on the current head node. Unfortunately that does not mean that all the software you have installed in your home directory will continue to work: the new OS has updated shared libraries that may or may not be compatible with the programs you have installed. So you will need to spend some time testing and/or re-installing the software you need. (See the paragraph below about scratch space that could be used for testing software updates.)

The nodes on the old cluster will be slowly disappearing from that cluster and reappearing on the new cluster. The plan is to move one or two nodes per week, so with about 30 nodes to move, after a few months the old cluster will have no nodes left - hence the need for everyone to move.

Allison told me that we need scratch space on the cluster (i.e. disk space that is not backed up and is intended only for somewhat temporary files). To that end the new cluster has a 70TB volume mounted at /scratch, and each (active) account can have a sub-directory of /scratch for temporary download space, or runtime temporary files. Files on /scratch will be automatically deleted when they get to be 90 days old (this time limit may be changed, up or down, as we see how heavily the scratch space is used). I have created subdirectories of /scratch for all users who have been active on the old cluster in the last month. Any other user can send an e-mail to Kevin or me to request that we create a sub-directory for them.

For those of you working with sensitive data, the scratch space is encrypted “at rest”. So you can use it too.

I have not attempted to install all the software from the old cluster on the new because I would like to start fresh and install only what we need. I have installed the most obvious software: R, singularity, samtools, etc. Feel free to send e-mail requesting the installation of specific software.

The new cluster makes use of “Environment Modules”. These allow the cluster to provide different versions of the same software. See the section on environment modules below. You are still welcome to install whatever software you need locally in your own home directory, but we hope that it will be easier for users to, for instance, upgrade to a new version of R, or stick with an older version if necessary.

New SLURM Configuration

Enforcing Core Counts

On the old cluster SLURM was configured so that it tracked allocated cores on each node, but did not limit a job from using more cores than specified in the submission i.e. a job could be submitted with the default 1-core allocation but run a program that started 10 threads and actually start using 10 cores. This occasionally led to nodes being bogged down.

On the new cluster jobs will have access to exactly as many cores as requested in the job submission. If you submit a job requesting 1 core, and start a program that uses 10 threads, all 10 threads will be time-sliced on 1 core. Your job might run something like 10 times slower than you expected it to.

So, you should be more careful in specifying how many cores your job needs. See the “-c” or “–cpus-per-task” option to sbatch.

See Enforcing Core Counts for more details.

Enforcing Memory Allocation

On the old cluster memory was not treated as a SLURM “consumable resource”. A job could use as much memory as it wanted, and this occasionally led to nodes being run out of memory.

On the new cluster, memory is a SLURM consumable resource. That means that you should be careful to specify how much memory your job needs. By default your job will be allocated 8GB per core that it has requested. By default 1 core is allocated, in which case 8GB of memory would be allocated to your job. If your job attempts to use more than its memory allocation it will be killed by the operating system.

You can change the amount of memory your job is allocated using either the “–mem” option, or the “–mem-per-cpu” option on sbatch.

See Enforcing Memory Allocation for more details.

Environment Modules

Environment modules allow you to control which software (and which version of that software) is available in your environment. For instance the new cluster has 4 different versions of standard R installed: 3.5.3 , 3.6.3, 4.0.5, 4.1.0. When you first log in and try to run R the OS will respond with “command not found”. To activate R in your environment you would type:

module add R

See Environment Modules for more details.

Connecting to the Updated Cluster

To connect to the new cluster you should SSH to (either of):

  • captainsisko.statgen.ncsu.edu
  • brccluster2021.statgen.ncsu.edu

Use the same user name and password that you use for the old cluster.

update_2021_overview.txt · Last modified: 2021/12/14 17:18 by root