Introduction to the CSG Cluster

Revised 7/14/2023

1. Asking for Help
1.1 General IT Support
1.2 Writing a Good Trouble Ticket
1.3 Help with Analysis Pipelines and Genomics Applications
1.4 Other Sources of Help
1.5 A Brief Note Regarding Backups
1.6 A Brief Note Regarding Regulated Data
2. Requesting an Account on the CSG Cluster
3. What is the CSG Cluster?
4. Cluster Structure
5. Getting Logged In
5.1 Selecting a Gateway to Use
5.2 Understanding Duo Two-Factor Authentication
5.3 Logging in From Windows
5.4 Logging in From Mac OS X
5.5 Logging in From Linux
6. Appropriate Use of Nodes
6.1 Monitoring CPU and Memory Usage
6.2 Monitoring Disk and Network Usage
7. Running Jobs on Gateway Nodes
8. Running Jobs on Compute Nodes with Slurm
8.1 Slurm Defaults and Limits
8.2 Slurm Machine Types
8.3 Using Dedicated Project Partitions in Slurm
8.4 Routing your job to a node supporting AVX instructions in Slurm
8.5 Getting Memory Utilization Statistics for a Slurm Job or Job Step
8.6 Common Slurm Pitfalls
9. Using R on the CSG Cluster
9.1 Accessing RStudio Server
9.2 Installing R Libraries
10. Using Jupyterhub on the CSG Cluster

Asking for Help

Please do not hesitate to ask for help early and often! CSG research support staff are here to save you time and enable you to do your best work. This is particularly important when you're outside of your comfort zone trying something new that may have a large CPU, memory, storage or I/O footprint and has the potential to disrupt the work of others.

General IT support

For help with U-M campus computing infrastructure including your U-M uniqname, Wolverine Access, Google at U-M and the U-M VPN, please submit a ticket via email to the U-M ITS help desk at 4help@umich.edu or dial 734-764-HELP.

For help with desktop computing or network printing at SPH, please submit a ticket via email to the SPH Computing Services help desk at sph.help@umich.edu.

For help with the CSG cluster, please submit a ticket via email to our IT help desk at csg.help@umich.edu. We aim to initially respond to, and often completely resolve all requests within one business day. Examples of appropriate CSG cluster support requests include but are not limited to:

Login issues
User account and group administration
Overloaded or crashed gateway and cluster worker nodes
Software installation or upgrade requests
File and directory permissions
Disk space exhaustion
Cluster batch job setup and troubleshooting
Questions regarding source code compilation and scripting
Coordinating data transfers

Please note that while our staff may occasionally work tickets at their discretion during evenings, weekends and holidays, in general, we do not guarantee response to support requests outside of normal business hours except in case of major service outage.

Writing a good trouble ticket

When submitting a trouble ticket to our help desk, please try to include as much pertinent information in the ticket as possible. This allows our staff to begin investigating your issue immediately without having to come back to you for more details to find or replicate the problem.

When reporting trouble on a gateway node, please include information such as the host name of the gateway node, the full command line of the program that failed to run, the full text of any error output produced by the program and the file paths that the program was attempting to access when it failed.

When reporting trouble with a Slurm job, the job ID number of the job that failed and the full srun or sbatch command submitted to run the job are particularly helpful.

When reporting login or connectivity issues, please include your cluster user name, the stage of login at which you are having trouble, your site and connectivity method.

Help with analysis pipelines and genomics applications

For help regarding the use of genomics applications and analysis pipelines, please speak with your advisor or discuss your problem with more experienced staff and students in the group.

Other sources of help

Many lab members are available to answer questions in real time on our Slack instance. If you did not receive a Slack invitation when you joined the lab, please open a ticket with our help desk and we will send you an invite.

Our quick reference guide covers a helpful assortment of common Linux and Slurm commands.

If you have prior experience in an environment that used a batch queuing system other than Slurm, the rosetta stone of workload managers can assist you in applying that knowledge when working with Slurm.

A brief note regarding backups

Only very limited backups are made of a small subset of data on the CSG cluster. Backups at CSG focus on files of less than 10 megabytes in size in home directories (in an attempt to capture most code and scripts) and specifically targeted irreplaceable research data on project node file shares.

While the use of RAID technology for CSG storage increases resilience to failure, and CSG IT operations staff make all possible efforts to safeguard user data on the cluster, occasionally incidents do happen that are beyond our control and large scale data loss may occur.

All users on the cluster are encouraged to take independent measures to safeguard their data, particularly source code and job scripts. These files are often relatively small (easy to move around and store copies of elsewhere) and require more effort to replace if lost (rewriting code from scratch, as opposed to just downloading data again from an external source). Toward this end, please consider one or more of the following:

Occasionally downloading a copy of valuable source code and scripts to your personal computer from the cluster using SFTP
Creating a free account on Github and committing your code and scripts there
Copying code and scripts to a second location on the cluster such as your home directory or a different project array
Copying code and scripts to other U-M IT resources such as your IFS space

If you find yourself in the position of being a steward of irreplaceable research data on project node file shares, please begin a dialog with our help desk and we can work with you to formulate a backup strategy for this data. Due to limited backup resources, we depend on analysts who are close to the data and understand what is most critical to flag things for backup arrangements.

If you have lost data due to, for example, an accidental deletion, please open a ticket with our help desk and we can check our backups.

A brief note regarding regulated data

Please note that the CSG cluster is not approved for the storage or processing of regulated data such as that which is covered under HIPAA. Do not transfer, store or process regulated data on the CSG cluster. Our support staff can help direct you to resources at U-M that are approved for regulated data of various types.

Requesting an Account on the CSG Cluster

Anyone with a U-M appointment and an existing account on the CSG cluster may request an account for a new user through our automated user account request form.

All potential users must have a valid U-M uniqname before an account request can be processed.

If you do not have a U-M appointment and an account on the CSG cluster already, please contact your advisor or collaborator at CSG and ask them to submit an account request on your behalf.

What is the CSG Cluster?

The CSG cluster is a complex of over 200 Intel x86-84-based computers running Linux with shared infrastructure and a common software load set designed for the purpose of scientific data processing. When using the cluster, computing tasks are cast individually or as a pipeline of operations in a job script. Users log in to a cluster gateway node to upload applications and input data to shared storage, prepare job scripts, submit batch jobs and collect output. A resource management system (Slurm) queues, schedules and runs the jobs on some number of cluster compute nodes. Upon successful completion, output of the batch jobs is written back to shared storage allowing for subsequent review from a cluster gateway node.

Cluster Structure

The diagram below shows a high-level overview of the CSG cluster.

Cluster gateway nodes are dual-homed (connected to two networks). One network interface on each gateway is connected to to the U-M campus network. This allows users to access them from U-M campus and the public Internet. A second interface on each gateway is connected to a private internal cluster network. This allows the gateway nodes to share files and communicate with the cluster compute worker nodes.

Cluster compute worker nodes are connected only to the private internal cluster network. This isolates them from the public Internet for security purposes and conserves addresses on the U-M campus network. Compute worker nodes cannot be accessed directly and are intended to accept work only from the cluster resource manager. Within the resource manager, compute worker nodes may be assigned to various logical partitions based on the lab or project that funded them.

Getting Logged In

The interface to the CSG cluster is a Linux command line. The SSH protocol is used to establish a remote command line session from your personal computer to a CSG cluster gateway node.

Please note that if you are working off campus, the University of Michigan now requires the use of the U-M VPN to initiate SSH connections to systems on U-M networks. Logging in to the U-M VPN will require the use of your U-M uniqname password. Documentation for the U-M VPN and installers for the U-M VPN client may be found at the following URL:

https://its.umich.edu/enterprise/wifi-networks/vpn/getting-started

Selecting a Gateway to Use

If you are affiliated with the Boehnke lab, you should use one of the Boehnke main cluster gateways:

snowwhite.sph.umich.edu
dumbo.sph.umich.edu

If you are affiliated with the Abecasis lab or Zoellner lab, you should use one of the Abecasis main cluster gateways:

fantasia.sph.umich.edu
wonderland.sph.umich.edu

If you are affiliated with the Zhou lab, you should use the following cluster gateway:

mulan.sph.umich.edu

If you are affiliated with the Willer lab, you should use the following cluster gateway:

hunt.sph.umich.edu

If you are affiliated with the Mukherjee lab or Fritsche lab, you should use the following cluster gateway:

junglebook.sph.umich.edu

If you are affiliated with the Tsoi lab, you should use the following cluster gateway:

psoriasis.sph.umich.edu

If you are affiliated with the Kardia-Smith lab, you should use the following cluster gateway:

orion.sph.umich.edu

If you are an external (non-UM) affiliate, you should use the following cluster gateway:

sandbox.sph.umich.edu

If your lab affiliation is not otherwise listed above, use one of the Boehnke lab or Abecasis lab cluster gateways.

Certain research projects may also have their own project-specific cluster gateway nodes. We tend to call these project nodes in casual conversation around the lab. If you are working on a project that has a dedicated gateway node, your advisor or local collaborator will inform you of the details.

The Abecasis and Boehnke lab gateway nodes were the original four gateway nodes available on the CSG cluster and tend to be catch-alls for other CSG-affiliated PIs and their associated students who do not have a private cluster gateway node for their lab. For that reason, we may call the Abecasis and Boehnke lab gateways main gateways in casual conversation around the lab.

Understanding Duo Two-Factor Authentication

Certain nodes on the CSG cluster associated with the TOPMED project require the use of Duo two-factor authentication to log in. For more information on setting up Duo, please see Duo Two-Factor at U-M ITS.

Nodes using Duo will prompt you to use Duo authentication after entering your password during the SSH login process.

Password:
Duo two-factor login for your_username 

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-XXXX
 2. Phone call to XXX-XXX-XXXX
 3. SMS passcodes to XXX-XXX-XXXX

Passcode or option (1-3):

Upon successfully responding to the Duo prompt, your login will proceed as usual.

Logging in From Windows

There is no SSH client included with the Windows operating system. However, there are many free SSH clients available to download. We recommend the use of PuTTY.

1. Download PuTTY from the URL:

https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

You can use the full MSI installer, or just download putty.exe to your Desktop.

2. Run PuTTY.

3. Find the "Host Name" field in the "PuTTY Configuration" window under the "Session" category (this category will show by default when PuTTY opens).

4. Enter the desired host name to connect to in the "Host Name" field.

5. Click "Open".

6. If this is the first time you have connected to a particular gateway node, a dialog box will pop up prompting you to accept the host key. Click OK.

7. A connection window will open. Enter your cluster user name at the "login as" prompt.

8. Enter your cluster password at the password prompt.

Logging in From Mac OS X

1. Navigate to the Applications folder on your hard disk.

2. Navigate to the Utilities folder in the Applications folder.

3. Run Terminal.

4. At the Terminal command prompt, use the following command to connect to a cluster gateway node:

$ ssh your_user_name@desired_cluster_gateway.sph.umich.edu

5. If this is the first time you have connected to a particular gateway node, you will be prompted to accept the host key. Type "yes" and hit ENTER.

6. Enter your cluster password at the password prompt.

Logging in From Linux

1. Execute a terminal. The specifics of how this is done will vary depending on which GUI you are using (GNOME or KDE).

2. At the terminal command prompt, use the following command to connect to a cluster gateway node:

$ ssh your_user_name@desired_cluster_gateway.sph.umich.edu

3. If this is the first time you have connected to a particular gateway node, you will be prompted to accept the host key. Type "yes" and hit ENTER.

4. Enter your cluster password at the password prompt.

Appropriate Use of Nodes

Unlike many high performance computing sites, we do allow users to run jobs directly on the gateway nodes in addition to submitting jobs to run on the cluster through the resource manager. Please keep in mind that the gateway nodes are a shared resource and in general quotas and limits are not automatically enforced. Users are responsible for ensuring that their consumption of system resources is reasonable and does not interfere with the work of others. Subjecting cluster gateways to excessive CPU, memory and I/O load or consuming all available storage on the cluster gateways has the potential to disrupt the work of your colleagues. Try to limit your use of the gateway nodes to the following tasks:

Running small numbers of large memory jobs that may not fit in available memory on compute worker nodes
Interactively profiling the behavior of one instance of a task that you plan to run as a batch job
Submitting batch jobs to the resource manager
Troubleshooting batch jobs
Low-intensity interactive work (i.e. writing code, preparing figures, processing TeX, etc)

Monitoring CPU and Memory Usage

It is your responsibility to know the memory and CPU requirements of your jobs, before running them on a gateway or the cluster

You should be familiar with htop before running jobs on a gateway. htop is used to monitor CPU and memory usage of processes/jobs running on a machine. Please watch a tutorial on htop, such as:

htop youtube tutorial
Understanding and using htop monitor system resources
Our own short guide on monitoring processes
If you're having trouble with htop, remember to ask for help.

Do not abuse the gateway nodes!

If you need to run a large number of jobs, you should be submitting batch jobs to the resource manager instead.

Only a very small number of short lived jobs should be run on a gateway, and only if the gateway is not currently running a large number of jobs.

Too many jobs on a gateway will overload it, slowing down everyone else's work.

Using too much memory will overload the gateway and may require a machine reboot, which will cause loss of work for other users. We have project-specific nodes with large amounts of system memory, if you need them.

Monitoring Disk and Network Usage

Use dstat -a to see aggregate disk/network bandwidth usage on a gateway machine.

Use df -kh to view available disk capacity on the system that you are using and the network shares that are mounted there.

Use du -sh to determine disk utilization of files in your home directory or on project storage.

Remember that bandwidth is limited. Too many disk accesses in parallel to a gateway can overload the gateway, making it VERY slow.

If you submit a large number of jobs that read data from a gateway, please regularly check the gateway with dstat and also attempt to list a few directories to get a sense for whether it is affecting the performance of the machine.

When possible, run I/O intensive jobs on project clusters reading from/writing to project storage rather than gateways.

Project data belongs on project machines, not in your main gateway home directories.

Submit a ticket to our help desk csg.help@umich.edu if you need help moving your data to an appropriate place or finding high-throughput scratch storage for your I/O intensive jobs.

Running Jobs on Gateway Nodes

If your jobs will run for a long period of time, they should be submitted to the resource manager.

In some cases, however, you may wish to run a very small number of them on a gateway. Reasons for doing so might be:

Test how long a typical job takes to run
Test memory usage of a typical job
You only need to run a tiny number of jobs for your whole project

Before doing so:

Ensure the machine is not already burdened by running htop and looking at the overall CPU usage.
The number of CPU cores can be found in htop, or by running lscpu. You should NEVER run more jobs on a gateway than there are available CPUs.
Submit only a small number of jobs, such that roughly 50-75% of CPU cores on the gateway remain available for others to use. If you need more than that, you should be submitting jobs to the resource manager.

To make sure your jobs continue running when you logout of the gateway, use one of the following methods:

nohup

Nohup tutorial

screen

Allows you to disconnect from your terminal, reconnect later, and pull up the same terminal.
Can save multiple sessions, and window "panes" within each session.
Screen tutorial

tmux

Allows you to disconnect from your terminal, reconnect later, and pull up the same terminal.
Can save multiple sessions, and window "panes" within each session.
Similar to screen but more readily maintained and still under active development
Can do split windows and manage multiple sessions easily
Scriptable
Tmux Tutorial

Running Jobs on Compute Nodes with Slurm

We keep our cluster Slurm documentation, along with examples of how to use Slurm, in the following Github repository. Please follow this guide when starting off with Slurm.

https://github.com/statgen/SLURM-examples

We also have the following official training slide decks from the Slurm developers available for reference. These are protected under copyright and you must log in with your cluster user name and password to view them.

Warnings

Compute bound jobs scale well and are generally safe to run lots of copies of.
I/O bound jobs do not scale well - think, plan and experiment carefully before running more than 10-20 I/O intensive jobs at once.
Use dstat on the gateway you are accessing data from to see how much disk bandwidth is being used. If the machine begins to feel unresponsive, dial back your jobs accordingly.

Slurm Defaults and Limits

Parameter	Value	Notes
Default number of CPUs per task	1
Maximum number of CPUs per task	unlimited	No worker node has more than 80 CPU cores (see Bad Constraints in Slurm)
Lowest common denominator number of CPUs	24	Any node will have at least this many cores
Default memory allocation per CPU	2 GB
Maximum memory allocation per CPU	unlimited	No worker node has more than ~560 GB RAM (see Bad Constraints in Slurm)
Lowest common denominator physical memory	64 GB	Any node will have at least this much physical memory
Default job run time	25 hours
Maximum job run time	28 days
Maximum jobs in queue (running plus pending)	200,000
Maximum array size	25,000 elements
Lowest common denominator /tmp space	800 GB	Any node will have at least this much /tmp space
Maximum available /tmp space	8.5 TB	At least a few nodes will have this much /tmp space

Slurm Machine Types

The CSG cluster is composed of blocks of different machine types that have been accrued over the course of many years. There is some variance in CPU performance across different blocks of nodes in the CSG cluster. When benchmarking an application, it may be helpful to lock your Slurm jobs to a specific machine type to control for this variable to some degree.

Please note that there may still be minor variations in CPU specification (clock rate, core count) and physical memory within each of these major machine types. However all machines tagged with each constraint will be guaranteed to be of the same product generation and microarchitecture.

Refer to the table below for a list of available constraints for each machine type.

Machine Type	Representative CPU	Constraint	Notes
Dell C6100	Intel Xeon X5660	c6100
Dell C6220/C6220 II	Intel Xeon E5-2640 v2	c6220	Requires hunt partition access
Dell PowerEdge R630	Intel Xeon E5-2680 v3	r630
Dell PowerEdge R640	Intel Xeon Gold 6248	r640
Dell PowerEdge R830	Intel Xeon E5-4650 v4	r830	Requires encore partition access
Dell PowerEdge R840	Intel Xeon Gold 6138	r840	Requires encore partition access
Dell PowerEdge R920	Intel Xeon E7-4890 v2	r920	Requires topmed or inpsyght partition access
Dell PowerEdge R930	Intel Xeon E7-8855 v4	r930	Requires topmed, giant-glgc or encore partition access
Dell PowerEdge R940	Intel Xeon Platinum 8268	r940	Requires topmed partition access
HPE DL360G9	Intel Xeon CPU E5-2680 v3	dl360g9
HPE DL580G9	Intel Xeon E7-4850 v3	dl580g9	Requires topmed partition access

These may be specified to the batch scheduler with the flag:

--constraint="{constraint}"

If you need more specific information about the machine on which your job runs, add the following line to the head of your job script. The host name, make and model of the node, CPU type, clock rate and amount of physical memory will appear in the log file produced by the job.

/usr/cluster/bin/nodeinfo

If you have any questions about detailed machine specifications on the cluster, please open a ticket with our help desk at csg.help@umich.edu.

Using Dedicated Project Partitions in Slurm

Major projects undertaken by CSG tend to have their own dedicated gateway node (also known as a project node) with attached storage and some number of compute nodes set aside in a logical cluster partition for which work associated with that project receives priority access. Sometimes these project-specific logical cluster partitions are called mini-clusters around the lab in casual conversation.

If you are working on a particular project, you may wish to submit jobs specifically to the cluster nodes associated with that project. You can use the command:

$ scontrol show partitions

To see which partitions are available. When submitting jobs, simply add the flag:

--partition=NAME1,NAME2,NAME3

to your sbatch or srun command.

To maximize resource utilization, compute worker nodes dedicated to a specific project will also take up work from the main cluster batch queues if they are otherwise idle, with the caveat that this work may be preempted and requeued if higher priority work comes in from the project-specific partition (See Job Preemption in Slurm).

Send an email to csg.help@umich.edu for requesting usage of project specific compute nodes. Appropriate cases for this might be if you are under a deadline crunch.

Routing your job to a node supporting AVX instructions in Slurm

Nodes on the CSG cluster have varying levels of AVX support. If your job runs a program that is compiled to use AVX instructions, specify the following flag when submitting your job:

--constraint="avx"

If your job runs a program that is compiled to use AVX2 instructions, specify the following flag when submitting your job:

--constraint="avx2"

If your job runs a program that is compiled to use AVX512 instructions, specify the following flag when submitting your job:

--constraint="avx512"

Nodes supporting a greater level of AVX instructions also include support for previous levels of AVX instructions. A node supporting AVX2 will also support AVX. A node suporting AVX512 will also support AVX2 and AVX.

Getting Memory Utilization Statistics for a Slurm Job or Job Step

Slurm has two tools for obtaining job memory consumption statistics: sacct and sstat. The sstat command only works on running jobs, and the sacct command only works for completed jobs.

For an array job, just taking the job ID shown in squeue will not work. It is required to take the job ID from squeue, then use the command:

$ scontrol show job [jobid]

To obtain the real job ID of the array step:

$ scontrol show job 40626979_5 | head -1
JobId=40628098 ArrayJobId=40626979 ArrayTaskId=5 JobName=perm

Then take that reported JobID and pass it as the -j parameter to sstat.

$ sstat -j 40628098.batch -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages
       JobID     AveRSS     MaxRSS  AveVMSize  MaxVMSize   AvePages MaxPages
------------ ---------- ---------- ---------- ---------- ---------- --------
40628098.ba+   1834920K   2293272K     12600K    265184K          0        0

For jobs that are not array jobs, just take the job ID given by squeue and supply that as the -j argument to sstat.

The sacct command takes similar command line arguments to sstat. For example:

$ sacct -j 46047141 -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages

The MaxRSS field is the maximum amount of memory used by the job at any time over the course of the run.

The time command is also useful to obtain the memory utilization of a running program. It's important to keep in mind there is both a shell builtin version and a standalone version of the time command. When using the time command to gather memory consumption statistics, be sure to run the standalone time command by furnishing the full path, as the shell builtin version of the time command does not support memory utilization profiling.

$ /usr/bin/time -v [program]

Common Slurm Pitfalls

Unconsidered Use of Slurm Mail Notifications

The CSG cluster uses the Slurm-mail system to provide you the facility to receive updates regarding job status changes via email. This is controlled by the following parameters in your job script:

#SBATCH --mail-type=ALL #SBATCH --mail-user=uniqname@umich.edu

Please be aware that unconsidered use of the email notification facility can result in thousands of email messages being sent to the specified email address in the job script in a very short period of time.

Follow the guidelines below when using email notification with your Slurm jobs:

Do not submit jobs where the mail user is specified as anyone other than yourself.
Do not submit jobs where the specified mail user does not include a fully qualified domain name or is otherwise an invalid email address.
Do not use the email notification facility for very large blocks of jobs or array jobs of more than a few hundred tasks.
Do not use the email notification facility for very short jobs that are expected to complete within a few hours or less.
If you are creating your job script from a template, remove email notification settings unless you have a genuine need to use the facility.
Put extra effort into testing large blocks of jobs that will use email notification to be confident that they do not fail immediately upon submission and generate an email flood.

Job Preemption in Slurm

The default partition for job submission in Slurm at CSG is "main".

To maximize hardware utilization, most compute worker nodes associated with a specific project are a member of two Slurm partitions. The first is their respective project-specific partition and the second is the "main" partition.

This is not always immediately apparent because Slurm will not show project-specific partitions in sinfo unless you explicitly have access to them.

Under normal circumstances, project-specific compute worker nodes will pick up work from the "main" partition if they are idle. They will run jobs from "main" until they happen to receive jobs via their higher priority project-specific partition. When this occurs, work in progress from the lower priority "main" partition will be preempted and requeued and the node will begin working on jobs submitted via the mini-cluster partition until no more jobs are queued to run in the mini-cluster partition. When all available work from the project-specific partition is exhausted, the node will begin picking up work from the "main" partition once again.

Note that only work submitted with the sbatch command will be requeued when preempted. Work submitted via the srun command will simply be terminated by Slurm when preempted.

If you are concerned about your job being preempted (especially for long-running jobs, where preemption can be particularly painful), either submit to the "main-nopreempt" partition, which excludes all project specific nodes, with the command:

$ sbatch --partition=main-nopreempt ...

Or use the Slurm --exclude switch when submitting jobs to "main" to restrict the scheduler to placing your jobs on nodes that are not also a member of a mini-cluster partition:

$ sbatch --exclude="1000g-mc[01-04],amd-mc[01-04],assembly-mc[01-04],bipolar-mc[01-08],esp-mc[01-04],giant-glgc-mc[01-06],got2d-mc[01-04],hunt-mc[05-13],r63[01-35],dl36[01-19],sardinia-mc[01-04],t2dgenes-mc[01-04],twins-mc[01-04]" ...

Nodes that are not a member of any project-specific partition and should never preempt running jobs include:

c[01-52]
r64[01-10]

Bad Constraints in Slurm

If you submit a job with a resource request that Slurm cannot fulfill, the jobs will be held in the queue indefinitely with reason code "BadConstraints". Actions that may cause this to occur may include but are not limited to:

Requesting more CPU cores than are present on any worker node
Requesting more physical memory than is available on any worker node
Requesting a machine type that is not available in any partitions that you submitted your job to run in

Generally, the greater the constraints (i.e. more cores, more physical memory), the fewer the number of nodes that can fulfill them, and the longer your job may end up waiting in the queue for a node to become free with the available resources.

When this occurs, you must cancel your jobs with the scancel command and resubmit them with appropriate constraints.

Using R on the CSG Cluster

R is available on all CSG cluster nodes. A large collection of over 1,000 R libraries including many Bioconductor libraries and other specialty genomics libraries is maintained centrally by our IT operations team and may be used across the cluster with no intervention required from the end user.

Accessing RStudio Server

RStudio Server is available on a number of CSG cluster gateway nodes. This provides access to interactive R sessions within a browser window, with back end processing occurring on the gateway node in question. To access RStudio Server, use one of the URLs below and sign in with your cluster username and password.

Boehnke Lab RStudio Server

Abecasis Lab RStudio Server

If you would like RStudio Server made available on a machine not listed here, please open a ticket with our help desk at csg.help@umich.edu.

Installing R Libraries

As a large number of R libraries are available pre-installed on the cluster, it is likely that the R module you need has already been built by our IT operations team.

You can see the list of modules available in R by running R and executing the command:

> library()

If the module you need is not already available in R on the cluster, we encourage you to contact our help desk at csg.help@umich.edu and request that the library be installed. Our operations team is happy to install R libraries per user request and library installations are almost always completed the same day they are requested. Requesting library installations instead of doing it yourself allows all cluster users to benefit from the library installation, saves you the effort of building and maintaining R libraries and allows our operations team to keep libraries up to date as major system changes occur.

If for development or other purposes you must maintain a private R library repository, follow the steps below.

1. Create a directory in your home directory that will be used to hold R libraries that you build.

$ mkdir -p ~/R/site-library

2. Set the R_LIBS_USER environment variable to point to that directory. If you are using the bash shell, the command would be:

$ export R_LIBS_USER=~/R/site-library

If you are using the tcsh shell, the command would be:

% setenv R_LIBS_USER ~/R/site-library

If the R_LIBS_USER environment variable is not set up, R will attempt to install libraries to system directories when you run install.packages() and this will fail as ordinary user accounts to not have access to write to these directories.

3. Run R and install libraries using the install.packages() command. For example:

$ R

R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages('tidyverse', dependencies=TRUE);

When finished, the libraries will be installed to the directory that you configured for R_LIBS_USER.

4. To make the changes persist for all future sessions, update your dotfiles to set R_LIBS_USER when you log in. If you are using the bash shell, the command would be:

$ echo "export R_LIBS_USER=~/R/site-library" >> ~/.bashrc

If you are using the tcsh shell, the command would be:

% echo "setenv R_LIBS_USER ~/R/site-library" >> ~/.cshrc.aliases

Using Jupyterhub on the CSG Cluster

Jupyterhub is available on a number of CSG cluster gateway nodes. This provides access to interactive Python sessions within a browser window, with back end processing occurring on the gateway node in question. To access Jupyterhub, use one of the URLs below and sign in with your cluster username and password.

Boehnke Lab Jupyterhub

Abecasis Lab Jupyterhub

If you would like Jupyterhub made available on a machine not listed here, please open a ticket with our help desk at csg.help@umich.edu.

Return to Table of Contents