For help with desktop computing or network printing at SPH, please submit a ticket via email to the SPH Computing Services help desk at sph.help@umich.edu.
For help with the CSG cluster, please submit a ticket via email to our IT help desk at csg.help@umich.edu. We aim to initially respond to, and often completely resolve all requests within one business day. Examples of appropriate CSG cluster support requests include but are not limited to:
Please note that while our staff may occasionally work tickets at their discretion during evenings, weekends and holidays, in general, we do not guarantee response to support requests outside of normal business hours except in case of major service outage.
When reporting trouble on a gateway node, please include information such as the host name of the gateway node, the full command line of the program that failed to run, the full text of any error output produced by the program and the file paths that the program was attempting to access when it failed.
When reporting trouble with a Slurm job, the job ID number of the job that failed and the full srun or sbatch command submitted to run the job are particularly helpful.
When reporting login or connectivity issues, please include your cluster user name, the stage of login at which you are having trouble, your site and connectivity method.
Our quick reference guide covers a helpful assortment of common Linux and Slurm commands.
If you have prior experience in an environment that used a batch queuing system other than Slurm, the rosetta stone of workload managers can assist you in applying that knowledge when working with Slurm.
While the use of RAID technology for CSG storage increases resilience to failure, and CSG IT operations staff make all possible efforts to safeguard user data on the cluster, occasionally incidents do happen that are beyond our control and large scale data loss may occur.
All users on the cluster are encouraged to take independent measures to safeguard their data, particularly source code and job scripts. These files are often relatively small (easy to move around and store copies of elsewhere) and require more effort to replace if lost (rewriting code from scratch, as opposed to just downloading data again from an external source). Toward this end, please consider one or more of the following:
If you find yourself in the position of being a steward of irreplaceable research data on project node file shares, please begin a dialog with our help desk and we can work with you to formulate a backup strategy for this data. Due to limited backup resources, we depend on analysts who are close to the data and understand what is most critical to flag things for backup arrangements.
If you have lost data due to, for example, an accidental deletion, please open a ticket with our help desk and we can check our backups.
All potential users must have a valid U-M uniqname before an account request can be processed.
If you do not have a U-M appointment and an account on the CSG cluster already, please contact your advisor or collaborator at CSG and ask them to submit an account request on your behalf.
Cluster gateway nodes are dual-homed (connected to two networks). One network interface on each gateway is connected to to the U-M campus network. This allows users to access them from U-M campus and the public Internet. A second interface on each gateway is connected to a private internal cluster network. This allows the gateway nodes to share files and communicate with the cluster compute worker nodes.
Cluster compute worker nodes are connected only to the private internal cluster network. This isolates them from the public Internet for security purposes and conserves addresses on the U-M campus network. Compute worker nodes cannot be accessed directly and are intended to accept work only from the cluster resource manager. Within the resource manager, compute worker nodes may be assigned to various logical partitions based on the lab or project that funded them.
Please note that if you are working off campus, the University of Michigan now requires the use of the U-M VPN to initiate SSH connections to systems on U-M networks. Logging in to the U-M VPN will require the use of your U-M uniqname password. Documentation for the U-M VPN and installers for the U-M VPN client may be found at the following URL:
https://its.umich.edu/enterprise/wifi-networks/vpn/getting-started
If you are affiliated with the Abecasis lab or Zoellner lab, you should use one of the Abecasis main cluster gateways:
If you are affiliated with the Zhou lab, you should use the following cluster gateway:
If you are affiliated with the Willer lab, you should use the following cluster gateway:
If you are affiliated with the Mukherjee lab or Fritsche lab, you should use the following cluster gateway:
If you are affiliated with the Tsoi lab, you should use the following cluster gateway:
If you are affiliated with the Kardia-Smith lab, you should use the following cluster gateway:
If you are an external (non-UM) affiliate, you should use the following cluster gateway:
If your lab affiliation is not otherwise listed above, use one of the Boehnke lab or Abecasis lab cluster gateways.
Certain research projects may also have their own project-specific cluster gateway nodes. We tend to call these project nodes in casual conversation around the lab. If you are working on a project that has a dedicated gateway node, your advisor or local collaborator will inform you of the details.
The Abecasis and Boehnke lab gateway nodes were the original four gateway nodes available on the CSG cluster and tend to be catch-alls for other CSG-affiliated PIs and their associated students who do not have a private cluster gateway node for their lab. For that reason, we may call the Abecasis and Boehnke lab gateways main gateways in casual conversation around the lab.
Nodes using Duo will prompt you to use Duo authentication after entering your password during the SSH login process.
Password: Duo two-factor login for your_username Enter a passcode or select one of the following options: 1. Duo Push to XXX-XXX-XXXX 2. Phone call to XXX-XXX-XXXX 3. SMS passcodes to XXX-XXX-XXXX Passcode or option (1-3):
Upon successfully responding to the Duo prompt, your login will proceed as usual.
1. Download PuTTY from the URL:
https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
You can use the full MSI installer, or just download putty.exe to your Desktop.
2. Run PuTTY.
3. Find the "Host Name" field in the "PuTTY Configuration" window under the "Session" category (this category will show by default when PuTTY opens).
4. Enter the desired host name to connect to in the "Host Name" field.
5. Click "Open".
6. If this is the first time you have connected to a particular gateway node, a dialog box will pop up prompting you to accept the host key. Click OK.
7. A connection window will open. Enter your cluster user name at the "login as" prompt.
8. Enter your cluster password at the password prompt.
2. Navigate to the Utilities folder in the Applications folder.
3. Run Terminal.
4. At the Terminal command prompt, use the following command to connect to a cluster gateway node:
5. If this is the first time you have connected to a particular gateway node, you will be prompted to accept the host key. Type "yes" and hit ENTER.
6. Enter your cluster password at the password prompt.
2. At the terminal command prompt, use the following command to connect to a cluster gateway node:
3. If this is the first time you have connected to a particular gateway node, you will be prompted to accept the host key. Type "yes" and hit ENTER.
4. Enter your cluster password at the password prompt.
You should be familiar with htop before running jobs on a gateway. htop is used to monitor CPU and memory usage of processes/jobs running on a machine. Please watch a tutorial on htop, such as:
Do not abuse the gateway nodes!
If you need to run a large number of jobs, you should be submitting batch jobs to the resource manager instead.
Only a very small number of short lived jobs should be run on a gateway, and only if the gateway is not currently running a large number of jobs.
Too many jobs on a gateway will overload it, slowing down everyone else's work.
Using too much memory will overload the gateway and may require a machine reboot, which will cause loss of work for other users. We have project-specific nodes with large amounts of system memory, if you need them.
Use df -kh to view available disk capacity on the system that you are using and the network shares that are mounted there.
Use du -sh to determine disk utilization of files in your home directory or on project storage.
Remember that bandwidth is limited. Too many disk accesses in parallel to a gateway can overload the gateway, making it VERY slow.
If you submit a large number of jobs that read data from a gateway, please regularly check the gateway with dstat and also attempt to list a few directories to get a sense for whether it is affecting the performance of the machine.
When possible, run I/O intensive jobs on project clusters reading from/writing to project storage rather than gateways.
Project data belongs on project machines, not in your main gateway home directories.
Submit a ticket to our help desk csg.help@umich.edu if you need help moving your data to an appropriate place or finding high-throughput scratch storage for your I/O intensive jobs.
In some cases, however, you may wish to run a very small number of them on a gateway. Reasons for doing so might be:
Before doing so:
To make sure your jobs continue running when you logout of the gateway, use one of the following methods:
https://github.com/statgen/SLURM-examples
We also have the following official training slide decks from the Slurm developers available for reference. These are protected under copyright and you must log in with your cluster user name and password to view them.
Parameter | Value | Notes | |
---|---|---|---|
Default number of CPUs per task | 1 | ||
Maximum number of CPUs per task | unlimited | No worker node has more than 80 CPU cores (see Bad Constraints in Slurm) | |
Lowest common denominator number of CPUs | 24 | Any node will have at least this many cores | |
Default memory allocation per CPU | 2 GB | ||
Maximum memory allocation per CPU | unlimited | No worker node has more than ~560 GB RAM (see Bad Constraints in Slurm) | |
Lowest common denominator physical memory | 64 GB | Any node will have at least this much physical memory | |
Default job run time | 25 hours | ||
Maximum job run time | 28 days | ||
Maximum jobs in queue (running plus pending) | 200,000 | ||
Maximum array size | 25,000 elements | ||
Lowest common denominator /tmp space | 800 GB | Any node will have at least this much /tmp space | |
Maximum available /tmp space | 8.5 TB | At least a few nodes will have this much /tmp space |
Please note that there may still be minor variations in CPU specification (clock rate, core count) and physical memory within each of these major machine types. However all machines tagged with each constraint will be guaranteed to be of the same product generation and microarchitecture.
Refer to the table below for a list of available constraints for each machine type.
Machine Type | Representative CPU | Constraint | Notes |
---|---|---|---|
Dell C6100 | Intel Xeon X5660 | c6100 | |
Dell C6220/C6220 II | Intel Xeon E5-2640 v2 | c6220 | Requires hunt partition access |
Dell PowerEdge R630 | Intel Xeon E5-2680 v3 | r630 | |
Dell PowerEdge R640 | Intel Xeon Gold 6248 | r640 | |
Dell PowerEdge R830 | Intel Xeon E5-4650 v4 | r830 | Requires encore partition access |
Dell PowerEdge R840 | Intel Xeon Gold 6138 | r840 | Requires encore partition access |
Dell PowerEdge R920 | Intel Xeon E7-4890 v2 | r920 | Requires topmed or inpsyght partition access |
Dell PowerEdge R930 | Intel Xeon E7-8855 v4 | r930 | Requires topmed, giant-glgc or encore partition access |
Dell PowerEdge R940 | Intel Xeon Platinum 8268 | r940 | Requires topmed partition access |
HPE DL360G9 | Intel Xeon CPU E5-2680 v3 | dl360g9 | |
HPE DL580G9 | Intel Xeon E7-4850 v3 | dl580g9 | Requires topmed partition access |
These may be specified to the batch scheduler with the flag:
If you need more specific information about the machine on which your job runs, add the following line to the head of your job script. The host name, make and model of the node, CPU type, clock rate and amount of physical memory will appear in the log file produced by the job.
If you have any questions about detailed machine specifications on the cluster, please open a ticket with our help desk at csg.help@umich.edu.
If you are working on a particular project, you may wish to submit jobs specifically to the cluster nodes associated with that project. You can use the command:
To see which partitions are available. When submitting jobs, simply add the flag:
to your sbatch or srun command.
To maximize resource utilization, compute worker nodes dedicated to a specific project will also take up work from the main cluster batch queues if they are otherwise idle, with the caveat that this work may be preempted and requeued if higher priority work comes in from the project-specific partition (See Job Preemption in Slurm).
Send an email to csg.help@umich.edu for requesting usage of project specific compute nodes. Appropriate cases for this might be if you are under a deadline crunch.
If your job runs a program that is compiled to use AVX2 instructions, specify the following flag when submitting your job:
If your job runs a program that is compiled to use AVX512 instructions, specify the following flag when submitting your job:
Nodes supporting a greater level of AVX instructions also include support for previous levels of AVX instructions. A node supporting AVX2 will also support AVX. A node suporting AVX512 will also support AVX2 and AVX.
For an array job, just taking the job ID shown in squeue will not work. It is required to take the job ID from squeue, then use the command:
To obtain the real job ID of the array step:
$ scontrol show job 40626979_5 | head -1 JobId=40628098 ArrayJobId=40626979 ArrayTaskId=5 JobName=perm
Then take that reported JobID and pass it as the -j parameter to sstat.
$ sstat -j 40628098.batch -o JobID,AveRSS,MaxRSS,AveVMSize,MaxVMSize,AvePages,MaxPages JobID AveRSS MaxRSS AveVMSize MaxVMSize AvePages MaxPages ------------ ---------- ---------- ---------- ---------- ---------- -------- 40628098.ba+ 1834920K 2293272K 12600K 265184K 0 0
For jobs that are not array jobs, just take the job ID given by squeue and supply that as the -j argument to sstat.
The sacct command takes similar command line arguments to sstat. For example:
The MaxRSS field is the maximum amount of memory used by the job at any time over the course of the run.
The time command is also useful to obtain the memory utilization of a running program. It's important to keep in mind there is both a shell builtin version and a standalone version of the time command. When using the time command to gather memory consumption statistics, be sure to run the standalone time command by furnishing the full path, as the shell builtin version of the time command does not support memory utilization profiling.
Please be aware that unconsidered use of the email notification facility can result in thousands of email messages being sent to the specified email address in the job script in a very short period of time.
Follow the guidelines below when using email notification with your Slurm jobs:
To maximize hardware utilization, most compute worker nodes associated with a specific project are a member of two Slurm partitions. The first is their respective project-specific partition and the second is the "main" partition.
This is not always immediately apparent because Slurm will not show project-specific partitions in sinfo unless you explicitly have access to them.
Under normal circumstances, project-specific compute worker nodes will pick up work from the "main" partition if they are idle. They will run jobs from "main" until they happen to receive jobs via their higher priority project-specific partition. When this occurs, work in progress from the lower priority "main" partition will be preempted and requeued and the node will begin working on jobs submitted via the mini-cluster partition until no more jobs are queued to run in the mini-cluster partition. When all available work from the project-specific partition is exhausted, the node will begin picking up work from the "main" partition once again.
Note that only work submitted with the sbatch command will be requeued when preempted. Work submitted via the srun command will simply be terminated by Slurm when preempted.
If you are concerned about your job being preempted (especially for long-running jobs, where preemption can be particularly painful), either submit to the "main-nopreempt" partition, which excludes all project specific nodes, with the command:
Or use the Slurm --exclude switch when submitting jobs to "main" to restrict the scheduler to placing your jobs on nodes that are not also a member of a mini-cluster partition:
Nodes that are not a member of any project-specific partition and should never preempt running jobs include:
Generally, the greater the constraints (i.e. more cores, more physical memory), the fewer the number of nodes that can fulfill them, and the longer your job may end up waiting in the queue for a node to become free with the available resources.
When this occurs, you must cancel your jobs with the scancel command and resubmit them with appropriate constraints.
You can see the list of modules available in R by running R and executing the command:
> library()
If the module you need is not already available in R on the cluster, we encourage you to contact our help desk at csg.help@umich.edu and request that the library be installed. Our operations team is happy to install R libraries per user request and library installations are almost always completed the same day they are requested. Requesting library installations instead of doing it yourself allows all cluster users to benefit from the library installation, saves you the effort of building and maintaining R libraries and allows our operations team to keep libraries up to date as major system changes occur.
If for development or other purposes you must maintain a private R library repository, follow the steps below.
1. Create a directory in your home directory that will be used to hold R libraries that you build.
$ mkdir -p ~/R/site-library
2. Set the R_LIBS_USER environment variable to point to that directory. If you are using the bash shell, the command would be:
$ export R_LIBS_USER=~/R/site-library
If you are using the tcsh shell, the command would be:
% setenv R_LIBS_USER ~/R/site-library
If the R_LIBS_USER environment variable is not set up, R will attempt to install libraries to system directories when you run install.packages() and this will fail as ordinary user accounts to not have access to write to these directories.
3. Run R and install libraries using the install.packages() command. For example:
$ R R version 4.1.1 (2021-08-10) -- "Kick Things" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > install.packages('tidyverse', dependencies=TRUE);
When finished, the libraries will be installed to the directory that you configured for R_LIBS_USER.
4. To make the changes persist for all future sessions, update your dotfiles to set R_LIBS_USER when you log in. If you are using the bash shell, the command would be:
$ echo "export R_LIBS_USER=~/R/site-library" >> ~/.bashrc
If you are using the tcsh shell, the command would be:
% echo "setenv R_LIBS_USER ~/R/site-library" >> ~/.cshrc.aliases