MOSIX Computing Clusters

MOSIX Computing Clusters

What is MOSIX

MOSIX is a management system of computational resources in a cluster or a Grid of x86 based Linux computers (nodes) with the aim of making all the nodes perform like a single computer with multiple processors (almost like an SMP). MOSIX users run their usual (parallel and sequential) applications while MOSIX transparently and automatically seeks resources and migrate their processes among nodes to improve the overall performance. For more details see the MOSIX web site. The command man mosix will also provide lots of details about program interfaces.

MOSIX is generic: it provides applications with a run-time environment that is identical to the underlying operating system (currently Linux), so there is no need to change or even link applications with any special library.

MOSIX was originally developed to manage a single private cluster. It has now been extended with new features that can make a number of independent clusters run as a federated system of cooperative computers, collectively called a MOSIX grid. A typical MOSIX grid consists of clusters in several departments and may also include a collection of servers, workstations and organizational shared clusters.

The main goal of MOSIX is to allow owners of nodes to share their computational resources from time to time, while still preserving the autonomy over their own clusters and their ability to disconnect their nodes from the grid at any time without disrupting already running programs.

A MOSIX grid can extend indefinitely as long as there is trust between its cluster owners, which is a key requirement for safe grid computing. This must include guarantees that guest applications will not be modified while running in remote clusters and that no hostile computers can be connected to the local network. Since nowadays these requirements are standard within clusters and intra-organizational grids, we recommend the use of MOSIX in such cases (other than that, nothing prevents the use of MOSIX in any grid).

MOSIX is most suitable for running compute intensive applications with low to moderate amount of I/O. Tests of MOSIX show that the performance of several such applications over a 1Gb/s campus grid is nearly identical to that of a single cluster. MOSIX is particularly suitable for:

  • Running applications with unpredictable resource requirements or run times.

  • Running long processes - which are automatically dispersed among grid nodes and are migrated back when nodes are disconnected from the grid.

  • Efficient utilization of grid-wide resources - by automatic resource discovery and load-balancing.

  • Combining nodes of different speeds - by migrating processes among nodes based on their respective speeds, current load and available memory.

MOSIX is a Linux kernel extension for single-system image clustering. This kernel extension turns a network of ordinary computers into a supercomputer for Linux applications. Nodes in the cluster talk to each other and the cluster adapts itself to the workload. Users SSH to a gateway machine which has a very large number of processors and lots of memory. From here the user starts tasks to run on the cluster nodes.

The Basics

For most practical purposes, logging onto and working with a gateway machine (e.g. fantasia) is pretty much like working on any Unix or Linux system (except that it's a lot more powerful and much more fun!). However, like any computer, the gateway has some limitations to bear in mind when using it in order to keep things running efficiently for everyone. This is particularly true when you start to run simulations using automated scripts or code.

In this situation it's helpful to realize that the gateway is really a set of many computers that have been linked together. The gateway node is where logins, application startup and disk I/O take place, while the other nodes are linked to the gateways and contribute CPU cycles.

Factors to Consider When Running Simulations on a Gateway

  1. How much memory will my job take? Although fantasia has a lot of memory as a whole, individual processes will run on a single node and will need to fit within the memory available on that node. This is particularly important to take into consideration when using statistical genetics software. The datasets we use tend to be large, and many of the problems are solved using numerical methods that can require vast amounts of memory. For linkage analyses (using Merlin or any other program), a primary factor influencing the amount of memory needed is the size of the family being analysed. This means that a merlin run that starts out using only 20MB might easily require 1GB or more of memory later in the simulation. Generally, it's a good idea to familiarize yourself with your data before starting a simulation and have some idea of what maximum memory usage may be. If you're not sure, and you really can't tell from the data, it's a good idea to start simulations at a time when you can keep an eye on their progress until you're sure they're running smoothly and then to check on them frequently as they progress using top.

  2. How many nodes will I need and what nodes are available? When writing your scripts, it can be helpful to bear in mind how many nodes you'll need and to weigh this against your time limitations and the general load on the cluster. If most nodes are already occupied or you're working at a time when many other people are also likely to have work to do, it's usually more reasonable to choose a smaller set of options to evaluate rather than to load up nodes and hope for the best. This will only slow things down for everyone (including yourself), and may even cause the cluster to crash. MOSIX will distribute your work in the most efficient manner it knows how to. So unless your process has particular requirements which can only be met by a particular client node, let MOSIX do the node picking.

    To see what nodes are available, you can use mon which shows node loads graphically if you want to determine the size and number of jobs running on each node. Once you know what nodes are available, you can direct your simulation to a specific node using the runon command described below. In general it's better to let MOSIX choose the node for you.

    NOTE: If you don't use runon, any command typed at the command prompt will be run on the main node. These processes cannot be migrated to client nodes in the cluster, so be sure to launch long running processes with runon.

  3. Can I run my simulations on the main node? It's usually best to direct any computationally intensive task to some other node. When you run a process on a gateway node, all other tasks normally handled there will have to compete with your simulation for CPU cycles and it is likely that the cluster as a whole will run more slowly.

  4. How much disk space will it take to save the output from my simulations? This can be deceiving, particularly if you are running hundreds or thousands of repetitions of a program over a variety of parameter values. While it may seem safer to just save all of your output "just in case", in practice, it's quite easy to accumulate 10-20GB of output by doing so. Huge increases in disk space usage can be problematic when space is already tight, and can cause the entire system to grind to a halt or even crash. To avoid this situation, it's usually best to write your scripts so that only values of interest are saved and to do a few test runs to be sure that the programs you're using aren't creating unusually large output files. Also, try to have your scripts remove unnecessary files when an analysis is complete.

    To check how much disk space you're using in a directory and all directories below it, you can use du -h. ls -l will allow you to see sizes of individual files within a directory.

Running Commands on MOSIX

Users of the cluster should understand a little about how things work in MOSIX:

  • Users need do nothing when coding their applications for use on the cluster. You should always preface your commands with runon. Process migration does not happen automatically, but only runon commands are migrated to other nodes.

  • When a process is migrated to another node, only the memory is moved. When a command is executed with runon, the code is loaded and begins executing on a client machine. Only the executing code and data in memory is moved to the destination node. All system calls by user's code (open, close files) mean the process returns to the gateway to execute the system call (open, close, get a line of data etc.).

Migrated processes use only the CPU and memory on a node. Processes doing lots of reading or writing files can run very slowly. "Lots" is a vague term, but we know from experience that a program which reads or writes a few files and then calculates most of the remaining time works fine in MOSIX. Conversely, a program that creates a thousand files will run more slowly on MOSIX than on a conventional Unix machine. Process migration is slow as computers go, and a program which continually issues system calls (e.g. print) which must be handled by the gateway node will run noticeably slower.

If something goes very wrong on the gateway node which causes it to 'panic' and need to be rebooted, all processes which were initiated on the gateway are lost. It's important to not overwhelm the gateway nodes. Be sure to run your long-running tasks on other nodes with runon

Your Environment

Facts about the MOSIX environment on this cluster.

  • There MOSIX gateways which generally appear identical to users, but there are differences between them.

  • Accounts must be manually created by the cluster administrators (see Policies below). The gateways do not automatically provide access to everyone. Uniqnames for accounts must match uniqnames in the ITD umich.edu realm. Passwords are separate from the umich.edu Kerberos system we use with AFS.

  • Access to AFS data and using AFS passwords is available on all gateways.

  • Java applications and SAS do not get along with MOSIX. If you need to run sas or java, run the command on the gateway nodes. If you have exceedingly long Java or SAS tasks to run, maybe you should be running elsewhere.

  • User's Unix HOMEs will reside on a local disk for each gateway. Users will therefore have to manage the files in separate HOME on each gateway and perhaps in AFS as well.

  • Although there should be adequate disk space on either gateway, please remember they are separate machines; users may need to spend time copying files between them. Files in user's HOME will be backed up to a third, independent source.

Useful Commands on a MOSIX Cluster

After users SSH to a gateway machine and a process is started with runon, the code is loaded on a client machine and begins executing there. Only the executing code and data in memory is moved to the destination node. All system calls by user's code (e.g. open, close files) mean the process returns to the gateway to execute the system call (open, close, get a line of data etc.). General Unix and shell scripting information is available here.

  • mon will provide a simple bar graph of all the nodes in the cluster. While it is running various single-character commands may be entered to change what is being displayed. 'man mon' will provide a summary of these commands.

  • mosps -aeu shows the commands that are executing on the cluster. It's your most convenient way to find out if your job is still running. It will also show you the node where your process is executing.

  • mtop is a special version of 'top' and will show processes executing across all nodes. In general you do not need this, as the only thing of interest in 'mtop' is the node where the code is running. Use 'top' as you would on normal Unix systems. Top has many options to help you see only the processes of interest - read about it with 'man top'.

  • runhome your-command will force the command 'your-command' to run on the gateway machine. It will not be migrated. This is almost never the right thing to do as it will consume resources vital to the other nodes.

  • runon your-command (or runon nodeid your-command) will force the command 'your-command' to run on a cluster node. If you specify the nodeid, the process will be run there. If you do not specify the nodeid, MOSIX will choose the best node (this is your best choice). You cannot coerce MOSIX to run too many tasks on a node, even if you specify the node. Rather MOSIX and the 'runon' command will queue your command for execution when the node is freer. This is why it's generally better to not specify the nodeid. You should start all long-running tasks with runon.

MOSIX Limititions

Some system-calls are not supported by mosrun, including system-calls that are tightly connected to resources of the local node or intended for system-administration. A complete list is provided with man mosrun. The system calls we have run into include:

  1. mmap - MAP_SHARED and mapping of special-character devices are not permitted. You might not think this is common, but some programs like Java, R and SAS use this call, even when it's not necessary. The runon script uses the -e option of mosrun which attempts to fix common misuses of this call. Thus, most things in R will work fine (you might see a warning about the unsupported system call). Unfortunately, both Java and SAS use calls which are not supported, so you need to know that niether Java nor SAS can be used on the cluster. You may, however, run these programs on the gateway - just don't use it with runon.

How to Send Jobs to the Background

When you start up a long simulation, you'll usually want to background your job. This will allow you to sign off of the gateway node, and makes it less likely that your job will be interrupted due to an error that causes you to lose your connection. Running your jobs in the background has the additional benefit of allowing you to start up one or more jobs from one computer and check them later from another computer.

To background a script, simply run it as you normally would, but append an & to the command, like this:

  runon 13 ./your_script.csh >& your_output &

If you'd still like to see the output from your script as your simulation progresses, you can do so using the tail command

  tail -f your_output 
to see progress reports as they are written to stdout. Note: Quitting the tail command will not interrupt your job. If you need to stop a job running in the background, you'll need to use top or mosps to get the process id and then stop all processes spawned by the script using the kill command.

Scripts for Long Running Tasks

If the gateway machine fails, all of the client nodes will fail. So while there may be nothing wrong with your wonderful code which runs for weeks, sometimes other problems can affect your task. If your task is one single program which calculates for weeks on end, there may be little you can do to protect yourself from a failure. If you are lucky, you might be able to change your program so that intermediate values are saved to a file and then in case of failure, your program can restart with the intermediate values.

A more common situation is that a task is really composed of running a program hundreds or even thousands of times, which each step takes a few minutes or hours. In this situation you can often use a wrapper script to keep track of where you are in the entire process and then if there is failure, your script can restart the iteration that was in progress at the time. The following script is a real-world example of just this case. You are encouraged to do something like the following. Be careful to tailor this code to your own needs. This is only provided as an example and not as a general solution for everyone.

The following script would normally be executed by runon.

#!/bin/csh -f 

# This script will do $n_sim gene-dropping simulations using merlin and
# will save the lod and zmean values for a specific marker ($marker)
# to a results directory called ~/merlin_results
# Requires a file (random.numbers) with random seeds, one per line

# restart == true  => start simulation at simulation number recorded in curr_simulation file, 
# restart == false => start with sim = 1
set restart = false

set file = "asp"
set marker = "MRK11"
set work_dir = "~/merlin_work"
set save_dir = "~/merlin_results"

# set up working and result directories
mkdir -p $work_dir
mkdir -p $save_dir

cp ${file}.ped ${work_dir}/
cp ${file}.dat ${work_dir}/
cp ${file}.map ${work_dir}/

# do actual computation in temporary work directory
cd $work_dir
echo "LOD" > lods
echo "ZMEAN" > zmeans

@ n_sims = 100
if ($restart == "false") then
   @ sim = 1
else
   @ sim = `sed -n '1p" curr_simulation"
endif 

while ($sim <= $n_sims)  
   # save index for current simulation to file for restart in case of crash
   echo "$sim" > curr_simulation
   # select the ith random number from random.numbers
   set random = `sed -n "$sim q;d" random.numbers` 
   merlin -p ${file}.ped -d ${file}.dat -m ${file}.map \
      --npl --simulate --markerNames -r $random > merlin.output
   # save lod and zmean value for  $marker 
   set line = `grep $marker_name pairs`   
   set lod = `echo $line | tr -s ' ' ' ' | cut -d' ' -f5`
   set zmean = `echo $line | tr -s ' ' ' ' | cut -d' ' -f2`
   echo "$lod" >> lods
   echo "$zmean" >> zmeans 
   @ sim ++
end
rm -f merlin.output

# write results to results directory
mv lods ${save_dir}/z_means_${file}
mv zmeans ${save_dir}/lods_${file}

# Get rid of the working directory
if ($work_dir != "")  rm -rf ${work_dir}

Resources


MOSIX Computing Clusters
$RCSfile: mosix.html,v $
$Date: 2007/04/10 12:27:17 $
$Revision: 1.4 $