|
MOSIX Computing Clusters
MOSIX Computing Clusters
What is MOSIX
MOSIX
is a management system of computational resources in a cluster or a Grid
of x86 based Linux computers (nodes) with the aim of making all the nodes
perform like a single computer with multiple processors (almost like an SMP).
MOSIX users run their usual (parallel and sequential) applications
while MOSIX transparently and automatically seeks resources and
migrate their processes among nodes to improve the overall performance.
For more details see the MOSIX web site.
The command man mosix will also provide lots of details about
program interfaces.
MOSIX is generic: it provides applications with a run-time environment that
is identical to the underlying operating system (currently Linux), so there
is no need to change or even link applications with any special library.
MOSIX was originally developed to manage a single private cluster.
It has now been extended with new features that can make a number of
independent clusters run as a federated system of cooperative computers,
collectively called a MOSIX grid.
A typical MOSIX grid consists of clusters in several departments
and may also include a collection of servers, workstations and
organizational shared clusters.
The main goal of MOSIX is to allow owners of nodes to share their
computational resources from time to time, while still preserving
the autonomy over their own clusters and their ability to disconnect
their nodes from the grid at any time without disrupting already running
programs.
A MOSIX grid can extend indefinitely as long as there
is trust between its cluster owners, which is a key requirement
for safe grid computing. This must include guarantees that guest
applications will not be modified while running in remote clusters
and that no hostile computers can be connected to the local network.
Since nowadays these requirements are standard within clusters and
intra-organizational grids, we recommend the use of MOSIX in such cases
(other than that, nothing prevents the use of MOSIX in any grid).
MOSIX is most suitable for running compute intensive applications with
low to moderate amount of I/O. Tests of MOSIX show that the performance
of several such applications over a 1Gb/s campus grid is nearly identical
to that of a single cluster.
MOSIX is particularly suitable for:
- Running applications with unpredictable resource requirements or run times.
- Running long processes - which are automatically dispersed
among grid nodes and are migrated back when nodes are disconnected from the grid.
- Efficient utilization of grid-wide resources -
by automatic resource discovery and load-balancing.
- Combining nodes of different speeds - by migrating
processes among nodes based on their respective speeds, current
load and available memory.
MOSIX is a Linux kernel extension for single-system image clustering.
This kernel extension turns a network of ordinary computers into a
supercomputer for Linux applications.
Nodes in the cluster talk to each other and the cluster adapts itself to
the workload.
Users SSH to a gateway machine which
has a very large number of processors and lots of memory.
From here the user starts tasks to run on the cluster nodes.
The Basics
For most practical purposes, logging onto and working with a gateway machine
(e.g. fantasia) is pretty much like working on any Unix or Linux system
(except that it's a lot more powerful and much more fun!).
However, like any computer, the gateway has some limitations to bear in mind
when using it in order to keep things running efficiently for everyone.
This is particularly true when you start to run simulations using automated
scripts or code.
In this situation it's helpful to realize that the gateway is really a set
of many computers that have been linked together.
The gateway node is where logins, application startup and disk I/O take place,
while the other nodes are linked to the gateways and contribute CPU cycles.
Factors to Consider When Running Simulations on a Gateway
- How much memory will my job take?
Although fantasia has a lot of memory as a whole, individual processes will run
on a single node and will need to fit within the memory available on that node.
This is particularly important to take into consideration when using
statistical genetics software.
The datasets we use tend to be large, and many of the problems are solved using
numerical methods that can require vast amounts of memory.
For linkage analyses (using Merlin or any other program), a primary factor
influencing the amount of memory needed is the size of the family being analysed.
This means that a merlin run that starts out using only 20MB might
easily require 1GB or more of memory later in the simulation.
Generally, it's a good idea to familiarize yourself with your data before
starting a simulation and have some idea of what maximum memory usage may be.
If you're not sure, and you really can't tell from the data, it's a good idea
to start simulations at a time when you can keep an eye on their progress
until you're sure they're running smoothly and then to check on them frequently
as they progress using top.
- How many nodes will I need and what nodes are available?
When writing your scripts, it can be helpful to bear in mind how many
nodes you'll need and to weigh this against your time limitations and the
general load on the cluster.
If most nodes are already occupied or you're working at a time when many other
people are also likely to have work to do, it's usually more reasonable to
choose a smaller set of options to evaluate rather than to load up nodes and
hope for the best.
This will only slow things down for everyone (including yourself), and may even
cause the cluster to crash.
MOSIX will distribute your work in the most efficient manner it knows how to.
So unless your process has particular requirements which can only be met by
a particular client node, let MOSIX do the node picking.
To see what nodes are available, you can use mon which shows node loads graphically
if you want to determine the size and number of jobs running on each node.
Once you know what nodes are available, you can direct your simulation to a
specific node using the runon command described below.
In general it's better to let MOSIX choose the node for you.
NOTE: If you don't use runon, any command typed at the command
prompt will be run on the main node.
These processes cannot be migrated to client nodes in the cluster,
so be sure to launch long running processes with runon.
- Can I run my simulations on the main node?
It's usually best to direct any computationally intensive task to some other node.
When you run a process on a gateway node, all other tasks normally handled there
will have to compete with your simulation for CPU cycles and it is
likely that the cluster as a whole will run more slowly.
- How much disk space will it take to save the output from my simulations?
This can be deceiving, particularly if you are running hundreds or thousands of
repetitions of a program over a variety of parameter values.
While it may seem safer to just save all of your output "just in case",
in practice, it's quite easy to accumulate 10-20GB of output by doing so.
Huge increases in disk space usage can be problematic when space is already tight,
and can cause the entire system to grind to a halt or even crash.
To avoid this situation, it's usually best to write your scripts so that only
values of interest are saved and to do a few test runs to be sure that the programs
you're using aren't creating unusually large output files.
Also, try to have your scripts remove unnecessary files when an analysis is complete.
To check how much disk space you're using in a directory and all directories
below it, you can use du -h.
ls -l will allow you to see sizes of individual files within a directory.
Running Commands on MOSIX
Users of the cluster should understand a little about how
things work in MOSIX:
- Users need do nothing when coding their applications for use on the cluster.
You should always preface your commands with runon.
Process migration does not happen automatically, but only
runon commands are migrated to other nodes.
- When a process is migrated to another node, only the memory is moved.
When a command is executed with runon, the code is loaded
and begins executing on a client machine.
Only the executing code and data in memory is moved to the destination node.
All system calls by user's code (open, close files) mean the process
returns to the gateway to execute the system call (open, close,
get a line of data etc.).
Migrated processes use only the CPU and memory on a node.
Processes doing lots of reading or writing files can run very slowly.
"Lots" is a vague term, but we know from experience
that a program which reads or writes a few files and then calculates most
of the remaining time works fine in MOSIX.
Conversely, a program that creates a thousand files will run more slowly
on MOSIX than on a conventional Unix machine.
Process migration is slow as computers go, and a program which
continually issues system calls (e.g. print) which must be handled
by the gateway node will run noticeably slower.
If something goes very wrong on the gateway node which causes it to
'panic' and need to be rebooted, all processes which were initiated
on the gateway are lost.
It's important to not overwhelm the gateway nodes.
Be sure to run your long-running tasks on other nodes with runon
Your Environment
Facts about the MOSIX environment on this cluster.
- There MOSIX gateways which generally appear identical
to users, but there are differences between them.
- Accounts must be manually created by the cluster administrators (see
Policies below).
The gateways do not automatically provide access to everyone.
Uniqnames for accounts must match uniqnames in the ITD umich.edu realm.
Passwords are separate from the umich.edu Kerberos system we use with AFS.
- Access to AFS data and using AFS passwords is available
on all gateways.
- Java applications and SAS do not get along with MOSIX.
If you need to run sas or java, run the command on the gateway nodes.
If you have exceedingly long Java or SAS tasks to run, maybe you should be
running elsewhere.
- User's Unix HOMEs will reside on a local disk for each gateway.
Users will therefore have to manage the files in separate HOME on
each gateway and perhaps in AFS as well.
- Although there should be adequate disk space on either gateway, please
remember they are separate machines; users may need to spend time copying files
between them.
Files in user's HOME will be backed up to a third, independent source.
Useful Commands on a MOSIX Cluster
After users SSH to a gateway machine and a process is started with runon,
the code is loaded on a client machine and begins executing there.
Only the executing code and data in memory is moved to the destination node.
All system calls by user's code (e.g. open, close files) mean the process
returns to the gateway to execute the system call (open, close,
get a line of data etc.).
General Unix and shell scripting information is available
here.
- mon will provide a simple bar graph of all the nodes in
the cluster. While it is running various single-character commands may be
entered to change what is being displayed.
'man mon' will provide a summary of these commands.
- mosps -aeu shows the commands that are executing on the cluster.
It's your most convenient way to find out if your job is still running.
It will also show you the node where your process is executing.
- mtop is a special version of 'top' and will show processes
executing across all nodes.
In general you do not need this, as the only thing of interest in
'mtop' is the node where the code is running.
Use 'top' as you would on normal Unix systems.
Top has many options to help you see only the processes of interest -
read about it with 'man top'.
- runhome your-command will force the command 'your-command' to
run on the gateway machine. It will not be migrated.
This is almost never the right thing to do as it will consume
resources vital to the other nodes.
- runon your-command (or runon nodeid your-command)
will force the command 'your-command' to run on a cluster node.
If you specify the nodeid, the process will be run there.
If you do not specify the nodeid, MOSIX will choose the best node
(this is your best choice).
You cannot coerce MOSIX to run too many tasks on a node, even if
you specify the node.
Rather MOSIX and the 'runon' command will queue your command for
execution when the node is freer.
This is why it's generally better to not specify the nodeid.
You should start all long-running tasks with runon.
MOSIX Limititions
Some system-calls are not supported by mosrun, including system-calls
that are tightly connected to resources of the local node or intended for
system-administration. A complete list is provided with man mosrun.
The system calls we have run into include:
- mmap - MAP_SHARED and mapping of special-character devices
are not permitted.
You might not think this is common, but some programs like Java, R and SAS
use this call, even when it's not necessary.
The runon script uses the -e option of mosrun which attempts
to fix common misuses of this call.
Thus, most things in R will work fine (you might see a warning about
the unsupported system call).
Unfortunately, both Java and SAS use calls which are not supported,
so you need to know that niether Java nor SAS can be used on the cluster.
You may, however, run these programs on the gateway - just don't use it with runon.
How to Send Jobs to the Background
When you start up a long simulation, you'll usually want to background your job.
This will allow you to sign off of the gateway node, and makes it less likely
that your job will be interrupted due to an error that causes you to
lose your connection.
Running your jobs in the background has the additional benefit of allowing
you to start up one or more jobs from one computer and check them later
from another computer.
To background a script, simply run it as you normally would, but append
an & to the command, like this:
runon 13 ./your_script.csh >& your_output &
If you'd still like to see the output from your script as your simulation
progresses, you can do so using the tail command
tail -f your_output
to see progress reports as they are written to stdout.
Note: Quitting the tail command will not interrupt your job.
If you need to stop a job running in the background, you'll need to use
top or mosps to get the process id and then stop all processes
spawned by the script using the kill command.
Scripts for Long Running Tasks
If the gateway machine fails, all of the client nodes will fail.
So while there may be nothing wrong with your wonderful code which runs
for weeks, sometimes other problems can affect your task.
If your task is one single program which calculates for weeks on end,
there may be little you can do to protect yourself from a failure.
If you are lucky, you might be able to change your program so that
intermediate values are saved to a file and then in case of failure,
your program can restart with the intermediate values.
A more common situation is that a task is really composed of running
a program hundreds or even thousands of times, which each step takes
a few minutes or hours.
In this situation you can often use a wrapper script to keep track
of where you are in the entire process and then if there is failure,
your script can restart the iteration that was in progress at the time.
The following script is a real-world example of just this case.
You are encouraged to do something like the following.
Be careful to tailor this code to your own needs. This is only provided
as an example and not as a general solution for everyone.
The following script would normally be executed by runon.
#!/bin/csh -f
# This script will do $n_sim gene-dropping simulations using merlin and
# will save the lod and zmean values for a specific marker ($marker)
# to a results directory called ~/merlin_results
# Requires a file (random.numbers) with random seeds, one per line
# restart == true => start simulation at simulation number recorded in curr_simulation file,
# restart == false => start with sim = 1
set restart = false
set file = "asp"
set marker = "MRK11"
set work_dir = "~/merlin_work"
set save_dir = "~/merlin_results"
# set up working and result directories
mkdir -p $work_dir
mkdir -p $save_dir
cp ${file}.ped ${work_dir}/
cp ${file}.dat ${work_dir}/
cp ${file}.map ${work_dir}/
# do actual computation in temporary work directory
cd $work_dir
echo "LOD" > lods
echo "ZMEAN" > zmeans
@ n_sims = 100
if ($restart == "false") then
@ sim = 1
else
@ sim = `sed -n '1p" curr_simulation"
endif
while ($sim <= $n_sims)
# save index for current simulation to file for restart in case of crash
echo "$sim" > curr_simulation
# select the ith random number from random.numbers
set random = `sed -n "$sim q;d" random.numbers`
merlin -p ${file}.ped -d ${file}.dat -m ${file}.map \
--npl --simulate --markerNames -r $random > merlin.output
# save lod and zmean value for $marker
set line = `grep $marker_name pairs`
set lod = `echo $line | tr -s ' ' ' ' | cut -d' ' -f5`
set zmean = `echo $line | tr -s ' ' ' ' | cut -d' ' -f2`
echo "$lod" >> lods
echo "$zmean" >> zmeans
@ sim ++
end
rm -f merlin.output
# write results to results directory
mv lods ${save_dir}/z_means_${file}
mv zmeans ${save_dir}/lods_${file}
# Get rid of the working directory
if ($work_dir != "") rm -rf ${work_dir}
|
Resources
MOSIX Computing Clusters
$RCSfile: mosix.html,v $
$Date: 2007/04/10 12:27:17 $
$Revision: 1.4 $
|