MOSIX is a management system of computational resources in a cluster or a Grid of x86 based Linux computers (nodes) with the aim of making all the nodes perform like a single computer with multiple processors (almost like an SMP). MOSIX users run their usual (parallel and sequential) applications while MOSIX transparently and automatically seeks resources and migrate their processes among nodes to improve the overall performance. For more details see the MOSIX web site. The command man mosix will also provide lots of details about program interfaces.
MOSIX is generic: it provides applications with a run-time environment that is identical to the underlying operating system (currently Linux), so there is no need to change or even link applications with any special library.
MOSIX was originally developed to manage a single private cluster. It has now been extended with new features that can make a number of independent clusters run as a federated system of cooperative computers, collectively called a MOSIX grid. A typical MOSIX grid consists of clusters in several departments and may also include a collection of servers, workstations and organizational shared clusters.
The main goal of MOSIX is to allow owners of nodes to share their computational resources from time to time, while still preserving the autonomy over their own clusters and their ability to disconnect their nodes from the grid at any time without disrupting already running programs.
A MOSIX grid can extend indefinitely as long as there is trust between its cluster owners, which is a key requirement for safe grid computing. This must include guarantees that guest applications will not be modified while running in remote clusters and that no hostile computers can be connected to the local network. Since nowadays these requirements are standard within clusters and intra-organizational grids, we recommend the use of MOSIX in such cases (other than that, nothing prevents the use of MOSIX in any grid).
MOSIX is most suitable for running compute intensive applications with
low to moderate amount of I/O. Tests of MOSIX show that the performance
of several such applications over a 1Gb/s campus grid is nearly identical
to that of a single cluster.
MOSIX is particularly suitable for:
MOSIX is a Linux kernel extension for single-system image clustering. This kernel extension turns a network of ordinary computers into a supercomputer for Linux applications. Nodes in the cluster talk to each other and the cluster adapts itself to the workload. Users SSH to a gateway machine which has a very large number of processors and lots of memory. From here the user starts tasks to run on the cluster nodes.
For most practical purposes, logging onto and working with a gateway machine (e.g. fantasia) is pretty much like working on any Unix or Linux system (except that it's a lot more powerful and much more fun!). However, like any computer, the gateway has some limitations to bear in mind when using it in order to keep things running efficiently for everyone. This is particularly true when you start to run simulations using automated scripts or code.
In this situation it's helpful to realize that the gateway is really a set of many computers that have been linked together. The gateway node is where logins, application startup and disk I/O take place, while the other nodes are linked to the gateways and contribute CPU cycles.
Users of the cluster should understand a little about how things work in MOSIX:
Migrated processes use only the CPU and memory on a node. Processes doing lots of reading or writing files can run very slowly. "Lots" is a vague term, but we know from experience that a program which reads or writes a few files and then calculates most of the remaining time works fine in MOSIX. Conversely, a program that creates a thousand files will run more slowly on MOSIX than on a conventional Unix machine. Process migration is slow as computers go, and a program which continually issues system calls (e.g. print) which must be handled by the gateway node will run noticeably slower.
If something goes very wrong on the gateway node which causes it to 'panic' and need to be rebooted, all processes which were initiated on the gateway are lost. It's important to not overwhelm the gateway nodes. Be sure to run your long-running tasks on other nodes with runon
Facts about the MOSIX environment on this cluster.
After users SSH to a gateway machine and a process is started with runon, the code is loaded on a client machine and begins executing there. Only the executing code and data in memory is moved to the destination node. All system calls by user's code (e.g. open, close files) mean the process returns to the gateway to execute the system call (open, close, get a line of data etc.). General Unix and shell scripting information is available here.
Some system-calls are not supported by mosrun, including system-calls that are tightly connected to resources of the local node or intended for system-administration. A complete list is provided with man mosrun. The system calls we have run into include:
When you start up a long simulation, you'll usually want to background your job. This will allow you to sign off of the gateway node, and makes it less likely that your job will be interrupted due to an error that causes you to lose your connection. Running your jobs in the background has the additional benefit of allowing you to start up one or more jobs from one computer and check them later from another computer.
To background a script, simply run it as you normally would, but append an & to the command, like this:
runon 13 ./your_script.csh >& your_output &
If you'd still like to see the output from your script as your simulation progresses, you can do so using the tail command
tail -f your_outputto see progress reports as they are written to stdout. Note: Quitting the tail command will not interrupt your job. If you need to stop a job running in the background, you'll need to use top or mosps to get the process id and then stop all processes spawned by the script using the kill command.
If the gateway machine fails, all of the client nodes will fail. So while there may be nothing wrong with your wonderful code which runs for weeks, sometimes other problems can affect your task. If your task is one single program which calculates for weeks on end, there may be little you can do to protect yourself from a failure. If you are lucky, you might be able to change your program so that intermediate values are saved to a file and then in case of failure, your program can restart with the intermediate values.
A more common situation is that a task is really composed of running a program hundreds or even thousands of times, which each step takes a few minutes or hours. In this situation you can often use a wrapper script to keep track of where you are in the entire process and then if there is failure, your script can restart the iteration that was in progress at the time. The following script is a real-world example of just this case. You are encouraged to do something like the following. Be careful to tailor this code to your own needs. This is only provided as an example and not as a general solution for everyone.
The following script would normally be executed by runon.
#!/bin/csh -f
# This script will do $n_sim gene-dropping simulations using merlin and
# will save the lod and zmean values for a specific marker ($marker)
# to a results directory called ~/merlin_results
# Requires a file (random.numbers) with random seeds, one per line
# restart == true => start simulation at simulation number recorded in curr_simulation file,
# restart == false => start with sim = 1
set restart = false
set file = "asp"
set marker = "MRK11"
set work_dir = "~/merlin_work"
set save_dir = "~/merlin_results"
# set up working and result directories
mkdir -p $work_dir
mkdir -p $save_dir
cp ${file}.ped ${work_dir}/
cp ${file}.dat ${work_dir}/
cp ${file}.map ${work_dir}/
# do actual computation in temporary work directory
cd $work_dir
echo "LOD" > lods
echo "ZMEAN" > zmeans
@ n_sims = 100
if ($restart == "false") then
@ sim = 1
else
@ sim = `sed -n '1p" curr_simulation"
endif
while ($sim <= $n_sims)
# save index for current simulation to file for restart in case of crash
echo "$sim" > curr_simulation
# select the ith random number from random.numbers
set random = `sed -n "$sim q;d" random.numbers`
merlin -p ${file}.ped -d ${file}.dat -m ${file}.map \
--npl --simulate --markerNames -r $random > merlin.output
# save lod and zmean value for $marker
set line = `grep $marker_name pairs`
set lod = `echo $line | tr -s ' ' ' ' | cut -d' ' -f5`
set zmean = `echo $line | tr -s ' ' ' ' | cut -d' ' -f2`
echo "$lod" >> lods
echo "$zmean" >> zmeans
@ sim ++
end
rm -f merlin.output
# write results to results directory
mv lods ${save_dir}/z_means_${file}
mv zmeans ${save_dir}/lods_${file}
# Get rid of the working directory
if ($work_dir != "") rm -rf ${work_dir}
|