|
MACH 1.0 Tutorial
A full tutorial is not yet available, the README file (pasted
below), should give you a flavor of how MACH works...
README File
INPUT FILES
===========
Mach 1.0 needs a Merlin format data and pedigree files as input.
The data file should look like this:
M marker1
M marker2
...
The pedigree file should list one individual per row. Each row
should start with an family id and individual id, followed by a
father and mother id (which should both be 0, 'zero', since
mach1 assumes individuals are unrelated), and sex. These initial
columns are followed by a series of marker genotypes, each with
two alleles. Alleles can be coded as 1, 2, 3, 4 or A, C, G, T.
For example:
FAM1001 ID1234 0 0 M 1 1 1 2 2 2
FAM1002 ID1234 0 0 F 1 2 2 2 3 3
Or:
FAM1001 ID1234 0 0 M A A A C C C
FAM1002 ID1234 0 0 F A C C C G G
USING MACH 1.0 for HAPLOTYPING
==============================
To use Mach 1.0 to haplotype a sample of unrelated individuals,
you'll need a MERLIN format pedigree and data file. You should
make sure that markers are ordered according to their physical
position and use the --phase command line option to request the
output of phased chromosomes.
The key parameters for managing the quality of inferred haplotypes
and the amount of computational effort expended in generating them
are the --rounds and --states parameters. If missing data is not
distributed evenly among the available individuals, you should
also consider the --weighted parameter (which favors using individuals
with more genotype data as templates for haplotyping other individuals).
The parameter --rounds K specifies how many iterations of the Markov
sampler should be run. Larger numbers will result in better
solutions. If there isn't much missing data, a value of 50 should
give a reasonable solution. Larger values will provide even better
solutions.
The parameter --states K specifies how many haplotypes should be
considered when updating each individual. Larger values will generate
more accurate solutions, but may slow things down a bit (as well as
requiring more memory). A value of 200 or larger typically provides
quite good solutions. The default is to use all available haplotypes
for each update (but this can require a lot of memory and time!).
Other important parameters are --compact (reduces memory use) and
--poll K (to request intermediate solutions after N iterations).
Example Usage:
mach -d sample.dat -p sample.ped --rounds 50 --states 200 --phase
USING MACH 1.0 to INFER UNTYPED MARKERS
=======================================
To use Mach 1.0 to infer genotypes at untyped markers, you
should use the --geno command line option. There are two main
strategies for imputation:
INCLUDE REFERENCE (e.g. HAPMAP) GENOTYPES TO YOUR DATASET:
If you select this option, you should simply create one large
pooled dataset. Some individuals will have missing data and
others will have much more complete genotyping information.
In addition to estimating the most likely genotype for
each individual, you can use the command line options --dosage and
--quality options to request additional information about each
inferred genotype.
USE REFERENCE (e.g. HAPMAP) HAPLOTYPES AS INPUT:
If you select this option, you should generate a file that
includes a set of reference haplotypes. These can be typed
at more markers than are available in your sample. You will
also need a small file that lists all the markers that appear
in the phased haplotypes.
Then, to estimate missing genotypes, you'll need to provide
the Merlin format data and pedigree files, the reference
haplotypes and the list of SNPs in the reference haplotypes.
All markers in the pedigree should also appear in the
reference haplotype set.
Most of the time, you'll get good estimates of genotypes at untyped
markers using the --rounds N and --greedy option.
If you don't use the --greedy option, you can control computational
effort with the --weighted and --states options. However, this
alternative strategy generally requires quite a few more iterations
before converging to a good solution.
Examples:
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --geno
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --geno
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --weighted --geno
SPEEDING UP IMPUTATION
The standard genotype imputation approach, described in the
preceding section works best when you execute a large
number of iterations of the Markov Chain (50-100). These iterations
are used to simultaneously update the crossover map (which determines
the likely locations for haplotype transitions), to update the error
rate map (which flags unusual markers), and to estimate the
missing genotypes.
An alternative approach is to use a single set of estimates for
the crossover and error rate maps and, conditional on these, to
find the most likely genotypes. This approach seems to work quite
well. To use it, use the --crossovermap and --errormap options to
specify estimates of error and crossover rates from a previous
mach run, and request the --mle option instead of --genos.
If you don't have an available set of map estimates, you can
request that Mach estimate them using a small number of iterations
of the Markov Chain with the rounds option.
Examples:
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --crossovermap mach.rec --errormap mach.erate --greedy --mle
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --greedy --mle --rounds 5
MACH1 OUTPUT KEY
================
Mach 1.0 generates a table that provides useful information
about each marker. The filename for the table has the extension
.info or .mlinfo, depending on whether the --mle option is used.
This table includes the marker name, allele labels, minor allele
frequency for each marker. In addition, the estimated probability
that an average imputed genotype will match an experimental
genotype is output (this should be 1.0 for genotyped markers, and
will often be less for untyped markers). You will also get an
estimate of the r-squared correlation between an estimated
genotype scores and true genotypes.
ASSESSING QUALITY OF SOLUTIONS
==============================
One simple way to empirically assess quality of the solutions
generated by Mach 1.0 is to use the mask option. This option
hides a small proportion of genotypes from the haplotyper and
then compares the imputed genotypes at these locations with
the actual genotypes.
Example:
mach -d sample.dat -p sample.ped --rounds 50 --states 200 --mask 0.02
mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --mask 0.02
| |