Genotype Imputation
Perhaps the reason that most people use of MACH is to infer genotypes at untyped markers in genome-wide association scans. The
process makes it relatively straightforward to combine results of genome-wide association scans based on different genotyping platforms
(for two early examples of how the process works, see the papers by Willer et al (Nat Genet, 2008) and Sanna et al (Nat Genet, 2008)) and to increase power
of association analyses for studies based on a single platform.
To infer missing genotypes, you'll typically provide genotypes for your own samples as input together with haplotypes for a
reference sample, such as the HapMap. An alternative is to create a large pooled dataset that includes genotypes both for your own
samples and for the reference individuals in a single pedigree file. Since this alternatively is not commonly used, we will focus here
on describing the first strategy.
Preliminary Checks
Before genotype imputation, you should carry out basic data quality checks on available genotypes. Typically, we exclude from
analysis markers that have low genotyping success rates (perhaps with <95% of genotypes called successfully), unexpected evidence for
deviations from Hardy-Weinberg equilibrium (perhaps with an HWE p-value < 0.000001 or so), large numbers of discrepancies among
duplicate samples or with several mendelian inconsistensies in available parent-offspring trios, or that are rare (with MAF < 1% or
so). All these checks are platform and study specific, and you'll have to figure out what is appropriate for your data. They are
mentioned here as a reminder...
When MACH loads your pedigree and the reference haplotypes, it checks that allele labels in the two samples are compatible and that
allele frequencies are broadly comparable. If your sample includes no A/T or G/C SNPs (e.g. because it was genotyped on an Illumina
Infinium platform), you can use the --autoFlip option to ensure that alleles in the pedigree file and those in the reference haplotypes
refer to the the same strand. If your sample does include A/T and G/C SNPs, you'll have to ensure they are aligned to the same strand
manually and inspect allele frequency discrepancies identified by MACH to help pinpoint problems. Although it is typical that a
small number of SNPs will drift in frequency between populations, we recommend that you read through the warnings generated by MACH. If
you see large frequency discrepancies or anything else suspicious ... investigate!
Newer versions of MACH will automatically ignore any SNPs that are present in your pedigree file but not in the reference panel.
SNPs that are present only in the reference panel but not in your pedigree will be imputed!
Step 1: Estimating Model Parameters
Once you are happy with your input dataset, the most (computationally) efficient way to carry out imputation in large GWAS datasets
is to use --greedy option and to carry out a two step process. The first step is to build a model that relates your samples to
the haplotypes in the reference panel. This model includes both an estimate of the "error" rate for each marker (an omnibus
parameter which captures both genotyping error, discrepancies between your platform and the reference panel, and recurrent mutation)
and of "crossover" rates for each interval (a parameter that describes breakpoints in haplotype stretches shared between your
samples and the reference panel).
The key choices for this first step are the number of iterations expended in estimating model parameters (specified with the
--rounds parameter) and the number of individuals in your sample to used for model building. In small samples, it
is often okay to include your entire sample in this model parameter estimation step, in larger samples it is usually sufficient to
include a random subset of 200-500 individuals in this step.
A typical command line might look like this:
mach1 -d gwas.dat -p gwas_subset.ped -s hapmap.legend -h hapmap.phased --hapmapFormat --greedy -r 100 --prefix step1
Once all iterations are completed, MACH will store model parameters in two files step1.rec and step1.erate. The two
filenames are specified by the --prefix option. We will use these files as input for the next step, where model parameters will
be fixed.
Useful Tip: When analyzing very large samples, the --compact option can
help you save memory.
Step 2: Carrying Out Genotype Imputation
This step is relatively quick and uses the parameters estimated in the previous round and calibrated to your specific dataset and
genotyping platform to impute all SNPs in the reference panel in your sampled individuals.
mach1 -d gwas.dat -p gwas.ped -s hapmap.legend -h hapmap.phased --hapmapFormat\
--crossover step1.rec --errormap step1.erate --greedy --mle --mldetails --prefix step2
The --mle and --mldetails options request that MACH should carry out maximum likelihood genotype imputation. The
results of this process are summarized in a .mlinfo file and detailed in a series of additional files. The .mlinfo file
is a tabular file with one row per SNP. The fields it contains are as follows:
Column |
Description |
SNP |
Marker name for this SNP |
Al1 |
Allele 1 Label (e.g. A, C, G or T) |
Al2 |
Allele 2 Label (e.g. A, C, G or T) |
Freq |
Frequency for Allele 1 |
Quality |
The average posterior probability for the most likely genotype. For a given frequency, markers with higher quality are typically
better imputed. However, it is hard to compare quality scores for markers with different minor allele frequencies. |
Rsq |
A better quality measured, which estimates the squared correlation between imputed and true genotypes. Typically, a cut-off of
0.30 or so will flag most of the poorly imputed SNPs, but only a small number (<1%) of well imputed SNPs.
|
The additional output files encode the following information:
File |
Contents |
.mlgeno |
Contains the best-guess (i.e., most likely) genotype for each individual at each SNP |
.mldose |
Contains dosages (i.e., estimated counts) of the reference allele (Al1 in .mlinfo) in each individual. These estimates may be
fractional and range from 0.0 to 2.0. |
.mlqc |
Contains a quality scores for each imputed genotype. The quality score is the posterior probability for the most likely
genotype, ranging from 0-1. |
.mlprob |
Contains posterior probabilites for the Al1/Al1 and Al1/Al2 genotypes at each marker for each individual. |
Hands-On Example
To try these analyses, go to the examples subdirectory in the mach distribution and execute the following commands:
# To estimate model parameters ...
prompt> mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.haplos --greedy --rounds 10 --prefix round1
# To fill in missing genotypes ...
prompt> mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.haplos --greedy --errormap round1.erate --cross round1.rec --mle --mldetails
|