University of Michigan Center for Statistical 


Genotype Imputation

Perhaps the reason that most people use of MACH is to infer genotypes at untyped markers in genome-wide association scans. The process makes it relatively straightforward to combine results of genome-wide association scans based on different genotyping platforms (for two early examples of how the process works, see the papers by Willer et al (Nat Genet, 2008) and Sanna et al (Nat Genet, 2008)) and to increase power of association analyses for studies based on a single platform.

To infer missing genotypes, you'll typically provide genotypes for your own samples as input together with haplotypes for a reference sample, such as the HapMap. An alternative is to create a large pooled dataset that includes genotypes both for your own samples and for the reference individuals in a single pedigree file. Since this alternatively is not commonly used, we will focus here on describing the first strategy.

Preliminary Checks

Before genotype imputation, you should carry out basic data quality checks on available genotypes. Typically, we exclude from analysis markers that have low genotyping success rates (perhaps with <95% of genotypes called successfully), unexpected evidence for deviations from Hardy-Weinberg equilibrium (perhaps with an HWE p-value < 0.000001 or so), large numbers of discrepancies among duplicate samples or with several mendelian inconsistensies in available parent-offspring trios, or that are rare (with MAF < 1% or so). All these checks are platform and study specific, and you'll have to figure out what is appropriate for your data. They are mentioned here as a reminder...

When MACH loads your pedigree and the reference haplotypes, it checks that allele labels in the two samples are compatible and that allele frequencies are broadly comparable. If your sample includes no A/T or G/C SNPs (e.g. because it was genotyped on an Illumina Infinium platform), you can use the --autoFlip option to ensure that alleles in the pedigree file and those in the reference haplotypes refer to the the same strand. If your sample does include A/T and G/C SNPs, you'll have to ensure they are aligned to the same strand manually and inspect allele frequency discrepancies identified by MACH to help pinpoint problems. Although it is typical that a small number of SNPs will drift in frequency between populations, we recommend that you read through the warnings generated by MACH. If you see large frequency discrepancies or anything else suspicious ... investigate!

Newer versions of MACH will automatically ignore any SNPs that are present in your pedigree file but not in the reference panel. SNPs that are present only in the reference panel but not in your pedigree will be imputed!

Step 1: Estimating Model Parameters

Once you are happy with your input dataset, the most (computationally) efficient way to carry out imputation in large GWAS datasets is to use --greedy option and to carry out a two step process. The first step is to build a model that relates your samples to the haplotypes in the reference panel. This model includes both an estimate of the "error" rate for each marker (an omnibus parameter which captures both genotyping error, discrepancies between your platform and the reference panel, and recurrent mutation) and of "crossover" rates for each interval (a parameter that describes breakpoints in haplotype stretches shared between your samples and the reference panel).

The key choices for this first step are the number of iterations expended in estimating model parameters (specified with the --rounds parameter) and the number of individuals in your sample to used for model building. In small samples, it is often okay to include your entire sample in this model parameter estimation step, in larger samples it is usually sufficient to include a random subset of 200-500 individuals in this step.

A typical command line might look like this:

mach1 -d gwas.dat -p gwas_subset.ped -s hapmap.legend -h hapmap.phased --hapmapFormat --greedy -r 100 --prefix step1

Once all iterations are completed, MACH will store model parameters in two files step1.rec and step1.erate. The two filenames are specified by the --prefix option. We will use these files as input for the next step, where model parameters will be fixed.

[tip]Useful Tip: When analyzing very large samples, the --compact option can help you save memory.

Step 2: Carrying Out Genotype Imputation

This step is relatively quick and uses the parameters estimated in the previous round and calibrated to your specific dataset and genotyping platform to impute all SNPs in the reference panel in your sampled individuals.

mach1 -d gwas.dat -p gwas.ped -s hapmap.legend -h hapmap.phased --hapmapFormat\
      --crossover step1.rec --errormap step1.erate --greedy --mle --mldetails --prefix step2

The --mle and --mldetails options request that MACH should carry out maximum likelihood genotype imputation. The results of this process are summarized in a .mlinfo file and detailed in a series of additional files. The .mlinfo file is a tabular file with one row per SNP. The fields it contains are as follows:

Column Description
SNP Marker name for this SNP
Al1 Allele 1 Label (e.g. A, C, G or T)
Al2 Allele 2 Label (e.g. A, C, G or T)
Freq Frequency for Allele 1
Quality The average posterior probability for the most likely genotype. For a given frequency, markers with higher quality are typically better imputed. However, it is hard to compare quality scores for markers with different minor allele frequencies.
Rsq A better quality measured, which estimates the squared correlation between imputed and true genotypes. Typically, a cut-off of 0.30 or so will flag most of the poorly imputed SNPs, but only a small number (<1%) of well imputed SNPs.

The additional output files encode the following information:

File Contents
.mlgeno Contains the best-guess (i.e., most likely) genotype for each individual at each SNP
.mldose Contains dosages (i.e., estimated counts) of the reference allele (Al1 in .mlinfo) in each individual. These estimates may be fractional and range from 0.0 to 2.0.
.mlqc Contains a quality scores for each imputed genotype. The quality score is the posterior probability for the most likely genotype, ranging from 0-1.
.mlprob Contains posterior probabilites for the Al1/Al1 and Al1/Al2 genotypes at each marker for each individual.

Hands-On Example

To try these analyses, go to the examples subdirectory in the mach distribution and execute the following commands:

# To estimate model parameters ...
prompt> mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.haplos --greedy --rounds 10 --prefix round1

# To fill in missing genotypes ...
prompt> mach1 -d sample.dat -p sample.ped -s hapmap.snps -h hapmap.haplos --greedy --errormap round1.erate --cross round1.rec --mle --mldetails


University of Michigan | School of Public Health | Abecasis Lab