Original data (generated by merging three preliminary call sets: (1) by Jared Maguire and colleagues at the Broad Institute; (2) by Yun Li and Goncalo Abecasis at the University of Michigan; and (3) by Quang Le and Richard Durbin at the Sanger Institute) are the March 2010 release of phased data from the 1000 Genomes Project, downloadable from The CEU dataset contains 120 haplotypes. Singletons (SNPs with minor allele appearing once) are NOT removed.

 The files can be directly fed to mach. We recommend a 2-step imputation procedure:
(step 1) a representative subset of >= 200 unrelated individuals are used to calibrate model parameters; and
(step 2) actual genotype imputation is performed for every person using parameters inferred in step 1.

Example command lines for a 2-step imputation:

mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer > mach.infer.log
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle --mldetails > mach.imp.log

Report to Yun Li if a large number of genotyped SNPs are discarded due to absence in this reference. You can check through the following command line
> grep "will be ignored" mach.*.log

Do not turn on --compact if memory is not an issue.


