1000G 2010-03 Download
Original data (generated by merging three preliminary call sets: (1) by Jared Maguire and colleagues at the Broad Institute; (2) by Yun Li and Goncalo Abecasis at the
University of Michigan; and (3) by Quang Le and Richard Durbin at the Sanger Institute) are the March 2010 release of phased data from the 1000 Genomes Project,
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_03/pilot1/. The CEU dataset contains 120 haplotypes.
Singletons (SNPs with minor allele appearing once) are NOT removed.
The files can be directly fed to mach. We recommend a 2-step imputation procedure:
(step 1) a representative subset of >= 200 unrelated individuals are used to calibrate model parameters; and
(step 2) actual genotype imputation is
performed for every person using parameters inferred in step 1.
Example command lines for a 2-step imputation:
mach1 -d sample.dat -p subset.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip -r 100 -o par_infer > mach.infer.log
mach1 -d sample.dat -p sample.ped -s chr20.snps -h chr20.hap --compact --greedy --autoFlip --errorMap par_infer.erate --crossoverMap par_infer.rec --mle
--mldetails > mach.imp.log
Report to Yun Li if a large number of genotyped SNPs are discarded due to absence in this
reference. You can check through the following command line
> grep "will be ignored" mach.*.log
Do not turn on --compact if memory is not an issue.