MACH Input Files
MACH input files include information on experimental genotypes for a set of individuals and, optionally, on a set of known
haplotypes. MACH can use estimate haplotypes for each sampled individual (conditional on the observed genotypes) or fill in missing
genotypes (conditional on observed genotypes at flanking markers and on the observed genotypes at other individuals). Since an
essential first step in any analysis is to make sure data is formatted correctly, it is worthwhile to go over the input files MACH
expects and their formats.
Observed Genotypes
The essential inputs for MACH are a set of observed genotypes for each individual being studied. Typically, MACH expects that all
the markers being examined map to one chromosome and that appear in map order in the input files. These requirements
can be relaxed when using phased haplotypes as input (see below).
MACH expects observed genotype data to be stored in a set of matched pedigree and data files. The two files are intrinsically
linked, the data file describes the contents of the pedigree file (every pedigree file is slightly different) and the pedigree
file itself can only be decoded with its companion data file. The two files can use either the more modern Merlin/QTDT format or the classic LINKAGE format. Detailed descriptions of each format are available elsewhere,
and here we focus on providing an overview of the bare essentials of the Merlin/QTDT format required for using MACH.
Data files can describe a variety of fields, including disease status information, quantitative traits and covariates, and marker
genotypes. A simple MACH data file simply lists names for a series of genetic markers. Each marker name appears its own line prefaced
by an " M " field code. Here is an example:
<Example of a simple data file>
M marker1
M marker2
...
<End of simple data file>
The actual genotypes are stored in a pedigree file. The pedigree file encodes one individual per row. Each row should start with an
family id and individual id, followed by a father and mother id (which typically are both set to 0, 'zero', since the current version
of MACH assumes all sampled individuals are unrelated), and sex. These initial columns are followed by a series of marker genotypes,
each with two alleles. Alleles can be coded as 1, 2, 3, 4 or A, C, G, T. See below for an example:
<Example of a pedigree file with numerically coded alleles>
FAM1001 ID1234 0 0 M 1 1 1 2 2 2
FAM1002 ID5678 0 0 F 1 2 2 2 3 3
...
<End of pedigree file>
<Example of a pedigree file with base-pair coded alleles>
FAM1001 ID1234 0 0 M A A A C C C
FAM1002 ID5678 0 0 F A C C C G G
...
<End of pedigree file>
In the MACH command line, the name of the data and pedigree files is indicated with the -d and -p options (in short
hand form) or the --datfile and --pedfile options (in long form) respectively. For example:
mach -d genotypes.dat -p genotypes.ped
Optional Phased Haplotypes
For many analyses, but in particular for genotype imputation, it can be very helpful to provide a set of reference haplotypes as
input. Reference haplotypes can include genotypes for markers that were not examined in your own sample but which can, often, be
inputed based on genotypes at flanking markers. Most commonly, these haplotypes might be derived from a public resource such as the International HapMap Project and, eventually, the 1000 Genomes
Project. You can retrieve a current set of phased HapMap format haplotypes from www.hapmap.org/downloads/phasing/.
Phase haplotype information is encoded in two files. The first file (which MACH calls the "snp file") lists the markers in the
phased haplotype. The second file (which MACH calls the "haplotype file") lists one haplotype per line. If you retrieved these files
from the HapMap website, simply combine the --hapmapFormat option with the --snp option to indicate the name of the
HapMap legend file and the --haps option to indicate the name of the file with phased haplotypes. Here is an example:
prompt> mach1 --hapmapFormat --snps genotypes_chr1_CEU_r22_nr.b36_fwd_legend.txt.gz --haps genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz ...
If you don't use the --hapmapFormat option, MACH expects the snp file (indicated with the --snps option) to simply
list one marker name per line and the haplotype files (indicated with the --haps option) to list one haplotype per line.
Haplotypes can be prefaced by one or two optional labels followed by a series of single character alleles, one for each marker. Within
each haplotype, spaces are ignored. Here are two examples:
<Example of a snp list file>
marker1
marker2
...
<End of snp list file>
In the sample haplotype file below, note that the first two columns are automatically ignored (because, based on the legend file,
MACH knows the phased haplotypes should include only 13 markers, corresponding to the last string of digits on each line). Also note
that the alleles A, C, G, and T have been recoded as digits 1, 2, 3, and 4.
<Example of a phased haplotype file>
FAMILY1->PERSON1 HAPLO1 2332323244332
FAMILY1->PERSON1 HAPLO2 2332323422132
FAMILY2->PERSON1 HAPLO1 3332323244332
FAMILY2->PERSON1 HAPLO2 3311321242332
...
<End of phased haplotype file>
If you provide a MACH a set of reference haplotypes as input, the marker order in the phased haplotypes overrides any marker order
that may be specified in the pedigree and data files that contain the genotype data. This means that one convenient way to re-order
markers in your original pedigree and data file is to simply create an empty haplotype file and a companion snp that lists markers in
the desired order. When you provide these two as input, they'll overwrite the marker order specified in the data file.
Useful Tip: You can usually economize disk space by using gzip to
compress your input files (the data and pedigree files and any files containing the reference haplotypes). MACH can automatically
recognize gzipped files and decompress them on the fly.
That is all you should need to get started!
|