University of Michigan Center for Statistical 


PEDSTATS Tutorial - Input Files

PEDSTATS provides graphical and text summaries of the information contained in any pair of pedigree and data files. Pedigree (.ped) files describe relationships between individuals in your dataset and also store marker genotypes, disease status and quantitative trait values. Data (.dat) files provide a description of the contents of the associated pedigree file.

PEDSTATS supports input files in either QTDT, LINKAGE or MENDEL format . Although the three formats are similar, in the discussion below we will focus on QTDT format.

Describing Relationships Between Individuals

Although pedigrees can become quite complex, all the information that is necessary to reconstruct individual relationships in a pedigree file can be summarized in five items: a family identifier, an individual identifier, a link to each parent (if available) and finally an indicator of each individual's sex.

Example Pedigree As an example of how family relationships are described, we will construct a pedigree file for a small pedigree with two siblings, their parents and maternal grand-parents.

For this simple pedigree, the five key items take the following values:

example    granpa   unknown  unknown    m
example    granny   unknown  unknown    f
example    father   unknown  unknown    m
example    mother   granpa   granny     f
example    sister   father   mother     f
example    brother  father   mother     m

These key values constitute the first five columns of any pedigree file. Because of restrictions in early genetic programs, text identifiers are usually replaced by unique numeric values. After replacing each identifier with unique integer and recoding sexes as 2 (female) and 1 (male), this is what a basic space-delimited pedigree file would look like:

<contents of basic.ped>
1   1   0  0  1
1   2   0  0  2
1   3   0  0  1
1   4   1  2  2
1   5   3  4  2
1   6   3  4  1
<end of basic.ped>

A pedigree file can include multiple families. Each family can have a unique structure, independent of other families in the dataset.

Describing Phenotypes and Genotypes

Usually the five standard columns are followed by various types of genetic data, including phenotypes for discrete and quantitative traits and marker genotypes.

Disease status is usually encoded in a single column as

   U or 1 for unaffecteds,
   A or 2 for affecteds, and
   X or 0 for missing phenotypes.

Quantitative traits are encoded as numeric values with X denoting missing values (it is also possible to use a peculiar numeric value to flag missing phenotypes, but the procedure is prone to error and not recommended).

Marker genotypes are encoded as two consecutive integers, one for each allele, optionally separated by a "/". A 0 (zero) or X can be used as a placeholder for missing alleles. The following are all valid genotype entries 1/1 (homozygote for allele 1), 0/0 (missing genotype), and 3 4 (heterozygote for alleles 3 and 4). For the X chromosome, males should be encoded as if they had two identical alleles.

This is what the previous pedigree file might look like after adding a column for disease status, measurements for a quantitative trait and genotypes for two markers:

<contents of basic2.ped>
1   1   0  0  1   1      x   3 3   x x
1   2   0  0  2   1      x   4 4   x x
1   3   0  0  1   1      x   1 2   x x
1   4   1  2  2   1      x   4 3   x x
1   5   3  4  2   2  1.234   1 3   2 2
1   6   3  4  1   2  4.321   2 4   2 2
<end of basic2.ped>

Notice that the two siblings (individuals 5 and 6 in the last two rows) are marked as affected (value 2 in the sixth column), everyone else is marked as unaffected (value 1 in the sixth column). The quantitative trait (seventh column) takes values 1.234 and 4.321 for individuals 5 and 6 respectively. Whereas everyone is genotyped at the first marker, for the second marker, only individuals 5 and 6 are genotyped.

Describing the pedigree file

Pedigree files can include any number of marker genotype, disease status and quantitative trait variables, limited only by available memory. Since each pedigree file has a unique structure (apart from the first five columns), its contents must be described in a companion data file.

The data file includes one row per data item in the pedigree file, indicating the data type (encoded as M - marker, A - affection status, T - Quantitative Trait C - Covariate and Z - Twins ) and providing a one-word label for each item. A data file for the pedigree above, which has one affection status, followed by one quantitative trait and two marker genotypes might read:

<contents of basic2.dat>
A  some_disease
T  some_trait
M  some_marker
M  another_marker
<end of basic2.dat>
Now that you understand pedigree and data file formats, you'll probably want to actually run PEDSTATS! You can get a copy from our download page or if you'd like, you can take a look at some text or PDF output first.


University of Michigan | School of Public Health