SIMLINK: A PROGRAM FOR ESTIMATING THE POWER OF A PROPOSED LINKAGE STUDY BY COMPUTER SIMULATION Version 4.12 April 2, 1997 Michael Boehnke and Lynn M. Ploughman Department of Biostatistics School of Public Health University of Michigan Ann Arbor, Michigan 48109-2029 Phone: (734) 936-1001 FAX: (734) 763-2215 Email: boehnke@umich.edu TABLE OF CONTENTS I. Introduction II. Definitions III. Assumptions of the Power Calculation IV. Options V. Outline of the Power Calculation VI. Input for SIMLINK VII. Output from SIMLINK VIII. Four Sample Problems IX. Array Sizes, File Management, and Other Practical Hints X. Error Conditions XI. References I. Introduction This document describes a computer program to estimate the probability, or power, of detecting linkage given family history information on a set of identified pedigrees. It is assumed that the pedigrees are of known structure and that some data may be available for the genetic trait that is to be mapped. The analysis described here can be applied to autosomal or X-linked traits determined by a single major locus. The trait may be dichotomous with complete or reduced penetrance, or may be quantitative. This power calculation is most usefully undertaken after family history data are gathered, but prior to examination and testing of pedigree members to obtain marker information. The result of this power calculation is an objective answer to the question: Will my families be sufficient to demonstrate linkage? The theoretical basis for this program is given by Ploughman and Boehnke (1989) and Boehnke (1986). The program SIMLINK (LODSTAT is now incorporated as part of SIMLINK) required for this power calculation has three major components: (A) Trait and Marker Genotype Simulation: This component of the program simulates cosegregation of trait and marker loci in pedigrees. If simulating one marker locus for lod score analysis, a particular (set of) recombination fraction(s) is assumed; if simulating two flanking marker loci for analysis by location scores, a particular map distance is assumed. The program assumes that phenotypic information may be available for some pedigree members for the trait, but not for the marker(s). Genotypes are simulated in an unbiased fashion (Boehnke, 1986) so that individuals are assigned a trait genotype consistent with their observed trait phenotype and the phenotypes of the other pedigree members. Marker genotype simulation is based on population marker gene frequencies, trait genotypes, and the recombination fraction(s) between the trait and marker loci, and assumes Hardy-Weinberg and linkage equilibrium. Traits can be genetically homogeneous, or can be heterogeneous between pedigrees. Individuals identified as unavailable for sampling are assigned unknown marker phenotypes for subsequent lod or location score calculation. (B) Lod or Location Score Calculation: This component of the program calculates lod or location scores based on the simulation results for each replicate pedigree. Lod scores are calculated if one marker locus was simulated; location scores are calculated if two flanking marker loci were simulated. A modified version of the computer program MENDEL (Lange et al., 1988) acts as a subroutine for implementing these calculations. (C) Linkage Information Calculation: This component of the program calculates sample statistics for the maximum lod/location score distributions, resulting in estimates of (1) expected maximum lod/location scores, (2) probabilities of maximum lod/location scores sufficiently large to conclude linkage, and (3) expected exclusion regions when the trait is not linked to the marker(s). Expected maximum lod scores for each pedigree conditional on whether individual pedigree members are homozygous or heterozygous can be used to identify key individuals for the linkage analysis. To estimate the power of a proposed linkage study, multiple replicates of each pedigree for each of several true recombination fractions or map distances between the trait and marker loci are simulated. After a replicate pedigree has been simulated for each pedigree type and each true recombination fraction or map distance, MENDEL calculates lod or location scores. The resulting scores are used to estimate the maximum lod/location score for each pedigree and for the set of pedigrees and to update the linkage information statistics. Once this process has been completed for the desired number of replicates, estimates of the linkage information provided by the pedigrees, including expected maximum lod/location scores and the probabilities of maximum lod/location scores greater than particular constants, are calculated and output to a series of tables. The probability of a maximum lod/location score greater than 3.0 gives the probability that the pedigree or set of pedigrees will be sufficient to demonstrate linkage. We thank Kenneth Lange and Daniel Weeks for their work in developing MENDEL and for generously allowing us to incorporate portions of it into SIMLINK. Any problems that arise through the use of the modified version of MENDEL as a component of SIMLINK are the responsibilities of Boehnke and Ploughman, and questions should be directed to us. II. Definitions Several terms are used in this document that are of key importance. These include: True Recombination Fraction: recombination fraction used to simulate replicate pedigrees when simulating one marker locus. True Map Distance: map distance between the two flanking marker loci used to simulate replicate pedigrees when simulating two flanking marker loci. Replicate pedigrees are simulated placing the trait locus at a series of distances along the interval between the two marker loci. All map distances are converted to recombination fractions using Haldane's (1919) mapping function for use in the simulation. Test Recombination Fraction: recombination fraction at which lod/location scores are calculated. In general, there will be several test recombination fractions for each true recombination fraction or map distance, since by chance a replicate pedigree may achieve its maximum lod/location score at a recombination fraction or map position different from the true one. Replicate Pedigree: a copy of one of the user-supplied pedigrees for which trait and/or marker phenotypes are simulated. In general, a large number of replicate copies should be simulated for each pedigree to achieve sufficiently accurate estimates of statistical power and mean maximum lod/location scores. III. Assumptions of the Power Calculation This power calculation for a linkage study assumes: (A) One or more pedigrees have been identified in which a dichotomous or quantitative trait determined by a two-allele genetic locus is segregating. If the dichotomous trait exhibits incomplete penetrance, the penetrance function can be described by a piecewise linear or cumulative normal penetrance function. (B) Pedigree structures (that is, relationships among pedigree members) are known for all pedigrees. Trait phenotypes may be known (but need not be) for some or all pedigree members. Marker phenotypes are unknown. (C) Mode of inheritance is known for the trait. If mode of inheritance for the trait is not clear, the power calculation corresponds to the power of a linkage study if the assumed trait mode of inheritance is true. Given several different candidate trait models, it may be desirable to carry out a power calculation for each model. (D) Hardy-Weinberg and linkage equilibrium. (E) No interference, so that Haldane's (1919) mapping function is appropriate. This assumption is relevant only if flanking markers are simulated. (F) No MZ-twins are present in the pedigrees. Given a pedigree with MZ twins, we recommend including only one of the twins in the data set for the power calculation. IV. Options The power calculation outlined here can be carried out in several different ways depending on the trait of interest and the interests and preferences of the investigator. Options available include: (A) Chromosomal Location: The trait and marker loci may be either all autosomal or all X-linked. (B) Marker Loci: The investigator must choose the situation to simulate: either a single marker locus or a pair of flanking marker loci. Marker mode of inheritance can follow any simple Mendelian pattern. The default maximum number of alleles per marker locus is 4, but can be increased by changing a set of dimension statements and recompiling. Gene frequencies must also be specified. If in the proposed study particular marker loci are to be used or are of predominant importance, modes of inheritance and allele frequencies for those markers can be simulated. If not, a reasonable choice might be to assume two- allele, codominant markers with equal allele frequencies. (C) Recombination Fractions or Map Distances: The results of the power calculation depend very strongly on the distance to the linked marker(s). Therefore, it may be helpful to consider several true recombination fractions between the trait locus and a single marker locus or to consider several true map distances between the two flanking marker loci. (D) Unlinked Marker: It is also of interest to estimate the region about an unlinked marker or pair of unlinked markers that might be excluded from linkage. This exclusion region may be estimated. (E) Genetic Heterogeneity: Genetic heterogeneity can be allowed for using the admixture model for heterogeneity (Smith, 1963). Under this model, the probability of the trait being linked in a given pedigree is alpha; with probability 1 - alpha the trait is unlinked. This model assumes that while different pedigrees may have different genetic forms of the disease, within a pedigree only a single genetic form is present. If genetic heterogeneity is allowed for, two different lod scores are calculated: the standard lod score which assumes genetic homogeneity, and a lod score which allows for maximization as a function of both the recombination fraction and the linked fraction alpha. Risch (1989) has demonstrated that for simple genetic models and nuclear family data, ignoring heterogeneity and calculating the standard lod score tends to be the more powerful choice unless the linked fraction alpha is small, the pedigrees are large, and the recombination fraction is small. The relative merits of these two analytic strategies for a specific combination of genetic model and pedigree data set can be evaluated using SIMLINK. (F) Identifying Key Pedigree Members: Often, particular pedigree members are of key importance in determining the linkage information provided by a pedigree. To assess that importance, we allow calculation of the expected maximum lod score for each pedigree conditional on the marker heterozygosity/homozygosity status of each pedigree member. We regard an individual as a key pedigree member if there is a large difference in the expected maximum lod score for his/her pedigree depending on whether or not (s)he is marker heterozygous. V. Outline of the Power Calculation The power calculation is a four step process, involving (A) calculation of genotype conditional probabilities for each pedigree member; (B) simulation of a replicate of each of the user-supplied pedigree(s); (C) calculation of lod/location scores for the replicate of each of the pedigree(s); and (D) calculation of statistics based on the lod/location scores. Step (A) is carried out once prior to replicate pedigree simulation, steps (B) and (C) are repeated in sequence for each replicate, and step (D) is carried out after all replicates have been simulated. Each of these steps is described in this section. (A) Calculation of Genotype Conditional Probabilities: To facilitate unbiased genotype simulation, conditional probabilities for the trait genotypes of each pedigree member are calculated conditional on the trait genotypes of (some of) their relatives. This is accomplished by a single trait-model likelihood evaluation using MENDEL. (B) Simulation of Pedigrees: SIMLINK simulates cosegregation at the trait and marker loci for multiple replicates of each pedigree. Simulations are carried out at the specified true recombination fractions for one marker locus or at the recombination fractions corresponding to the specified map distance for two flanking marker loci. Input required includes (for details, see Input): (1) Family History Information for Each Pedigree Member: an ID, IDs for the parents, gender, trait phenotype if known, trait availability indicator, and, if desired, a variable (e.g. age) which along with gender and genotype determines the penetrance function. (2) Trait and Marker Locus Descriptions: mode of inheritance and allele frequency information for the trait and marker loci in the form required by MENDEL. (3) Recombination Fractions/Map Distance: true recombination fractions at which cosegregation is to be simulated, if simulating one marker locus; a single map distance, if simulating two flanking marker loci. For two marker loci, the trait locus will be placed at positions along the interval between the two marker loci and the resulting map distances converted to recombination fractions using Haldane's (1919) mapping function. (4) Penetrance Function: Currently, SIMLINK allows for a piecewise-linear penetrance function or a cumulative normal penetrance function for dichotomous traits. The program allows for different forms of these penetrance functions for each trait genotype/gender combination and allows them to depend on one quantitative variable. This variable typically will be age, and we will assume that it is age for the remainder of this document. The piecewise-linear function assumes that a minimum penetrance holds for ages less than a minimum age, increases linearly to a maximum penetrance at a maximum age, and remains at the maximum penetrance for ages greater than the maximum age. The cumulative normal penetrance function assumes that penetrance increases from the minimum penetrance at age minus infinity to the maximum penetrance at age plus infinity following a cumulative normal distribution with a specified mean and standard deviation. Quantitative traits with genotype-specific normal distributions are the third penetrance option. (5) Control Information: Number of replicates to be simulated for each available pedigree, locus and pedigree file names, seeds for the random number generator, and other control variables. SIMLINK creates pedigree files appropriate for MENDEL containing a single replicate of each pedigree type. In each replicate pedigree, members with known trait phenotype are assigned their correct trait phenotype. Pedigree members of currently unknown trait phenotype may be assigned a trait phenotype if desired; marker phenotypes can also be simulated and assigned. When simulating one marker locus, one marker phenotype will be listed for each true recombination fraction under which pedigrees were simulated; when simulating two flanking marker loci, two marker phenotypes, one per locus, will be listed for each pair of true recombination fractions under which pedigrees were simulated. (C) Lod or Location Score Calculations: Using the pedigree file created by SIMLINK, MENDEL calculates log likelihoods for subsequent calculation of lod scores or location scores. (D) Calculation of Linkage Information Estimates: SIMLINK calculates the following linkage information criteria for the pedigrees at the different true recombination fractions/map distances: (1) For linked markers: (a) the expected maximum lod/location score for each pedigree and for the summed pedigrees assuming homogeneity or allowing for heterogeneity (optional); and (b) the probability of a maximum lod/location score greater than specified constants for each pedigree, the summed pedigrees assuming homogeneity or allowing for heterogeneity (optional), and any one pedigree. (2) For unlinked markers: (a) the expected lod/location score for several test recombination fractions/map distances for each pedigree and the summed pedigrees; and (b) the probability of a lod/location score greater than specified constants. These information criteria may be used to estimate: (1) The Power of the Linkage Study: The power of a proposed linkage study is the probability of detecting a linked marker if it is tested. Equivalently, it is the probability of a obtaining a maximum lod score of at least 3.0 for a linked marker (Morton, 1955). This probability is estimated under (1b) above when the constant equals 3.0. The power can be estimated for (a) each pedigree alone, (b) the summed pedigrees (under the assumption that the trait is caused by the same locus in all pedigrees), (c) the summed pedigrees allowing for between pedigree heterogeneity (optional), and (d) all the pedigrees but without summing the lod scores (allowing in the analysis for the possibility that the trait may be caused by two or more loci, but assuming in the simulation that only one locus is actually involved). (2) The Expected Exclusion Region for An Unlinked Marker (Pair): A lod score of less than -2.0 is customarily accepted as conclusive evidence for the exclusion of linkage (Morton, 1955). Calculating the expected lod/location scores for an unlinked marker (pair) at each of several test recombination fractions/map distances, yields an estimate of the exclusion region when testing for linkage to an unlinked marker (pair). (3) Probability of Incorrectly Concluding Linkage: Estimating the probability of a maximum lod/location score greater than 3.0 for a true recombination fraction of .50 gives the probability of incorrectly concluding linkage to an unlinked marker (pair). In statistical terms, that is the probability "a" of making a type I error for a single marker (pair). Since many (pairs of flanking) markers will often be considered, the overall probability of making a type I error is greater. Assuming that the linkage calculations for the different (pairs of flanking) markers are independent, the overall probability of making a type I error becomes 1 - (1 - a)**n, where n is the number of (pairs of flanking) markers and "**" represents exponentiation. In addition, SIMLINK will as an option calculate the expected maximum lod score for each pedigree conditional on the heterozygosity/homozygosity status of each pedigree member. This provides a means of identifying pedigree member(s) whose marker status has a strong impact on the linkage information provided by the pedigree. VI. Input for SIMLINK Three input files are required: (A) the control file, (B) the locus file, and (C) the pedigree file. (A) The Control File: The control file contains general information describing the power calculation. The sample control file below requests a power calculation based on 100 replicates for a genetically homogeneous dominant trait called "TRAIT" with penetrance 0.80 in both males and females (independent of age). Power is to be estimated for a marker linked at 0%, 5%, or 10% recombination to the trait; free recombination is also simulated. The data will be echoed in the output file, and the effect of individual marker eterozygosity/homozygosity status will be determined. 100 1 1 1 4 1 1 0 1.00 0.00 0.05 0.10 0.50 0.00 60.0 0.00 0.00 0.00 60.0 0.80 0.80 0.00 60.0 0.80 0.80 0.00 60.0 0.00 0.00 0.00 60.0 0.80 0.80 0.00 60.0 0.80 0.80 M F TRAIT LOCUS.DAT PEDIG.DAT 31171 2413 19771 The following records in the given order and with variables and formats as described below are required in the control file (see Examples): 1. Control Information: The following nine variables in order, each within an 8 column field, all but the last right justified (8I8,F8.5): Note: This record and its format have been substantially altered since version 4.0. The definition of NTHETA has also been changed to include free recombination. Col 1- 8 NREP: the number of replicate data sets to simulate. Col 9-16 NMLOCI: the number of marker loci: =1 then lod scores are calculated, =2 then two markers are assumed to flank the trait locus and location scores are calculated. Col 17-24 PENOPT: the indicator of the type of penetrance function for the trait: =1 a piecewise-linear penetrance function for a dichotomous trait, =2 a cumulative normal penetrance function for a dichotomous trait, =3 a quantitative trait due to a mixture of normal distributions. Col 25-32 IFREE: indicator of whether free recombination between the trait and marker locus (loci) is to be simulated: =0 if no, =1 if yes. Col 33-40 NTHETA: if using one marker locus, the number of different true recombination fractions between the trait and marker loci to be considered. Ignored if using two flanking marker loci. Col 41-48 IECHO: data echoing indicator =0 if data will not be echoed in the output file =1 if data will be echoed in the output file Col 49-56 INDINF: identify key individuals by heterozygosity/ homozygosity status; =0 if no, =1 if yes Col 57-64 LNKOPT: linkage heterogeneity option indicator =0 if genetic homogeneity is assumed =1 if genetic heterogeneity is allowed Col 65-72 ALPHA: probability that a pedigree is segregating the linked form of the trait (ignored if LNKOPT=0) 2. Recombination Fractions/Map Distance: If lod scores are to be calculated (NMLOCI=1), the set of possible true recombination fractions between the trait and marker loci input in fields eight columns wide (8F8.6). If location scores are to be calculated (NMLOCI=2), the true map distance in Morgans between the two marker loci (only one distance is allowed), followed by the distance option variable DISOPT, input in fields eight columns wide (F8.6,I8), with DISOPT right justified. Col 1- 8 First true recombination fraction if one marker locus or the true map distance if two marker loci, Col 9-16 Second true recombination fraction if one marker locus or DISOPT if two marker loci (right justified) DISOPT=0 says to allow for multiple locations for the disease locus between the two markers; DISOPT=1 says to assume the disease locus is in the middle; DISOPT=1 requires much less computation Col 17-24 Third true recombination fraction if one marker locus etc. 3. Parameter values for the trait penetrance function: For each possible trait genotype/gender combination, input four parameters per line in fields eight columns wide (4F8.4) (see Outline of the Power Calculation): line 3: for a male with trait genotype 11; line 4: for a male with trait genotype 12; line 5: for a male with trait genotype 22; line 6: for a female with trait genotype 11; line 7: for a female with trait genotype 12; line 8: for a female with trait genotype 22. Here, alleles 1 and 2 correspond to the first and second trait alleles entered in the locus file, respectively. For a dichotomous trait with a piecewise linear penetrance function (PENOPT=1): Col 1- 8 minimum age (or whatever quantitative variable is to be used), Col 9-16 maximum age, Col 17-24 minimum penetrance, i.e., penetrance at the minimum age, Col 25-32 maximum penetrance, i.e., penetrance at the maximum age. Note: If a constant penetrance of 80% is desired, independent of age, a line with the values 0. 60. .80 .80 could be entered. For a dichotomous trait with a cumulative normal penetrance function (PENOPT=2): Col 1- 8 mean age for the penetrance function, Col 9-16 standard deviation of age for the penetrance function, Col 17-24 minimum penetrance assuming an age of minus infinity, Col 25-32 maximum penetrance assuming an age of plus infinity. If dealing with a quantitative trait due to a mixture of normal distributions (PENOPT=3): Col 1- 8 mean trait value at age zero, Col 9-16 rate at which the mean trait value changes linearly with age, Col 17-24 standard deviation of the trait value at age zero, Col 25-32 rate at which the standard deviation of the trait value changes linearly with age. 4. Male and female symbols: The symbols used to identify males and females in the pedigree file (e.g., M and F or 1 and 2). Enter the symbols in character fields eight columns wide (2A8): Col 1- 8 male symbol, Col 9-16 female symbol. 5. Trait locus name: The name given the trait locus in the locus file. Enter the name in a character field eight columns wide (A8): Col 1- 8 trait locus name. 6. Locus file name: The name of the locus file, in character format (A). 7. Pedigree file name: The name of the pedigree file, in character format (A). 8. Seeds for the random number generator: These three positive integers will be used to start the random number generator used in the simulation (Wichman and Hill, 1982). The values should be relatively large, though no larger than 32767, and should be changed from one run to the next. Input the numbers right justified in fields eight columns wide (3I8). Col 1- 8 First random number generator seed, Col 9-16 Second random number generator seed, Col 17-24 Third random number generator seed. Note: The control file should end with an end-of-file symbol. (B) The Locus File: The locus file contains information describing the genetic loci involved in the power calculation. This includes one trait locus and either one or two marker loci. The sample locus file below includes a trait locus and two markers, and could be used for a linkage power calculation based on location scores. TRAIT AUTOSOME 2 3 d .99 D .01 1. 1 d/d 2. 2 d/d D/d 3. 1 D/d MARKER1 AUTOSOME 2 3 1 .50 2 .50 11 1 1/1 12 1 1/2 22 1 2/2 ABO AUTOSOME 3 4 A .26 B .06 O .68 A 2 A/A A/O B 2 B/B B/O AB 1 A/B O 1 O/O The trait locus has autosomal dominant inheritance with reduced penetrance; the specific penetrance functions are described in the control file. Because the D allele is relatively rare, the D/D genotype is assumed impossible, and unaffected spouses in the pedigree file (see below) will be assumed not at risk (phenotype 1.). While these assumptions are not exactly true, they are reasonably accurate, and they result in a much simplified power calculation. We strongly recommend the use of such assumptions whenever possible. It is important to remember that this is a power calculation; approximate answers should be quite satisfactory. Note: excluding either homozygous genotype is not appropriate for an X-linked trait, since hemizygous males are assumed by MENDEL to be homozygous for their allele. The first marker in the locus file is a two allele codominant marker with equal allele frequencies (note, allele names can be characters, including numbers). Given no prior interest in a particular marker, we generally use such a codominant marker as a compromise along the broad continuum between infinitely polymorphic "magic markers" at one extreme and two allele polymorphisms with one rare allele at the other extreme. The second marker is the ABO locus, and demonstrates how dominance relationships are dealt with when all genotypes are allowed for. Inspection of this example shows that data on the loci are provided one locus at a time with the following records (also see Examples and Lange et al., 1988): 1. Trait locus general information: the following four variables in (2A8,2I2) format, the two integer variables right justified: Col 1- 8 the name of the trait locus, Col 9-16 the chromosomal type of the trait locus: =AUTOSOME, if the trait locus is autosomal, =X-LINKED, if the trait locus is X-linked. Col 17-18 number of alleles at the trait locus (must be 2), Col 19-20 number of trait phenotypes (by convention, this must be 3 for a dichotomous trait (see below) or 0 for a quantitative trait). 2. Trait allele information: for each allele, a record with the following two variables in (A8,F8.5) format: Col 1- 8 trait allele name, Col 9-16 trait allele frequency. Note: Allele frequencies should sum to 1.0. For each trait phenotype, enter record 3 below once and record 4 below once for each trait genotype that corresponds to the particular trait phenotype. For dichotomous traits, three trait phenotypes are possible: 1.=normal and not at risk of becoming affected; 2.=normal and at risk of becoming affected; 3.=affected. Using the not at risk phenotype 1. when possible (for example, for spouses who marry into the pedigree for a relatively rare trait) can result in substantial computational savings since it will usually correspond to fewer possible trait genotypes than the at risk phenotype 2. . For quantitative traits, by convention, zero trait phenotypes are possible. Note: The dichotomous trait phenotypes must be 1., 2., or 3. In that order, and the trailing decimal points are required. 3. Trait phenotype information (dichotomous traits only): the following two variables in a record in (A8,I2) format, the integer variable right justified: Col 1- 8 trait phenotype name: 1., 2., or 3. (in that order) Col 9-10 number of trait genotypes associated with this trait phenotype. 4. Trait phenotype/genotype correspondence (dichotomous traits): following each trait phenotype record, list the trait genotypes corresponding to that phenotype, one record per genotype, each genotype in (A17) format. Each genotype is denoted by its two allele names separated by a slash (/). The slash character should not be part of an allele name. Note: For an X-linked trait, no special symbols are required for males. If a listed phenotype is appropriate for both females and males, only the associated homozygous genotypes will be assigned to a male with the phenotype. Internally, the program identifies hemizygous genotypes with the corresponding homozygous genotypes. Data on the marker loci are provided one locus at a time with the following records 5-8 required for each marker locus. 5. Marker locus general information: the following four variables in (2A8,2I2) format, the two integer variables right justified: Col 1- 8 the marker locus name, Col 9-16 the chromosomal type of the marker locus: =AUTOSOME, if the marker locus is autosomal, =X-LINKED, if the marker locus is X-linked, Col 17-18 number of alleles at the marker locus, Col 19-20 number of phenotypes at the marker locus. Note: Lod/location score calculation time can increase rapidly as a function of the number of marker alleles. Given more alleles, attendant array sizes may also become too large, particularly on microcomputers. 6. Marker allele information: for each allele, a record with the following two variables in (A8,F8.5) format: Col 1- 8 marker allele name, Col 9-16 marker allele frequency. Note: Allele frequencies should sum to 1.0. For each phenotype for the current marker, enter record 7 below once and record 8 below once for each marker genotype that corresponds to the particular marker phenotype. 7. Marker phenotype information: the following two variables in a record in (A8,I2) format, the integer variable right justified: Col 1- 8 marker phenotype name, Col 9-10 number of marker genotypes associated with this marker phenotype. 8. Marker phenotype/genotype correspondence: following each marker phenotype record, list the marker genotypes associated with the marker phenotype in one record per marker genotype, each genotype in (A17) format. Each marker genotype is denoted by its two allele names separated by a slash (/). The slash character should not be part of an allele name. Note: For an X-linked trait, no special symbols are required for males. If a listed phenotype is appropriate for both females and males, only the associated homozygous genotypes will be assigned to a male with the phenotype. Internally, the program identifies hemizygous genotypes with the corresponding homozygous genotypes. 9. End-of-file symbol. The locus file must end with one and only one end-of- file symbol. THIS IS CRITICAL!! On some computers and with some word processors, an end-of-file symbol is added automatically, and the symbol is invisible. On other computers there is a visible or partially visible symbol. All FORTRAN 77 compilers have an ENDFILE command if it is necessary to produce the end-of-file symbol. (C) The Pedigree File: The pedigree file contains information describing the pedigrees identified for use in the power calculation. The sample pedigree file below includes two pedigrees of ten and six individuals, respectively. (I3,1X,A8) (3(A3,1X),2A1,A2,T15,A2,A3,A4) 10 FAMILY1 1 M 3. 1. 80. 2 F 1. 1. 70. 3 1 2 F 3. 1. 80. 4 1 2 M 1. 1. 80. 5 8 9 F 3. 1. 80. 6 4 5 M 1. 1. 80. 7 4 5 M 1. 1. 85. 8 M 3. 1. 80. 9 F 1. 1. 75. 10 8 9 F 3. 1. 50. 6 FAMILY2 1 5 6 M 3. 1. 80. 2 F 1. 1. 70. 3 1 2 F 3. 1. 80. 4 1 2 M 3. 1. 80. 5 M 3. 1. 80. 6 F 1. 1. 80. In the pedigree file, two format statements are followed by information on each pedigree, one pedigree at a time. Pedigree information includes a pedigree description record, followed by a record for each pedigree member. The following records in the given order and with variables and formats as described below are required in the pedigree file (see Examples and Lange et al., 1988): 1. Pedigree record format statement: This FORTRAN format statement is used to read the pedigree description records. It should consist of an integer format for reading the number of individuals in a pedigree and a character format (maximum of eight characters) for reading the pedigree ID. For example, (I3,1X,A8). 2. Individual record format statement: This FORTRAN format statement is used to read the individual records. Each individual record consists of an ID, parents' IDs, gender, MZ-twin status, trait phenotype for the first time (in character format corresponding exactly to what appears in the locus file for a dichotomous trait, or a blank field if this is for a quantitative trait), trait phenotype again (present for both dichotomous and quantitative traits), the observable phenotype indicator, and penetrance variable (such as age). In order to read a dichotomous trait phenotype a second time, a tab (T) can be used to reread the previous field; two different fields must be read for quantitative trait data (see below). All items or fields on an individual record should be read in character format (A) and each should consist of eight characters or less. This includes the quantitative variables (trait phenotype, observable phenotype indicator, and penetrance variable), for which decimal points are mandatory. For example, (3(A3,1X),2A1,A2,T15,A2,A3,A4). 3. Pedigree information. This record is present once for each pedigree. Enter the following two variables in the format specified in record 1. Field 1: the number of individuals in the pedigree (right justified), Field 2: the pedigree ID (optional). 4. Individual data. This record is present once for each pedigree member. For each pedigree member, input the following variables in the format specified in record 2. Field 1: Individual's ID, Field 2: ID of one of his/her parents, blank if the parent is not in the pedigree, Field 3: ID of the other parent, blank if the parent is not in pedigree, Field 4: Individual's gender, using symbols specified in the control file (for example, M or F, 1 or 2), Field 5: MZ-twin status, must be left blank since SIMLINK does not allow for MZ twins, Field 6: Individual's trait phenotype (see note below for quantitative traits), Field 7: Individual's trait phenotype again, Field 8: Indicator of the availability of the individual's phenotypes if a linkage study is carried out. =0. if marker phenotypes should not be simulated, and the trait phenotype should be left as specified in the pedigree file; =1. if marker phenotypes should be simulated, and a trait phenotype should be simulated if not listed in the pedigree file; =2. if marker phenotypes should be simulated, and the trait phenotype should be left as specified in the pedigree file; =3. if marker phenotypes should not be simulated, and the trait phenotype should be simulated if not listed in the pedigree file. Note: These last two options were not available in earlier versions of SIMLINK. Field 9: penetrance function variable, for example age. Note 1: Individual IDs must be unique within pedigrees. Note 2: Either both parents or neither parent of a person must be listed in a pedigree. Note 3: Missing values for any field must be represented by blanks. Note 4: For a dichotomous trait, the trait phenotype is read twice for each individual. This can be done either by having two identical input fields and reading them both, or having a single input field and reading it twice using a tab (T) in the format statement. For a quantitative trait, there must be two separate trait phenotype fields. The first trait phenotype field must be left blank and the second trait phenotype field must contain the quantitative trait phenotype. This approach to input makes it possible to use the same program for both dichotomous and quantitative traits. Our apologies for any confusion it may cause. 5. End-of-file symbol. The pedigree file must end with one and only one end-of- file symbol. THIS IS CRITICAL!! On some computers and with some word processors, this is done automatically, and the symbol is invisible. On other computers there is a visible or partially visible symbol. All FORTRAN 77 compilers have an ENDFILE command if it is necessary to produce the end of file symbol. VII. Output from SIMLINK The output from SIMLINK takes the form of up to seven tables, depending on the analyses carried out. Maximum lod/location scores for each replicate of each pedigree are estimated by quadratic interpolation over the lod/location score values calculated at the test recombination fractions/map distances. Table 1. Summary of Information Used in the Simulation. Table 1 summarizes the information used in the simulation. This includes the trait locus name, the number of pedigree replicates simulated, true recombination fractions/map distances, and the test recombination fractions/map distances used. Tables 2 and 3 give estimates of the mean maximum lod/location score and the probabilities of maximum lod/location scores greater than specified constants for each of the true recombination fractions/map distances. These estimates are given for each pedigree separately (listed under 1, 2, and so forth), for the pedigrees combined assuming genetic homogeneity (under SUMMED), for the pedigrees combined allowing for between-pedigree heterogeneity (under SUMMEDH) (optional), and for any one pedigree over all the available pedigrees (under ANY). The values for a specific pedigree give estimates of the expected information provided by that pedigree. The values for the summed pedigrees estimate the expected information provided by pooling the data. Pooling the data in this way assumes that the trait is caused by a single genetic locus, that is, there is no heterogeneity. The values for the summed pedigrees allowing for heterogeneity estimates the expected information provided by pooling the data while explicitly allowing for heterogeneity. The values under ANY correspond to the information provided when an analysis is carried out under the assumption of genetic heterogeneity, and information from different pedigrees is not pooled, but the trait is actually homogeneous. Table 2. Estimated Mean Maximum Lod/Location Score for a Marker (Pair). This table lists the estimated mean maximum lod/location score, its standard error, and the maximum maximum-lod/location-score among all replicates for each pedigree, for the summed pedigrees assuming homogeneity, for the summed pedigrees allowing for between-pedigree heterogeneity (optional), and for any of the pedigrees. These estimates are reported for each of the true recombination fractions/map distances. Note: Since the maximum of the sum is usually less than the sum of the maxima, the expected maximum summed lod/location score (for all pedigrees combined) will usually be less than the sum of the expected maximum lod/location scores for the individual pedigrees. Table 3. Estimated Probabilities of Maximum Lod/Location Scores Greater than Specified Constants for a Linked Marker (Pair). This table lists the estimates and standard errors of probabilities of maximum lod/location scores greater than 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 for each pedigree, for the summed pedigrees assuming homogeneity, for the summed pedigrees allowing for heterogeneity (optional), and for any of the pedigrees. These values are reported for each of the true recombination fractions/map distances. For linked loci, estimates of the probabilities of maximum lod/location scores greater than 3.0 give estimates of the power of a proposed linkage study based on the corresponding data and the assumption of a linked marker or a pair of flanking markers at the given recombination fraction/map distance. For unlinked loci, these same estimates give estimates of the probability of incorrectly inferring linkage to an unlinked marker or pair of markers. In statistical terms, this estimates the probability "a" of making a type I error for a single analysis. Since many markers will often be considered, the overall probability of making a type I error is greater. Assuming that the linkage calculations for the different marker (pairs) are independent, the overall probability of making a type I error becomes 1-(1- a)**n, where n is the number of marker (pairs) and "**" represents exponentiation. Table 4. Estimated Probabilities of Maximum Location Scores Greater Than Specified Constants, Averaged Over the Interval Between the Two Marker Loci. This table lists estimates of the average probability, when the trait locus is located somewhere between the two marker loci, of a maximum location score greater than constants 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 for each pedigree, for the summed pedigrees assuming homogeneity, for the summed pedigrees allowing for heterogeneity, and for any of the pedigrees. Table 4. is omitted when simulating only one marker locus or if only a single location for the disease locus was chosen in the control file (see above). See Boehnke (1986) for a method using two-point lod scores to calculate a lower bound on the information provided by flanking markers and location scores. Tables 5 and 6 provide estimates of the expected lod/location score and probability of a lod/location score greater than specified constants when the marker (pair) is unlinked. These tables differ from tables 2 and 3 by reporting values for each test recombination fraction/map distance, rather than maximizing over all test recombination fractions/map distances. Tables 5 and 6 can be used to estimate the distance to each side of an unlinked marker (pair) that is likely to be excluded using the available pedigrees. Tables 5 and 6 are included only if free recombination is simulated (that is, IFREE=1). Table 5. Estimated Mean Lod/Location Score for an Unlinked Marker (Pair). For each test recombination fraction/map distance, this table gives the estimate of the mean lod/location score, its standard error, and the sample maximum and minimum lod/location scores for each pedigree and for the summed pedigrees assuming homogeneity. In addition, an estimate of the test recombination fraction/map distance at which the mean lod/location score equals -2.0 is printed. This estimate is based on quadratic interpolation of the lod/location score. This recombination fraction/map distance gives an estimate of the expected exclusion distance when testing for linkage to an unlinked marker (pair). If interpolation is not possible, asterisks are printed. Table 6. Estimated Probabilities of Lod/Location Scores Greater than Specified Constants for an Unlinked Marker (Pair). For each test recombination fraction/map distance, estimates and standard errors for the probabilities of lod/location scores greater than -2.0, -1.5, -1.0, ... , 2.5, and 3.0 are given. For each test recombination fraction/map distance, one minus the probability of a lod/location score greater than -2.0 gives an estimate of the probability that linkage will be excluded for at least that distance from an unlinked marker (pair). VIII. Four Sample Problems Input files for these examples are EXAMPLE*.CON, EXAMPLE*.LOC, and EXAMPLE*.PED; output files are EXAMPLE*.OUT (*=1,2,3,4). These files are all included on the diskette. Before using SIMLINK for your own data, we strongly recommend running the test problems to verify that you are obtaining the same results. The example input files should be helpful when you go to prepare input files for your own analyses. Example 1: Eight Pedigrees, Autosomal Dominant Trait with Piecewise Linear Penetrance Function Each of the eight pedigrees in this example is identical to that described by Ploughman and Boehnke (1989). Eight copies are used to achieve a moderate-sized power estimate for demonstration purposes. Pedigrees 1 through 8 are segregaing an autosomal dominant trait with complete penetrance by age 40. Three pedigree members, numbered 4, 6, and 7, in each of the pedigrees, are unaffected, at risk, and below the age of 40. The penetrance for these pedigree members is described by a piecewise linear function (PENOPT=1) which increases from 0 at age 0 to 1.0 at age 40 for trait genotypes DD and Dd, and is 0 at all ages for trait genotype dd. The remaining pedigree members are either affected or unaffected and assumed not to be at risk. The ages listed for these pedigree members are not needed by the penetrance function, and, hence, need not be correct (see pedigree file). Only 20 replicates are simulated in this example, so that it can be used to quickly check that the program is producing the same results as are given in EXAMPLE1.OUT. Control file: EXAMPLE1.CON Column numbers are provided for easy reference; they are not part of the input file. 1 2 3 4 5 6 1234567890123456789012345678901234567890123456789012345678901234 20 1 1 1 4 1 0 0 0.00 0.10 0.20 0.50 2. True rec. frac. 0.0 40.0 0.0 1.0 3. for males, DD 0.0 40.0 0.0 1.0 for males, Dd 0.0 40.0 0.0 0.0 for males, dd 0.0 40.0 0.0 1.0 for females, DD 0.0 40.0 0.0 1.0 for females, Dd 0.0 40.0 0.0 0.0 for females, dd M F 4. male and female symbols AUTODOM 5. trait locus name EXAMPLE1.LOC 6. locus file name EXAMPLE1.PED 7. pedigree file name 3791 3271 313 8. seeds for random number generator 1. The control line states that 20 replicates will be simulated for each pedigree (NREP=20), 1 marker locus will be simulated (NMLOCI=1), the penetrance function is piecewise linear (PENOPT=1), free recombination will be simulated (IFREE=1), 4 true recombination fractions will be considered (NTHETA=4), echo the data (IECHO=1), do not examine the effects of individual heterozygosity/homozygosity status (INDINF=0), and assume the trait is homogeneous (LNKOPT=0). Since LNKOPT=0, SIMLINK assumes the linked fraction alpha is 1. 2. Linked marker phenotypes will be simulated at the following true recombination fractions between the trait and marker loci: 0.00, 0.10, 0.20, and 0.50. 3. The minimum age, maximum age, minimum penetrance, and maximum penetrance for the piecewise linear penetrance function for each possible trait genotype/gender combination. 4. The male and female symbols used in the pedigree file are M and F. 5. The trait locus name is AUTODOM in the locus file. 6. The locus file name is EXAMPLE1.LOC, chosen to make clear the contents of the file. 7. The pedigree file name is EXAMPLE1.PED, chosen to make clear the contents of the file. 8. These three values are chosen as seeds for the random number generator. If the same values are used in a later run, the same results will be obtained. If they are changed, the results will change too. Locus file: EXAMPLE1.LOC Column numbers are provided for easy reference; they are not part of the input file. 1 2 12345678901234567890123456789 Comments: AUTODOM AUTOSOME 2 3 1. Trait locus information D .01 2. Trait allele information d .99 1. 1 3. Trait phenotype information d/d 4. Pheno/geno correspondence 2. 2 3. Trait phenotype information D/d 4. Pheno/geno correspondence d/d 4. Pheno/geno correspondence 3. 1 3. Trait phenotype information D/d 4. Pheno/geno correspondence MARKER1 AUTOSOME 2 3 5. Marker locus information A .50 6. Marker allele information B .50 AA 1 7. Marker phenotype information A/A 8. Pheno/geno correspondence AB 1 7. Marker phenotype information A/B 8. Pheno/geno correspondence BB 1 7. Marker phenotype information B/B 8. Pheno/geno correspondence 1. The trait locus name is AUTODOM; it is autosomal, has 2 alleles, and 3 phenotypes. 2. The 2 trait alleles are the dominant disease-susceptibility allele D, with allele frequency 0.01, and the recessive allele d, with allele frequency 0.99. 3., 4. There are 3 trait phenotypes: phenotype 1. has 1 associated genotype, d/d, phenotype 2. has 2 associated genotypes, D/d and d/d, and phenotype 3. has 1 associated genotype, D/d. Because it is so rare, genotype D/D has been omitted from this analysis, reducing the amount of computation time substantially. We strongly recommend this approach whenever feasible. Note: Homozygous genotypes should not be eliminated if the trait locus is X-linked. 5. The marker locus name is MARKER1; it is autosomal, has 2 alleles, and 3 phenotypes. 2. The 2 marker alleles are A and B, each with allele frequency 0.50. 3., 4. There are 3 marker phenotypes: phenotype AA has 1 associated genotype, A/A, phenotype AB has 1 associated genotype, A/B, and phenotype BB has 1 associated genotype, B/B, so that the marker is codominant. Pedigree file: EXAMPLE1.PED Column numbers are provide for easy reference; they are not part of the input file. 1 2 12345678901234567890123456789 Comments: (I3,1X,A8) 1. Pedigree record format (3(A3,1X),2A1,A2,T15,A2,A3,A4) 2. Individual record format 10 FAMILY 1 3. Pedigree information 1 M 3. 1. 80. 4. Individual data 2 F 1. 1. 70. 3 1 2 F 3. 1. 80. 4 1 2 M 2. 1. 30. 5 8 9 F 3. 1. 80. 6 4 5 M 2. 1. 10. 7 4 5 M 2. 1. 5. 8 M 3. 1. 80. 9 F 1. 1. 75. 10 8 9 F 1. 1. 50. 10 FAMILY 2 3. Pedigree information 1 M 3. 1. 80. 4. Individual data 2 F 1. 1. 70. 3 1 2 F 3. 1. 80. 4 1 2 M 2. 1. 30. 5 8 9 F 3. 1. 80. 6 4 5 M 2. 1. 10. 7 4 5 M 2. 1. 5. 8 M 3. 1. 80. 9 F 1. 1. 75. 10 8 9 F 1. 1. 50. . . . 10 FAMILY 8 3. Pedigree information 1 M 3. 1. 80. 4. Individual data 2 F 1. 1. 70. 3 1 2 F 3. 1. 80. 4 1 2 M 2. 1. 30. 5 8 9 F 3. 1. 80. 6 4 5 M 2. 1. 10. 7 4 5 M 2. 1. 5. 8 M 3. 1. 80. 9 F 1. 1. 75. 10 8 9 F 1. 1. 50. 1. Each pedigree record, consisting of the number of individuals in a pedigree and the pedigree ID (optional), will be read in format (I2,1X,A8). 2. Each individual record, consisting of an ID, parents' IDs, gender, MZ-twin status (blank), trait phenotype, trait phenotype again (by tabbing to the previous field), the observable marker phenotype indicator, and age, will be read in format (3(A3,1X),2A1,A2,T15,A2,A3,A4). 3. There are ten individuals in each of the eight pedigrees. The pedigree IDs are FAMILY 1, FAMILY 2, ..., and FAMILY 8. 4. For each individual: his/her ID, the IDs of both of his/her parents, his/her gender (using the symbols M and F as specified in the control file), a blank field for MZ-twin status, his/her trait phenotype, a 1. indicating that his/her marker phenotype should be simulated, and his/her age. Example 2: Two Pedigrees, Autosomal Dominant Trait with Cumulative Normal Penetrance Function Pedigrees 1 and 2 are segregating a heterogeneous autosomal dominant trait with complete penetrance by age 40. In pedigree 1, individuals 32, 35, 39, and 40 are unaffected, at risk, and below the age of 40; likewise, in pedigree 2, individuals 30, 33, 36, and 38 are unaffected, at risk, and below the age of 40. The penetrance for these individuals is described by a cumulative normal function (PENOPT=2) with a mean age of 10.0, a standard deviation of 4.0, a minimum penetrance of 0.0, and a maximum penetrance of 1.0 for trait genotypes DD and Dd. The penetrance is 0.0 at all ages for trait genotype dd. The remaining pedigree members are either affected or unaffected and not at risk. The linked fraction of pedigrees is assumed to be .80. A related example is described by Boehnke (1986). Control file: EXAMPLE2.CON 250 1 2 1 2 1 0 1 0.80 0.05 0.50 2. True rec. frac. 10.0 4.0 0.0 1.0 3. for males, DD 10.0 4.0 0.0 1.0 for males, Dd 0.0 4.0 0.0 0.0 for males, dd 10.0 4.0 0.0 1.0 for females, DD 10.0 4.0 0.0 1.0 for females, Dd 0.0 4.0 0.0 0.0 for females, dd 1 2 4. male and female symbols AUTODOM 5. trait locus name EXAMPLE2.LOC 6. locus file name EXAMPLE2.PED 7. pedigree file name 3191 371 21713 8. seeds for random number generator Locus file: EXAMPLE2.LOC AUTODOM AUTOSOME 2 3 1. Trait locus information D .01 2. Trait allele information d .99 1. 1 3. Trait phenotype information d/d 4. Pheno/geno correspondence 2. 2 3. Trait phenotype information D/d 4. Pheno/geno correspondence d/d 4. Pheno/geno correspondence 3. 1 3. Trait phenotype information D/d 4. Pheno/geno correspondence MARKER1 AUTOSOME 2 3 5. Marker locus information A .50 6. Marker allele information B .50 AA 1 7. Marker phenotype information A/A 8. Pheno/geno correspondence AB 1 7. Marker phenotype information A/B 8. Pheno/geno correspondence BB 1 7. Marker phenotype information B/B 8. Pheno/geno correspondence Pedigree file: EXAMPLE2.PED (I2,1X,A8) 1. Pedigree record format (3(A3,1X),2A1,A3,T15,3A3) 2. Individual record format 40 FAMILY 1 3. Pedigree information 1 1 3. 0. 80. 4. Individual data 2 2 1. 0. 80. 3 2 1. 0. 80. 4 1 2 1 3. 0. 80. 5 1 2 1 3. 0. 80. 6 2 1. 1. 80. 7 2 1. 1. 80. 8 3 4 1 3. 1. 80. 9 3 4 2 1. 1. 80. 10 3 4 1 3. 1. 80. 11 2 1. 1. 80. 12 2 1. 1. 80. 13 5 6 1 3. 1. 80. 14 5 6 1 3. 1. 80. 15 2 1. 1. 80. 16 5 6 2 1. 1. 80. 17 5 6 2 3. 1. 80. 18 1 1. 1. 80. 19 5 6 1 1. 1. 80. 20 1 1. 1. 80. 21 7 8 2 3. 1. 80. 22 7 8 1 1. 1. 80. 23 7 8 1 1. 1. 80. 24 7 8 1 3. 1. 80. 25 10 11 1 1. 1. 80. 26 10 11 2 1. 1. 80. 27 10 11 1 3. 1. 80. 28 12 13 2 1. 1. 80. 29 12 13 2 3. 1. 80. 30 12 13 2 1. 1. 80. 31 14 15 2 1. 1. 80. 32 14 15 2 2. 1. 10. 33 14 15 1 3. 1. 80. 34 17 18 2 1. 1. 80. 35 17 18 2 2. 1. 5. 36 17 18 2 3. 1. 80. 37 17 18 1 1. 1. 80. 38 20 21 2 1. 1. 80. 39 20 21 2 2. 1. 12. 40 20 21 2 2. 1. 8. 38 FAMILY 2 3. Pedigree information 1 1 3. 0. 80. 4. Individual data 2 2 1. 0. 80. 3 1 1. 1. 80. 4 1 2 2 3. 0. 80. 5 1 2 2 3. 1. 80. 6 1 2 2 1. 1. 80. 7 1 1. 1. 80. 8 3 4 2 3. 1. 80. 9 3 4 2 1. 1. 80. 10 3 4 1 3. 1. 80. 11 2 1. 1. 80. 12 1 1. 1. 80. 13 7 8 2 3. 1. 80. 14 1 1. 1. 80. 15 7 8 2 3. 1. 80. 16 7 8 2 3. 1. 80. 17 1 1. 1. 80. 18 10 11 2 1. 1. 80. 19 10 11 1 3. 1. 80. 20 2 1. 1. 80. 21 12 13 1 1. 1. 80. 22 12 13 1 1. 1. 80. 23 14 15 2 1. 1. 80. 24 2 1. 1. 80. 25 16 17 1 3. 1. 80. 26 16 17 2 3. 1. 80. 27 1 1. 1. 80. 28 16 17 1 3. 1. 80. 29 16 17 1 3. 1. 80. 30 16 17 1 2. 1. 17. 31 19 20 1 3. 1. 80. 32 19 20 2 3. 1. 80. 33 19 20 1 2. 1. 13. 34 24 25 1 1. 1. 80. 35 24 25 1 3. 1. 80. 36 26 27 2 2. 1. 8. 37 26 27 1 1. 1. 80. 38 26 27 2 2. 1. 10. Example 3: Three Pedigrees, X-linked Recessive Trait with Two Flanking Marker Loci The rare, X-linked recessive trait segregating in these pedigrees is Becker Muscular Dystrophy. The pedigrees BD28, BD78, and BD9 were taken from Brown et al. (1985) with some modification of ages. Although this trait has age- dependent penetrance, usually appearing in the 20s, since all unaffecteds in the line of descent of the trait are beyond the typical range of onset ages, assuming complete penetrance is reasonable for a power calculation and will save computation time. Therefore, the piecewise linear penetrance function used in the analysis has complete penetrance for individuals with trait genotype dd and 0.0 penetrance for individuals with trait genotype DD or Dd. Two flanking marker loci with a true map distance of 10 cM between them were used in the simulation. Control file: EXAMPLE3.CON 250 2 1 1 1 1 1 0 0.10 1 2. True map dist., dist. option 0.0 40.0 1.0 1.0 3. for males, dd 0.0 40.0 0.0 0.0 for males, Dd 0.0 40.0 0.0 0.0 for males, DD 0.0 40.0 1.0 1.0 for females, dd 0.0 40.0 0.0 0.0 for females, Dd 0.0 40.0 0.0 0.0 for females, DD M F 4. male and female symbols XREC 5. trait locus name EXAMPLE3.LOC 6. locus file name EXAMPLE3.PED 7. pedigree file name 2791 3903 1313 8. seeds for random numbers Locus file: EXAMPLE3.LOC XREC X-LINKED 2 3 1. Trait locus information d .0001 2. Trait allele information D .9999 1. 2 3. Trait phenotype information D/D 4. Pheno/geno correspondence D/d 2. 3 3. Trait phenotype information D/D 4. Pheno/geno correspondence D/d d/d 3. 1 3. Trait phenotype information d/d 4. Pheno/geno correspondence MARKER1 X-LINKED 2 3 5. Marker locus information A .50 6. Marker allele information B .50 AA 1 7. Marker phenotype information A/A 8. Pheno/geno correspondence AB 1 7. Marker phenotype information A/B 8. Pheno/geno correspondence BB 1 7. Marker phenotype information B/B 8. Pheno/geno correspondence MARKER2 X-LINKED 2 3 5. Marker locus information Y .50 6. Marker allele information Z .50 YY 1 7. Marker phenotype information Y/Y 8. Pheno/geno correspondence YZ 1 7. Marker phenotype information Y/Z 8. Pheno/geno correspondence ZZ 1 7. Marker phenotype information Z/Z 8. Pheno/geno correspondence Note: The genotypes DD and dd must be included in this X-linked example so that the male hemizygous genotypes will be allowed for by MENDEL. Pedigree file: EXAMPLE3.PED (I3,1X,A8) 1. Pedigree record format (3(A3,1X),2A1,A2,T15,A2,A3,A4) 2. Individual record format 10 BD28 3. Pedigree information 1 M 1. 0. 80. 4. Individual data 2 F 1. 0. 80. 3 M 1. 1. 80. 4 1 2 F 1. 1. 80. 5 1 2 M 3. 0. 80. 6 F 1. 1. 80. 7 1 2 M 3. 1. 80. 8 3 4 M 3. 1. 80. 9 5 6 M 1. 1. 80. 10 5 6 F 1. 1. 80. 7 BD78 3. Pedigree information 1 M 1. 1. 90. 4. Individual data 2 F 1. 1. 85. 3 M 1. 1. 65. 4 1 2 F 1. 1. 60. 5 1 2 M 3. 0. 60. 6 1 2 M 1. 1. 60. 7 3 4 M 3. 1. 33. 12 BD9 3. Pedigree information 1 M 1. 0. 90. 4. Individual data 2 F 1. 0. 90. 3 M 1. 1. 90. 4 1 2 F 1. 1. 90. 5 1 2 M 1. 1. 90. 6 3 4 M 3. 1. 62. 7 3 4 M 3. 1. 64. 8 3 4 M 3. 1. 66. 9 3 4 F 1. 1. 63. 10 M 1. 1. 66. 11 9 10 M 3. 1. 36. 12 9 10 M 3. 1. 40. Example 4: One Pedigree with an Autosomal Dominant Quantitative Trait The large nuclear family in this example is segregating an autosomal major locus for a quantitative trait. The mean trait value for an individual with the DD or Dd trait genotype is 10.0 plus 0.10 times the age of the individual; the standard deviation is 1.0. The mean trait value for an individual with the dd trait genotype is 5.0 and is not a function of age; the standard deviation is also 1.0. Control file: EXAMPLE4.CON 250 1 3 1 3 1 1 0 0.00 0.10 0.50 2. True rec. frac. 10.0 0.10 1.0 0.0 3. for males, DD 10.0 0.10 1.0 0.0 for males, Dd 5.0 0.0 1.0 0.0 for males, dd 10.0 0.10 1.0 0.0 for females, DD 10.0 0.10 1.0 0.0 for females, Dd 5.0 0.0 1.0 0.0 for females, dd M F 4. male and female symbols QUANT 5. trait locus name EXAMPLE4.LOC 6. locus file name EXAMPLE4.PED 7. pedigree file name 3191 371 21713 8. seeds for random number generator Locus file: EXAMPLE4.LOC QUANT AUTOSOME 2 0 1. Trait locus information D .01 2. Trait allele information d .99 MARKER1 AUTOSOME 2 3 5. Marker locus information A .50 6. Marker allele information B .50 AA 1 7. Marker phenotype information A/A 8. Pheno/geno correspondence AB 1 7. Marker phenotype information A/B 8. Pheno/geno correspondence BB 1 7. Marker phenotype information B/B 8. Pheno/geno correspondence Pedigree file: EXAMPLE4.PED (I2,1X,A8) 1. Pedigree record format (3(A3,1X),3A1,A4,A3,A4) 2. Individual record format 15 QUANT 3. Pedigree information 1 M 20. 1. 80. 4. Individual data 2 F 5. 1. 70. 3 1 2 M 19. 1. 55. 4 1 2 F 16. 1. 52. 5 1 2 M 16. 1. 50. 6 1 2 M 14. 1. 48. 7 1 2 M 15. 1. 46. 8 1 2 F 6. 1. 44. 9 1 2 M 4. 1. 41. 10 1 2 F 17. 1. 39. 11 1 2 F 16. 1. 36. 12 1 2 M 5. 1. 35. 13 1 2 F 12. 1. 33. 14 1 2 F 6. 1. 31. 15 1 2 M 5. 1. 29. Note: A blank must be present in the first trait phenotype field for a quantitative trait. IX. Array Sizes, File Management, and Other Practical Hints The maximum sizes of the variables and arrays in SIMLINK are initially set according to the values of the following variables: Initial Variable Description Value MAXALL maximum number of marker alleles 4 MAXCON maximum number of constants for comparing to lod/location scores 9 MAXGEN maximum number of marker genotypes 10 MAXP maximum number of people on whom a person's conditional probabilities can depend 4 MAXPED maximum number of pedigrees 20 MAXPEO maximum number of people per pedigree 100 MAXPHN maximum number of marker phenotypes 10 MAXTH maximum number of true recombination fractions/map distances 8 MAXTOT maximum number of people in entire data set 200 MAXTST maximum number of test recombination fractions/map distances 8 MXGLST maximum size of GLIST array 1200 MXMG maximum size of MARGEN array 6400 MXMP maximum size of MKPHEN array 3200 MXTM maximum size of the hetero/homozygos arrays 1600 MXPLST maximum size of PLIST array 800 MXPROB maximum size of CONDPR array 16200 MXTEMP maximum size of TEMPPR array (maximum number of conditional probabilities per person) 81 LENC maximum size of CARRAY array for MENDEL 200 LENI maximum size of IARRAY array for MENDEL 5000 LENL maximum size of LARRAY array for MENDEL 100 LENR maximum size of RARRAY array for MENDEL 5000 To modify these dimensions, as you will almost certainly need to do, modify the parameter statement in SIMLINK.FOR for the variable in question. This may be accomplished by using a file editor. Then recompile SIMLINK.FOR and link the .OBJ files. Note: Many of the maximum sizes listed above are interrelated, so that if one is altered, others may need to be as well. The relationships are given below: MAXTH = maximum number of recombination fractions MAXTOT = maximum total number of people in the data set MAXLOC = maximum number of loci (1 or 2) MAXP = maximum number of individuals on whom someone's conditional genotype probabilities might depend (roughly speaking, no more than 3 + the number of loops in a pedigree) MXGLST = MAXTOT*3*2 (where 3 is the number of possible trait genotypes and 2 is the number of haplotypes) MXMG = MAXTOT*MAXTH*MAXLOC*2 (where 2 is the number of haplotypes) MXMP = MXMG/2 MXPLST <= MAXTOT*MAXP MXPROB <= MAXTOT*3**MAXP (where 3 is the number of possible trait genotypes and "**" represents exponentiation) MXTEMP = 3**MAXP (where 3 is the number of possible trait genotypes) MXTM = MAXTOT*MAXTH Note: MAXTST must be greater than or equal to the number of test recombination fractions/map distances (NTEST). X. Error Conditions When SIMLINK stops without completing the desired analysis, error messages may be found (1) on the screen, (2) in the output file, or (3) in the file SIMERR.SCR. SIMDOC.SCR can be consulted to determine the correspondence between input IDs and MENDEL IDs. The most frequent error encountered when using SIMLINK is insufficient array size for any of a large variety of arrays. This can be dealt with by editing SIMLINK.FOR, identifying the PARAMETER statement associated with the array dimension that is too small, recompiling SIMLINK.FOR, and linking the program. NOTE: On a microcomputer using MICROSOFT FORTRAN, it may not be possible to make all arrays sufficiently large because of the 640K limitation of DOS. In such cases, possible solutions include: (a) limiting the number of trait genotypes whenever possible (see above); (b) decreasing the number of marker alleles; (c) decreasing some array dimensions if possible; (d) calculating lod scores rather than location scores; (e) using the F77L-EM/32 compiler; or (f) using a larger computer. As you encounter other errors that are not clearly explained by the error message(s) provided, I would appreciate knowing about them so that I can add them to this documentation and/or add better error messages to the program. XI. References Boehnke M (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. American Journal of Human Genetics 39:513-527. Brown CS, Thomas NST, Sarfarazi M, Davies KE, Kunkel L, Pearson PL, Kingston HM, Shaw DJ, Harper PS (1985) Genetic linkage relationships of seven DNA probes with Duchenne and Becker muscular dystrophy. Human Genetics 71:62-74. Haldane JBS (1919) The combination of linkage values and the calculation of distances between the loci of linked factors. Journal of Genetics 8:299-309. Lange K, Boehnke M, Weeks D (1988) Documentation for MENDEL, Version 2.3, November, 1988. Lange K, Weeks D, Boehnke M (1988) Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genetic Epidemiology 5:471-472. Morton NE (1955) Sequential tests for the detection of linkage. American Journal of Human Genetics 7:277-318. Ploughman LM, Boehnke M (1989) Estimating the power of a proposed linkage study for a complex genetic trait. American Journal of Human Genetics 44:543-551. Risch N (1989) Linkage detection tests under heterogeneity. Genetic Epidemiology 6:473-480. Smith CAB (1963) Testing for heterogeneity of recombination fraction values in human genetics. Annals of Human Genetics 27:175-182. Wichman BA, Hill ID (1982) An efficient and portable pseudo-random number generator. Applied Statistics 31:188-192.