SIBMED
(SIBpair Mutation and Error Detection)
Version 1.0


April 18, 2000


Julie A. Douglas and Michael Boehnke

University of Michigan
Department of Biostatistics
School of Public Health
1420 Washington Heights
Ann Arbor, Michigan 48109-2029

Phone: (734) 647-9466 and (734) 936-1001
FAX:  (734) 763-2215
Email: jddoug@umich.edu and boehnke@umich.edu


TABLE OF CONTENTS

Introduction
Assumptions
Input
Results and Output
Advice on Running SIBMED
Default Array Dimensions
Compiling and Running SIBMED
Error Conditions and User Support
Acknowledgements
References

Introduction

	Sib-pair designs are routinely employed in linkage studies of complex 
diseases and quantitative traits.  When parents or additional sibs are 
available, genotyping errors and mutations can often be detected as Mendelian 
incompatibilities or apparent double recombinants.  However, given only sib-pair 
data, genotyping errors and mutations cannot be detected as Mendelian 
incompatibilities.  Still, they can be inferred probabilistically given a 
sufficiently dense map of markers.  

	Genotyping errors occur when the observed genotype does not correspond to 
the true underlying genotype, owing to laboratory error or incorrect data 
interpretation or entry.  Marker mutations, which are relatively frequent for 
short tandem repeat polymorphisms (Weber and Wong 1993), can mimic genotyping 
errors.  The net effect of undetected genotyping errors and mutations for 
affected sib-pair designs is to diminish evidence for linkage.  

	SIBMED is a FORTRAN 77 program that identifies likely genotyping errors 
and mutations for a sib pair in the context of multipoint mapping.  
Specifically, using a hidden Markov model, the program calculates the posterior 
probability of genotyping error or mutation for each sib-pair-marker 
combination, given all the available marker data, an assumed genotype-error 
rate, and a known genetic map.  The subset of combinations for which this 
posterior error probability is high may be considered for exclusion, review, or 
retyping.

	Because the empirical distributions of the posterior error probabilities 
are a function of map density, prior error rate, marker position, and marker 
allele frequency, we use Monte Carlo simulation to determine marker-specific 
cutoff constants in order to identify unusually high posterior error 
probabilities.  Monte Carlo simulation has the advantage of being completely 
specific to the data at hand, thereby providing a mechanism for controlling the 
false positive rate.  Subject to small false positive rates (0.001), SIBMED does 
not generate false evidence for linkage by removing incorrectly identified 
errors.  We recommend the use of larger false positive rates ONLY if re-scoring 
or retyping will be carried out.  

	Optionally, SIBMED allows users to forgo Monte Carlo simulation, and 
instead produce a list of high posterior error probabilities with the associated 
sib-pair-marker combination information.  Genotypes for which the posterior 
error probabilities are extremely high, for example, greater than 0.90, can be 
confidently excluded.    

	The accuracy of error detection using SIBMED is directly related to map 
density; it is most advantageous for dense maps (1-3 cM marker spacing).  The 
method and additional information are described in Douglas et al. (2000).


Assumptions

1. Only full sibs are considered; MZ-twin pairings are excluded.
2. All genetic markers are autosomal and codominant, and marker allele 
frequencies are known.
3. Intermarker distances are known, and there are no sex differences in 
recombination fractions.
4. The assumed relationship (i.e. full sibs) is correct.
5. Hardy-Weinberg and linkage equilibrium.
6. No genetic interference (for purposes of calculating multipoint genotype 
probabilities).
7. Kosambi's mapping function is used to convert map distances to recombination 
fractions.
8. Prior genotype-error rate e (0e1).
9. No marker has more than 99 alleles, and no more than 9999 markers are to be 
analyzed.  Chromosomes are of length no more than 50 Morgans.


Input

	SIBMED requires three input files:  (1) a control file that describes the 
basic analysis parameters and seeds for Monte Carlo simulation (optional), (2) a 
locus file that describes the marker loci to be analyzed, and (3) a pedigree 
file that describes the structure and genotype data for the families to be 
analyzed.  The locus and pedigree files are similar to those required by MENDEL 
(Lange et al. 1988) and identical to those required by RELPAIR (Boehnke and 
Duren, 1997).  

(1) Control File

	The control file contains information describing the analyses to be 
carried out.  It includes the following 11 (or 14) records:

(i) locus file name, left justified in columns 1-72
(ii) pedigree file name, left justified in columns 1-72
(iii) result file name, left justified in columns 1-72
(iv) output file name, left justified in columns 1-72
(v) echo marker map indicator, in column 1 ("y" if the map information should be 
echoed to the result file, "n" if not; note--do not enter the quotation marks)
(vi) echo pedigree data indicator, in column 1 ("y" if the pedigree data should 
be echoed to the result file, "n" if not; note--do not enter the quotation 
marks)
(vii) female symbol, left justified in columns 1-8 (for example, "F")
(viii) male symbol, left justified in columns 1-8 (for example, "M")
(ix) prior error rate ((.000001), left justified in columns 1-8 
(x) simulation indicator, in column 1 ("y" if the simulation should be 
performed, "n" if not; note--do not enter quotation marks)
(xi) posterior error probability cutoff constant ((.000001), left justified in 
columns 1-8 

or

(xi) prescribed false positive rate ((.000001), left justified in columns 1-8
(xii) first random number seed (32,767), right justified in columns 1-5
(xiii) second random number seed (32,767), right justified in columns 1-5
(xiv) third random number seed (32,767), right justified in columns 1-5


For example, the control file below specifies a prior error rate of 1% and a 
posterior error probability cutoff of 90%:

	sample.loc
	sample.ped
	sample.res
	sample.out
	y
	y
	F
	M
	0.01
	n
	0.90
End of file symbol

In contrast, the control file below specifies a prior error rate of 1% and a 
prescribed false positive rate of 0.1%:

	sample.loc
	sample.ped
	sample.res
	sample.out
	y
	y
	F
	M
	0.01
	y
	0.001
	18438
	 7390
	20094
	End of file symbol

Note that the control file must end with some kind of end of file symbol.  On 
some computers, PCs for example, this is automatically done, and the symbol is 
invisible.  On other computers, there is a visible or partially visible symbol.  
All FORTRAN 77 compilers have an ENDFILE command if it is necessary to produce 
the end of file symbol.  


 (2) Locus File

The locus file contains information describing the marker loci and map 
distances.  Marker loci are provided one locus at a time with the following 
records required for each locus:

(i) locus identifier record in (2A8,2I2,I4,F8.5) format specifying
(a) locus name, in columns 1-8
(b) AUTOSOME or X-LINKED, depending on whether the locus is autosomal or X-
linked, in columns 9-16 (SIBMED does not currently handle the X-linked case, so 
this option will be ignored)
(c) number of alleles (<=99), right justified in columns 17-18
(d) number of phenotypes, right justified in columns 19-20 (SIBMED only handles 
codominant marker loci, so this value must be 0 by convention)
(e) chromosome on which the locus is located, right justified in columns 21-24 
(1-22)
(f) map position of the locus (in Morgans) relative to one end of the 
chromosome, in columns 25-32

Note:  Kosambi's mapping function is used to convert map distances to 
recombination fractions.  Changing the program FUNCTION MAPFUN can easily modify 
this choice.  

(ii) for each allele, a record in (A8,F8.5) format specifying
(a) allele name, in columns 1-8
(b) allele population frequency, in columns 9-16

	For example, consider the locus file below:

	MARKER1 AUTOSOME 3 0   1    0.00
	1       .18
	2       .55
	3       .27
	MARKER2 AUTOSOME 4 0   1    0.02
	1       .01
	2       .07
	3       .86
	4       .06
	MARKER3 AUTOSOME 3 0   1    0.05
	1       .32
	2       .61
	3       .07
	MARKER4 AUTOSOME 5 0   1    0.07
	1       .14
	2       .02
	3       .79
	4       .01
	5       .04	
End of file symbol

This particular file specifies alleles and their frequencies for 4 autosomal 
marker loci with intermarker distances of 2, 3, and 2 cM, respectively.  	

The locus file and the pedigree file must be coordinated so that the genotype 
fields for individuals match exactly the order of the loci in the locus file.  
Thus, a pedigree appropriate to the above locus file must have MARKER1 and 
MARKER2 as the sixth and seventh items in each individual record (see below).  
Finally, the locus file must end with some kind of end of file symbol (see 
Control File).


(3) Pedigree File

	The pedigree file contains records for pedigrees and records for 
individuals within each pedigree.  A pedigree record specifies the number of 
people in a pedigree and an ID for the pedigree. An ID can have a maximum of 
eight characters.  Individual records for each person in the pedigree follow the 
pedigree record.  Each individual record provides an individual ID, the IDs of 
the individual's parents if present in the pedigree (blank otherwise), his/her 
sex, identical twin status, and all his/her marker genotypes. The order of these 
items is fixed.  Also, note that individual IDs must be unique within a 
pedigree.    
 
	The term pedigree has certain technical connotations.  To properly 
reconstruct relationships between individuals, individuals must often be 
included who are dead or otherwise unavailable for study.  As a general rule of
thumb, either both parents or neither parent of an individual must be listed in
the pedigree.

	Preceding the pedigree and individual records are two FORTRAN format 
records for reading in the data.  The first format record specifies the format 
for reading the pedigree records.  It consists of an integer format (I) for 
reading the number of individuals in a pedigree and a character format (A) for 
reading the pedigree ID.  (I2,1X,A8) gives a typical example.  Note that since 
the number of pedigree members is read in I format, this number must be right 
justified in its field, in this example, in columns 1 and 2.  The second format 
directs the reading of the individual records.  It is very important that the 
format records match the corresponding pedigree data exactly.  Note that a 
common source of error is mismatch between the number of markers specified in 
the format record and the actual number of genotypes present for each individual 
in the pedigree file.  


	For example, the pedigree file below contains genotype data for 3 
pedigrees (a total of 5 sib pairs) and 4 marker loci:

	(I2,1X,A8)
	(3A8,2A1,A3,3(1X,A3))
	 6 FAMILY1
	FATHER                  M
	MOTHER                  F
	MA                      F
	CHILD1  FATHER  MOTHER  M 1/2 1/1 1/3 3/5
	CHILD2  FATHER  MA      M 3/3 1/3 2/2 1/5
	CHILD3  FATHER  MOTHER  M 3/3 1/3 2/2 1/5
	 4 FAMILY2
	DAD                     M
	MOM                     F
	KID1    DAD     MOM     M 1/1 2/2 1/1 4/4
	KID2    DAD     MOM     F 2/3 1/2 2/3 1/5
	 4 FAMILY3
	PA                      M                
	MA                      F                
	BROTHER PA      MA      M 2/2 3/4     2/3
	SISTER  PA      MA      F 1/2 2/4 2/2 1/4
	End of File Symbol

	In this example, the individual ID and parental IDs are read in A8 format.  
Although the mother is listed as the second parent in the pedigree, there is no 
requirement that the first parent be the father and the second parent the 
mother.  Sex and MZ twin status are read in A1 format while MARKERS 1-4 are each 
in A3 format.  The 1X format item allows one space between reading two fields.  
All items or fields in an individual record should be read in character format 
(A), and with the exception of the marker genotypes, should consist of eight 
characters or fewer.  Marker genotypes are represented by two allele labels 
separated by a slash (/), where the allele labels are identical those in the 
locus file.  Recall (from Locus File), each allele should consist of eight 
characters or fewer.
	Missing values for any field are represented by blanks.  For a genotype, 
either zero or two alleles are present, never one.  In addition, we have used an 
'F' for female and 'M' for male in the above example.  Because we currently are 
not allowing for X-linkage, this variable is ignored in the analysis.  Different 
female and male symbols may be specified, if desired (see Control File).  
Moreover, the MZ twin field is left blank in this example because there are no 
identical twins.  If there are identical twins (triplets, etc.) in a pedigree, 
then each pair (trio, etc.) of them should be assigned the same unique-to-the-
twinship non-blank identifier.  Finally, the pedigree file must end with some 
kind of end of file symbol (see Control File).
	

Results and Output

	The SIBMED result file begins with a short header, followed by the 
information used in the analysis and simulations, including the names of the 
locus and pedigree files, the prior error rate, and the prescribed false 
positive rate or posterior error probability cutoff constant.  Optionally (see 
Control File), the marker map and/or pedigree data are then printed.  The result 
file concludes with a summary of the results.  The first section provides a 
detailed listing of the sib pairs identified with likely genotyping errors or 
mutations, including the pedigree ID, the individual sib IDs, the marker name 
and position (in cM), and the associated posterior error probability. The second 
section concludes with the total number of sib-pair-marker combinations 
identified with likely genotyping errors or mutations and the total number of 
sib pairs and sib-pair-marker combinations analyzed.  
	
	An example result file is given below:
	
 ***************************************************** 
 *************** SIBMED.F VERSION 1.0 **************** 
 ************** WRITTEN BY J.A. DOUGLAS ************** 
 ****************** APRIL 17, 2000 ******************* 
 ***************************************************** 

 LOCUS FILE:   sample.loc                                      

 PEDIGREE FILE:   sample.ped                                      

 ____________________________________________________________________________
 
 PRIOR ERROR RATE =      0.0100

 PRESCRIBED FALSE POSITIVE RATE =  0.0010

 ____________________________________________________________________________
 
 ***SUMMARY OF RESULTS*** 
 ____________________________________________________________________________
 
 PAIRS IDENTIFIED WITH LIKELY MARKER GENOTYPING ERRORS 
 OR MUTATIONS: 
 ____________________________________________________________________________
 
 FAMILY    SIB1      SIB2      MARKER     POS(CM)   PR.ERROR
 ____________________________________________________________________________
 
 FAMILY1   CHILD1    CHILD3    MARKER3      5.0      0.282287
 ____________________________________________________________________________
 
 SIBMED IDENTIFIED   1 SIB-PAIR-MARKER COMBINATIONS
 ____________________________________________________________________________
 
    4 SIB PAIRS ANALYZED 
      16 SIB-PAIR-MARKER COMBINATIONS CONSIDERED
 ____________________________________________________________________________
 

	The SIBMED output file provides the flagged sib-pair-marker combination 
information in column format, which can then be easily manipulated by other 
software packages.  As in the result file, the first column corresponds to the 
pedigree ID, the second and third columns correspond to the sib IDs, the fourth 
and fifth columns list the marker name and position (in cM), respectively, and 
the last column gives the posterior error probability.  For example, the output 
file corresponding to the previous result file is given below:  

 FAMILY1   CHILD1    CHILD3    MARKER3      5.0      0.282287


Advice on Running SIBMED

	If the number of sib-pair-marker combinations identified with likely 
genotyping errors or mutations is uncomfortably large, we recommend using a 
smaller prescribed false positive rate.  Again, we do not recommend using false 
positive rates greater than 0.001 unless re-scoring or retyping will be carried 
out.  Even then, the advantage of more complete error detection must be balanced 
against the additional effort required to re-score and possibly retype a much 
larger subset of potential errors and mutations.  Removing likely errors under 
less restrictive false positive rates can have the unfortunate consequence of 
introducing false linkage information. 

Inspecting results family by family may be helpful in sorting out who among 
several flagged sib-pair-marker combinations has a likely genotyping error or 
mutation.  For example, in a sibship of size three, if two of the three possible 
pairings of sibs yield relatively large posterior error probabilities for the 
same marker, then the sib in common to both of these pairings may be inferred to 
have a likely genotyping error or mutation.  Similarly, examining results marker 
by marker may be helpful in identifying problem markers.  The output file can be 
used for these post-processing purposes.  

Default Array Dimensions

	The maximum array dimensions for SIBMED are initially set by the values of 
the following variables:

Variable Name	Initial Value		Variable Description
			   
MAXALL		40	             	Maximum number of alleles for a genetic marker
MAXFAM		2,000             		Maximum number of families in the sample	
MAXLOC		500		   	Maximum number of genetic markers to analyze
MAXPEO		30	             	Maximum number of people per family
MXPTOT		8,000              		Maximum total number of people in the sample
MAXSIB		10,000			Maximum number of sib pairs in the sample		
NSIM			1,000,000		Maximum number of sib pairs in the simulation

	If you wish to choose your own array dimensions, modify the PARAMETER 
statement on the first non-comment lines of SIBMED (currently lines 58-59).  
This may be accomplished by using a file editor.  Then recompile the program as 
described below.

	If the data being analyzed are too large for the current array dimensions, 
the program will issue an error message and will terminate.  The error message 
will give the specific parameter whose value needs to be changed.


Compiling and Running SIBMED

	SIBMED has been developed on microcomputers running in MS-DOS* mode 
(Pentium) and on workstations running UNIX (System V Release 4.0).  Source code 
as well as executables for both these platforms is available on our web site.  
If you are using one of these platforms and if the default array dimensions with 
the shipped versions (see above) are sufficient for your problem, you can skip 
compiling and just do the following:

(1) Copy the appropriate executable file to your hard disk.

(2) Create the required input files as described above.

(3) Run the program by typing SIBMED<CR>, where <CR> represents a carriage 
return. SIBMED will print a header to the screen and ask you to specify the 
control file name.  Once this is done, SIBMED will attempt to carry out the 
analyses requested.

	The DOS executable files were compiled and linked using Leahy Fortran 90 
Release 1.00i with the commands:

lf90 SIBMED.FOR -bind -fix -tp -o3 -exe SIBMED

These commands may be found in the file SIBMED.BAT.

	The UNIX executable files were compiled and linked using f77 with the 
commands:

f77 SIBMED.F -s -dalign -xlibmopt -native -O4 -o SIBMED

These commands may be found in the file SIBMED.MAK.

	Since SIBMED is written in standard FORTRAN 77, we anticipate that the 
program should be portable with little or no modification to other platforms as 
well.  Please let us know if you have any difficulties with this.  Also, we 
would appreciate learning of the changes required for such transfers.

	*Note that due to the way Windows 95 allocates memory, you will not likely 
be able to run SIBMED from the MS-DOS prompt window.  Instead, you should re-
boot to MS-DOS mode.

Error Conditions and User Support

	When SIBMED stops without completing the desired analyses, error messages 
may be found either on the screen, in the result file, or both.  ALWAYS CHECK 
THE RESULT FILE FOR ERROR MESSAGES IF THE PROGRAM STOPS UNEXPECTEDLY!  After 
correcting the error(s) noted, the program can be re-run.

	While we have tried to carefully document SIBMED and to make it reasonably 
easy to use, problems will undoubtedly arise.  Before calling us, please read 
the documentation carefully.  It is our experience with other programs that many 
of the questions we are asked are answered in the documentation.  If after 
reading the documentation you are still having problems, help on an as-available 
basis can be obtained by phone, letter, fax, or e-mail from either Julie A. 
Douglas or Michael Boehnke.

	If you think you have found a bug in our program, or an error in the 
documentation, we also would like very much to know about it.  Since this is the 
first distribution version of SIBMED, errors may be found.  We will do our best 
to correct any errors discovered as quickly as possible and inform all users 
immediately.

	Although SIBMED is distributed free of charge, please do not pass on a 
copy of the programs to others.  Instead, please ask anyone wishing to use the 
programs to obtain them directly from our website 
(http://www.sph.umich.edu/csg/software).  This procedure allows us to keep 
accurate records of software users and to make users aware of improvements as 
they are made and errors as they are corrected.

Acknowledgements

	We thank Bill Duren for making subroutines from his computer program 
available to us.  This research was supported by NIH grants T32 HG00040 (JAD), 
R01 HG00376 (MB), and a University of Michigan Rackham Predoctoral Fellowship 
(JAD).

References

Douglas JA, Boehnke M, Lange K (2000) A multipoint method for detecting 
genotyping errors and mutations in sibling-pair linkage data.  Am J Hum Genet 
66: 1287-1297

Boehnke M, Duren WL (1997) RELPAIR:  Software for assessing pairwise 
relationships using marker data, Version 0.90


Kosambi DD (1944) The estimation of map distances from recombination values.  
Ann Eugen 12:172-175

Lange K, Weeks D, Boehnke M (1988) Programs for pedigree analysis:  MENDEL, 
FISHER, and dGENE.  Genet Epidemiol 5:471-472

Weber JL, Wong C (1993) Mutation of human short tandem repeats.  Hum Mol Genet 
2: 1123-1128