MERLIN Tutorial -- Simulation
When interpreting results for pedigree analysis, it is extremely
helpful to know how often a similar result might arise by chance. For
example, in a linkage analysis it may be helpful to know how many
peaks of similar height are expected conditional on the set of phenotypes
being analysed and the available marker map. When investigating
suspicious genotypes, it is important to characterize the false-positive
rate for error detection procedures.
MERLIN has the ability to perform gene dropping simulations which
replace input data with simulated chromosomes conditional on family
structure and actual marker spacings and allele frequencies, as well
as missing data patterns. The procedure for generating simulated data
is described in the reference
section.
For this example, we will use a data set from the examples
subdirectory of the MERLIN distribution as input. You can also
find the example data in the download page.
Estimating false positive rates for error detection
In the error detection tutorial, we identified
7 pairs of unlikely genotypes in a 20 marker, 5-cM scan, of 200 sib-pairs,
corresponding to 8,000 total genotypes. The data is organized into three
files, a pedigree file summarizing genotypes and relationships (error.ped),
a data file describing the contents of the pedigree (error.dat) and
map file providing marker locations (error.map).
To review a descriptive summary of the dataset, you could run pedstats:
prompt> pedstats -d error.dat -p error.ped
To review the original set of unlikely genotypes, you could use MERLIN's
automated error analysis:
prompt> merlin -d error.dat -m error.map -p error.ped --error
To estimate false positive rates, we will request that MERLIN analyse a
simulated data set with identical allele frequencies and marker spacing by
using the --simulate command line option. Try it out!
prompt> merlin -d error.dat -m error.map -p error.ped --error --simulate
You should first see the MERLIN start-up screen and summary of selected options.
Note that the options --error and --simulate are selected. Note also
that the current random seed is 123456. This seed indicates which simulated replicate
will be used, and selecting a different seed produces an alternative simulated data
set.
MERLIN 0.8.4 - (c) 2000-2001 Goncalo Abecasis
The following parameters are in effect:
Data File : error.dat (-dname)
Pedigree File : error.ped (-pname)
Missing Value Code : -99.999 (-xname)
Map File : asp.map (-mname)
Allele Frequencies : ALL INDIVIDUALS (-f[a|e|f|file])
Steps Per Interval : 0 (-i9999)
Random Seed : 123456 (-r9999)
Data Analysis Options
General : --error [ON], --ibd, --kinship, --information
Linkage : --npl, --pairs, --qtl, --deviates, --vc
Haplotyping : --best, --sample, --all, --founders
Recombination : --zero, --one, --two, --three, --singlepoint
Limits : --bits [24], --megabytes
Output : --quiet, --markerNames
Simulation : --simulate [ON], --save
Additional : --simwalk2, --matrices, --swap
This start-up screen should be followed by an error detection analysis for the
replicate, which should indicate a single pair of unlikely genotypes:
Family: 38 - Founders: 2 - Descendants: 2 - Bits: 2
MRK6 genotype for individual 3 is unlikely [0.021855]
MRK6 genotype for individual 4 is unlikely [0.021855]
NOTE: In many newer versions of MERLIN, you may not find any unlikely genotypes
in the replicate produced with the default seed. This is not a problem, and merely reflects
the low false positive rate of the procedure. Continue reading to learn about how to use
a different seed...
So MERLIN flags a single pair of unlikely genotypes in this particular replicate...
Is this typical of other replicates? There are two ways to investigate the issue further.
One option is to generate additional replicates, one at a time, by
repeating the above procedure with a different random seed. To do this, you will
need to set the -r command line option. The following command repeats the
previous analysis but sets the random seed to 1234, thus generating a different
set of simulated data:
prompt> merlin -d error.dat -m error.map -p error.ped --error --simul -r 1234
Another option is to request that MERLIN loop through the simulation procedure
multiple times. This option is available through the --reruns command line
option in newer versions of MERLIN. To analyse 20 simulated datasets, try:
prompt> merlin -d error.dat -m error.map -p error.ped --error --simul --reruns 20
In either way, it is straight-forward to repeat any MERLIN analysis for simulated
chromosomes and estimate false-positive rates for error detection or linkage analysis
(note that MERLIN does not change input phenotypes and disease status when conducting
simulations).
Although we focused on simulating data under the null hypothesis (that is on simulating
random genotypes that are independent of the phenotype and genotype data), Merlin can
also simulate quantitative trait loci associated with a specific simulated phenotype. The
procedure for these simulations under the alternative hypothesis is sketched out in the
reference section.
Now that you have seen how to generate simulated replicates, you could proceed
to haplotype analysis or ibd
estimation. If you haven't already done so, you could try the
linkage or error detection
tutorials.
|