University of Michigan Center for Statistical 


MERLIN Tutorial -- Linkage Analysis

Linkage analysis tests for co-segregation of a chromosomal region and a trait of interest. In this section, we will walk through a basic non-parametric and variance components linkage analysis using MERLIN.

For this example, we will use a simulated data set that you will find in the examples subdirectory of the MERLIN distribution or in the download page.

The dataset consists of a simulated 5-cM scan of chromosome 24 in 200 affected sib-pair families and is organized into 3 files, a data file (asp.dat), a pedigree file (asp.ped) and a map file ( An overview of MERLIN input files is available elsewhere.

The recommended first step in any analysis is to verify that input files are being interpreted correctly. So let's start by running pedstats... Pedstats requires an input data file (-d parameter) and pedigree file (-p parameter):

prompt> pedstats -d asp.dat -p asp.ped

By examining the abbreviated pedstats output below, you should be able to confirm that there are 200 pedigrees, each with 4 individuals (two affected siblings and their parents). Among phenotyped individuals, the prevalence of the disease is 100% (there are no unaffecteds in the sample) and the pedigree also includes a quantitative trait. In addition there are no phenotyped or genotyped founders.

Pedigree Statistics
(c) 1999-2001 Goncalo Abecasis

The following parameters are in effect:
            QTDT Pedigree File :         asp.ped (-pname)
                QTDT Data File :         asp.dat (-dname)
            Missing Value Code :         -99.999 (-xname)

          Individuals: 800 (400 founders, 400 nonfounders)
             Families: 200
 Average Family Sizes: 4.00
  Average Generations: 2.00

                   [Phenotypes]      [Founders]       Mean        Var
          trait      400  50.0%        0   0.0%      0.021      1.496

                  [Diagnostics]      [Founders] Prevalence
      affection      400  50.0%        0   0.0%     100.0%
          Total      400  50.0%        0   0.0%

                    [Genotypes]      [Founders]     Hetero      
           MRK1      400  50.0%        0   0.0%      72.8%      
           MRK2      400  50.0%        0   0.0%      73.2%      
        (...statistics for other markers would appear here...)
          Total     8000  50.0%        0   0.0%      74.1%

Everything checks out, so let's run merlin! We will need to specify an input data file (-d parameter), pedigree file (-p parameter) and map file (-m parameter). In addition, we need to request a non-parametric linkage analysis. In this case, we will request calculation of both the Whittemore and Halpern NPL pairs (--pairs) and NPL all (--npl) statistics:

prompt> merlin -d asp.dat -p asp.ped -m --pairs --npl

After running the command, you should first see the MERLIN banner and a summary of currently selected options:

MERLIN 0.8.4 - (c) 2000-2001 Goncalo Abecasis

The following parameters are in effect:
                     Data File :         asp.dat (-dname)
                 Pedigree File :         asp.ped (-pname)
            Missing Value Code :         -99.999 (-xname)
                      Map File : (-mname)
            Allele Frequencies : ALL INDIVIDUALS (-f[a|e|f|file])
            Steps Per Interval :               0 (-i9999)
                   Random Seed :          123456 (-r9999)

Data Analysis Options
         General : --error, --ibd, --kinship, --information
         Linkage : --npl [ON], --pairs [ON], --qtl, --deviates, --vc
     Haplotyping : --best, --sample, --all, --founders
   Recombination : --zero, --one, --two, --three, --singlepoint
          Limits : --bits [24], --megabytes
          Output : --quiet, --markerNames
      Simulation : --simulate, --save
      Additional : --simwalk2, --matrices, --swap

Notice that allele frequencies were estimated by counting among all individuals (the default). Alternatively, one could calculate allele frequencies among founders only (-ff), request equal allele frequencies (-fe) or use an allele frequency file with custom frequencies.

After a few moments, you should see analysis results at each location:

Phenotype: affection [ALL] (200 families)
                 Pos   Zmean  pvalue    delta    LOD  pvalue
                 min  -20.00     1.0   -0.707 -60.21     1.0
                 max   20.00 0.00000    0.707  60.21 0.00000
               0.000    0.96     0.2    0.092   0.27    0.13
               5.268    1.39    0.08    0.126   0.54    0.06
              10.536    1.27    0.10    0.110   0.43    0.08
              15.804    1.43    0.08    0.128   0.56    0.05
              21.072    0.88     0.2    0.083   0.22     0.2
              26.340    1.37    0.08    0.130   0.55    0.06
              31.608    1.53    0.06    0.151   0.71    0.04
              36.876    2.18   0.014    0.197   1.32   0.007
              42.144    2.60   0.005    0.218   1.75   0.002
              47.412    3.00  0.0014    0.251   2.33  0.0005
              52.680    3.43  0.0003    0.286   3.05 0.00009
            (... results continue at other locations...)

The first two lines indicate the maximum possible scores for this dataset. These are followed by analysis results at each location (cM position, Zscore, p-value assuming normal approximation, Kong and Cox delta, K&C LOD score and K&C p-value). You will notice that results are identical for the NPL all and pairs statistics -- this is always the case for families with a single affected sib-pair! Linkage peaks at location 52.68 with a Zscore of 3.43 (assymptotic p-value of 0.0003), corresponding to a Kong and Cox LOD score of 3.05 with probability 0.00009.

Two Merlin options can be helpful when sorting through large masses of linkage results. These two options are the --pdf option, which generates a simple graphical summary of your linkage curves, and the --tabulate option, which generates a tab-delimited file summarizing all the results for easy analysis in other programs.

Other commonly used linkage analysis options include requesting output with marker names, instead of cM positions (--markerNames option) and requesting analysis between markers (--steps n for n steps per interval) or along a grid of equally spaced locations along the chromosome (--grid n for an n-cM grid). Try them out! For example...

prompt> merlin -d asp.dat -p asp.ped -m --steps 4 --pairs --markerNames

... would calculate the NPL pairs statistic at 4 locations between consecutive markers and use marker names in the output.

TIP:The standard non-parametric linkage analysis carried out by Merlin uses the Kong and Cox (1997) linear model to evaluate the evidence for linkage. This model is designed to identify small increases in allele sharing spread across a large number of families -- this is what one usually expects in a complex disease. If you are searching for a large increase in allele sharing in a small number of families, you can select the Kong and Cox (1997) exponential model by adding the --exp option to your command line, after the --npl or --pairs options. This alternative model is more computationally intensive and requires more memory, but provides a better linkage test if you expect a large increase in allele sharing among affected individuals.

To carry out a variance components linkage analysis on the same data set, we will use the --vc option. If you are using a peculiar value, such as 1234 or -99.999 to represent missing values in your data, remember to use the -x peculiar_value option to tell MERLIN about it in all quantitative trait analyses. In the asp pedigree, missing values have been replaced by x. Let's try a variance components analysis:

prompt> merlin -d asp.dat -p asp.ped -m --vc

In the output, you will see the estimated sample heritability for each phenotype (in this case 86%) followed by estimates of the genetic effect and LOD scores at each marker location:

Phenotype: trait [VC] (200 families, h2 = 86.74%)
            Position      H2    ChiSq     LOD  pvalue
               0.000   40.95%    5.21    1.13   0.011
               5.268   51.42%    9.88    2.15  0.0008
              10.536   56.26%   13.01    2.82  0.0002
              15.804   65.40%   19.63    4.26 0.00000
              21.072   60.89%   15.36    3.34 0.00004
            (... results continue at other locations...)

In this case, linkage peaks at position 15.8 cM. You could identify which families are contributing the most to these linkage signals using the --perFamily option, which generates an additional file tabulating the contribution of each family to the overall LOD score (for non-parametric analysis this partial contribution will be labelled pLOD).

Since this is a selected sample, you might want to check out the simulation section to find out how to conduct gene-dropping simulations that could be used, for example, to estimate empirical p-values. Or proceed to the error detection (improves power!), haplotyping or ibd estimation sections.


University of Michigan | School of Public Health | Abecasis Lab