MERLIN Tutorial -- Modeling Marker-Marker Linkage Disequilibrium

This tutorial describe the procedures and options available for modeling marker-marker linkage disequilibrium with MERLIN. It assumes that you are relatively familiar with MERLIN and its standard command line options. If you haven't yet done so, it is a good idea to first learn about input file formats and non-parametric linkage analysis.

MERLIN can accomodate marker-marker linkage disequilibrium in nearly all available analyses, including parametric and non-parametric analysis of discrete traits, regression and variance-components based analysis of quantitative traits, haplotyping analyses and simulation. Modeling marker-marker linkage disequilibrium is especially important when analysing SNP linkage maps in datasets where some parental genotypes are missing. It has been shown that in these settings ignoring marker-marker linkage disequilibrium can result in severe biases in linkage calculations.

To model linkage disequilibrium, MERLIN organizes markers into clusters. Each cluster can include both SNP and microsatellite markers. MERLIN then uses population haplotype frequencies to assume linkage disequilibrium within each cluster. Two limitations of the model are that it assumes no recombination within clusters and no linkage disequilibrium between clusters. These approximations appear to be reasonable in many datasets.

For this example, we will use a simulated data set that you will find in the examples subdirectory of the MERLIN distribution or in the download page.

The dataset consists of a SNP linkage scan of a candidate chromosome in a set of 500 affected sibships, each with three genotyped affected siblings and one affected parent. The SNP data consists of clusters of 2-3 SNPs, all within 100kb of each other, genotyped approximately 5cM apart along a single chromosome (20 clusters and 59 SNPs in total).

The three standard Merlin format input files are the data file snp-scan.dat, pedigree file snp-scan.ped and map file snp-scan.map. In the pedigree file, SNP alleles 'A', 'C', 'G' and 'T' have been coded as allele 1, 2, 3 and 4, respectively. All input files are text files, and you can check their contents using the UNIX more command or using the following pedstats command:

prompt> pedstats -d snp-scan.dat -p snp-scan.ped

We are going to evaluate the evidence for linkage is this SNP data set, and a good place to start, is to run a standard non-parametric linkage analysis (--npl command option) ignoring linkage disequilibrium between markers. We will request that Merlin carry out analysis at positions spaced every 2 cM along the chromosome(--grid 2 command line option). Try running the following command:

prompt> merlin -d snp-scan.dat -p snp-scan.ped -m snp-scan.map --npl --grid 2

After the opening banner screen, your results should be similar to the following:

Phenotype: DISEASE [ALL] (500 families)
======================================================
            Pos   Zmean  pvalue    delta    LOD  pvalue
            min  -18.26     1.0   -0.408 -62.47     1.0
            max   54.77 0.00000    1.225  301.0 0.00000
          0.000    1.28    0.10    0.094   0.57    0.05
          5.000    1.85    0.03    0.109   0.96    0.02
         10.000    2.38   0.009    0.139   1.56   0.004
         15.000    3.23  0.0006    0.204   3.08 0.00008
         20.000    4.72 0.00000    0.308   6.75 0.00000
         25.000    3.97 0.00004    0.241   4.48 0.00000
         30.000    3.62 0.00014    0.234   3.98 0.00001
         35.000    2.64   0.004    0.149   1.86   0.002
         40.000    2.05    0.02    0.122   1.18   0.010
         45.000    2.27   0.012    0.126   1.35   0.006
         50.000    1.99    0.02    0.110   1.03   0.015
         55.000    2.27   0.012    0.133   1.43   0.005
         60.000    2.28   0.011    0.124   1.33   0.007
         65.000    3.21  0.0007    0.189   2.84 0.00015
         70.000    4.20 0.00001    0.236   4.62 0.00000
         75.000    3.78 0.00008    0.220   3.90 0.00001
         80.000    2.74   0.003    0.161   2.08  0.0010
         85.000    2.23   0.013    0.127   1.34   0.006
         90.000    2.20   0.014    0.127   1.33   0.007
         95.000    2.28   0.011    0.154   1.66   0.003
        100.000    1.87    0.03    0.187   1.65   0.003

The 4th column (labeled LOD score) is the Kong and Cox LOD score for this data. You will notice two very strong LOD score peaks, one around 20cM (LOD 6.75) and another around 70cM (LOD of 4.62). Unfortunately, ignoring marker-marker LD can lead to inflated LOD scores when some parental genotypes are missing. The results are typical of situations where marker-marker disequilibrium is not modeled appropriately and should not be taken as evidence for linkage.

To verify whether there is evidence for linkage, we will repeat the previous analysis, but modeling marker-marker disequibrium. First, we will carry out analyses using pre-specified clusters and haplotype frequencies. Next, we will see how Merlin can automatically define clusters using the available marker map and genotype data.

We will use cluster definitions in the file snp-scan.clusters. This file describes clusters of SNPs in linkage disequilibrium. This file can be generated by the user, or by a previous MERLIN run. We will describe the file in detail, since it should help clarify how MERLIN models linkage disequilibrium.

The file describes a series of clusters, each consisting of a series of consecutive markers. The description of each cluster begins on a separate line with the word CLUSTER followed by a series of marker names, that must exactly match the data and map files. Optionally, this line can be followed by a series of entries, each on a separate line, describing the haplotypes in the cluster and their frequencies. Each of these lines begins with the word HAPLO followed by a haplotype frequency and a series of alleles.

For example, this is the first cluster in the snp-scan.clusters file:

CLUSTER rs556990 rs553316 rs7989953
HAPLO 0.4500   3   2   1
HAPLO 0.3167   3   2   3
HAPLO 0.2000   1   4   1
HAPLO 0.0333   1   4   3

The cluster includes three markers (rs556990, rs553316, rs7989953) organized into 4 distinct haplotypes. The first two markers are in complete linkage disequilibrium, such that allele 3 at rs556990 always appears with allele 2 at rs553316, whereas allele 1 at rs556990 always appears with allele 4 at rs553316. The last marker is in strong, but incomplete disequilibrium with the first two: allele 3 for rs7989953 nearly always occurs on the 3-2 haplotype for markers rs556990 and rs553316.

After reading the file with clustering information, MERLIN will do the following:

So let's repeat the original analysis, but modeling of marker-marker disequilibrium enabled. To do this, use the following command-line:

prompt> merlin -d snp-scan.dat -p snp-scan.ped -m snp-scan.map --npl --grid 5 --cluster snp-scan.clusters

After the opening banner screen, you will first see a series of information messages:

MARKER CLUSTERS: Marker map changed, see [merlin-clusters.log]
MARKER CLUSTERS: User supplied file defines 20 clusters

Family:   101 - Founders: 2  - Descendants: 3  - Bits: 4
  Cluster at marker rs7334521 dropped [OBLIGATE RECOMBINANT]

Family:   287 - Founders: 2  - Descendants: 3  - Bits: 4
  Cluster at marker rs7334521 dropped [UNKNOWN HAPLOTYPE]

The first two lines indicate that the cluster information was successfully loaded. Since MERLIN assumes no recombination within clusters, the original genetic map was adjusted slightly -- you can examine the contents of merlin-clusters.log for details. In addition, MERLIN encountered two families (101 and 287) where genotypes for one cluster did not fit with the model described in the clustering file. In family 101, the observed genotypes imply an obligate recombinant in the cluster including markers rs7334521, rs4495999 and rs9546406. In family 287, the observed genotypes imply a haplotype that is not present in the clustering file. In both families, genotypes for markers rs7334521, rs4495999 and rs9546406 will be marked as missing to allow analysis to proceed. In our experience, discarding a small proportion of the available genotypes in this manner results in no noticeable biases.

After these messages, you will find the linkage analysis results, which should be similar to the following:

Phenotype: DISEASE [ALL] (500 families)
======================================================
            Pos   Zmean  pvalue    delta    LOD  pvalue
            min  -18.26     1.0   -0.408 -62.47     1.0
            max   54.77 0.00000    1.225  301.0 0.00000
          0.011    0.82     0.2    0.061   0.24    0.15
          5.011    1.17    0.12    0.070   0.39    0.09
         10.011    1.32    0.09    0.078   0.49    0.07
         15.011    1.31    0.09    0.083   0.52    0.06
         20.011    1.43    0.08    0.098   0.67    0.04
         25.011    1.66    0.05    0.107   0.84    0.02
         30.011    1.54    0.06    0.102   0.75    0.03
         35.011    1.47    0.07    0.085   0.60    0.05
         40.011    1.10    0.14    0.067   0.35    0.10
         45.011    1.30    0.10    0.074   0.46    0.07
         50.011    1.41    0.08    0.079   0.53    0.06
         55.011    1.37    0.09    0.081   0.53    0.06
         60.011    1.57    0.06    0.087   0.65    0.04
         65.011    2.26   0.012    0.136   1.44   0.005
         70.011    3.05  0.0011    0.178   2.55  0.0003
         75.011    3.24  0.0006    0.192   2.92 0.00012
         80.011    2.34   0.010    0.138   1.53   0.004
         85.011    1.74    0.04    0.099   0.82    0.03
         90.011    1.45    0.07    0.084   0.58    0.05
         95.011    0.88     0.2    0.060   0.25    0.14

There is now a single linkage peak around 75cM (LOD of 2.92). The original peak around 20cM has completely disappeared, and was simply an artifact of linkage disequilibrium between markers. Thus, there is some good evidence for a single linkage peak in these data (at around 75cM). The analysis ignoring linkage disequilibrium, which showed an additional peak at around 20cM was quite inaccurate.

If you want to model linkage disequilibrium, but do not have a file describing preset clusters for your SNP mapping panel, MERLIN provides two options for automatically clustering markers. The --distance k option inserts a cluster breakpoint between markers that are less than k cM apart (that is, all consecutive markers spaced less than k cM are placed into a cluster). The --rsq treshold option calculates pairwise r2 between neighboring markers and creates a cluster joining markers for which pairwise r2 > threshold and all intervening markers.

To explore these alternative options, try the following command lines:

prompt> merlin -d snp-scan.dat -p snp-scan.ped -m snp-scan.map --npl --grid 5 --rsq 0.1 --cfreq
prompt> merlin -d snp-scan.dat -p snp-scan.ped -m snp-scan.map --npl --grid 5 --dist 3 --cfreq
prompt> merlin -d snp-scan.dat -p snp-scan.ped -m snp-scan.map --npl --grid 5 --clusters snp-scan.clusters-only --cfreq

The first command-line, will search for markers for which r2 is > 0.10 and define clusters including each identified pair and the intervening markers. The second command-line will group markers that are less than 3 cM apart into a cluster. The final command-line will use the cluster definitions in the snp-scan.clusters-only file, but estimate haplotype frequencies from the available genotype data. In each case, the --cfreq flag requests that the estimated clusters and their frequencies should be saved to a file.

That is it! You should be on your way to modeling linkage disequilibrium between markers in your own data, so as to make the best use of available SNP mapping panels.

To learn about other analyses options, you might want to check the non-parametric linkage analysis or parametric linkage analysis sections, or proceed haplotyping, simulation or ibd estimation sections.