MERLIN Tutorial -- Association Analysis

Association analysis tests for the association between a SNP and a trait of interest. In this section, we describe a couple of family-based association tests for quantitative traits. We also show the genotype inference feature integrated in the association tests to handle missing genotypes (Burdick et al. 2006). The rapid association test implemented in MERLIN can be used for genome-wide genome scan, and the genotype inference feature makes the association tests more powerful than conventional approaches. Although the association tests can be applied to family data, they do not control population stratification. If population stratification is a concern, population membership should be included as a covariate.

We first show an exemplary analysis. The dataset consists 20 pedigrees in three generations. At two loci, genotypes are missing for all individuals in the third generation. At other loci, genotypes are more complete. The dataset is organized into 3 files, a data file (assoc.dat)(Download), a pedigree file (assoc.ped)(Download), and a map file (assoc.map)(Download). An overview of MERLIN input files is available elsewhere.

To run MERLIN for the association analysis, we need to specify an input file (-d parameter), a pedigree file (-p parameter), and a map file (-m parameter). In addition, we need to request one of the following association tests: a score test (--fastAssoc) or a likelihood-ratio test (--assoc). The score test (--fastAssoc) is rapid and ideal for screening the whole genome in a genome-wide association (GWA) analysis, while the likelihood-ratio test (--assoc) can be used to test a candidate region to follow up the initial screening.

The command may look like:

prompt> merlin -d assoc.dat -p assoc.ped -m assoc.map --fastAssoc

After running the command, you should see the following results:

Phenotype: trait [FAST-ASSOC] (20 families, h2 = 62.09%)
==============================================================================
  Position        Marker  Allele  Effect      H2     LOD  pvalue
    68.550          SNP1       3   0.024   0.02%   0.008     0.8
    69.582          SNP2       1   0.106   0.53%   0.204     0.3
    70.634          SNP3       3  -0.216   2.34%   0.884    0.04
    73.100          SNP4       1  -0.007   0.00%   0.001     0.9
    75.018          SNP5       3  -0.017   0.01%   0.005     0.9
    75.629          SNP6       3   0.423   5.93%   2.446  0.0008
    75.651          SNP7       1   0.423   5.93%   2.446  0.0008
    75.889          SNP8       1  -1.061  47.78%  18.724 1.6e-20
    75.889          SNP9       4   0.526  10.81%   3.747 0.00003
    76.700         SNP10       3   0.032   0.05%   0.020     0.8
    77.554         SNP11       1   0.067   0.19%   0.065     0.6
    77.627         SNP12       3   0.343   4.61%   1.787   0.004
    79.352         SNP13       1   0.042   0.07%   0.027     0.7
    79.443         SNP14       1   0.101   0.50%   0.190     0.3
    79.865         SNP15       1  -0.291   3.76%   1.510   0.008
    80.700         SNP16       1   0.206   2.08%   0.775    0.06
    81.238         SNP17       1  -0.184   1.28%   0.459    0.15
  Peak -->          SNP8       1  -1.061  47.78%  18.724 1.6e-20

The result above shows the fastAssoc analysis of phenotype "trait". The 7 columns are the position of a SNP, the name of the marker under test, the allele being test, the effect of the allele, the proportion of total variance explained by this allele, the LOD converted from the chi-square test statistic, and the p-value of the fastAssoc test. The last row indicates the peak locus at which the association is the strongest among all tests above.

Then the fastAssoc test is followed by the assoc test:

prompt> merlin -d assoc.dat -p assoc.ped -m assoc.map --assoc --start 75.7 --stop 76.0

After running the command, you should see the following results:

Phenotype: trait [ASSOC] (20 families, h2 = 62.09%)
==============================================================================
  Position     H2      LOD  pvalue        Marker Allele  Effect     LOD  pvalue
    75.889  77.14%    7.52 2.0e-09          SNP8      1  -1.141  24.982 7.7e-27
    75.889  77.14%    7.52 2.0e-09          SNP9      4   0.562   3.807 0.00003

This analysis (--assoc) only examines SNPs in region between 75.7cM (specified by parameter --start) and 76.0cM (specified by parameter --stop). The 9 columns in the result are the position of the SNP under test, the proportion of variance that is explained by the locus in a variance component linkage analysis, the LOD score of a VC linkage analysis (the corresponding parameter is --vc), the p-value of the VC linkage analysis, the name of the marker under test, the allele being tested, the effect of the allele, the LOD score converted from the likelihood-ratio test statistic, and the p-value of the assoc test. The last row indicates the peak locus at which the association is the strongest among all tests above.

One unique feature of association tests in MERLIN is that missing genotypes are imputed and incorporated in an association test. A simple example of genotype inference is illustrated in the figure on the left (Burdick et al. 2006). All missing genotypes in the third generation are inferred in this example(the inferred genotypes are shown in italics). In this type of dataset, incomplete high resolution SNP data are combined with more complete sparse markers from a linkage scan. Burdick et al. (2006) show by incorporating the genotype inference procedure, the power of the association test can be substantially improved. This suggests in a genome-wide association study following a linkage study, only a porportion of individuals needs to be genotyped and genotypes of the remaining family members can be obtained by a genotype inference procedure. This genotyping strategy has been shown to be cost-effective.

The genotype inference procedure can be carried out separately from an association test. Posterior distributions are provided by MERLIN for all missing genotype. This function may be important for some cases, e.g., the association analysis of qualitative traits. The parameter for the genotype inference is --infer. The inferred data are stored in merlin-infer.dat and merlin-infer.ped. The command line for genotype inference may look like this:

prompt> merlin -d assoc.dat -p assoc.ped -m assoc.map --infer

In the inferred pedigree file (saved as merlin-infer.ped here), each locus is described in 5 columns: the most likely genotype for the missing genotype, the expectation of count for the tested allele, and the posterior probabilities for three genotypes. In the inferred data file (saved as merlin-infer.dat here), 5 rows corresponding to the 5 columns in the pedigree file are labeled as M, T, C, C, and C type respectively. The posterior probabilities can help infer a missing genotype: if one of the three posterior probabilities is large enough, then the missing genotype can be inferred as the genotype with the highest posterior probability; otherwise, if none of the posterior probabilities is large enough, then the missing genotype remains uninferred. In order to control the error rate of genotype inference, a high cutoff value, say, .99, is usually chosen for the posterior probabilities, so that the missing error rate is controlled to be less than 1%.

REFERENCE

Burdick JT, Chen WM, Abecasis GR, Cheung VG (2006) In silico method for inferring genotypes in pedigrees. Nat Genet 38:1002-4 [PDF]