University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

Analysis of Case-Control Samples and Trio Families

In this example, we will use LAMP to carry out parametric association analyses in a sample of cases and controls. The exact same procedure can be applied to samples that also include parent-offspring trios. These samples contain no information about genetic linkage, but still allow a simple genetic model to be estimated at each marker.

We will analyse a simple dataset and discuss the differences between carrying out a parametric analysis using LAMP and other more common strategies for the analysis of these samples, such as the Transmission Disequilibrium Test (first described by Richard Spielman and colleagues) or chi-squared tests of association (with 1 or 2 degrees of freedom).

Input Files

We will analyse an examplar dataset set consisting of 250 cases and 250 matched controls, each genotyped at 15 SNPs within a candidate region. The genotype and phenotype data is described in two files, cc.dat and cc.ped (in the examples subdirectory of the LAMP distribution). You can check the contents of these files by opening them in a text editor or, more conveniently, using the PEDSTATS program.

The first few lines of the pedigree file are reproduced below:

< ... pedigree file excerpt begins after header line below ... >
FAMID   ID      FATID   MOTID  SEX MRK1 MRK2 MKR3 MRK4 ... 
1       1       0       0       1  a/a  c/c  g/g  c/c  a/g  t/t  a/a  a/t  a/t  t/t  t/c  t/t  g/g  t/t  c/a    1
10      1       0       0       1  a/a  c/t  g/c  c/g  a/g  a/t  a/a  a/t  a/t  t/t  c/c  t/t  g/g  t/t  a/a    1
11      1       0       0       2  a/a  c/c  g/g  c/c  a/a  t/t  a/a  a/a  a/a  t/t  t/t  t/t  g/g  t/t  c/c    1
12      1       0       0       1  a/a  t/c  c/g  g/c  g/a  t/t  a/a  t/a  t/a  t/t  c/t  t/t  g/g  t/t  a/c    1
40      1       0       0       2  a/a  c/c  g/g  c/c  a/a  a/a  g/a  a/a  a/a  t/t  t/t  t/t  g/t  t/t  c/c    2
64      1       0       0       1  a/a  t/c  c/g  g/c  g/a  t/a  a/a  t/a  t/a  t/t  c/t  t/c  g/g  t/t  a/c    2
< ... pedigree file continues with data for a total of 500 individuals ... >

Note that, because the sample consists of unrelated individuals, each sample has been assigned a unique family id (FAMID). In each row, this is followed by a dummy individual id, father id, and mother id. These are then followed by a sex code and a series of genotypes for each individual (in most pedigree files, genotypes are coded as integers with '1' denoting the first allele, '2' the second allele, etc. However, LAMP can also accomodate the letters 'a', 'c', 't' and 'g' as allele labels, as shown here). The last column indicates the disease status ('1' for unaffected, '2' for affected and '0' for unknown).

In addition to the pedigree and data files, LAMP also needs a file listing the positions of the markers to be tested for association (if a marker appears in the pedigree or data files, but not in this list of candidates, it will be ignored). In this case, the file is called cc.map and its first few lines are reproduced below:

< ... snippet of candidate SNP file starts here ... >
1       SNP1    27.000883
1       SNP2    27.002269
1       SNP3    27.003546
1       SNP4    27.004961
1       SNP5    27.006254
1       SNP6    27.007089
1       SNP7    27.007975
< ... snippet ends here ... >

Each line of the map file list the chromosome and position of a single marker, typically in megabases or in centimorgan units.

Running the Analysis

Since all the files are ready, we can proceed to run the analysis. Although LAMP will estimate disease allele frequencies and penetrances, we do need to provide an estimate of the prevalence of the trait at hand (through the --prevalence command line option). In this case, the trait was simulated with a prevalence of 0.05, and our final command line will look like this:

  lamp -d cc.dat -p cc.ped -c cc.map --prev 0.05 -f none

The first three options specify input file names, starting with the datafile (-d), and followed by the pedigree file (-p), and candidate positions file (-c). The last option sets the name of the framework file (to none, since this file is currently not required for analysing samples of unrelated individuals or trios). If you execute the above command, you should see LAMP output scroll through... The most interesting part is usually the table of estimated LOD scores at the end, so we will start there.

[tip]Useful Tip: If the output scrolls through too quickly, you can page through it by adding "| more " to the end of the command line. Alternatively, you can redirect it to a file by adding "> lamp-output.txt" to the end of the command line.

Estimated Test Statistics

LAMP summarizes evidence for association using a LOD score. You can easily convert a LOD score into a chi-squared statistic by multiplying it by 2ln(10) (about 4.61, according to google). In genetics, LOD scores are commonly used in settings where many statistical tests are performed -- such as most SNP association studies.

An abbreviated version of the table of LOD scores produced by LAMP is reproduced below. You will see that the strongest evidence for linkage was identified at SNP6, corresponding to a LOD score 5.28 (with 2 degrees of freedom, p-value ≈ 5 * 10-5).

                                         TEST FOR    
                                       ASSOCIATION  
                                    ----------------
LOCATION      TRAIT         ALLELE     LOD df pvalue   
====================================================
SNP1          MYSTERY_TRAIT      a    0.11  2    0.8 
SNP2          MYSTERY_TRAIT      c    0.02  2    0.9 
SNP3          MYSTERY_TRAIT      g    0.02  2    0.9 
SNP4          MYSTERY_TRAIT      c    0.00  2    1.0 
SNP5          MYSTERY_TRAIT      a    0.00  2    1.0 
SNP6          MYSTERY_TRAIT      a    5.28  2  5e-06 <-- Association peak here
SNP7          MYSTERY_TRAIT      a    0.14  2    0.7 
SNP8          MYSTERY_TRAIT      a    0.08  2    0.8 
SNP9          MYSTERY_TRAIT      a    0.08  2    0.8 
SNP10         MYSTERY_TRAIT      t    0.56  2    0.3 
SNP11         MYSTERY_TRAIT      t    0.22  2    0.6 
SNP12         MYSTERY_TRAIT      t    3.12  2 0.0008 
SNP13         MYSTERY_TRAIT      g    2.16  2  0.007 
SNP14         MYSTERY_TRAIT      t    0.03  2    0.9 
SNP15         MYSTERY_TRAIT      c    0.22  2    0.6 

... additional output lines removed here ...

For each SNP, LAMP first estimated the disease allele frequency under the null (i.e., assuming the SNP has no effect on disease status) and then under a simple model where the probability of being affected varies by genotype. If you are curious, you can check parameter estimates in the lamp-base.out file and lamp-direct-association.out file.

In the snippet of the lamp-base.out file below, you will see that, under the null, allele 'a' of SNP6 is estimated to have a frequency of 0.57 in the population.

< ... snippet of lamp-base.out begins here ... >
               TRAIT: MYSTERY_TRAIT
               LOCUS: SNP6
               MODEL: BASE

      LOG-LIKELIHOOD: -507.8747
   FITTED PARAMETERS: 1

ESTIMATED ALLELE FREQUENCIES
  ALLELE a : 0.5750     <-- estimated allele frequency
  ALLELE t : 0.4250
< ... snippet ends here ... >

In the snippet of the lamp-direct-association.out file below, you will see that, under the alternative model, the estimated frequency of the 'a' allele decreases to 0.52. You will also see that individuals carrying allele 'a' are much more likely to be affected and, since the sample is enriched for affected individuals, this probably explains the overestimate of the population frequency of the 'a' allele in the original analysis. The table includes other useful information such as estimates of the λsib and population attributable fractions associated with allele.

< ... snippet of lamp-direct-association.out begins here ... >
               TRAIT: MYSTERY_TRAIT
               LOCUS: SNP6
               MODEL: DIRECT ASSOCIATION

      LOG-LIKELIHOOD: -495.7110
   FITTED PARAMETERS: 3

ESTIMATED ALLELE FREQUENCIES
  ALLELE a : 0.5168 "PRESUMED" CAUSAL ALLELE
  ALLELE t : 0.4832 

ALLELE a INCREASES SUSCEPTIBILITY AND IS LABELED '-' BELOW

DISEASE LOCUS PARAMETERS
            FREQUENCY -: 0.5168
         PENETRANCE +/+: 0.01963  t/t homozygotes have 2% chance of being affected  
         PENETRANCE +/-: 0.05382
         PENETRANCE -/-: 0.06941  a/a homozygotes have 7% chance of being affected 

             LAMBDA_SIB: 1.0625
  ATTRIBUTABLE FRACTION: 0.6073
< ... snippet ends here ... >

Also useful is the estimated log-likelihood. This can be compared with other LAMP analyses that assume a different prevalence (--prevalence option) or that constrain the genetic model (--additive, --multiplicative, --dominant or --recessive options).

Analysing Your Own Data

Hopefully this tutorial gave you a flavor of how to use LAMP for a simple association analyses. To analyze your own data, you will have to organize your data into a pedigree file, a data file and a map file. You will then have to decide whether to constrain the disease model (--additive, --multiplicative, --dominant or --recessive options) and specify the correct prevalence for your trait (--prevalence option). As always, if you identify evidence for association, you should check that it is not due to artifacts such as deviations from Hardy-Weinberg equilibrium or poor genotyping quality.

If your sample includes more extended pedigrees, LAMP will usually try to estimate more complex models that can distinguish direct and indirect association. Even in those datasets, it is often useful to use the --ignore-linkage command line option for an initial quick-and-dirty analysis.

Learning More

If you enjoyed this portion of the tutorial, you might want to try some of the other sections. You can learn about combined linkage and association analysis, parametric linkage analysis with MOD scores or return to the main tutorial menu.


 
 

University of Michigan | School of Public Health | Abecasis Lab