Analysis of Case-Control Samples and Trio Families
In this example, we will use LAMP to carry out parametric
association analyses in a sample of cases and controls. The exact same
procedure can be applied to samples that also include parent-offspring
trios. These samples contain no information about genetic linkage,
but still allow a simple genetic model to be estimated at each
marker.
We will analyse a simple dataset and discuss the differences between
carrying out a parametric analysis using LAMP and other more common
strategies for the analysis of these samples, such as the Transmission
Disequilibrium Test (first described by Richard Spielman and colleagues)
or chi-squared tests of association (with 1 or 2 degrees of freedom).
Input Files
We will analyse an examplar dataset set consisting of 250 cases and
250 matched controls, each genotyped at 15 SNPs within a candidate region.
The genotype and phenotype data is described in two files, cc.dat and
cc.ped (in the examples subdirectory of the
LAMP distribution). You can check the contents of these
files by opening them in a text editor or, more conveniently, using
the PEDSTATS program.
The first few lines of the pedigree file are reproduced below:
< ... pedigree file excerpt begins after header line below ... >
FAMID ID FATID MOTID SEX MRK1 MRK2 MKR3 MRK4 ...
1 1 0 0 1 a/a c/c g/g c/c a/g t/t a/a a/t a/t t/t t/c t/t g/g t/t c/a 1
10 1 0 0 1 a/a c/t g/c c/g a/g a/t a/a a/t a/t t/t c/c t/t g/g t/t a/a 1
11 1 0 0 2 a/a c/c g/g c/c a/a t/t a/a a/a a/a t/t t/t t/t g/g t/t c/c 1
12 1 0 0 1 a/a t/c c/g g/c g/a t/t a/a t/a t/a t/t c/t t/t g/g t/t a/c 1
40 1 0 0 2 a/a c/c g/g c/c a/a a/a g/a a/a a/a t/t t/t t/t g/t t/t c/c 2
64 1 0 0 1 a/a t/c c/g g/c g/a t/a a/a t/a t/a t/t c/t t/c g/g t/t a/c 2
< ... pedigree file continues with data for a total of 500 individuals ... >
Note that, because the sample consists of unrelated individuals, each sample has been assigned a unique family id (FAMID). In each row, this is followed by a dummy individual id, father id, and mother id. These are then followed by
a sex code and a series of genotypes for each individual (in most pedigree files, genotypes are coded as integers with '1' denoting the first allele, '2' the second allele, etc. However, LAMP can also accomodate the letters 'a', 'c', 't' and 'g' as allele labels, as shown here). The last column indicates the disease status ('1' for unaffected, '2' for affected and '0' for unknown).
In addition to the pedigree and data files, LAMP also needs a file listing the
positions of the markers to be tested for association (if a marker appears
in the pedigree or data files, but not in this list of candidates,
it will be ignored). In this case, the file is called cc.map and its first
few lines are reproduced below:
< ... snippet of candidate SNP file starts here ... >
1 SNP1 27.000883
1 SNP2 27.002269
1 SNP3 27.003546
1 SNP4 27.004961
1 SNP5 27.006254
1 SNP6 27.007089
1 SNP7 27.007975
< ... snippet ends here ... >
Each line of the map file list the chromosome and position of a single marker, typically in megabases or in centimorgan units.
Running the Analysis
Since all the files are ready, we can proceed to run the analysis. Although
LAMP will estimate disease allele frequencies and penetrances, we do need to
provide an estimate of the prevalence of the trait at hand (through the
--prevalence command line option). In this case, the trait was simulated
with a prevalence of 0.05, and our final command line will look like
this:
lamp -d cc.dat -p cc.ped -c cc.map --prev 0.05 -f none
The first three options specify input file names, starting with the datafile (-d),
and followed by the pedigree file (-p), and candidate positions file (-c). The
last option sets the name of the framework file (to none, since this file is currently not required for
analysing samples of unrelated individuals or trios).
If you execute the above command, you should see LAMP output scroll through...
The most interesting part is usually the table of estimated LOD scores
at the end, so we will start there.
Useful Tip: If the output scrolls through too quickly, you can page through it by adding "| more " to the end of the command line. Alternatively, you can redirect it to a
file by adding "> lamp-output.txt" to the end of the command line.
Estimated Test Statistics
LAMP summarizes evidence for association using a LOD score. You can easily convert a LOD
score into a chi-squared statistic by multiplying it by 2ln(10) (about 4.61, according to
google). In genetics, LOD scores
are commonly used in settings where many statistical tests are performed -- such as most
SNP association studies.
An abbreviated version of the table of LOD scores produced by LAMP is reproduced below.
You will see that the strongest evidence for linkage was identified at SNP6, corresponding
to a LOD score 5.28 (with 2 degrees of freedom, p-value ≈ 5 * 10-5).
TEST FOR
ASSOCIATION
----------------
LOCATION TRAIT ALLELE LOD df pvalue
====================================================
SNP1 MYSTERY_TRAIT a 0.11 2 0.8
SNP2 MYSTERY_TRAIT c 0.02 2 0.9
SNP3 MYSTERY_TRAIT g 0.02 2 0.9
SNP4 MYSTERY_TRAIT c 0.00 2 1.0
SNP5 MYSTERY_TRAIT a 0.00 2 1.0
SNP6 MYSTERY_TRAIT a 5.28 2 5e-06 <-- Association peak here
SNP7 MYSTERY_TRAIT a 0.14 2 0.7
SNP8 MYSTERY_TRAIT a 0.08 2 0.8
SNP9 MYSTERY_TRAIT a 0.08 2 0.8
SNP10 MYSTERY_TRAIT t 0.56 2 0.3
SNP11 MYSTERY_TRAIT t 0.22 2 0.6
SNP12 MYSTERY_TRAIT t 3.12 2 0.0008
SNP13 MYSTERY_TRAIT g 2.16 2 0.007
SNP14 MYSTERY_TRAIT t 0.03 2 0.9
SNP15 MYSTERY_TRAIT c 0.22 2 0.6
... additional output lines removed here ...
For each SNP, LAMP first estimated the disease allele frequency under the null
(i.e., assuming the SNP has no effect on disease status) and then under a simple
model where the probability of being affected varies by genotype. If you are
curious, you can check parameter estimates in the lamp-base.out file and
lamp-direct-association.out file.
In the snippet of the lamp-base.out file below, you will see that,
under the null, allele 'a' of SNP6 is estimated to have a frequency of
0.57 in the population.
< ... snippet of lamp-base.out begins here ... >
TRAIT: MYSTERY_TRAIT
LOCUS: SNP6
MODEL: BASE
LOG-LIKELIHOOD: -507.8747
FITTED PARAMETERS: 1
ESTIMATED ALLELE FREQUENCIES
ALLELE a : 0.5750 <-- estimated allele frequency
ALLELE t : 0.4250
< ... snippet ends here ... >
In the snippet of the lamp-direct-association.out file below, you will see that,
under the alternative model, the estimated frequency of the 'a' allele
decreases to 0.52. You will also see that individuals carrying allele 'a' are much more
likely to be affected and, since the sample is enriched for affected individuals,
this probably explains the overestimate of the population frequency of the 'a' allele in
the original analysis. The table includes other useful information such as
estimates of the λsib and population attributable fractions
associated with allele.
< ... snippet of lamp-direct-association.out begins here ... >
TRAIT: MYSTERY_TRAIT
LOCUS: SNP6
MODEL: DIRECT ASSOCIATION
LOG-LIKELIHOOD: -495.7110
FITTED PARAMETERS: 3
ESTIMATED ALLELE FREQUENCIES
ALLELE a : 0.5168 "PRESUMED" CAUSAL ALLELE
ALLELE t : 0.4832
ALLELE a INCREASES SUSCEPTIBILITY AND IS LABELED '-' BELOW
DISEASE LOCUS PARAMETERS
FREQUENCY -: 0.5168
PENETRANCE +/+: 0.01963 t/t homozygotes have 2% chance of being affected
PENETRANCE +/-: 0.05382
PENETRANCE -/-: 0.06941 a/a homozygotes have 7% chance of being affected
LAMBDA_SIB: 1.0625
ATTRIBUTABLE FRACTION: 0.6073
< ... snippet ends here ... >
Also useful is the estimated log-likelihood. This can be compared with other LAMP analyses
that assume a different prevalence (--prevalence option) or that constrain the
genetic model (--additive, --multiplicative, --dominant or --recessive
options).
Analysing Your Own Data
Hopefully this tutorial gave you a flavor of how to use LAMP for a simple association
analyses. To analyze your own data, you will have to organize your data into a pedigree
file, a data file and a map file. You will then have to decide whether to constrain the
disease model (--additive, --multiplicative, --dominant or --recessive
options) and specify the correct prevalence for your trait (--prevalence option). As always, if you identify evidence for association, you should check that
it is not due to artifacts such as deviations from Hardy-Weinberg equilibrium or
poor genotyping quality.
If your sample includes more extended pedigrees, LAMP will usually try to estimate
more complex models that can distinguish direct and indirect association. Even in those
datasets, it is often useful to use the --ignore-linkage command line option
for an initial quick-and-dirty analysis.
Learning More
If you enjoyed this portion of the tutorial, you might want to try some of the other
sections. You can learn about combined linkage and association
analysis, parametric linkage analysis with MOD scores or
return to the main tutorial menu.
|