Linkage analysis tests for co-segregation of a chromosomal region and a trait of interest. In this section, we will walk through a basic non-parametric and variance components linkage analysis using MERLIN.
For this example, we will use a simulated data set that you will find in the examples subdirectory of the MERLIN distribution or in the download page.
The dataset consists of a simulated 5-cM scan of chromosome 24 in 200 affected sib-pair families and is organized into 3 files, a data file (asp.dat), a pedigree file (asp.ped) and a map file (asp.map). An overview of MERLIN input files is available elsewhere.
The recommended first step in any analysis is to verify that input files are being interpreted correctly. So let's start by running pedstats... Pedstats requires an input data file (-d parameter) and pedigree file (-p parameter):
prompt> pedstats -d asp.dat -p asp.ped
By examining the abbreviated pedstats output below, you should be able to confirm that there are 200 pedigrees, each with 4 individuals (two affected siblings and their parents). Among phenotyped individuals, the prevalence of the disease is 100% (there are no unaffecteds in the sample) and the pedigree also includes a quantitative trait. In addition there are no phenotyped or genotyped founders.
Pedigree Statistics (c) 1999-2001 Goncalo Abecasis The following parameters are in effect: QTDT Pedigree File : asp.ped (-pname) QTDT Data File : asp.dat (-dname) Missing Value Code : -99.999 (-xname) PEDIGREE STRUCTURE ================== Individuals: 800 (400 founders, 400 nonfounders) Families: 200 Average Family Sizes: 4.00 Average Generations: 2.00 QUANTITATIVE TRAIT STATISTICS ============================= [Phenotypes] [Founders] Mean Var trait 400 50.0% 0 0.0% 0.021 1.496 AFFECTION STATISTICS ==================== [Diagnostics] [Founders] Prevalence affection 400 50.0% 0 0.0% 100.0% Total 400 50.0% 0 0.0% MARKER GENOTYPE STATISTICS ========================== [Genotypes] [Founders] Hetero MRK1 400 50.0% 0 0.0% 72.8% MRK2 400 50.0% 0 0.0% 73.2% (...statistics for other markers would appear here...) Total 8000 50.0% 0 0.0% 74.1%
Everything checks out, so let's run merlin! We will need to specify an input data file (-d parameter), pedigree file (-p parameter) and map file (-m parameter). In addition, we need to request a non-parametric linkage analysis. In this case, we will request calculation of both the Whittemore and Halpern NPL pairs (--pairs) and NPL all (--npl) statistics:
prompt> merlin -d asp.dat -p asp.ped -m asp.map --pairs --npl
After running the command, you should first see the MERLIN banner and a summary of currently selected options:
MERLIN 0.8.4 - (c) 2000-2001 Goncalo Abecasis The following parameters are in effect: Data File : asp.dat (-dname) Pedigree File : asp.ped (-pname) Missing Value Code : -99.999 (-xname) Map File : asp.map (-mname) Allele Frequencies : ALL INDIVIDUALS (-f[a|e|f|file]) Steps Per Interval : 0 (-i9999) Random Seed : 123456 (-r9999) Data Analysis Options General : --error, --ibd, --kinship, --information Linkage : --npl [ON], --pairs [ON], --qtl, --deviates, --vc Haplotyping : --best, --sample, --all, --founders Recombination : --zero, --one, --two, --three, --singlepoint Limits : --bits , --megabytes Output : --quiet, --markerNames Simulation : --simulate, --save Additional : --simwalk2, --matrices, --swap
Notice that allele frequencies were estimated by counting among all individuals (the default). Alternatively, one could calculate allele frequencies among founders only (-ff), request equal allele frequencies (-fe) or use an allele frequency file with custom frequencies.
After a few moments, you should see analysis results at each location:
Phenotype: affection [ALL] (200 families) ============================================================ Pos Zmean pvalue delta LOD pvalue min -20.00 1.0 -0.707 -60.21 1.0 max 20.00 0.00000 0.707 60.21 0.00000 0.000 0.96 0.2 0.092 0.27 0.13 5.268 1.39 0.08 0.126 0.54 0.06 10.536 1.27 0.10 0.110 0.43 0.08 15.804 1.43 0.08 0.128 0.56 0.05 21.072 0.88 0.2 0.083 0.22 0.2 26.340 1.37 0.08 0.130 0.55 0.06 31.608 1.53 0.06 0.151 0.71 0.04 36.876 2.18 0.014 0.197 1.32 0.007 42.144 2.60 0.005 0.218 1.75 0.002 47.412 3.00 0.0014 0.251 2.33 0.0005 52.680 3.43 0.0003 0.286 3.05 0.00009 (... results continue at other locations...)
The first two lines indicate the maximum possible scores for this dataset. These are followed by analysis results at each location (cM position, Zscore, p-value assuming normal approximation, Kong and Cox delta, K&C LOD score and K&C p-value). You will notice that results are identical for the NPL all and pairs statistics -- this is always the case for families with a single affected sib-pair! Linkage peaks at location 52.68 with a Zscore of 3.43 (assymptotic p-value of 0.0003), corresponding to a Kong and Cox LOD score of 3.05 with probability 0.00009.
Two Merlin options can be helpful when sorting through large masses of linkage results. These two options are the --pdf option, which generates a simple graphical summary of your linkage curves, and the --tabulate option, which generates a tab-delimited file summarizing all the results for easy analysis in other programs.
Other commonly used linkage analysis options include requesting output with marker names, instead of cM positions (--markerNames option) and requesting analysis between markers (--steps n for n steps per interval) or along a grid of equally spaced locations along the chromosome (--grid n for an n-cM grid). Try them out! For example...
prompt> merlin -d asp.dat -p asp.ped -m asp.map --steps 4 --pairs --markerNames
... would calculate the NPL pairs statistic at 4 locations between consecutive markers and use marker names in the output.
TIP:The standard non-parametric linkage analysis carried out by Merlin uses the Kong and Cox (1997) linear model to evaluate the evidence for linkage. This model is designed to identify small increases in allele sharing spread across a large number of families -- this is what one usually expects in a complex disease. If you are searching for a large increase in allele sharing in a small number of families, you can select the Kong and Cox (1997) exponential model by adding the --exp option to your command line, after the --npl or --pairs options. This alternative model is more computationally intensive and requires more memory, but provides a better linkage test if you expect a large increase in allele sharing among affected individuals.
To carry out a variance components linkage analysis on the same data set, we will use the --vc option. If you are using a peculiar value, such as 1234 or -99.999 to represent missing values in your data, remember to use the -x peculiar_value option to tell MERLIN about it in all quantitative trait analyses. In the asp pedigree, missing values have been replaced by x. Let's try a variance components analysis:
prompt> merlin -d asp.dat -p asp.ped -m asp.map --vc
In the output, you will see the estimated sample heritability for each phenotype (in this case 86%) followed by estimates of the genetic effect and LOD scores at each marker location:
Phenotype: trait [VC] (200 families, h2 = 86.74%) ===================================================== Position H2 ChiSq LOD pvalue 0.000 40.95% 5.21 1.13 0.011 5.268 51.42% 9.88 2.15 0.0008 10.536 56.26% 13.01 2.82 0.0002 15.804 65.40% 19.63 4.26 0.00000 21.072 60.89% 15.36 3.34 0.00004 (... results continue at other locations...)
In this case, linkage peaks at position 15.8 cM. You could identify which families are contributing the most to these linkage signals using the --perFamily option, which generates an additional file tabulating the contribution of each family to the overall LOD score (for non-parametric analysis this partial contribution will be labelled pLOD).
Since this is a selected sample, you might want to check out the simulation section to find out how to conduct gene-dropping simulations that could be used, for example, to estimate empirical p-values. Or proceed to the error detection (improves power!), haplotyping or ibd estimation sections.