FUGUE is the program used to construct haplotypes for the chromosome 22 and 19 linkage disequilibrium maps.
FUGUE is currently under development, but it already provides some functionaly not available in other programs. If you decide to use fugue, please e-mail me at email@example.com.
This version is recommended for Unix users with access to the GNU C++ compiler. To install FUGUE, unpack the archive below, type make and follow instructions. Have fun!
If you do not have access to a C++ compiler, one of the following precompiled versions may work on your system:
Linux-fugue.tar.gz For GNU/LINUX systems SunOS-fugue.tar.gz For Sun Workstations Windows-fugue.tar.gz For Windows Workstations
This tutorial will give you a feel for how fugue works. To run it, you will need to have FUGUE and a recent version of MERLIN installed.
We will first see how to estimate haplotype frequencies in a sample of families, unrelated individuals or both. This is a two step process, where MERLIN is used to enumerate all possible haplotypes for each founder (assuming no recombination) and FUGUE then uses an E-M algorithm to estimate haplotype frequencies.
We will use the three input files family.dat, family.ped and family.map. For a description of input formats, see the MERLIN tutorial. If you examine the input files, you will find out that they include genotypes for 569 individuals in 77 families with between 3 and 4 generations. A total of 10 SNP markers, with average heterozygosity of 48% are listed.
To ask MERLIN to list all possible non-recombinant haplotypes for each family, we will use the --all, --zero and --founders command line options. Issue the command:
prompt> merlin -d family.dat -p family.ped -m family.map --zero --founders --all
This will generate a merlin.chr file which details sets of possible haplotypes for each family and a merlin.hap file which summarizes possible haplotype sets for each founder. This later file will be automatically detected and used by FUGUE as input. To run fugue, issue the command:
prompt> fugue -t 0.005
Your screen output detail estimated haplotype frequencies (excluding haplotypes with frequencies of zero or close to zero) and the estimated log-likelihood of the data, accurate up to an arbitrary constant. The -t 0.005 command line option requests that only haplotypes with estimated frequencies of 0.005 or greater should be displayed.
FUGUE - Frequency Using Graphs (c) 2001 Goncalo Abecasis The following parameters are in effect: Input File : merlin.hap (-fname) Max Bits : 16 (-b9999) Restarts : 0 (-r9999) Convergence Threshold : 1e-06 (-c99.999) Display Threshold : 0.005 (-t99.999) Divide-And-Conquer : OFF (-a[+|-]) Filtering data... Total: 390 Haplotypes in 76 Sets Known: 0 Haplotypes in 0 Sets [UNDERESTIMATE] 1024 haplotype frequencies will be estimated [~0 Mb of memory required] Starting with equal allele frequencies... Pass 9, log(lk) = -634.581 Best log(lk) = -634.58 Haplotypes with estimated frequency > 0.005 0.55% 1111111222 39.09% 1111112111 0.84% 1111112112 9.45% 1111121222 5.59% 1111122111 7.67% 1121121222 0.83% 1121122111 0.84% 2221112111 1.10% 2222221221 30.43% 2222221222 0.56% 2222222111 These 11 haplotypes represent 96.94% of total probability
Other commonly used options include the -a option to generate an approximate solution in datasets with many SNP markers (>20) and the -r option, which tries to avoid local minima in the likelihood by carrying out a number of random restarts.
A companion program to FUGUE, FUGUE-CC is suitable for the analysis of haplotypes in case-control datasets. Similar analysis could be carried out with the standard version of FUGUE and a little bit of scripting, but FUGUE-CC is a timesaver
For this example, we will use the cc.dat and cc.pedinput files. These files contain a set 44 affected and 43 unaffected individuals genotyped at 6 SNPs. To compare the haplotype frequencies in the case and control samples, run FUGUE-CC with the following options:
prompt> fugue-cc -d cc.dat -p cc.ped -s 10
In the program output, you will see estimated haplotype frequencies and corresponding log-likelihoods for the combined sample (LLK_ALL), for cases only (LLK_CASES), for controls only (LLK_CONTROLS). In addition, you will see a log-likelihood ratio statistic defined as LLK_CASES + LLK_CONTROLS - LLK_ALL. The best way to evaluate its significance is to generate a number of permutated datasets and analyse each one.
The -s 10 command line option tells FUGUE to generate 10 such permutations. In this case, a similar was not observed in any of the permuted data sets and additional permutations are recommended. Here is the output with 100 permutations:
FUGUE FOR CASE-CONTROL DATA (c) 2001-2003 Goncalo Abecasis The following parameters are in effect: Data File : cc.dat (-dname) Pedigree File : cc.ped (-pname) Random Restarts for EM : 0 (-e9999) Random Permutations for Sample : 100 (-s9999) The pedigree file includes: 43 cases, 44 controls, 0 individuals of unknown phenotype 87 founders, 0 non-founders Haplotyping Combined Sample =========================== Haplotypes with estimated frequency > 0.001 34.10% 112111 1.19% 112112 0.99% 121221 19.53% 121222 5.53% 122111 0.79% 221221 35.59% 221222 2.29% 222111 These 8 haplotypes represent 100.00% of total probability The logLikelihood of the data is -202.9073 Haplotyping Case Sample ======================= Haplotypes with estimated frequency > 0.001 20.56% 112111 1.16% 121221 25.22% 121222 3.86% 122111 1.19% 221221 48.01% 221222 These 6 haplotypes represent 100.00% of total probability The logLikelihood of the data is -84.8631 Haplotyping Control Sample ========================== Haplotypes with estimated frequency > 0.001 47.36% 112111 2.30% 112112 1.17% 121221 13.83% 121222 6.93% 122111 23.64% 221222 4.77% 222111 These 7 haplotypes represent 100.00% of total probability The logLikelihood of the data is -103.5748 Haplotyping Random Permutations of the Data =========================================== Permutation 1: llk(cases) = -128.953, llk(controls) = -67.058, llk(sum) = -196.011 Permutation 2: llk(cases) = -49.853, llk(controls) = -141.836, llk(sum) = -191.689 (... subsequent lines removed ...) Summary of Results ================== logLikelihood for Combined Sample: -202.907 logLikelihood for Cases: -84.863 logLikelihood for Controls: -103.575 logLikelihood for Cases + Controls: -188.438 logLikelihood ratio: 14.469 Permutations with higher ratio: 0/100
Hmm... Even with 100 permutations, none exceed the result in the original sample. This could be quite an interesting finding! ... Unfortunately, this is only a simulated dataset!