MERLIN Quick Reference

Main

Abecasis Lab

MERLIN

Home

-----------------------------------------------------------------

Tutorial

-----------------------------------------------------------------

Download

-----------------------------------------------------------------

Reference

-----------------------------------------------------------------

FAQ

MERLIN - Reference Sheet

The following is a summary of all available MERLIN command line options and their meanings:

Input Files and Basic Parameters

-d datafile: Selects input data file, in linkage or QTDT format.
-p pedfile: Selects pedigree file, with genotype, phenotype and family structure information
Newer versions of Merlin (>1.1) can combine multiple data and pedigree files on the fly. To do this, list multiple data files separated by commas after the -d option, for example, -d pheno.dat,geno.dat, and also list the corresponding pedigree files separated by commas after the -p option, for example, -p pheno.ped,geno.ped.
-x missing_value_code: Selects the missing value code for quantitative phenotypes and covariates in the pedigree file. If possible, it is always safer to replace missing values with 'x', rather than use this option.
-m mapfile: File indicating chromosome and centimorgan position for each marker. Use with QTDT format input files. Recombination fractions will be derived from marker positions using the Haldane mapping function.
-f [a|e|f|m|file]: Source for allele frequency information. Allele frequencies can be set in a user user specified file (-f filename), they can be estimated using maximum likelihood (-fm), or they can be estimated by counting in founders (-ff) or in all individuals (-fa), or assumed equal (-fe). For use with QTDT format input files.
-r seed: Selects a different random sequence for simulation and sampling of haplotypes.

General Analyses

--error: Find unlikely genotypes. Likely errors are listed in merlin.err file.
--information: Calculate information based on entropy at each analysis position.
--likelihood: Calculate likelihood of observed genotype data.
--model parametric_models.tbl
: Calculate parametric LOD scores, using the models specified in parametric_models.tbl. For a detailed description of this option, see the MERLIN parametric linkage analysis tutorial and the MERLIN reference.

IBD State Calculations

--ibd: Output pairwise IBD coefficients to merlin.ibd
--kinship: Output pairwise kinship coefficients to merlin.kin
--matrices [see * note]: Calculate possible pairwise IBD matrices and their probabilities for each family. This information is stored in the file merlin.kmx
--extended [see * note]: Output extended IBD state information to merlin.s15. Extended IBD states track sharing of maternal and paternal alleles separately and also provide additional information for inbred pedigrees.
--select: Select most informative affected individuals on the basis of allele sharing information, and record the results in the file merlin.sel. If it is only practical to genotype a single individual per family in an association study, genotyping these individuals can improve power (Fingerlin et al, 2004).

Non-Parametric Linkage Analyses

--npl: Use the Whittemore and Halpern NPL all statistic to test for allele sharing among affected individuals. Also calculates a LOD score using the Kong and Cox linear model.
--pairs: Use the Whittemore and Halpern NPL pairs statistic to test for allele sharing among affected individuals. Also calculates a LOD score using the Kong and Cox linear model. Versions 0.10.1 and higher also consider sharing within inbred individuals when computing this statistic.
--qtl: Use a non-parametric statistic to test for sharing among individuals with similar phenotypes. Use the sample mean to estimate the population mean, and calculate a LOD score using the Kong and Cox linear model. Follow this link for additional details on this option.
--deviates: Similar to --qtl, but assumes that phenotypes are deviates from the population mean. Follow this link for additional details on this option.
--exp: Calculate non-parametric LOD scores using the Kong and Cox exponential model. Although more time consuming, this option can be powerful in datasets that show very strong linkage signals or which include larger pedigrees.
--zscores: Generate a compact file summarizing family-specific NPL scores at each location. The file can be used for additional follow-up analyses.

Variance Components Linkage Analysis

--vc: Perform variance components linkage analysis assuming no dominance. Also calculates sample heritability for each trait.
--useCovariates: Model covariate effects during analysis. In QTDT format data files, covariates are indicated by "C" data type.
--ascertainment: Model single proband ascertainment. In ascertained families, the proband can be tagged by setting his individual id to "proband" or by including a dummy affection status variable named "proband" and setting its value to 2 (or affected) for probands and missing otherwise.
--unlinked alpha: Use a simple heterogeneity model for linkage. The model assumes that a fraction alpha of the families are unlinked.

Association Analyses

--infer: Estimate missing genotypes in a pedigree. When this option is selected, MERLIN will estimate the posterior distribution of each missing SNP genotype conditional on available genotype data. A new pedigree file will be generated including the most likely genotype for each individual [whenever this most likely genotype has a posterior probability of > 95%], the probability that each missing genotype is a homozygote for the reference allele, the probability that each missing genotype is an heterozygote, and the expected number of copies of a reference allele in each missing genotype.
Genotypes will only be inferred for markers with exactly two alleles. Multi-allelic and monomorphic markers will not be included in the output file. If you want a smaller output pedigree file, consider the --inferBest, --inferExpected and --inferProbabilities options as alternatives.
--assoc: This option uses a variance component model to estimate an additive effect for each SNP and carry out an association test. Before evaluating evidence for association, missing genotypes are estimated to increase power.
--fastAssoc: This option uses a rapid score test to estimate an additive effect for each SNP. It is slightly less accurate, but much more computationally efficient, than the --assoc option and recommended for first pass analysis of genome-wide scans and other large datasets.
--filter threshold: When the --fastassoc option is used, only output p-values below a certain threshold.
--custom covariates.tbl: The custom file allows users to customize the covariate model for each trait. For each trait to be analyzed, this file should contain two lines. The first line should include the TRAIT keyword followed by the trait name. The second line should include the COVARIATE keyword followed by a list of appropriate covariates.
This option affects both association analysis and quantitative trait linkage analyses.

If you use the above options for association analysis, please cite Chen and Abecasis (AJHG, 2007) which provides a full account of the approach.

Analysis Positions

--steps:n: Carry out analyses at n equally spaced locations to analyse between consecutive markers
--minStep:dist: When carrying out analyses between markers, ensure that consecutive analysis locations are separated by at least dist centiMorgans.
--maxStep:dist: When carrying out analyses between markers, ensure that consecutive analysis locations are separated by no more than dist centiMorgans.
--grid:n: Carry out analysis along an n-cM grid of equally spaced locations, starting at the location specified with --start option and continuing up to the location specified with the --stop option. If --start and --stop are left blank, start at the first marker and stop after the final marker in each chromosome.
--start:pos: Start analyses at pos centiMorgans.
--stop:pos: Stop analyses at pos centiMorgans.
--positions:pos1,pos2,...: Carry out analysis only at the specified positions. Each position can be a marker name or centimorgan location.

Haplotyping Analyses

--best: Output the most likely haplotype vector to merlin.chr
--sample: Samples a likely haplotype vector according to likelihood and outputs it to merlin.chr. Use the random seed parameter, -r, to sample a different vector.
--sample:n: Repeats the sampling process n times for each family.
--all: List all possible haplotype vectors for each family in merlin.chr. Must be used with the --zero recombination option.
--founders: List founder haplotype graphs in merlin.hap.
--horizontal: Use an alternative, horizontal format for outputting haplotypes. In this alternative format alleles for each individual haplotype are listed along a single line

Recombination Options

--zero: Assume no recombination between markers. Families with obligate recombinants will be discarded.
--one, --two, --three: Allow 1, 2 or 3 recombination events between consecutive informative markers. This can improve performance of Lander-Green algorithm convolutions and still provide accurate solutions when markers are closely spaced.
--singlepoint: Consider each marker individually.

Marker Clustering Options for Modelling Linkage Disequilibrium

--cluster clustering.tbl: Model linkage disequilibrium for clusters of neighboring markers defined in the clustering.tbl file. The file should indicate groups of markers that are in linkage disequilibrium and, optionally, frequencies of the haplotypes they define. If haplotype frequencies are not provided, they will be estimated automatically. For more details of options for modeling linkage disequilibrium, see the tutorial on modeling marker-marker disequilibrium with MERLIN
--distance threshold: Automatically define clusters and estimate haplotype frequencies for groups of markers that are less than threshold cM apart.
--rsq threshold: Automatically define clusters including pairs of SNPs for which pairwise r² exceeds threshold and all intervening markers.
--cfreq: This option instructs merlin to generate a file summarizing clusters of markers in linkage disequilibrium and the haplotype frequency distribution within each cluster. This file can be used with the --cluster option in subsequent analysis.

Resource Usage

--bits:n: Do not attempt to analyse pedigrees of more than n bit complexity.
--megabytes:n: Do not attempt to allocate more than n megabytes of memory. Starting with version 1.1 Merlin will select different strategies to analyze larger pedigrees when it expects the standard approach will exhaust memory. This option can stop unnecessary crashes and facilitate the analysis of large pedigrees.
--minutes:n: Do not attempt to analyse families where calculations for the forward portion of the Markov-Chain require more than n minutes.

Performance

--trim: Trim pedigree by removing individuals with no phenotype or genotype data who are not required to define kin relationships between other individuals in the pedigree
--noCoupleBits: Disable founder couple symmetry. This option generally slows things down, but allows grandmaternal and grandpaternal haplotypes to be distinguished during haplotyping analyses even when grandparents are not genotyped.
--swap: Use swap file to reduce memory usage.
--smallSwap: Uses an alternative strategy to manage swap files, so as to conserve disk space.

Output Formatting

--quiet: Do not output progress reports when analyzing large families
--markerNames: Use marker names, rather than cM positions, to label results
--frequencies: Output allele frequencies calculated internally by MERLIN to a file
--perFamily: Output perFamily LOD scores for each family to a file. For non-parametric analyses, output includes the non-parametric Z score for each family and two LOD scores calculated using the Kong and Cox method, one using best fitting overall model (pLOD) and the other maximized within each family (LOD). For variance components analyses the output includes each family's contributions to the log-likelihood under the null and alternative hypothesis as well as as to the LOD score.
--pdf: Output LOD score plots to pdf file merlin.pdf.
--tabulate: Generate tables summarizing key analysis results in tab-delimited format. These tables can be convenient for subsequent analysis.
--prefix label: Requests that output file names should be derived from label. For example, estimated haplotypes should be stored in a file called label.chr.

Simulation Options

--simulate

Perform gene dropping simulation. Generate random genotypes for each marker, conditional on current missing data pattern, genetic map and allele frequencies. Use the random seed option (-r seed) to select a different replicate. For more details on this option, follow this link.

--reruns N

Repeat simulation N times.

--trait AFFECTION,FREQ(-),PEN(+/+),PEN(+/-),PEN(-/-),POSITION

--trait QTLNAME,SNP,Var(QTL),Var(Polygenes),Var(Environment)

When combined with the --simulate option, this instructs Merlin to simulate a quantitative trait or discrete trait. The --trait option is interpreted slightly differently in each case.

For a discrete trait, genotypes are simulated conditional on observed phenotypic data: if a particular family includes two affected individuals, Merlin will sample genotypes conditional on that outcome and the genetic model you specify. Merlin will simulate genotypes conditional on the phenotypes in a discrete trait labeled AFFECTION. It will assume that the disease risk allele has frequency FREQ(-) and that the propability of developing disease, conditional on the (unobserved) disease locus genotypes is PEN(+/+),PEN(+/-) and PEN(-/-). The disease locus will be placed at position POSITION.

For a quantitative trait, simulated phenotypes will replace observed phenotypes (but the original missing data pattern will be respected so that if, for example, all parental trait values are missing in the original data, they will also be missing in the simulated data). With the second format above, simulated traits values will be stored in a column labeled QTLNAME. The QTL be influenced by SNP which will explain Var(QTL) of the total variance. The remaining variance will be polygenic, Var(Polygenes) or environmental, Var(Environment).

The QTL phenotypes will replace the original trait values for QTLNAME, but will respect the original missing data pattern. The QTL genotypes will also replace the original genotypes for SNP, but will respect the original missing data pattern. An examplar set of options might be: --simulate --trait BMI,rs9930506,0.01,0.39,0.60. This would simulate trait BMI such that marker rs9930506 accounts for ).01 of the variance, with residual polygenic variance of 0.39 and residual environmental variance of .60.

--save

Save simulated pedigree and corresponding data, map and allele frequency files as merlin-replicate.ped, merlin-replicate.dat, merlin-replicate.map and merlin-replicate.freq, respectively.

Miscellaneous options

--simwalk2: Perform a smart linkage analysis in conjuction with Simwalk2. MERLIN tackles the small pedigrees, Simwalk2 does the larger ones, you get one answer. This option requires Mega2 version 2.3 or later and MERLIN version 0.9.2 or later. Please see the Mega2 Manual for more detailed information.
--inverseNormal: Apply quantile normalization to each quantitative trait prior to analysis.

Options marked * are currently available on a trial basis. They probably require careful validation, but they may still be useful.

MINX: Chromosome X Analyses

MINX (MERLIN in X) is an X-specific version of Merlin. It is available in distributions of MERLIN version 0.9.1 and later. There is currently no manuscript describing MINX performance and algorithms in detail. Although I believe MINX results to be correct, the methods are unpublished and I would advise using with care.

MINX implements X-chromosome specific versions of the functions provided by the standard Merlin implementation. Males are hemizogous and carry only one X chromosome. MINX assumes that males are scored as homozygous in the input pedigree file.

MERLIN-REGRESS: Pedigree Wide Regression Analysis

Sham et al. (Am J Hum Genet 71:238-253)

MERLIN-REGRESS implements an extension of the Haseman-Elston quantitative trait linkage analysis procedure that extracts linkage information from trait squared-sums and differences from all non-inbred relative pairs. For a detailed analytical description of this approach, please see the manuscript by Sham et al. (2000).

This regression approach provides a powerful quantitative trait linkage test even in selected samples, but requires specification of the trait mean, variance and covariances between different relative pairs. The present implementation derives covariances between different types of relative pairs from their kinship coefficients and an estimate of the trait heritability.

Most of the MERLIN-REGRESS options are shared with MERLIN and described above. The following are MERLIN-REGRESS specific options:

Basic Trait Modeling Options

--mean:x: Mean for the trait under investigation in an (unselected) population. Misspecifying this parameter will generally result in decreased power.
--variance:x: Variance for the trait under investigation in an (unselected) population. Misspecifying this parameter will generally result in decreased power.
--heritability:x: Heritability for the trait under investigation in an (unselected) population. Underestimating the trait heritability can result in inflated error rates, so it is prudent to avoid setting this value too low.
--testRetest: Specifies the correlation between repeated measures of the same variable. This is useful when multiple measurements have been taken (and averaged) for each subject. To use this option, the pedigree file should include covariates (one per trait) indicating the number of times each subject was measured for each trait. This variable must be named TRAIT_REPEATS for each TRAIT where repeat measurements are available.
-t modelsFile: Specifies the name of a file listing alternative models for analysis. This should be a space delimited file where each line indicates a trait name, mean, variance and heritability. An example is available in the tutorial. When this table exists, the --mean, --variance and --heritability command line options are ignored.

Options for Modeling Random Samples

--randomSample: Specifies that the sample was not selected, and that MERLIN-REGRESS should use the observed sample mean, variance and heritability as estimates of population parameters.
--useCovariates: Specifies that covariates in the pedigree file should be "regressed-out" before analysis. This option is only available for random samples.
--inverseNormal: Specifies that inverse normal transformation (where each measurement is transformed to its corresponding quantile in a standard normal transformation) should be applied to the data before analysis. This option is only available in random samples, and can be helpful in dealing with data where outliers are present.

Other Options

--rankFamilies: Rank families according to their expected informativeness. This information can help focus genotyping efforts.

University of Michigan | School of Public Health | Abecasis Lab