MERLIN Tutorial -- Error detection

Genotyping errors can lead to misleading inferences about gene flow in pedigrees and greatly reduce the effectiveness of pedigree analysis. In this section, we will use MERLIN to conduct a sensitivity analysis of the likelihood and identify problem genotypes.

You can find the simulated data set for this section in the examples subdirectory of the MERLIN distribution or in the download page.

The dataset consists of a simulated 5-cM scan of chromosome 24 in 200 affected sib-pair families and is organized into 3 files, a data file (error.dat), a pedigree file (error.ped) and a map file (error.map). An overview of MERLIN input files is available elsewhere.

Sibpair with no errors

How does error detection work?

Before conducting the error detection analysis, we will review the basic principles behind it. Consider the simple pedigree to the left, with two siblings genotyped at several markers. Since their genotypes are identical at all markers, it seems quite likely that they share the stretch of chromosome under investigation.


Sibpair with one error

Now, consider what happens if we change the genotype for a single marker (indicated by the red circle)... This marker now contradicts information provided by all others, indicating that perhaps one of the parents carried two nearly identical copies of the chromosome or two recombination events occurred.

In the first example, inference about inheritance is relatively consistent at all markers, while in the second example inference about inheritance is strongly influenced by the single genotype. Intuitively, the first outcome seems much more plausible.

MERLIN finds genotypes that provide information about gene flow in a pedigree that contradicts information provided by other available data. MERLIN considers all available data simultaneously (not just pairs of individuals) so that error detection improves in accuracy in larger pedigrees. Genotypes flagged by MERLIN are likely to be errors and are certainly worth checking!

Error detection using MERLIN

To run error detection using merlin, we need to provide an input pedigree file (-p command line option) and matching data and map files (-d and -m options) and request an error detection analysis (--error option):

prompt> merlin -d error.dat -p error.ped -m error.map --error

Try it out! You should see the merlin banner and a summary of selected options, followed by a list of unlikely genotypes. In this case, this is the list:

Family:     2 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK11 genotype for individual 3 is unlikely [0.003848]
  MRK11 genotype for individual 4 is unlikely [0.003848]

Family:    73 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK17 genotype for individual 3 is unlikely [0.008866]
  MRK17 genotype for individual 4 is unlikely [0.008866]

Family:    81 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK8 genotype for individual 3 is unlikely [0.001567]
  MRK8 genotype for individual 4 is unlikely [0.001567]

Family:    94 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK12 genotype for individual 3 is unlikely [0.002101]
  MRK12 genotype for individual 4 is unlikely [0.002101]

Family:   136 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK16 genotype for individual 3 is unlikely [0.008330]
  MRK16 genotype for individual 4 is unlikely [0.008330]

Family:   162 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK14 genotype for individual 3 is unlikely [0.003037]
  MRK14 genotype for individual 4 is unlikely [0.003037]

Family:   164 - Founders: 2  - Descendants: 2  - Bits: 2
  MRK6 genotype for individual 3 is unlikely [0.001805]
  MRK6 genotype for individual 4 is unlikely [0.001805]

Unlikely genotypes listed in file [merlin.err]

In this data set with 20 markers and 200 sib-pair families, MERLIN flagged 7 pairs of unlikely genotypes. Since we are dealing with sib-pairs, errors are not pinpointed to specific individuals (all that we can tell is that at least one of the siblings is likely to have an erroneous genotype in each family!).

In a real-life setting it would be worthwhile re-checking genotype assays for these individuals. In this case, we will simply run pedwipe to erase genotypes that are flagged as problematic. Run:

prompt> pedwipe -d error.dat -p error.ped 

Pedwipe retrieves a list of unlikely genotypes from the merlin.err file and removes them from the data. A new set of data and pedigree files is created, named wiped.dat and wiped.ped. You can get a feel for the impact of these 7 problematic genotypes on linkage analysis by running a non-parametric linkage analysis before and after their removal:

prompt> merlin -d error.dat -p error.ped -m error.map --npl 
(...excerpt of results before removing problematic genotypes...)
Phenotype: affection [ALL] (200 families)
============================================================
                 Pos   Zmean  pvalue    delta    LOD  pvalue
              42.144    2.16    0.02    0.186   1.24   0.008
              47.412    2.39   0.008    0.204   1.51   0.004
              52.680    2.57   0.005    0.214   1.69   0.003
              57.948    1.72    0.04    0.145   0.76    0.03
              63.216    1.19    0.12    0.106   0.39    0.09
 
prompt> merlin -d wiped.dat -p wiped.ped -m error.map --npl 
(...excerpt of results after removing problematic genotypes...)
Phenotype: affection [ALL] (200 families)
============================================================
                 Pos   Zmean  pvalue    delta    LOD  pvalue
              42.144    2.24   0.012    0.191   1.32   0.007
              47.412    2.48   0.007    0.209   1.60   0.003
              52.680    2.87   0.002    0.237   2.10  0.0009
              57.948    2.10    0.02    0.175   1.13   0.011
              63.216    1.47    0.07    0.127   0.57    0.05

The seven problematic genotypes (out of 8,000 total genotypes), cause a 0.4 change in the Kong and Cox allele sharing LOD score! To learn about estimating false positive rates for error detection and linkage analysis you should proceed to the simulation section. Alternatively, you may want to learn more about linkage analysis, haplotyping or ibd estimation.