University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

GAINQC Second Pass: Marker Checks

In the second pass, GAINQC performs marker based checks and flags markers that fail to exceed the user-set thresholds for marker quality. Similar to sample checks, all genotypes with quality scores less than the required quality threshold are discarded (set as missing). In addition, all samples that were flagged in the first pass are not used for the second pass.

SNP Checks

Genotyping completeness: The genotyping completeness is computed for all the markers. The markers with genotyping compelteness rate less than the prespecified threshold are flagged. Unlike the sample completeness, the markers are only flagged on an absolute scale.

Minor allele count: The number of minor alleles is counted for each marker. If this number is less than the minimum number of minor alleles required (user specified threshold), the marker is flagged.

Minor allele frequency: Using the minor allele count and the number of non-missing genotypes, the minor allele frequency of each marker is caculated. Similar to the minor allele count, markers with maf lower than the user specified threshold are flagged.

Hardy-Weinberg equilibrium p-value: For each SNP, the exact p-value for the Hardy-Weinberg Equilibrium is calculated. More details on the test can be found here. This test is performed only on the founders. In case of the X chromosome, only female founders are used. Any marker with a HWE p-value less than a given threshold are flagged.

Duplicate mismatches: In case there are duplicate samples in the study, GAINQC counts the number of mismatches in the duplicate gentoypes of each marker. If there are only 2 duplicates, a mismatch is identified if the genotypes are different (and neither one is missing). If there are more than 2 duplicates (with the same sample ID), the number of mismatches are counted by constructing a majority-vote genotype and counting the number of duplicate samples that do not have that genotype. This process is repeated for all groups of duplicate samples. The total number of mismatches is then the sum of the mismatches over all the duplicate sample groups for each marker. Markers with more than a specified number of mismatch errors are flagged.

Mendelian inconsistencies: The number and rate of mendelian inconsistencies per marker are also computed. These statistics are computed only if there is parent-offspring data present in the study. For each marker, the number of mendelian inconsistencies is counted as the number of parent-offspring (trios or pairs) that have a mendelian inconsistency. The rate is calculated as the ratio of the number of pairs or trios with an error to the total number of pairs or trios tested. Trios/Pairs are not included if they have missing data making testing impossible. Markers with a large number (or rate) of mendelian inconsistencies are flagged.

Additional Marker Statistics

These statistics are calculated per marker only if they are requested. Also, they need some extra data to be present, as will be noted in the description of each statistc.

Transmission Disequilibrium Test (TDT): If there are any trios (or parent offspring pairs) present in the data, GAINQC can be asked to perform a trasmission disequilibrium test (TDT). More information on th4e TDT is present here. For each marker, the two alleles, the number of each allele trasmitted from a heterozygote parent to an offspring, the number of trios (or pairs) used, TDT chi-squared statistic and the corresponding p-value are calculated. These are output in the snp information file along with other snp statistics.

Association Tests:: To perform association tests, samples must be labeled. This labeling can be done using the pedigree file. GAINQC performs a n-1 degrees of freedom allelic association test, where n is the number of distinct labels. More information on the association test can be found here. For each marker, the association test chi-suqared statistic, the corresponding p-value and the number of strata used are output in the snp information file.

Average quality score: The average quality score of the sample genotypes is computed using the genotypes before thresholding, i.e. all the sample's genotypes are used - including the ones that have a quality score lower than the threshold. A histogram of average quality scores is included with the sample histograms. This is computed only if quality score are available.

NOTE: None of the additional marker statistics are used for flagging markers. They are only provided as additional information for each marker.


 
 

University of Michigan | School of Public Health | Abecasis Lab