| |
GAINQC Second Pass: Marker Checks
In the second pass, GAINQC performs marker based checks and flags markers that fail to exceed the user-set
thresholds for marker quality. Similar to sample checks, all genotypes with quality scores less than the required
quality threshold are discarded (set as missing). In addition, all samples that were flagged in the first pass are
not used for the second pass.
SNP Checks
Genotyping completeness:
The genotyping completeness is computed for all the markers. The markers with genotyping compelteness rate less than the
prespecified threshold are flagged. Unlike the sample completeness, the markers are only flagged on an absolute scale.
Minor allele count:
The number of minor alleles is counted for each marker. If this number is less than the minimum number of minor alleles
required (user specified threshold), the marker is flagged.
Minor allele frequency:
Using the minor allele count and the number of non-missing genotypes, the minor allele frequency of each marker is caculated.
Similar to the minor allele count, markers with maf lower than the user specified threshold are flagged.
Hardy-Weinberg equilibrium p-value:
For each SNP, the exact p-value for the Hardy-Weinberg Equilibrium is calculated. More details on the test can be found
here. This test is performed only on the founders. In case of the X
chromosome, only female founders are used. Any marker with a HWE p-value less than a given threshold are flagged.
Duplicate mismatches:
In case there are duplicate samples in the study, GAINQC counts the number of mismatches in the duplicate gentoypes of
each marker. If there are only 2 duplicates, a mismatch is identified if the genotypes are different (and neither one is
missing). If there are more than 2 duplicates (with the same sample ID), the number of mismatches are counted by
constructing a majority-vote genotype and counting the number of duplicate samples that do not have that genotype. This
process is repeated for all groups of duplicate samples. The total number of mismatches is then the sum of the
mismatches over all the duplicate sample groups for each marker. Markers with more than a specified number of mismatch
errors are flagged.
Mendelian inconsistencies:
The number and rate of mendelian inconsistencies per marker are also computed. These statistics are computed only if
there is parent-offspring data present in the study. For each marker, the number of mendelian inconsistencies is counted
as the number of parent-offspring (trios or pairs) that have a mendelian inconsistency. The rate is calculated as the
ratio of the number of pairs or trios with an error to the total number of pairs or trios tested. Trios/Pairs are not
included if they have missing data making testing impossible. Markers with a large number (or rate) of mendelian
inconsistencies are flagged.
Additional Marker Statistics
These statistics are calculated per marker only if they are requested. Also, they need some extra data to be present,
as will be noted in the description of each statistc.
Transmission Disequilibrium Test (TDT):
If there are any trios (or parent offspring pairs) present in the data, GAINQC can be asked to perform a trasmission
disequilibrium test (TDT). More information on th4e TDT is present here. For each marker,
the two alleles, the number of each allele trasmitted from a heterozygote parent to an offspring, the number of trios (or pairs) used,
TDT chi-squared statistic and the corresponding p-value are calculated. These are output in the snp information file along with
other snp statistics.
Association Tests::
To perform association tests, samples must be labeled. This labeling can be done using the pedigree
file. GAINQC performs a n-1 degrees of freedom allelic association test, where n is the number of distinct labels. More information
on the association test can be found here. For each marker, the association test chi-suqared
statistic, the corresponding p-value and the number of strata used are output in the snp information file.
Average quality score:
The average quality score of the sample genotypes is computed using the genotypes before thresholding, i.e. all the sample's genotypes
are used - including the ones that have a quality score lower than the threshold. A histogram of average quality scores is included with the
sample histograms. This is computed only if quality score are available.
NOTE: None of the additional marker statistics are used for flagging markers. They are only provided as additional information
for each marker.
| |