| |
GAINQC First Pass: Sample Checks
GAINQC operates in 2 passes. In the first pass, it performs sample based checks and flags samples that fail to meet the
user-set threshold for sample quality. In the second pass, the flagged samples are excluded and quality check is performed on
SNPs. In this section we will describe, in moderate detail, the sample based checks performed by GAINQC.
NOTE: All the tests are performed on genotypes after thresholding for quality score, if scores are provided. Genotypes
with quality score below the thresholds are blanked out (treated as missing).
Individual Sample Checks
Genotyping completeness:
The genotyping completeness rate is computed for all samples. For each sample, this is computed as the ration of the number of
markers with a non-missing genotype call to the total number of markers. The sample genotyping completeness is checked both absolutely and
relative to other samples. Any sample with too low absolute genotyping complteness is flagged. Samples with too low or too high
genotyping complteness realtive to other samples (outliers) are also flagged.
Heterozygosity:
The heterozygosity for each samples is computed as the ratio of the number of heterozygote genotype calls to the total number of
non-missing calls. Similar to the genotyping completeness, the samples are flagged if their heterozygosity if too low or too high, both on
the absolute and the relative scale.
Mendelian inconsistencies:
In case trios are present in the study sample, GAINQC computes the number of markers where the sample is present in an inconsistent trio.
The samples with too many mendelian inconsistencies are flagged.
Sample Sex Odds:
For each sample, the odds of the sample being a male as opposed to being a female are calculated using the X-linked markers. Odds less than zero
indicate that the sample is more likely to be a female, whereas positive odds indicate that the sample is more likely to be a male. The samples
who have sex odds that mismatch their putative sex (with enough confidence -- above a threshold) are flagged.
Individual Sample Statistics
These statistics are calculated per sample in the first pass of GAINQC, but these are not used to flag any samples. They are provided as additional information.
Log-Likelihood of the genotypes:
The likelihood of observing the genotypes of the sample are calculated using the allele frequencies computed for all the SNPs.
The allele frequencies are computed using all the available sample data. A histogram of log-likelihoods (log of the likelihood)
is generated and included with the sample histograms.
Average quality score:
The average quality score of the sample genotypes is computed using the genotypes before thresholding, i.e. all the sample's genotypes
are used - including the ones that have a quality score lower than the threshold. A histogram of average quality scores is included with the
sample histograms. This is computed only if quality score are available.
| |