GAINQC Output Files
GAINQC displays the status of the program on the screen. In addition, it also creates output files containing
the quality informtion for samples and snps. In case relationship check is performed, it also generates a file
containing misspecified relationships pairs and the estimated relationships for these pairs. The screen output and
the various output files from the QC program are described here.
Screen Output & Log Files
Screen Output
The screen output generated by GAINQC contains basic information about the run of the program. It includes information such
as the switches that are in effect and default settings for the various thresholds followed by settings that have been changed.
Following information on the current run, it displays a 1-3 line summary on the various input files that were loaded. Then, it shows
details of the first pass of the quality control. The first pass includes sample based checks. A summary of the actions taken in the
first pass are displayed. Subsequently, the second pass information is displayed. It contains a summary of SNP based test that were
performed. During both these passes, the progress of the program is indicated using a simple progress monitor. An example screen output
can be seen here.
Program Log File: Error and Warning Messages
GAINQC creates a log file for reporting error and warning messages. This log file logs all messages from the program. It also
logs in useful intermediate messages indicating the progress of the program. The size of the log file can be controlled using the command
line --logSize switch. In addition, messages of one kind are logged only 1000 times. After this, it is reported that the "warning/error
message will no longer be logged". An example log file is attached here.
Quality Statistics Files
Sample Statistics File
The sample statistics file contains the statistics that were computed for all the samples during the first pass of the program. It includes
information such as genotyping completeness, heterozygosity, proportion of markers with mendelian inconsistencies, log of odds of being a male,
the log likelihood of the sample genotype configuration and average quality score. It also includes a column indicating if the sample failed the
QC analysis. It also gives the reason for the failure.
Included below is a toy sample information file:
<contents of toy.sampleinfo>
SampleId Completeness Heterozygosity MendelErrors SexOdds LogL AvgQualityScore Flagged Comments
Samp1 0.9993 0.3715 0.0000 -2129.893814 -1325.334380 0.0000 PASSED -
Samp2 1.0000 0.3327 0.0000 775.551252 -828.696931 0.0000 PASSED -
Samp3 0.9993 0.3823 0.0000 -2242.627123 -1260.070164 0.0000 PASSED -
Samp4 0.9993 0.3918 0.0000 -2307.924867 -1274.799311 0.0000 PASSED -
Samp5 1.0000 0.0013 0.0000 789.111679 -820.437656 0.0000 FAILED LOW_HETEROZYGOSITY[ABSOLUTE]
Samp6 1.0000 0.3841 0.0000 -2243.299415 -1293.133159 0.0000 PASSED -
Samp8 0.7993 0.0013 0.0000 836.827641 -868.299592 0.0000 FAILED TOO_FEW_GENOTYPES[ABSOLUTE]
Samp9 1.0000 0.4077 0.0000 -2441.868670 -1254.438732 0.0000 FAILED HIGH_HETEROZYGOSITY[ABSOLUTE]
<end of toy.sampleinfo>
In this toy sample information file sample Samp5/Samp9 failed the QC analysis because of low/high absolute heterozygosity,
whereas sample Samp8 failed due to low genotype call rate.
Mendelian Inconsistencies Log
In case parent offspring trios (or pairs) are present in the study sample, a check for mendelian inconsistencies is performed.
The mendelian incosistencies found across all markers are logged in a mendelian inconsitencies log file. This file contains the name of
marker and sample ids (parent(s) and offspring) where the mendelian error was found. Information on the chromosome of the marker and sexes
of the sample is also provided.
A toy mendelian errors log file is given below:
<contents of toy.mendelLog>
<end of toy.mendelLog>
SNP Statistics File
In the same spirit as the sample statistics file, the GAINQC program creates a SNP statistics file that contains all the statistics that were calculated
during the second pass, when quality control is done for SNPs using samples that passed the first pass. In addition to SNP statistics such as MAF, completeness,
mendelian error rate, odds of being an autosomal marker (against X-linked), # of mismatches in duplicates and average quality score, this file also contains
information on the TDT test (if any trios were part of the sample) and the association test (if samples were assigned labels). Finally, there are 2 columns
indicating whether the SNP passed QC or not and if it failed, the reason for failing QC.
A toy SNP statistics file is shown below:
<contents of toy.snpinfo>
Marker MinorAllele MAF Completeness HWEPvalue MendelErrors ImpliedMendelErrorRate Mismatches XlinkedOdds AvgQualScore Flagged Comments
rs16998050 T 0.138522 0.997788 0.618391 0 0.000000 0 -66.663183 0.7832 PASSED -
rs12010301 C 0.296053 1.000000 1 0 0.000000 0 -93.084500 0.7341 PASSED -
rs1921396 A 0.303430 0.997788 0.00749048 0 0.000000 0 -83.828533 0.7656 PASSED -
rs5913791 T 0.198946 0.997788 0.590222 0 0.000000 0 -10.654893 0.6168 PASSED -
rs17474852 G 0.109788 0.995575 1 0 0.000000 0 -66.257458 0.6885 PASSED -
rs1091272 C 0.402632 1.000000 0.726674 0 0.000000 0 -92.660182 0.8498 PASSED -
rs6611060 C 0.197368 1.000000 1 0 0.000000 0 -74.651094 0.7402 PASSED -
rs674007 C 0.269737 1.000000 1 0 0.000000 0 -87.456307 0.7555 PASSED -
rs11795511 T 0.192358 0.997788 0.287873 0 0.000000 0 -63.058974 0.7903 PASSED -
rs2205513 C 0.305263 1.000000 0.182062 0 0.000000 0 -85.541103 0.7167 PASSED -
<end of toy.snpinfo>
Markers Not Assessed File
During the second pass of the GAINQC program, QC step is skipped for some markers for various reasons such as the SNP not being present in the SNP
information file, information mismatch etc. GAINQC logs such issues in a file called the not assessed file. This file contains the ids of the SNPs on
which the QC was not performed. It also includes a reason for skipping the QC step for each SNP in this file.
A toy SNPs not assessed file is shown below:
<contents of toy.notassessed>
rs1927743
rs17727
rs719216
rs96024
<end of toy.notassessed>
Relation Information File: Putative and Inferred Relation Mismatch
If the study sample consists of some pairs of relatives (or just to check that all samples are unrelated), GAINQC can be instructed to perform a
relationship check analysis. This analysis consists of estimating IBD (Identical By Descent) probabilites and kinship from these probabilities and
comparing them to expected values. It also flag relation pairs that are too far away from other pairs within the same relationship group. To provide
information about the pairs that have been flagged by the aforementioned relationship analysis, GAINQC creates a file with sample ids of the samples
in the pair, estimated and expected kinships and putative and estimated relationship. At most first degree relationships are estimated (Parent-Offspring,
Siblings, Unrelated, Duplicates/MZ, Half-siblings). Other possible relations are just bunched together as RELATED. The program also creates histograms
of the distribution of the kinship within each relationship group.
A toy relationship information file is shown here:
<contents of toy.relationinfo>
<end of toy.relationinfo>
External Genotypes Comparison
The --hapmap option can be used to compare the genotype calls in the genotype file (from the study) to genotype calls for the same samples and
markers obtained from an external source. As the name indicates, we expect the external source to be HapMap. This
option generates 2 output log files:
- A log of alleles flips for markers, between the study data genotypes and external genotypes
- A log of sample genotype mismatches between the genotype sources
In addition to these output files, this option also outputs a text histogram on the screen showing the mismatches between the two sources for the genotypes.
An example of each type of log file is shown below:
<contents of toy-hapmap.log>
<end of toy-hapmap.log>
<contents of toy-vs-reference-sample.txt>
<end of toy-vs-reference-sample.txt>
Data Output Files: Post QC Filtered Data
Once GAINQC performs all its analyses, one might want a filtered dataset containing only the SNPs and samples that passed quality control. This
is accomplished using the --output option. If this option is specified, the program outputs 3 data files, viz.,
- the filtered genotypes file: contains the genotypes (as read in) for all the markers and samples that passed quality control
- the filtered and thresholded genotypes file: contains the genotypes for all the markers and samples that passed quality control
(the genotypes with low quality scores are output as missing)
- the quality score file: contains the quality scores for all the genotypes output in the genotype files
All the text output files have been described above, the graphical output files with the histograms are described here.
|