University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

GAINQC Output Files

GAINQC displays the status of the program on the screen. In addition, it also creates output files containing the quality informtion for samples and snps. In case relationship check is performed, it also generates a file containing misspecified relationships pairs and the estimated relationships for these pairs. The screen output and the various output files from the QC program are described here.

Screen Output & Log Files

Screen Output

The screen output generated by GAINQC contains basic information about the run of the program. It includes information such as the switches that are in effect and default settings for the various thresholds followed by settings that have been changed. Following information on the current run, it displays a 1-3 line summary on the various input files that were loaded. Then, it shows details of the first pass of the quality control. The first pass includes sample based checks. A summary of the actions taken in the first pass are displayed. Subsequently, the second pass information is displayed. It contains a summary of SNP based test that were performed. During both these passes, the progress of the program is indicated using a simple progress monitor. An example screen output can be seen here.

Program Log File: Error and Warning Messages

GAINQC creates a log file for reporting error and warning messages. This log file logs all messages from the program. It also logs in useful intermediate messages indicating the progress of the program. The size of the log file can be controlled using the command line --logSize switch. In addition, messages of one kind are logged only 1000 times. After this, it is reported that the "warning/error message will no longer be logged". An example log file is attached here.

Quality Statistics Files

Sample Statistics File

The sample statistics file contains the statistics that were computed for all the samples during the first pass of the program. It includes information such as genotyping completeness, heterozygosity, proportion of markers with mendelian inconsistencies, log of odds of being a male, the log likelihood of the sample genotype configuration and average quality score. It also includes a column indicating if the sample failed the QC analysis. It also gives the reason for the failure.

Included below is a toy sample information file:

<contents of toy.sampleinfo>
SampleId	Completeness	Heterozygosity	MendelErrors	SexOdds	LogL	AvgQualityScore	Flagged	Comments
Samp1	0.9993	0.3715	0.0000	-2129.893814	-1325.334380	0.0000	PASSED	-
Samp2	1.0000	0.3327	0.0000	775.551252	-828.696931	0.0000	PASSED	-
Samp3	0.9993	0.3823	0.0000	-2242.627123	-1260.070164	0.0000	PASSED	-
Samp4	0.9993	0.3918	0.0000	-2307.924867	-1274.799311	0.0000	PASSED	-
Samp5	1.0000	0.0013	0.0000	789.111679	-820.437656	0.0000	FAILED	LOW_HETEROZYGOSITY[ABSOLUTE]
Samp6	1.0000	0.3841	0.0000	-2243.299415	-1293.133159	0.0000	PASSED	-
Samp8	0.7993	0.0013	0.0000	836.827641	-868.299592	0.0000	FAILED	TOO_FEW_GENOTYPES[ABSOLUTE]
Samp9	1.0000	0.4077	0.0000	-2441.868670	-1254.438732	0.0000	FAILED	HIGH_HETEROZYGOSITY[ABSOLUTE]
<end of toy.sampleinfo>

In this toy sample information file sample Samp5/Samp9 failed the QC analysis because of low/high absolute heterozygosity, whereas sample Samp8 failed due to low genotype call rate.

Mendelian Inconsistencies Log

In case parent offspring trios (or pairs) are present in the study sample, a check for mendelian inconsistencies is performed. The mendelian incosistencies found across all markers are logged in a mendelian inconsitencies log file. This file contains the name of marker and sample ids (parent(s) and offspring) where the mendelian error was found. Information on the chromosome of the marker and sexes of the sample is also provided.

A toy mendelian errors log file is given below:

<contents of toy.mendelLog>
<end of toy.mendelLog>

SNP Statistics File

In the same spirit as the sample statistics file, the GAINQC program creates a SNP statistics file that contains all the statistics that were calculated during the second pass, when quality control is done for SNPs using samples that passed the first pass. In addition to SNP statistics such as MAF, completeness, mendelian error rate, odds of being an autosomal marker (against X-linked), # of mismatches in duplicates and average quality score, this file also contains information on the TDT test (if any trios were part of the sample) and the association test (if samples were assigned labels). Finally, there are 2 columns indicating whether the SNP passed QC or not and if it failed, the reason for failing QC.

A toy SNP statistics file is shown below:

<contents of toy.snpinfo>
Marker MinorAllele MAF Completeness HWEPvalue MendelErrors ImpliedMendelErrorRate Mismatches XlinkedOdds AvgQualScore Flagged Comments
rs16998050   T       0.138522        0.997788        0.618391        0       0.000000        0       -66.663183      0.7832  PASSED  -
rs12010301   C       0.296053        1.000000        1               0       0.000000        0       -93.084500      0.7341  PASSED  -
rs1921396    A       0.303430        0.997788        0.00749048      0       0.000000        0       -83.828533      0.7656  PASSED  -
rs5913791    T       0.198946        0.997788        0.590222        0       0.000000        0       -10.654893      0.6168  PASSED  -
rs17474852   G       0.109788        0.995575        1               0       0.000000        0       -66.257458      0.6885  PASSED  -
rs1091272    C       0.402632        1.000000        0.726674        0       0.000000        0       -92.660182      0.8498  PASSED  -
rs6611060    C       0.197368        1.000000        1               0       0.000000        0       -74.651094      0.7402  PASSED  -
rs674007     C       0.269737        1.000000        1               0       0.000000        0       -87.456307      0.7555  PASSED  -
rs11795511   T       0.192358        0.997788        0.287873        0       0.000000        0       -63.058974      0.7903  PASSED  -
rs2205513    C       0.305263        1.000000        0.182062        0       0.000000        0       -85.541103      0.7167  PASSED  -
<end of toy.snpinfo>

Markers Not Assessed File

During the second pass of the GAINQC program, QC step is skipped for some markers for various reasons such as the SNP not being present in the SNP information file, information mismatch etc. GAINQC logs such issues in a file called the not assessed file. This file contains the ids of the SNPs on which the QC was not performed. It also includes a reason for skipping the QC step for each SNP in this file.

A toy SNPs not assessed file is shown below:

<contents of toy.notassessed>
rs1927743
rs17727
rs719216
rs96024
<end of toy.notassessed>

Relation Information File: Putative and Inferred Relation Mismatch

If the study sample consists of some pairs of relatives (or just to check that all samples are unrelated), GAINQC can be instructed to perform a relationship check analysis. This analysis consists of estimating IBD (Identical By Descent) probabilites and kinship from these probabilities and comparing them to expected values. It also flag relation pairs that are too far away from other pairs within the same relationship group. To provide information about the pairs that have been flagged by the aforementioned relationship analysis, GAINQC creates a file with sample ids of the samples in the pair, estimated and expected kinships and putative and estimated relationship. At most first degree relationships are estimated (Parent-Offspring, Siblings, Unrelated, Duplicates/MZ, Half-siblings). Other possible relations are just bunched together as RELATED. The program also creates histograms of the distribution of the kinship within each relationship group.

A toy relationship information file is shown here:

<contents of toy.relationinfo>
<end of toy.relationinfo>

External Genotypes Comparison

The --hapmap option can be used to compare the genotype calls in the genotype file (from the study) to genotype calls for the same samples and markers obtained from an external source. As the name indicates, we expect the external source to be HapMap. This option generates 2 output log files:

  1. A log of alleles flips for markers, between the study data genotypes and external genotypes
  2. A log of sample genotype mismatches between the genotype sources
In addition to these output files, this option also outputs a text histogram on the screen showing the mismatches between the two sources for the genotypes.

An example of each type of log file is shown below:

<contents of toy-hapmap.log>
<end of toy-hapmap.log>
<contents of toy-vs-reference-sample.txt>
<end of toy-vs-reference-sample.txt>

Data Output Files: Post QC Filtered Data

Once GAINQC performs all its analyses, one might want a filtered dataset containing only the SNPs and samples that passed quality control. This is accomplished using the --output option. If this option is specified, the program outputs 3 data files, viz.,

  • the filtered genotypes file: contains the genotypes (as read in) for all the markers and samples that passed quality control
  • the filtered and thresholded genotypes file: contains the genotypes for all the markers and samples that passed quality control (the genotypes with low quality scores are output as missing)
  • the quality score file: contains the quality scores for all the genotypes output in the genotype files

All the text output files have been described above, the graphical output files with the histograms are described here.


 
 

University of Michigan | School of Public Health | Abecasis Lab