University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

GAINQC Input Files

GAINQC performs a data quality check on genotype data. In order to perform quality control, GAINQC requires three mandatory input files, viz. the pedigree file describing the relationships between individuals, the marker information file providing details on the markers and the genotype file. Optionally, GAINQC also takes input files containing settings / thresholds for the software, quality scores for each genotype and external genotypes. In these pages, the input file formats are described in detail along with examples.

Mandatory Input Files

Pedigree File: Describing Relationships Between Individuals

The pedigree file must contain 5 mandatory columns, which include all the information about the relationships between individuals and also the sex information for individuals. The five columns, in order, are a family identifier, an individual identifier, a link to each parent (if available) and finally an indicator of each individual's sex.

In addition to the standard pedigree file described above, the pedigree file for the GAINQC software can also contain additional columns in the KEY=VALUE format. These columns are used to attach labels to the individuals in the pedigree. Multiple label columns can be present, but only one of the labels is used for any one run of the software.

Example Pedigree The image on the left shows a three generational pedigree. We will construct pedigree file for this pedigree. In addition to the 5 required items, we will also add a KEY=VALUE column with key 'STATUS' and value 'AFFECTED' or 'UNAFFECTED'. Each individual is identified by a unique number, as shown in the image. For this pedigree, let us assume the family id to be 1. Following convention, males are assigned sex code 1 and females are assigned code 2. The character '0' is used to code missingness.


For this pedigree, the pedigree file is given below:

<contents of toy.ped>
1   1   0  0  1 STATUS=UNAFFECTED
1   2   0  0  2 STATUS=UNAFFECTED
1   3   1  2  1 STATUS=UNAFFECTED
1   4   0  0  2 STATUS=UNAFFECTED
1   5   3  4  1 STATUS=UNAFFECTED
1   6   3  4  2 STATUS=AFFECTED
1   7   3  4  2 STATUS=AFFECTED
1   8   3  4  2 STATUS=AFFECTED
<end of basic.ped>

A pedigree file can include multiple families. Each family can have a unique structure, independent of other families in the dataset. Unlike other programs, this software requires each individual have a unique id, i.e. individuals in same/different families cannot have the same individual id.

Genotype File: Describing Genotypes

Genotypes for the individuals in the pedigree file is present in the genotype file. The genotypes are given in a matrix format. In this file format, each column corresponds to one sample, whereas the rows correspond to markers. The file includes a header line containing the sample names. Duplicate samples are indicated by adding a '.' followed by a duplicate identifier to the sample name. Common duplicate identifiers are '.d1' or '.dup'. Marker genotypes are encoded as two letters, one of "A", "C", "T" or "G", denoting the 2 bases - one for each allele. Missing genotypes are coded using 'NN'. For markers on the X chromosome, genotypes for males must be coded as either 'AX', 'CX' etc. or they must be coded as if they were homozygote for the allele that they have - 'AA', 'CC' etc. For the markers on the Y chromosome, genotypes for females must be missing and the genotypes for males must be coded as either 'AY', 'CY' etc. or they must be coded as if they are homozygote for the allele that they possess.

A toy genotype file is given below, with 10 samples + 1 duplicate and 7 markers:

<contents of toy.geno>
markerID      Samp1   Samp2   Samp3   Samp4   Samp5   Samp6   Samp7   Samp8   Samp9   Samp10  Samp10.dup1
Marker1       TT      AT      AT      TT      TT      TT      TT      TT      AT      TT      TT
Marker2       GG      GG      GG      GG      GG      GA      GA      GG      GG      GG      GG
Marker3       GG      GA      GA      GG      GG      GA      NN      GG      GA      GG      NN
Marker4       CC      CC      CC      TC      NN      CC      CC      TC      CC      CC      CT
Marker5       CC      CC      CT      CT      CT      CT      CT      CT      CT      CC      CC
toyMarker_X   AG      AG      GG      GG      GG      AA      GX      GG      AX      AX      AA
toyMarker_Y   NN      NN      CY      CC      NN      NN      CC      CY      AY      AA      AA
<end of toy.geno>
In the toy genotype file, the first token of the header is place holder since the first column is the marker name. The header then contains the sample ids of the samples corresponding to the columns. Each row then contains genotypes for the samples at one marker.

SNP Information File: Describing SNP attributes

The SNP information file contains basic attributes of all the SNPs included in the study. It must include 5 essential columns, which are the SNP ID used in the study, the chromosome of the SNP, the position of the SNP on the chromosome, the quality control type for the SNP, the flanking sequence for the SNP - including the alleles. Optionally, 4 other columns also can be included in the file. These columns are the rs number of the SNP, the build (of the database) from which this information was obtained, the source of the DNA (buccal swab, blood etc.) and the strand of the SNP (+/-).

An example snp information file is given below with 7 SNPs:

<contents of toy.snp>
PREFERRED_ID   CHR	POSITION	QC_TYPE	SEQUENCE
Marker1        1	3019921		A	CGGCT[A/T]ACGTA
Marker2        8	1002991		N	GAGGC[G/A]GGCTG
Marker3        16	19487782	A	ACCTA[G/A]CGGCT
Marker4        16	20938172	A	CGTAG[C/T]GCGGA
Marker5        21	1002991		A	GCTCC[C/T]CCTTA
toyMarker_X    X	3057758		X	CGGGT[A/G]ACCGT
toyMarker_Y    Y	2098382		Y	TTGGA[A/C]GCCAC
<end of toy.snp>

It is important to note that the sequence column contains the alleles of the SNP, in the format '[A/B]' where 'A' and 'B' are the two alleles at the SNP. There is no minimum flanking sequence length requirement, so if only the alleles are encoded, the sequence column can be just '[A/T]' instead of 'CGGCT[A/T]ACGTA'.

These three mandatory input files are sufficient to get GAINQC going. Given below are a few quirks of GAINQC:

  • Samples not present in the pedigree file will be ignored.
  • SNPs not present in the snp information file will be ignored.
  • Duplicate samples do not need an entry in the pedigree file.

Optional Input Files

Quality Scores File

The quality score file is a platform specific file that indicates the confidence in each individual genotype call. The format of this file is very similar to that of the genotype file. In this file, instead of the genotypes, each element is a number which indicates the quality score for the genotype. It must be noted that higher number indicates higher confidence in the genotype call. Typically, a quality score threshold can be used to block (treat as missing) genotypes with quality scores lower than the threshold.

A toy quality score file is provided below:

<contents of toy.qual>
markerID	Samp1   Samp2   Samp3   Samp4   Samp5   Samp6   Samp7   Samp8   Samp9   Samp10  Samp10.dup1
Marker1 	22      28      26      19      28      28      28      26      22      28	26
Marker2 	21      22      22      11      22      04      04      22      22      22	22
Marker3 	28      28      26      28      28      28      22      28      28      28	28
Marker4 	11      13      18      25      19      16      18      18      20      21	20
Marker5 	06      11      03      03      03      03      03      03      03      06	10
toyMarker_X	03      03      03      03      03      03      03      03      03      03	03
toyMarker_Y	22      22      27      27      28      27      26      11      27      23	25
<end of toy.qual>

Algorithm Settings File

The settings file can be used to tailor the algorithm. In this file, one can specify all the thresholds for the tests performed by GAINQC. In addition, the label key ('STATUS' in our toy.ped example), the headings for the snp information file ('PREFERRED_ID', 'CHR' etc.) and other auxillary settings can also be altered. The program uses default settings if no settings are specified. It is possible to alter only a few settings using this file. The settings that have not been specified in the file are set to default.

The settings in this file are in a KEY=VALUE format. All possible keys can be found at the end of this section. Comment lines can also be included using C-style comments (`//' at the beginning of the line). Only one KEY=VALUE pair must be present in one line.

A toy settings file is shown below:

<contents of toy.settings>
// Don't drop samples because of bad markers
SAMPLE_CALLS_MIN = 0.80

// We don't apply an HWE cutoff at this stage ... those statistics are calculated per sample
MARKER_CALLS_MIN = 0.90
MARKER_HWE_PVALUE = 0
MARKER_MENDEL_RATE = 0.30
MARKER_MAX_MENDEL = 1

// This is the Perlegen recommended threshold for the first pass
QUALITY_THRESHOLD = 7

// We increase this value without decreasing quality very much to speed things up
RELATION_MAF_MIN = .2

//sample label property
SAMPLE_LABEL_KEY = POP
SAMPLE_LSEXODDS = 100
<end of toy.settings>

In this toy settings file, the sample call rate cutoff is set at 80%, the marker call rate threshold is set at 90% etc. Also some comments are included explaining the choices.

Table of keys for settings file
Key Type Default Description
SAMPLE_CALLS_MIN Float in [0, 1] 0.95 Minimum proportion of called genotypes per sample
SAMPLE_CALLS_ZSCORE Integer ≥ 0 5 Maximum z-score for sample genotype completeness
SAMPLE_HET_MIN Float in [0, 0.5] 0.10 Minimum sample heterozygosity
SAMPLE_HET_MAX Float in [0, 0.5] 0.40 Maximum sample heterozygosity
SAMPLE_HET_ZSCORE Integer ≥ 0 5 Maximum z-score for sample heterozygosity
SAMPLE_LSEXODDS Float 10.0 Minimum log odds of being male
SAMPLE_MENDEL_MAX Float in [0, 1] 0.02 Maximum proportion of markers with mendelian errors in a trio containing this sample
SAMPLE_LABEL_KEY String   Key used for labeling samples in the pedigree file
SAMPLE_PDF_MAX_BINS Integer ≥ 1 100 Maximum number of bins in the sample pdf histograms
MARKER_CALLS_MIN Float in [0, 1] 0.95 Minimum call rate for SNPs
MARKER_MISMATCHES_MAX Integer ≥ 0 1 Maximum number of mismatches among duplicate samples for a marker
MARKER_HWE_PVALUE Float in [0, 1] 10^-6 P-value for the exact test for Hardy-Weinberg Equilibrium
MARKER_MENDEL_RATE Float in [0, 1] 0.01 Maximum proportion of trios with mendelian errors
MARKER_MAX_MENDEL Integer ≥ 0 5 Maximum number of trios with mendelian errors
MAF_MIN Float in [0, 0.5] 0.0 Minimum minor allele frequency
MIN_ALLELE_COUNT Integer ≥ 0 0 Minimum minor allele count
MARKER_PDF_MAX_BINS Integer ≥ 1 100 Maximum number of bins in the marker pdf histograms
RELATION_ZSCORE Integer ≥ 0 5 Maximum z-score for the relation check
RELATION_MAF_MIN Float in [0, 0.5] 0.10 Minimum minor allele frequency for markers to be used for relationship checks
RELATION_MAX_MEAN_DIFF Float in [0, 1] 0.10 Maximum difference between the mean kinship of all pairs with the same relationship and the expected kinship for the relationship
RELATION_BINS Integer ≥ 1 100 Number of bins for relationship histograms
GENOTYPING_ERROR Float in [0, 1] 0.01 Genotyping error rate
QUALITY_THRESHOLD Float ≥ 0 0.0 Minimum quality score needed for a genotype to be considered valid
PREFERRED_ID String PREFERRED_ID Header token for column in SNP information file containing the preferred id for the SNPs
RS_ID String RS_ID Header token for column in SNP information file containing the RS# for the SNPs
BUILD String BUILD Header token for column in SNP information file containing the build of the human genome from which the snp information was obtained
CHR String CHR Header token for column in SNP information file containing the chromosome for the SNPs
POSITION String POSITION Header token for column in SNP information file containing the position of the SNP on the chromosome
QC_TYPE String QC_TYPE Header token for column in SNP information file containing the QC type for the SNPs
SEQUENCE String SEQUENCE Header token for column in SNP information file containing the flanking sequence (including the alleles) of the SNPs
SOURCE String SOURCE Header token for column in SNP information file containing the source of the genetic material using which the SNP was genotyped
STRAND String STRAND Header token for column in SNP information file containing the Strand on which the SNP was genotyped

External Genotype File:

Externally obtained genotypes can be used to verify the accuracy of the genotyping process. This file contains the gneotypes obtained from the external source. The file must be in the HapMap genotype format. This file is used along with the genotype file to compute concordance between the genotypes and also some snp characteristics such as whether the alleles match or are flipped. The best way to understand this format is to look at an example. [More information can be found at HapMap's website]

A toy external genotype file is shown here:

<contents of toy.external>
rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code Samp1 Samp3 Samp7 Samp8 Samp10
rs11511647 C/T chr10 62765 + ncbi_b36 sanger - - - QC+ CC CT CC CT TT
rs4880608 A/G chr10 83299 + ncbi_b36 affymetrix - - - QC+ GG GG GG GG GG
rs12218882 A/G chr10 84172 + ncbi_b36 perlegen - - - QC+ NN NN AG GG GG
rs10904045 C/T chr10 84426 + ncbi_b36 perlegen - - - QC+ TT CT TT CT CC
rs10751931 C/T chr10 85949 + ncbi_b36 sanger - - - QC+ CC CC CC CC CC
rs11252127 C/T chr10 88087 + ncbi_b36 perlegen - - - QC+ CC CC CC CC CC
rs12775203 C/T chr10 88277 + ncbi_b36 sanger - - - QC+ TT TT TT TT TT
<end of toy.external>

The tokens in the header of the external genotype file are of critical importance. The three key columns that must be present, other than marker name and genotypes, are 'SNPalleles', 'strand' and 'QC_code'. In order to use the external genotype file, some SNPs and samples must overlap between the external genotype file and user supplied genotype file. In case no matches are found, a warning message is output and the external genotype file is not used. If the external genotype file is used, the snp information file must contain the rs id column with the rs# of the SNPs in the study.

Ignore Samples File

In some cases, some samples may need to be excluded for the quality control analysis. This can be accomplished using the ignore samples file. This file simply contains a list of sample Ids, one in each line, that must be excluded from the QC run. It is important to note that duplicate sample ids need to be specified seperately, i.e. just specifying the original sample id will not ignore all the duplicates; each duplicate must be put in the ignore samples file if they all need to be excluded from the analysis.

A toy ignore sample file is given below:

<contents of toy.exclude>
Samp2
Samp3
Samp10.dup1
<end of toy.exclude>

That is all we need to know about the input file formats. In the next section, we will look at the output file formats.


 
 

University of Michigan | School of Public Health | Abecasis Lab