GAINQC Input Files
GAINQC performs a data quality check on genotype data. In order to perform quality control, GAINQC requires
three mandatory input files, viz. the pedigree file describing the relationships between individuals, the marker
information file providing details on the markers and the genotype file. Optionally, GAINQC also takes
input files containing settings / thresholds for the software, quality scores for each genotype and external genotypes. In these pages, the input file formats are described in detail along with examples.
Mandatory Input Files
Pedigree File: Describing Relationships Between Individuals
The pedigree file must contain 5 mandatory columns, which include all the information about the relationships between individuals and also the sex information for individuals. The five columns, in order, are a family identifier, an individual identifier, a link to each parent (if available) and finally an indicator of each individual's sex.
In addition to the standard pedigree file described above, the pedigree file for the GAINQC software can also contain additional columns in the KEY=VALUE format. These columns are used to attach labels to the individuals in the pedigree. Multiple label columns can be present, but only one of the labels is used for any one run of the software.
The image on the left shows a three generational pedigree. We will construct pedigree
file for this pedigree. In addition to the 5 required items, we will also add a KEY=VALUE column with key 'STATUS' and value 'AFFECTED' or 'UNAFFECTED'. Each individual is identified by a unique number, as shown in the image. For this pedigree, let us assume the family id to be 1. Following convention, males are assigned sex code 1 and females are assigned code 2. The character '0' is used to code missingness.
For this pedigree, the pedigree file is given below:
<contents of toy.ped>
1 1 0 0 1 STATUS=UNAFFECTED
1 2 0 0 2 STATUS=UNAFFECTED
1 3 1 2 1 STATUS=UNAFFECTED
1 4 0 0 2 STATUS=UNAFFECTED
1 5 3 4 1 STATUS=UNAFFECTED
1 6 3 4 2 STATUS=AFFECTED
1 7 3 4 2 STATUS=AFFECTED
1 8 3 4 2 STATUS=AFFECTED
<end of basic.ped>
A pedigree file can include multiple families. Each family can
have a unique structure, independent of other families in the dataset.
Unlike other programs, this software requires each individual have a unique id, i.e.
individuals in same/different families cannot have the same individual id.
Genotype File: Describing Genotypes
Genotypes for the individuals in the pedigree file is present in the genotype
file. The genotypes are given in a matrix format. In this file format, each column
corresponds to one sample, whereas the rows correspond to markers. The file includes
a header line containing the sample names. Duplicate samples are indicated by adding a
'.' followed by a duplicate identifier to the sample name. Common duplicate identifiers
are '.d1' or '.dup'. Marker genotypes are encoded as two letters, one of "A", "C", "T"
or "G", denoting the 2 bases - one for each allele. Missing genotypes are coded using 'NN'.
For markers on the X chromosome, genotypes for males must be coded as either 'AX', 'CX' etc.
or they must be coded as if they were homozygote for the allele that they have - 'AA', 'CC' etc.
For the markers on the Y chromosome, genotypes for females must be missing and the genotypes for males
must be coded as either 'AY', 'CY' etc. or they must be coded as if they are homozygote for the
allele that they possess.
A toy genotype file is given below, with 10 samples + 1 duplicate and 7 markers:
<contents of toy.geno>
markerID Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8 Samp9 Samp10 Samp10.dup1
Marker1 TT AT AT TT TT TT TT TT AT TT TT
Marker2 GG GG GG GG GG GA GA GG GG GG GG
Marker3 GG GA GA GG GG GA NN GG GA GG NN
Marker4 CC CC CC TC NN CC CC TC CC CC CT
Marker5 CC CC CT CT CT CT CT CT CT CC CC
toyMarker_X AG AG GG GG GG AA GX GG AX AX AA
toyMarker_Y NN NN CY CC NN NN CC CY AY AA AA
<end of toy.geno>
In the toy genotype file, the first token of the header is place holder since the first column is
the marker name. The header then contains the sample ids of the samples corresponding to the
columns. Each row then contains genotypes for the samples at one marker.
SNP Information File: Describing SNP attributes
The SNP information file contains basic attributes of all the SNPs included in the study.
It must include 5 essential columns, which are the SNP ID used in the study, the chromosome of the SNP,
the position of the SNP on the chromosome, the quality control type for the SNP, the flanking sequence for the SNP -
including the alleles. Optionally, 4 other columns also can be included in the file. These columns are the rs
number of the SNP, the build (of the database) from which this information was obtained, the source of the DNA (buccal swab, blood etc.) and the strand of the SNP (+/-).
An example snp information file is given below with 7 SNPs:
<contents of toy.snp>
PREFERRED_ID CHR POSITION QC_TYPE SEQUENCE
Marker1 1 3019921 A CGGCT[A/T]ACGTA
Marker2 8 1002991 N GAGGC[G/A]GGCTG
Marker3 16 19487782 A ACCTA[G/A]CGGCT
Marker4 16 20938172 A CGTAG[C/T]GCGGA
Marker5 21 1002991 A GCTCC[C/T]CCTTA
toyMarker_X X 3057758 X CGGGT[A/G]ACCGT
toyMarker_Y Y 2098382 Y TTGGA[A/C]GCCAC
<end of toy.snp>
It is important to note that the sequence column contains the alleles of the SNP, in the format '[A/B]' where 'A' and 'B' are
the two alleles at the SNP. There is no minimum flanking sequence length requirement, so if only the alleles are encoded, the sequence
column can be just '[A/T]' instead of 'CGGCT[A/T]ACGTA'.
These three mandatory input files are sufficient to get GAINQC going. Given below are a few quirks of GAINQC:
- Samples not present in the pedigree file will be ignored.
- SNPs not present in the snp information file will be ignored.
- Duplicate samples do not need an entry in the pedigree file.
Optional Input Files
Quality Scores File
The quality score file is a platform specific file that indicates the confidence in each individual genotype call.
The format of this file is very similar to that of the genotype file. In this file, instead of the genotypes, each element is a number
which indicates the quality score for the genotype. It must be noted that higher number indicates higher confidence in the genotype call.
Typically, a quality score threshold can be used to block (treat as missing) genotypes with quality scores lower than the threshold.
A toy quality score file is provided below:
<contents of toy.qual>
markerID Samp1 Samp2 Samp3 Samp4 Samp5 Samp6 Samp7 Samp8 Samp9 Samp10 Samp10.dup1
Marker1 22 28 26 19 28 28 28 26 22 28 26
Marker2 21 22 22 11 22 04 04 22 22 22 22
Marker3 28 28 26 28 28 28 22 28 28 28 28
Marker4 11 13 18 25 19 16 18 18 20 21 20
Marker5 06 11 03 03 03 03 03 03 03 06 10
toyMarker_X 03 03 03 03 03 03 03 03 03 03 03
toyMarker_Y 22 22 27 27 28 27 26 11 27 23 25
<end of toy.qual>
Algorithm Settings File
The settings file can be used to tailor the algorithm. In this file, one can specify all the thresholds for the tests performed by GAINQC.
In addition, the label key ('STATUS' in our toy.ped example), the headings for the snp information file ('PREFERRED_ID', 'CHR' etc.) and other
auxillary settings can also be altered. The program uses default settings if no settings are specified. It is possible to alter only a few settings
using this file. The settings that have not been specified in the file are set to default.
The settings in this file are in a KEY=VALUE format. All possible keys can be found at the end of this section. Comment lines can also be included
using C-style comments (`//' at the beginning of the line). Only one KEY=VALUE pair must be present in one line.
A toy settings file is shown below:
<contents of toy.settings>
// Don't drop samples because of bad markers
SAMPLE_CALLS_MIN = 0.80
// We don't apply an HWE cutoff at this stage ... those statistics are calculated per sample
MARKER_CALLS_MIN = 0.90
MARKER_HWE_PVALUE = 0
MARKER_MENDEL_RATE = 0.30
MARKER_MAX_MENDEL = 1
// This is the Perlegen recommended threshold for the first pass
QUALITY_THRESHOLD = 7
// We increase this value without decreasing quality very much to speed things up
RELATION_MAF_MIN = .2
//sample label property
SAMPLE_LABEL_KEY = POP
SAMPLE_LSEXODDS = 100
<end of toy.settings>
In this toy settings file, the sample call rate cutoff is set at 80%, the marker call rate threshold is set at 90% etc. Also some
comments are included explaining the choices.
Table of keys for settings file
Key |
Type |
Default |
Description |
SAMPLE_CALLS_MIN |
Float in [0, 1] |
0.95 |
Minimum proportion of called genotypes per sample |
SAMPLE_CALLS_ZSCORE |
Integer ≥ 0 |
5 |
Maximum z-score for sample genotype completeness |
SAMPLE_HET_MIN |
Float in [0, 0.5] |
0.10 |
Minimum sample heterozygosity |
SAMPLE_HET_MAX |
Float in [0, 0.5] |
0.40 |
Maximum sample heterozygosity |
SAMPLE_HET_ZSCORE |
Integer ≥ 0 |
5 |
Maximum z-score for sample heterozygosity |
SAMPLE_LSEXODDS |
Float |
10.0 |
Minimum log odds of being male |
SAMPLE_MENDEL_MAX |
Float in [0, 1] |
0.02 |
Maximum proportion of markers with mendelian errors in a trio containing this sample |
SAMPLE_LABEL_KEY |
String |
|
Key used for labeling samples in the pedigree file |
SAMPLE_PDF_MAX_BINS |
Integer ≥ 1 |
100 |
Maximum number of bins in the sample pdf histograms |
MARKER_CALLS_MIN |
Float in [0, 1] |
0.95 |
Minimum call rate for SNPs |
MARKER_MISMATCHES_MAX |
Integer ≥ 0 |
1 |
Maximum number of mismatches among duplicate samples for a marker |
MARKER_HWE_PVALUE |
Float in [0, 1] |
10^-6 |
P-value for the exact test for Hardy-Weinberg Equilibrium |
MARKER_MENDEL_RATE |
Float in [0, 1] |
0.01 |
Maximum proportion of trios with mendelian errors |
MARKER_MAX_MENDEL |
Integer ≥ 0 |
5 |
Maximum number of trios with mendelian errors |
MAF_MIN |
Float in [0, 0.5] |
0.0 |
Minimum minor allele frequency |
MIN_ALLELE_COUNT |
Integer ≥ 0 |
0 |
Minimum minor allele count |
MARKER_PDF_MAX_BINS |
Integer ≥ 1 |
100 |
Maximum number of bins in the marker pdf histograms |
RELATION_ZSCORE |
Integer ≥ 0 |
5 |
Maximum z-score for the relation check |
RELATION_MAF_MIN |
Float in [0, 0.5] |
0.10 |
Minimum minor allele frequency for markers to be used for relationship checks |
RELATION_MAX_MEAN_DIFF |
Float in [0, 1] |
0.10 |
Maximum difference between the mean kinship of all pairs with the same relationship and the expected kinship for the relationship |
RELATION_BINS |
Integer ≥ 1 |
100 |
Number of bins for relationship histograms |
GENOTYPING_ERROR |
Float in [0, 1] |
0.01 |
Genotyping error rate |
QUALITY_THRESHOLD |
Float ≥ 0 |
0.0 |
Minimum quality score needed for a genotype to be considered valid |
PREFERRED_ID |
String |
PREFERRED_ID |
Header token for column in SNP information file containing the preferred id for the SNPs |
RS_ID |
String |
RS_ID |
Header token for column in SNP information file containing the RS# for the SNPs |
BUILD |
String |
BUILD |
Header token for column in SNP information file containing the build of the human genome from which the snp information was obtained |
CHR |
String |
CHR |
Header token for column in SNP information file containing the chromosome for the SNPs |
POSITION |
String |
POSITION |
Header token for column in SNP information file containing the position of the SNP on the chromosome |
QC_TYPE |
String |
QC_TYPE |
Header token for column in SNP information file containing the QC type for the SNPs |
SEQUENCE |
String |
SEQUENCE |
Header token for column in SNP information file containing the flanking sequence (including the alleles) of the SNPs |
SOURCE |
String |
SOURCE |
Header token for column in SNP information file containing the source of the genetic material using which the SNP was genotyped |
STRAND |
String |
STRAND |
Header token for column in SNP information file containing the Strand on which the SNP was genotyped |
External Genotype File:
Externally obtained genotypes can be used to verify the accuracy of the genotyping process. This file contains the gneotypes
obtained from the external source. The file must be in the HapMap genotype format. This file is used along with the genotype
file to compute concordance between the genotypes and also some snp characteristics such as whether the alleles match or are flipped.
The best way to understand this format is to look at an example. [More information can be found at
HapMap's website]
A toy external genotype file is shown here:
<contents of toy.external>
rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code Samp1 Samp3 Samp7 Samp8 Samp10
rs11511647 C/T chr10 62765 + ncbi_b36 sanger - - - QC+ CC CT CC CT TT
rs4880608 A/G chr10 83299 + ncbi_b36 affymetrix - - - QC+ GG GG GG GG GG
rs12218882 A/G chr10 84172 + ncbi_b36 perlegen - - - QC+ NN NN AG GG GG
rs10904045 C/T chr10 84426 + ncbi_b36 perlegen - - - QC+ TT CT TT CT CC
rs10751931 C/T chr10 85949 + ncbi_b36 sanger - - - QC+ CC CC CC CC CC
rs11252127 C/T chr10 88087 + ncbi_b36 perlegen - - - QC+ CC CC CC CC CC
rs12775203 C/T chr10 88277 + ncbi_b36 sanger - - - QC+ TT TT TT TT TT
<end of toy.external>
The tokens in the header of the external genotype file are of critical importance. The three key columns that must be present,
other than marker name and genotypes, are 'SNPalleles', 'strand' and 'QC_code'. In order to use the external genotype file, some
SNPs and samples must overlap between the external genotype file and user supplied genotype file. In case no matches are
found, a warning message is output and the external genotype file is not used. If the external genotype file is used, the snp
information file must contain the rs id column with the rs# of the SNPs in the study.
Ignore Samples File
In some cases, some samples may need to be excluded for the quality control analysis. This can be accomplished using the ignore samples file. This file
simply contains a list of sample Ids, one in each line, that must be excluded from the QC run. It is important to note that duplicate sample ids need to be
specified seperately, i.e. just specifying the original sample id will not ignore all the duplicates; each duplicate must be put in the ignore samples file
if they all need to be excluded from the analysis.
A toy ignore sample file is given below:
<contents of toy.exclude>
Samp2
Samp3
Samp10.dup1
<end of toy.exclude>
That is all we need to know about the input file formats. In the next section, we will look at the output file formats.
|