Instruction and examples for the genetic matching program

Main

CSG Home

Liming Liang

Home

-----------------------------------------------------------------

Software

-----------------------------------------------------------------

GENOME

-----------------------------------------------------------------

ASTHMA/eQTL

-----------------------------------------------------------------

eQTL imputation

-----------------------------------------------------------------

Repeated Measures

-----------------------------------------------------------------

Genetic Matching

-----------------------------------------------------------------

MQLS

Instructions and examples

The program needs two types of input files:
(1) .ped file(s) to store the genotype for autosomal markers (in QTDT format). When the -w parameter is larger than 1, it is recommended that the markers are ordered in their physical positions so that independent markers can be choosen.

(2) .dat file(s) to discribe the columns in the .ped file (in QTDT format). The affection status named as "Affected" or "affected" will be analyzed.

The names for input files are the same except the '.ped' and '.dat' postfixes, e.g. 'filename.ped' and 'filename.dat'.
Sometimes genotypes are stored in multiple files, for example, separated by chromosomes. The program allows multiple .ped and .dat files with names starts from 'filename1.ped' and 'filename1.dat' to 'filenameN.ped' and 'filenameN.dat' where 'N' needs to be specified by the '-c' parameter. You can download the example package (Linux, Windows) to get a sense of the data format and their names.

After you download the example package, let's walk through this example. Genotypes are stored in 'example1.ped' and 'example2.ped' together with corresponding .dat files. The package also includes the executable file 'score_match'. The data have 100 SNPs and we will use the best 50 of them to perform matching and test for association to the disease status 'affected' for each of the 100 SNPs. By default, the additive effect model will be used. We will also output the similarity score for each case-control pair and the matched sets between cases and controls. This can be done by the following command:

prompt> score_match -f example -c 2 -m 50 -w 1 -o result.out -os score.out -om match.out > print.out

Alternatively, you can use the Gereralized Mantel-Haenszel test with the additive effect model. This can be done by:

prompt> score_match -f example -c 2 -m 50 -w 1 -o result.out -os score.out -om match.out --MHtest > print.out

The command tells the program that filenames start from 'example' and there are 2 genotype files. Fifty SNPs selected based on one-side HWE test are used for calculating the dissimilarity score for matching. We set the window parameter -w to be 1 so that every marker can be a candidate. The scores will be output to 'score.out' and the matching results will be stored in 'match.out'.

The complete list of parameters can be found here.

The screen printout from the program are written to 'print.out'. It shows the values of parameters and the status of each step. The genomic control paramter(Devlin & Roeder 1999) is also output and it is '1' in this simple example, which means no stratification.

Results are stored in 'result.out'. The first a few rows are listed below:

N

SNP

Freq

n_case

n_ctrl

n_group

p(chisq)

p(GC)

coef1

se1

p(clogit)

return_code

1

rs001

0.823209

716

389

387

0.595657

0.595657

-0.0130281

0.131851

0.921269

0

1

rs002

0.879185

716

389

389

0.339203

0.339203

-0.119761

0.141066

0.393708

0

1

rs003

0.836355

716

389

388

0.14851

0.14851

-0.176763

0.129084

0.167787

0

1

rs004

0.441123

716

389

388

0.354586

0.354586

0.0325878

0.110581

0.768176

0

1

rs005

0.730525

716

389

389

0.640485

0.640485

0.0251796

0.128108

0.84421

0

1

rs006

0.842391

716

389

388

0.741572

0.741572

0.0986692

0.131289

0.45337

0

1

rs007

0.730664

716

389

386

0.0393779

0.0393779

-0.250753

0.148833

0.0892587

0

1

rs008

0.899004

716

389

389

0.815852

0.815852

0.0262791

0.161911

0.871168

0

1

rs009

0.865158

716

389

389

0.886642

0.886642

0.0838755

0.146485

0.56775

0

The first column 'N' indicates the SNP is from which dataset.

'Freq' is the frequency of the first allele in the data.

'n_case' and 'n_ctrl' are the number of cases and controls in the genotyped individuals.

'n_group' is the number of matched sets.

p(chisq) is the allelic Chi-square test p-value for association without adjusting for potential population structure.

p(GC) is the allelic Chi-square test adjusted by the genomic control parameter(Devlin & Roeder 1999).

'coef1' and 'se1' is the size and standard error for the additive effect of the SNP estimated from the conditional logistic regression (conditioning on the matched sets).

'p(clogit)' is the likelihood ratio test p-value for additive effect of the SNP in the conditional logistic regression.

'return_code' indicates the status while fitting the conditional logistic model. '0' means normal. Refer to here for the meaning of other codes.

University of Michigan | School of Public Health