University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

A Tool for Checking Lane Identity (Under construction!)

Introduction

High-throughput technology has been widely used in genetics and genomics areas recently such as microarray and DNA chips. The next generation sequencing will be available and prevalent in the next few years. In each machine run, the sequencer can easily generate millions of reads, which are short fragments of DNA, in a short period. Those reads can be assembled and mapped to reference genome. An ongoing international project "1000 Genomes Project" tries to provide a deep catalog of human genetic variation with the help of next generation sequencing technology. To date, pilot studies have nearly been finished by the efforts of many different centers. Totally over 10,000 lanes of 180 samples have been generated. A crucial problem arising is that some errors may occur during the manual preparation of DNA and labeling. For example, a wrong sample label is tagged to the lane or DNA is contaminated. Therefore, we need a fast and efficient way to evaluate the accuracy and identify the potential problems of tens of thousands of lanes. A feasible way is to map a small mount of reads randomly distributed on the genome and compare each base to the available genotypes and reference genome for each lane.

LaneCheck is an efficient and memory saving program to check potential problematic lanes generated by shotgun sequencing machine and mapped by sequence assembler softwares (e.g. KARMA , Maq ). It has been widely used in the pilot study of 1000 genomes project. The program requires that the read/lane files have been mapped and stored in the format of either SAM or BAM file. The labeled sample has available genotypes.

Input Options

.
Option Description
--referencegenome NCBI36.fa - concatenated chromosome FASTA files from NCBI release 36.3 release. Download below
--dbSNPFile GenomeSNP.dbsnp - packed binary file that marks known dbSNP positions. Download below
--lanesampleFile Three tab delimited columns required in the format of S(B)amfile FamilyID PersonID
--pedFile, --datFile, --mapFile Either standard QTDT or LINKAGE format. See Description here
--mapquality The reads with MAPQ > mapquality will be retained

Output Option

Option Description
--prefix A series of measures of mismatch and match counts comparing to non-dbSNP bases in the referencegenome and "good" HapMap homozygotes sites.

Source Distribution

This version is recommended for Unix users with access to the GNU C++ compiler.

 
lanecheck-0.0.1 : Pre-complied binary file, ready to use.
 
lanecheck-0.0.1.tar.gz : Source code. 
If pre-complied file does not work, download source code, unzip it and type 'make'.
Reference Genome and dbSNP files:  Required input files (850M), click to download or convert yourself using the tools in KARMA

Example

Download the example here and run the command below. NCBI36.fa and GenomeSNP.dbsnp can be downloaded above and path is required if you put them in the different folder.
lanecheck --referencegenome NCBI36.fa 
--dbSNPfile GenomeSNP.dbsnp --lanesamplefile test.key 
--pedfile test.ped --datfile test.dat --mapfile test.map --prefix test

 
 

University of Michigan | School of Public Health