|
A Tool for Checking Lane Identity (Under construction!)
|
Introduction
High-throughput technology has been widely used in genetics and genomics areas recently such as microarray and DNA chips. The next generation sequencing will be available and prevalent in the next few years. In each machine run, the sequencer can easily generate millions of reads, which are short fragments of DNA, in a short period. Those reads can be assembled and mapped to reference genome. An ongoing international project "1000 Genomes Project" tries to provide a deep catalog of human genetic variation with the help of next generation sequencing technology. To date, pilot studies have nearly been finished by the efforts of many different centers. Totally over 10,000 lanes of 180 samples have been generated. A crucial problem arising is that some errors may occur during the manual preparation of DNA and labeling. For example, a wrong sample label is tagged to the lane or DNA is contaminated. Therefore, we need a fast and efficient way to evaluate the accuracy and identify the potential problems of tens of thousands of lanes. A feasible way is to map a small mount of reads randomly distributed on the genome and compare each base to the available genotypes and reference genome for each lane.
LaneCheck is an efficient and memory saving program to check potential problematic lanes generated by shotgun sequencing machine and mapped by sequence
assembler softwares (e.g. KARMA , Maq ). It has been widely used in the pilot study of 1000 genomes project. The program requires that the read/lane files have been mapped and stored in the format of either SAM or BAM file.
The labeled sample has available genotypes.
Input Options
Option |
Description |
--referencegenome |
NCBI36.fa - concatenated chromosome FASTA files from NCBI release 36.3 release. Download below |
--dbSNPFile |
GenomeSNP.dbsnp - packed binary file that marks known dbSNP positions. Download below |
--lanesampleFile |
Three tab delimited columns required in the format of
S(B)amfile FamilyID PersonID | .
--pedFile, --datFile, --mapFile |
Either standard QTDT or LINKAGE format.
See Description here |
--mapquality |
The reads with MAPQ > mapquality will be retained |
Output Option
Option |
Description |
--prefix |
A series of measures of mismatch and match counts comparing to non-dbSNP bases in the referencegenome and "good" HapMap homozygotes sites.
|
Source Distribution
This version is recommended for Unix users with access to the
GNU C++ compiler.
lanecheck-0.0.1 : Pre-complied binary file, ready to use.
lanecheck-0.0.1.tar.gz : Source code.
If pre-complied file does not work, download source code, unzip it and type 'make'.
Reference Genome and dbSNP files: Required input files (850M), click to download or convert yourself using the tools in KARMA
Example
Download the example here and run the command below. NCBI36.fa and GenomeSNP.dbsnp can be downloaded above and path is required if you put them in the different folder.
lanecheck --referencegenome NCBI36.fa
--dbSNPfile GenomeSNP.dbsnp --lanesamplefile test.key
--pedfile test.ped --datfile test.dat --mapfile test.map --prefix test
| |