vcfCodingSnps v1.5

Input files should include an input SNP .vcf file, a gene file and a reference genome file. The gene file and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies.

By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. All the gene files that provided by the package are put in the folder "geneLists". And users are free to provide their own gene files.

In order to get a correct result of annotations, it is essential for the user to make sure that

  • the genome versions of your .vcf file, gene file and reference genome are MATCH !!!!
  • make sure your gene file is comprehensive enough to get the desired annotation result. Sometimes, users ask questions like "why this SNP is annotated as a nonsynonymous by another annotation tool but vcfCodingSnps doesn't give the nonsynonymous output?". For many cases, it is simply because that the other tool annotated that SNP as nonsynonymous with respect to a specific gene, while that gene is not included in the gene file user provided.

1. Example headlines of input VCF-format SNP file:

  ##format=VCFv3.2
  ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom22.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=30
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Tue Dec  1 00:42:24 2009
  #CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  NA12891  NA12892  NA12878

  22  14439753  .  a  t  100  mapQ=0  depth=68;duples=homs;mac=2  GT:GQ:DP  1|1:100:40  0|0:81:28  1|0:84:0

  22  14441250  .  t  c  59  mapQ=0  depth=40  GT:GQ:DP  1|1:56:25  1|1:31:15  1|1:32:0

  22  14443154  .  t  g  45  mapQ=9  depth=92;duples=homs;mac=2  GT:GQ:DP  1|1:49:21  0|0:60:20  1|0:100:51
  ... ...

2. Example headlines of NCBI released B36 reference genome file:

>1 dna:chromosome chromosome:NCBI36:1:1:247249719:1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCT
AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCT
AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCC
TAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAAC
CCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTA
... ...

3. If user want to use his own gene file instead, here is a sample pathway of generating an input gene file from UCSC genome browser

Go to http://genome.ucsc.edu/ >>> Click "table" >>> Specify the fields required (clade: mammal, genome:human etc.) >>> In "track" filed, select "UCSC gene" >>> get output gene file

3.1. Gene file used should be of GenePred table format. The first11 fields of gene file are required tab delimited fields and must be put in the order as following:

1st string name "Name of gene"
2nd string chrom "Chromosome name"
3rd char[1] strand "+ or - for strand"
4th uint txStart "Transcription start position"
5th uint txEnd "Transcription end position"
6th uint cdsStart "Coding region start"
7th uint cdsEnd "Coding region end"
8th uint exonCount "Number of exons"
9th uint[exonCount] exonStarts "Exon start positions"
10th uint[exonCount] exonEnds "Exon end positions"
11th string gene symbol "Standard gene symbol"

Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc. If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like

awk `{FS="\t"; print $0"\t"$1 }` yourGenelist > yourNewGenelist

Note: each start position (geneStartPosition, codingStartposition, exonStartpositions) is zero-based start (the start position number is the actual base position ninus one) while for each end position (geneEndPosition, codingEndposition, exonEndpositions) is one-based end (the end position number is the same as the actual base position).

3.2. If gene file assumes an extended GenePred format, there will be an exctra "exonframe" field. For some genes, due to translational frame shifts or other reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is not a multiple of three". While we will use the usual mod 3 method for counting codons.

3.3. A detailed instruction on using the table browser could be found at (http://genome.ucsc.edu/cgi-bin/hgTables).

3.4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).

Here is an example of input gene file headlines:

##ucscname  chrom  strand  txStart  txEnd  cdsStart  cdsEnd  exonCount  exonStarts  exonEnds  genename

uc001aaa.2  chr1  +  1115  4121  1115  1115  3  1115,2475,3083, 2090,2584,4121,  BC032353

uc009vip.1  chr1  +  1115  4272  1115  1115  2  1115,2475,  2090,4272,  AX748260

uc001aab.2  chr1  -  4268  14764  4268  4268  10  4268,4832,5658,6469,6716,7095,7468,7777,8130,14600,
  4692,4901,5810,6628,6918,7231,7605,7924,8242,14764,  DKFZp434K1323

uc009viq.1  chr1  -  4268  19221  4268  4268  7   4268,5658,6469,6720,7468,14600,19183,
  4692,5810,6628,6918,7924,14754,19221,  FLJ00038

uc009vir.1  chr1  -  4268  19221  4268  4268  10  4268,4832,5658,6469,6720,7095,7777,8130,14600,19183,
  4692,4901,5810,6628,6918,7605,7924,8229,14754,19221,  FLJ00038

uc001aac.2  chr1  -  4268  19221  4268  4268  11  4268,4832,5658,6469,6720,7121,7468,7777,8130,14600,19183,
  4692,4901,5810,6628,6918,7231,7605,7924,8232,14754,19221,  FLJ00038
  ......