vcfCodingSnps v1.5

0. Contents

 

1. Introduction

vcfCodingSnps is a variant annotation tool that annotates genetic variants such as single nucleotide polymorphisms (SNPs) and short insersion and deletions (INDELs) in a VCF format input file. It takes a VCF as an input and generates an annotated VCF file as an output.

Given a VCF SNP (or short INDEL) file, vcfCodingSnps will annotate each single varaint in a row according to a user specified gene list and a reference genome. The gene list and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. One can find more detailed information about input files here.

For a single variant in the input VCF file, vcfCodingSnps annotates it within each gene region (including upstream and downstream regions of the gene) that covers the variant and outputs it into the output VCF file; while for each single gene in the input gene list, vcfCodingSnps lists the annotated results for the variants that lie in that gene region in the output log file to fersilitate gene based analyses. More information about output files are provided here.

 

2. Functional Categories

The functional categories of annotation that integrated in vcfCodingSnps in the current version includes:

* For single nucleotide polymorphisms (SNPs) :

Category

Definition used in vcfCoding Snps

stop gained a SNP in coding sequence and introducing a TAG, TAA, or TGA stop codon
stop lost a SNP in coding sequence and causing a loss of a TAG, TAA, or TGA stop codon
non-synonymous coding a SNP in coding sequence, located in a codon resulting in a change of amino acid, excluding SNPs that can be defined as either stop gained or stop lost
synonymous coding a SNP in coding sequence, located in a codon that not resulting in a change of amino acid
essential splice site a SNP changing the highly conserved GU in the first two basepairs of the intron or (AG) in the last two basepair of the intron
splice site a SNP occurring in 3 - N1 basepairs into the intron, or N2 basepairs into the exon . N1 by default is 8, N2 by default is 3. N1 and N2 can be defined by user through option --n1 --n2.
5' UTR a SNP located within the 5' UTR of a transcript
3' UTR a SNP located within the 3' UTR of a transcript
intronic a SNP in the intron of a known gene, and cannot be defined as essential splice site or splice site
upstram A SNP located within N kb from the transcript start site (5'-end) of a known gene, N by default is 5 and can be defined by user through option --ns
downstream A SNP located within N kb from the transcript end site (3'-end) of a known gene, N by default is 5 and can be defined by user through option --ns
introgenic A SNP not located within a known gene and also not identified as upstream or downstream of a knowngene

A toy example is provided here for illustrating how SNPs are annotated into above categories. Figure1 is a fregment of reference genome with a gene located from bp 2 to 67 (from "Gene start" to "Gene end"). The upper case letter in each box is the reference allele at that bp position on the reference genome. The blocks in blue color represent exons in the gene. The yellow and orange bars below the box array represent condons in the coding region. We annotate ten SNPs marked by bold red arrows above correponding bp positions. The alternative alleles of each SNP and the annotation results are given in Table 1 below.

Figure 1. A toy example of annotating 10 SNPs in a gene on the reference genome
toy

Table 1. Alternative alleles and annotation results
Pos Alt SNP Ref SNP Alt SNP Codon Ref SNP Codon Alt SNP AA Ref SNP AA Anno Type
3 G A -- -- -- -- 5'UTR
5 A C CAT CCT His Pro Non_Synonymous
13 G T CCG CCT Pro Pro Synonymous
25 C A TAC TAA Tyr Stop Stop Loss
43 A C -- -- -- -- Splice Site
44 G T -- -- -- -- Intronic
50 C A -- -- -- -- Essential Splice Site
53 A C ACC CCC Thr Pro Non_Synonymous
66 C T -- -- -- -- 3'UTR
72 A C -- -- -- -- Downstram

 

3. A Quick Start Quideline

Here is an example for the first time user of vcfCodingSnps. After installation, at the root folder of the package, type

./vcfCodingSnps.v1.5 -s example/example.input.vcf

and you would expect to see a screen output like:

       ##################################################################################################          

vcfCodingSnps1.5 -- vcf SNP annotating tool
(c) 2010.5 Yanming Li, Goncalo Abecasis
Commend and (or) suggestions are welcome! Please send to liyanmin@umich.edu.

##################################################################################################

The following parameters are in effect:

Availabe Options
Input Files : --refgenome [referenceGenomes/genome.V36.fa],
--snpfile [example/example.input.vcf],
--genefile [geneLists/UCSCknownGene.B36.txt], --n1 [8],
--n2 [3], --ns [5]
Output Files : --outfile [vcfCodingSNP.out.vcf], --log [ON]

Reading chromosome >1 dna:chromosome chromosome:NCBI36:1:1:247249719:1...
Reading chromosome >2 dna:chromosome chromosome:NCBI36:2:1:242951149:1...
Reading chromosome >3 dna:chromosome chromosome:NCBI36:3:1:199501827:1...
Reading chromosome >4 dna:chromosome chromosome:NCBI36:4:1:191273063:1...
Reading chromosome >5 dna:chromosome chromosome:NCBI36:5:1:180857866:1...
Reading chromosome >6 dna:chromosome chromosome:NCBI36:6:1:170899992:1...
Reading chromosome >7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...
Reading chromosome >8 dna:chromosome chromosome:NCBI36:8:1:146274826:1...
Reading chromosome >9 dna:chromosome chromosome:NCBI36:9:1:140273252:1...
Reading chromosome >10 dna:chromosome chromosome:NCBI36:10:1:135374737:1...
Reading chromosome >11 dna:chromosome chromosome:NCBI36:11:1:134452384:1...
Reading chromosome >12 dna:chromosome chromosome:NCBI36:12:1:132349534:1...
Reading chromosome >13 dna:chromosome chromosome:NCBI36:13:1:114142980:1...
Reading chromosome >14 dna:chromosome chromosome:NCBI36:14:1:106368585:1...
Reading chromosome >15 dna:chromosome chromosome:NCBI36:15:1:100338915:1...
Reading chromosome >16 dna:chromosome chromosome:NCBI36:16:1:88827254:1...
Reading chromosome >17 dna:chromosome chromosome:NCBI36:17:1:78774742:1...
Reading chromosome >18 dna:chromosome chromosome:NCBI36:18:1:76117153:1...
Reading chromosome >19 dna:chromosome chromosome:NCBI36:19:1:63811651:1...
Reading chromosome >20 dna:chromosome chromosome:NCBI36:20:1:62435964:1...
Reading chromosome >21 dna:chromosome chromosome:NCBI36:21:1:46944323:1...
Reading chromosome >22 dna:chromosome chromosome:NCBI36:22:1:49691432:1...
Reading chromosome >23 dna:chromosome chromosome:NCBI36:X:1:154913754:1...
start mapping
mapping snp file... ...DONE! snp mapsize = 29642
mapping gene file... ...DONE! gene mapsize = 66803
start annotating... ...
DONE! Complete annotating!!!

Hope this could give you some flavor on how vcfCodingSnps works.

 

4. Command Line Options

-s SNP file This option specifies the name of the input VCF-format SNP file
-r reference genome file This option specifies the name of the imput reference genome FASTA file. It should be of either NCBI release 36/hg18 or GRCH37/hg19 format. By default it will load NCBI36 reference genome. Users can chose to download other versions of reference genome files at the download page
-g gene file Specifies the name of the input gene file, by default use a gene file (UCSCknownGene.B36.txt) generated by UCSC genome browser
-o output file Specifies the name of the output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf
-l log file Specifies the name of the log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log
--n1 parameter user defined number of bps into intron for splice site, by default will be set to 8
--n2 parameter user defined number of bps into extron for splice site, by default will be set to 3
--ns parameter user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5