|
Input files should include an input SNP .vcf file, a gene file and a reference genome file. The gene file and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. All the gene files that provided by the package are put in the folder "geneLists". And users are free to provide their own gene files. In order to get a correct result of annotations, it is essential for the user to make sure that
1. Example headlines of input VCF-format SNP file:
2. Example headlines of NCBI released B36 reference genome file:
3. If user want to use his own gene file instead, here is a sample pathway of generating an input gene file from UCSC genome browser Go to http://genome.ucsc.edu/ >>> Click "table" >>> Specify the fields required (clade: mammal, genome:human etc.) >>> In "track" filed, select "UCSC gene" >>> get output gene file 3.1. Gene file used should be of GenePred table format. The first11 fields of gene file are required tab delimited fields and must be put in the order as following:
Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc. If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like awk `{FS="\t"; print $0"\t"$1 }` yourGenelist > yourNewGenelist Note: each start position (geneStartPosition, codingStartposition, exonStartpositions) is zero-based start (the start position number is the actual base position ninus one) while for each end position (geneEndPosition, codingEndposition, exonEndpositions) is one-based end (the end position number is the same as the actual base position). 3.2. If gene file assumes an extended GenePred format, there will be an exctra "exonframe" field. For some genes, due to translational frame shifts or other reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is not a multiple of three". While we will use the usual mod 3 method for counting codons. 3.3. A detailed instruction on using the table browser could be found at (http://genome.ucsc.edu/cgi-bin/hgTables). 3.4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570). Here is an example of input gene file headlines:
|