Readme | Download |
Home | FAQ |
Update
Introduction
This program is for defining enriched regions on the genome using short sequence read data generated from the ChIP-Seq assay. The algorithm is based on a two-state HMM to help identifying significantly enriched regions.
Setup
Please visit Install page for details
Usage
perl HPeak.pl < -format FORMAT -t TFILE -n NAME > [Options]
Options:
-h Show this help message
-format Format of tag file, "ELAND" or "BED". REQUIRED
-t TFILE Treatment file name. REQUIRED
-n, -name Experiment name used to generate output file. REQUIRED
-c CFILE Control file name.
-f, -fw Estimated DNA fragment width. DEFAULT: 300
-w, -window Window size (bp). DEFAULT: 25
-p, -sig P-value threshold for peak detection. DEFAULT: 1e-3
-wig Whether to generate WIG file for UCSC genome brower
-s, -seq Whether to extract peak sequences
-a, -ann Whether to extract nearest gene information for peaks
Parameters:
–format: input file format:
our program currently allows two types of input formats: ELAND or BED. The first is the format used by ILLUMINA to report of sequence tag location information (see the Figure below).
>PATHBIO-SOLEXA_20F0CAAXX:1:39:231:694
TTTTTTGAGGAATATATGTATATATNTNTGTTGGGT U0
1 1
0 hs_ref_chr5.fa
68396743 F
DD
>PATHBIO-SOLEXA_20F0CAAXX:1:39:856:492
TCTCCCAATTGACCTTTGGGATATGNGNATAAAATT U0
1 0
0 hs_ref_chr7.fa
128899421 F
DD
Columns 7-9 contain mapping information we used are:
7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).
The detailed information can be obtained from Illumina.
An alternative input format is the BED format. See http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED for details. Four columns are required in this type of input files: Chromosome, start, end positions and strand (+: positive strand, -- negative strand). They have to occupy the first four columns and need to be in the right order. The sequences do not need to be in order. More columns are allowed, but will be ignored by this program. An example of BED format input file is shown below.
1
51131 51155
+
1 52543
52567 -
1 60254
60278 -
–t, –c: input filenames
The treatment file and control file contains location and names of input files in ELAND or BED format. Each line in these files represents one file. These files will be merged so sequences from multiple sources can be combined. A sample file is shown below.
Example.inp:
/data/GERALD/s_1_eland_result.txt
/data/GERALD/s_2_eland_result.txt
/data/GERALD/s_3_eland_result.txt
–f, –fw: fragment width
The average length of size-selected DNA fragment. For example, if the size of DNA fragment ranges from 100 to 500 bp, then the average fragment width is 300 bp.
–w, –window: window size
This program partitions the genome into small segment so counting of read coverage is conducted in each segment. This strategy allows comparison across samples. Larger window size reduce computation time and file size but lower resolution in defining enriched regions. The default value is 25bp.
–p, –sig: significance level
This is the p-value threshold to call a region significantly enriched. Multiple comparison is adjusted using the Bonferroni method, i.e., p/N, N is total number of regions. The default significance level is 0.001 (same as in Robertson et al. Nature Methods 2007).
–wig: whether to generate WIG format coverage profile file for the enriched regions.
–seq: whether to generate FASTA format sequence files for the enriched regions.
–ann: whether to generate detailed annotation information for the enriched regions.
Detailed annotation includeing AT%, conservation rate, genomic features (exon, intron, intergenic,…) and up- and down- stream gene name. A typical annotation file is showm below:
chr1:1273426-1273850
425 13
GC%: 64.9 Masked%: 0.0
0+/-0.01 Intron
DVL1 NM_182779
-
chr1:1332076-1332350 275
8 GC%: 64.4
Masked%: 8.4 0.31+/-0.44
Exon MRPL20
NM_017971 -
chr1:1499426-1499725 300
10 GC%: 76.0
Masked%: 0.0 0.03+/-0.16
Exon SSU72
NM_014188 -
Output:
The main output file is ***.allregions.txt. This is a BED format file indicating chromosome, start and end locations of all enriched regions
***.sum contains three main parts:
1). The
first part lists all the parameter values entered.
2). The second part contains total number of reads and uniquely mapped reads
from treated and control samples.
3). The third part summarizes the number of enriched regions, total length of
DNA covered by these regions, and range of read coverage in these regions.
Example
perl HPeak.pl -format ELAND -t example.txt -n Example -fw 300 -w 25 -p 1e-3 -wig -seq -ann
[ HPeak ] | Qin Lab | Chinnaiyan Lab ]