HPeak - a HMM-based Peak Finder for Next Generation Sequencing Data

HPeak: A HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data

Introduction

This program is for defining enriched regions on the genome using short sequence read data generated from the ChIP-Seq assay. The algorithm is based on a two-state HMM to help identifying significantly enriched regions.

Setup

Please visit Install page for details

Usage

perl HPeak.pl < -format FORMAT -t TFILE -n NAME > [Options]

Options:

-h Show this help message

-format Format of tag file, "ELAND" or "BED". REQUIRED

-t TFILE Treatment file name. REQUIRED

-n, -name Experiment name used to generate output file. REQUIRED

-c CFILE Control file name.

-f, -fw Estimated DNA fragment width. DEFAULT: 300

-w, -window Window size (bp). DEFAULT: 25

-p, -sig P-value threshold for peak detection. DEFAULT: 1e-3

-wig Whether to generate WIG file for UCSC genome brower

-s, -seq Whether to extract peak sequences

-a, -ann Whether to extract nearest gene information for peaks

Parameters:

–format: input file format:

our program currently allows two types of input formats: ELAND or BED. The first is the format used by ILLUMINA to report of sequence tag location information (see the Figure below).

>PATHBIO-SOLEXA_20F0CAAXX:1:39:231:694 TTTTTTGAGGAATATATGTATATATNTNTGTTGGGT U0 1 1 0 hs_ref_chr5.fa 68396743 F DD

>PATHBIO-SOLEXA_20F0CAAXX:1:39:856:492 TCTCCCAATTGACCTTTGGGATATGNGNATAAAATT U0 1 0 0 hs_ref_chr7.fa 128899421 F DD

Columns 7-9 contain mapping information we used are:

7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).

The detailed information can be obtained from Illumina.

An alternative input format is the BED format. See http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED for details. Four columns are required in this type of input files: Chromosome, start, end positions and strand (+: positive strand, -- negative strand). They have to occupy the first four columns and need to be in the right order. The sequences do not need to be in order. More columns are allowed, but will be ignored by this program. An example of BED format input file is shown below.

1     51131       51155       +
1     52543       52567       -
1     60254       60278       -

–t, –c: input filenames

The treatment file and control file contains location and names of input files in ELAND or BED format. Each line in these files represents one file. These files will be merged so sequences from multiple sources can be combined. A sample file is shown below.

Example.inp:

/data/GERALD/s_1_eland_result.txt
/data/GERALD/s_2_eland_result.txt
/data/GERALD/s_3_eland_result.txt

–f, –fw: fragment width

The average length of size-selected DNA fragment. For example, if the size of DNA fragment ranges from 100 to 500 bp, then the average fragment width is 300 bp.

–w, –window: window size

This program partitions the genome into small segment so counting of read coverage is conducted in each segment. This strategy allows comparison across samples. Larger window size reduce computation time and file size but lower resolution in defining enriched regions. The default value is 25bp.

–p, –sig: significance level

This is the p-value threshold to call a region significantly enriched. Multiple comparison is adjusted using the Bonferroni method, i.e., p/N, N is total number of regions. The default significance level is 0.001 (same as in Robertson et al. Nature Methods 2007).

–wig: whether to generate WIG format coverage profile file for the enriched regions.

–seq: whether to generate FASTA format sequence files for the enriched regions.

–ann: whether to generate detailed annotation information for the enriched regions.

Detailed annotation includeing AT%, conservation rate, genomic features (exon, intron, intergenic,…) and up- and down- stream gene name. A typical annotation file is showm below:

chr1:1273426-1273850    425     13      GC%: 64.9       Masked%: 0.0    0+/-0.01        Intron DVL1    NM_182779       -
chr1:1332076-1332350    275     8       GC%: 64.4       Masked%: 8.4    0.31+/-0.44     Exon    MRPL20 NM_017971       -
chr1:1499426-1499725    300     10      GC%: 76.0       Masked%: 0.0    0.03+/-0.16     Exon    SSU72   NM_014188       -

Output:

The main output file is ***.allregions.txt. This is a BED format file indicating chromosome, start and end locations of all enriched regions

***.sum contains three main parts:

1). The first part lists all the parameter values entered.
2). The second part contains total number of reads and uniquely mapped reads from treated and control samples.
3). The third part summarizes the number of enriched regions, total length of DNA covered by these regions, and range of read coverage in these regions.

Example

perl HPeak.pl -format ELAND -t example.txt -n Example -fw 300 -w 25 -p 1e-3 -wig -seq -ann

[ HPeak ] | Qin Lab | Chinnaiyan Lab ]