Analysis of ChIP-seq data
This tutorial was inspired by efforts of Mo Heydarian and Mallory Freeberg. Tools higlighted here have been wrapped by Björn Grüning, Marius van den Beek and other IUC members. Dave Bouvier and Martin Cech helped fine tuning and deploying tools to Galaxy's public server.
In this tutorial we will:
- pre-process sequencing reads
- map reads
- post-process mapped data
- assess quality and strength of ChIP-signal
- display coverage plots in a genome browser
- call ChIP peaks with
- inspect obtained calls
- look for sequence motifs within called peaks
- look at distribution of enriched regions across genes.
For this analysis we will be using ChIP-exo datasets. For this experiment immunoprecipitation was performed with antobodies against Reb1. Reb1 recognizes a specific sequence (
TTACCCG) and is involved in many aspects of transcriptional regulation by all three yeast RNA polymerases and promotes formation of nucleosome-free regions (NFRs) (Hartley & Madhani:2009; Raisner:2005).
There are four datasets:
|Reb1_R1||ChIP experiment, Replicate 1|
|Input_R1||Input DNA, Replicate 1|
|Reb1_R2||ChIP experiment, Replicate 2|
|Input_R2||Input DNA, Replicate 2|
|Galaxy data library containing the reads. Here you can see two replicates (
After uploading datasets into Galaxy history we will combine all datasets into a single dataset collection. This will simplify downstream processing of the data. The process for creating a collection for this tutorial is is shown here.
In this particular case the data is of very high quality and do not need to be trimmed or postprocessed in any way before mapping. We will proceed by mapping all the data against the latest version of the yeast genome
|Mapping all data at once. Note that Select input type is set to
BWAon a collection will generate another collection of BAM files. Name this collection
mapped data(for help on how to rename a collection see this video).
For post-processing we will remove all non-uniquely mapped reads. This can be done by simply filtering out all reads with mapping quality less than
20 using NGS: SAMtools → Filter SAM or BAM:
|Filtering multi-mapped reads by restricting the data to reads with mapping quality above 20. Note that by selecting folder () button you can select as entire collection of BAM datasets to filter at once.|
Filter SAM or BAMon a collection will generate another collection of BAM files. Name this collection
filtered data(for help on how to rename a collection see this video).
After we mapped and filtered the reads it is time to make some inferences about how good the underlying data is.
In out experiment there are two replicates, each containing treatment and input (control) datasets. The first thing we can check is if the samples are correlated (in other words if treatment and control samples across the two replicates contain this same kind of signal). To do this we first generate read count matrix using NGS: DeepTools → multiBamSummary.
|Running multiBAMsummary on a collection of BAM datasets (as before you can select collection by pressing folder () button).|
This tool breaks genome into bins of fixed size (10,000 bp in our example) and computes the number of reads falling within each bin. Here is a fragment of its output:
#'chr' 'start' 'end' 'Reb1_R1' 'Input_R1' 'Input_R1' 'Reb1_R2' chrVI 0 1000 19.0 41.0 3.0 6.0 chrVI 1000 2000 29.0 30.0 13.0 5.0 chrVI 2000 3000 0.0 0.0 0.0 0.0 chrVI 3000 4000 0.0 2.0 0.0 0.0 chrVI 4000 5000 7447.0 139.0 7.0 2645.0
we can then feed this matrix into NGS: DeepTools → plotCorrelation to generate heat map like this:
|B. Heatmap of four samples: Treatments (Rab1) and controls (Input) are well correlated among themselves.|
Here we can see that there are good correlations between replicates (between Reb1_R1 and Reb1_R2, and between input_R1 and input_R2), while correlations between treatments (Reb1) and controls (input) are weak. This is a good sign implying that there is some signal on our data.
How do we tell is we do have signal coming from ChIP enrichment? One way of doing this is Signal Extraction Scaling (SES) proposed by Diaz:2012. SES works as follows. Suppose we have two datasets: ChIP and Input DNA. We divide genome into N non-overlapping windows (N = 10 in the example below) and for each window compute the number of reads. This way we end up with two lists: one listing read counts for ChIP (ChIP list) and the other for Input (Input list):
Window ChIP-count Input-count ------------------------------- 1 3 3 2 4 3 3 2 1 4 1 3 5 3 3 6 27 2 7 18 3 8 2 2 9 45 3 10 8 3
We then sort the ChIP list in ascending order and move elements from Input-list to match this order:
Window ChIP-count Input-count ------------------------------- 4 1 3 3 2 1 8 2 2 1 3 3 5 3 3 2 4 3 10 8 3 7 18 3 6 27 2 9 45 3
Now let's add another two columns to this dataset. These columns will show percentage of reads summing up to each row for ChIP and Input data. For example, 0.044 on row 3 is (1 + 2 + 2)/113 = 0.044.
1 2 3 4 5 ------------------------------- 4 1 3 0.008 0.115 3 2 1 0.026 0.153 8 2 2 0.044 0.230 1 3 3 0.070 0.346 5 3 3 0.097 0.230 2 4 3 0.132 0.576 10 8 3 0.203 0.692 7 18 3 0.362 0.807 6 27 2 0.601 0.884 9 45 3 1.000 1.000 ------------------------ 113 26 Where: 1 = Window, 2 = read count in ChIP, 3 = read count in Input 4 = % or read to this point in ChIP 5 = % of read to this point
In the matrix above a large portion of ChIP reads (column 4) is concentrated in the few bins close to the bottom. This is not the case for the input reads (column 5). If we plot two last columns of this matrix we will get a curve like this:
|SES plot for our toy example. Most "reads" in the ChIP experiment are concentrated in the last three bins.|
DeepTools provide a nice explanation of how the success of a ChIP experiment can be judged based on SES (also called fingerprint) plots:
|DeepTools explanation of SES plots.|
So let's apply this to our own data using NGS: DeepTools → plotFingerprint:
|B. SES fingerprint of four samples: Treatments (Rab1) show characteristic shape indicating of ChIP-signal. Approximately 30% of reads are contained in several % of genome.|
In this section we will convert BAM files generated with
bwa into bigWig format that will allow us to view read coverage distribution across the genome. We will also "pre-warm" a genome browser for displaying peaks we will be calling in the next section.
We will use NGS: DeepTools → bamCoverage:
|Running bamCoverage on a collection of filtered BAM datasets (as before you can select collection by pressing folder () button). Here we set Bin size to
|Finally we set Extend reads to the given average fragment size to
bamCoverageon a collection of BAM datasets will generate a collection of bigWig datasets. Name this collection
coverage(for help on how to rename a collection see this video).
Now we can display bigWig datasets generated in the previous section in a genome browser. There is a variety of available browsers. In this tutorial we will use IGV Browser (this video shows how to display multiple datasets in IGV).
|Collection of bigWigs produced by
Clicking this link in all four datasets (you will need to expand each dataset by clicking on it. This will expose IGV link) and focusing browser on MPH1 (YIR002C) gene will produce the following image:
|Coverage distributing within IGV. Here ChIP replicates are colored in orange and controls are blue. All four tracks were set to maximum value of
While the peaks shown in the browser screenshot above are pretty clear and consistent across the two replicates, looking at the entire genome in the browser is hardly a sustainable way to identify all peaks. There are several ways for identifying binding events genome-wide. They are summarized in the figure below:
|Outline of three ChIP-seq binding event detection methods. Peak-finding methods typically either shift the ChIP-seq tag locations in a 3′ direction by half the expected fragment length, or extend the length of the tag in a 3′ direction to be equal to the expected fragment length. Tags from opposite strands are merged to construct an unstranded tag density landscapes, and binding event locations are predicted from the locations with maximum tag coverage within each region that contains a significant enrichment of ChIP-seq tags (i.e. the peak summit). Peak-pairing methods [e.g. GeneTrack build similar tag density landscapes, but retain strandedness information and typically do not shift or extend the tag locations. Peak locations are determined on each strand separately, and nearby peaks in the correct stranded orientation within a given distance are paired together. Binding event locations are predicted from the peak-pair midpoint locations. Probabilistic binding detection methods aim to estimate the locations of binding events that could have given rise to the observed ChIP-seq tag locations. These methods begin training with initial guesses of binding event locations and a model of how tags are expected to be distributed around real ChIP-seq binding events. During each training step, every ChIP-seq tag is probabilistically associated with nearby binding events, depending on the distance between the tag and the event location. Given these probabilistic tag assignments, binding event locations are updated to achieve a better fit with their associated tags, and the model of how tags are distributed around binding events is updated to reflect the accumulation of tags around all current binding events. During the training process, binding events with few assigned tags are weeded out of the model, and the process eventually converges to a set of final binding locations. (Figure and legend from Mahony and Pugh:2015).|
In this tutorials we will use MACS2 peak caller.
MACS (or its current version
MACS2) performs several steps for calling peaks from paired treatment/control datasets:
|Steps of the MACS workflow (From Feng:2012).|
Here is a concise description of these steps:
- Removing redundancy - MACS retains uniquely mapped reads and removes reads that are repeatedly mapped to the same location. This reduces effects of PCR amplification biases during library preparation.
- Build model and estimate fragment size - one of the MACS inputs is the fragment size or bandwidth, which is approximate size of DNA fragments generated during fragmentation step of library preparation. MACS first slides a window sized at twice the bandwidth across the genome and finds instances where read counts enriched by between 10 and 30 fold relative to the genome background. It then randomly samples 1,000 of such regions and build the model. To build the model it separates reads mapping to each of the strands and build two distributions (two modes). The midpoint between the two modes is the middle of the binding size and the distance between the modes is the fragment size
d(see Figure below).
|Peaks mapped to two strands are treated separately to build two coverage density profiles - two two modes. The distance between the modes is the fragment size
Generate peaks - now that d has been defined MACS slides a window of size 2d across the genome to identify regions significantly enriched in the ChIP sample. MACS assumes that background reads obey Poisson distribution. Thus given the number of reads in a given interval within the control sample we can calculate the probability of having observed number of reads in the ChIP sample (e.g., see flood example here). This procedure is performed for several intervals around the examined location (2d, 1kb, 5kb, 10kb, and the whole genome) and the maximum value is chosen. One problem with this approach is that it only works if both samples (ChIP and control) are sequenced to the depth, which is not usually happening in practice. To correct with this MACS scales down the larger sample.
Compute False Discovery Rate (FDR) - Feng:2012 explains computing FDR in MACS as follows: "When a control sample is available (and you should really always use it - AN), MACS can also estimate an empirical FDR for every peak by exchanging the ChIP-seq and control samples and identifying peaks in the control sample using the same set of parameters used for the ChIP-seq sample. Because the control sample should not exhibit read enrichment, any such peaks found by MACS can be regarded as false positives. For a particular P value threshold, the empirical FDR is then calculated as the number of control peaks passing the threshold divided by the number of ChIP-seq peaks passing the same threshold."
In our case we have two replicates each containing ChIP and input DNA samples. We will first run
MACS2 on pooled data (combining two ChIP samples and two inputs, respectively). We will then run
MACS2 on each replicate individually. Finally, we will pick a robust set of peaks present in all three callsets.
One complication with the way we processed all data is that we have combined everything in a single dataset collection. MACS however will need for us to separate ChIP samples and controls. Fortunately for us we have set readgroups when we were mapping reads to the yeast genome. This will come handy for us right now because we will:
- merge the entire collection of mapped and filtered BAMs into a singe BAM dataset
- split this dataset into four separate BAM files using readgroups
- run MACS on resulting files.
First, to merge a collection of mapped, filtered BAM files into a single dataset we will use NGS: Picard → MergeSamFiles:
|Merging a collection with
Next, we will use NGS: SAMtools → Split to separate merged file into individual BAM files. Each resulting BAM file will contained aligned reads corresponding to original four datasets:
|Splitting BAM dataset on readgroups. This will produce four BAM datasets.|
|Resulting datasets. Each contains aligned reads from the four original conditions.|
Now it is time to run MACS2. First we will use NGS: Peak calling → MACS2 predictd tool from the
MACS2 package. This tool will help us to find optimal parameters for running peak calling function of
This procedure will help us estimate the d parameter by performing the [cross-correlation] analysis between reads mapping to + and - strands. Let's look at these results:
|Replicate 1||Replicate 2|
|Peak model and lag between strands.|
In the case of these data peaks are very sharp and have narrow gap between them:
33 bp for replicate 1 and 2, respectively. We will use an average of these values,
--extsize parameter for calling peaks using NGS: Peak calling → MACS2 callpeak:
|Calling peaks with
|In this lower part of
If you set parameters as was shown above
MACS2 will produce two outputs (if it produced more just find the ones called
narrow peaks and
summits). Let's click on the pencil icon() adjacent to
narrow peak datasets and rename then as shown below:
Next, we will run
MACS2 on BAM datasets for Replicate 1 only:
|Calling peaks with
- rename resulting datasets as
MACS2run on Replicate 2
- rename resulting
narrow peakdatasets as
In the end you should have something like this:
Looking at MACS2 data we have gotten the following numbers of peaks:
|Pooled||Replicate 1||Replicate 2|
Peaks data is generated in the following format:
1 2 3 4 5 6 7 8 9 10 -------------------------------------------------------------------- chrI 35 491 MACS2_peak_1 176 . 10.35332 21.51081 17.68957 101 chrI 87135 87212 MACS2_peak_2 127 . 7.71763 15.89278 12.78060 26 chrI 92612 92793 MACS2_peak_3 153 . 9.22373 18.72748 15.31966 49 chrI 119739 119782 MACS2_peak_4 78 . 6.08885 10.52482 7.82302 25
where columns are:
- Iterative id given by
- Integer score for display
- Strand (irrelevant in this case)
- Fold-change (fold enrichment for this peak summit against random Poisson distribution with local lambda)
- -log10P-value (e.g., 17.68 is 1 x 10-17)
- -log10Q-value from Benjamini–Hochberg–Yekutieli procedure
- Relative summit position to peak start
To see how many peaks are common between the pooled datasets and the two replicates we will use Operate on Genomic Intervals → Join tool twice.
First we will join
Peaks pooled with
|Joining Pooled and R1 results with
Next we will join the result of the previous operation with
|Joining Pooled/R1 with R2 results with
This results in 723 regions are shared among polled, R1, and R2 peaks. Let's call this High confidence set. Before we can use it however, let's cut out only relevant columns. Since we have produced this dataset by joining three other datasets it is three times wider (30 columns). To cut this first three columns we can use Text Manipulation → Cut columns tool:
|Cutting columns from
High confidence set. This will make it easy to find as we continue.
Cut columnstool produces a dataset of tabular type. However, by cutting the first ten columns we have created a dataset in BED format. Thus we need to let Galaxy know about that by resetting metadata as shown below.
Next we need to make sure that output of
Cut columns tool has the type
BED. To do this we will edit its metadata as show below:
|Setting metadata to datatype
Let's visualize Merged peaks as well as Narrow peaks and Summits produced by
MACS2 in IGV by clicking on
display with IGV local links adjacent to
Peaks pooled and
High confidence set datasets (you should already have browser open):
|An overview in IGV. Here you can see original bigWig datasets along with predicted peaks.|
In this experiment antibodies against Reb1 protein have been used for immunoprecipitaion. The recognition site for Reb1 is
TTACCCG (Badis:2008 and Harbison:2004). To find out which sequence motifs are found within our peaks we first need to convert coordinates into underlying sequences. This is done using Fetch Alignments/Sequences → Extract Genomic DNA tool:
|Extracting genomic DNA corresponding to ChIP-seq peaks. Here we use
Next, we need to make sure that all sequences are sufficiently long for finding patterns. MEME, the tools we will use to find motifs, required sequences to be at least 8 nucleotides long. So we will remove short sequences using FASTA manipulation → Filter sequences by length tool:
|Filtering FASTA by length. Here we are removing all sequences shorted than 8 nucleotides.|
Now we can run Motif Tools → MEME:
|Running MEME on length-filtered FASTA sequences from the previous step. Note that Options configuration is set to
MEME generates a number of outputs. The most interesting is HTML Report. It shows that 620 regions contain
|MEME Motif found in 620 sequences corresponding to common peak regions.|
How many genes contain upstream regions enriched in ChIP tags. This is often represented as a heatmap:
|Heatmap example from DeepTools documentation.|
To generate the heatmap we must first produce normalized datasets for the two replicated we have. This is done using NGS: DeepTools → bamCompare tool:
Because we want to plot enrichment around genes we need to download gene annotation. We will use Get Data → UCSC Main for this:
|Getting data from UCSC. Here make sure you select assembly called
|Here just click Send query to Galaxy.|
Next, to prepare data necessary for drawing the heatmap we will use NGS: DeepTools → computeMatrix utility:
|Computing matrix - the data from which heatmap will be built. Here both normalized datasets are select within Score file box, yeast genes we have just downloaded from UCSC are chosen as Regions to plot. 'reference-point is set as computeMatrix main option and, finally, upstream and downstream distances are set to 2,000 bp. Obviously you are welcome to play with these parameters.|
Finally, we can visualize the heatmap by using NGS: DeepTools → plotHeatmap tool:
|Drawing heatmap with
The resulting image shows that a significant fraction of 6,692 genes present in the annotation data we have used contain Reb1 binding sites within their upstream regions:
|Heatmap showing distribution of Reb1 binding sites across upstream regions of 6,692 yeast genes.|
This entire analysis is available as a Galaxy history here. Import it and play with it.
Hopefully this tutorial has given you the taste for what is possible. There are more tools out there so experiment! If things do not work - complain using
Open Chat button below or our support forum.