ChIP'ing the mammalian genome: technical advances and insights into functional elements

Characterization of the functional components in mammalian genomes depends on our ability to completely elucidate the genetic and epigenetic regulatory networks of chromatin states and nuclear architecture. Such endeavors demand the availability of robust and effective approaches to characterizing protein-DNA associations in their native chromatin environments. Consider able progress has been made through the applica tion of chromatin immunoprecipitation (ChIP) to study chromatin biology in cells. Coupled with genome-wide analyses, ChIP-based assays enable us to take a global, unbiased and comprehensive view of transcriptional control, epigenetic regulation and chromatin structures, with high precision and versatility. The integrated knowledge derived from these studies is used to decipher gene regulatory networks and define genome organization. In this review, we discuss this powerful approach and its current advances. We also explore the possible future developments of ChIP-based approaches to interrogating long-range chromatin interactions and their impact on the mechanisms regulating gene expression.


Introduction
Now that the complete human genome sequence is avail able [1], the current challenges are to identify all the functional genetic elements it encodes and to elucidate the complex regulatory networks that coordinate the function of all genetic and epigenetic elements that are crucial for cellular homeostasis, development and disease progression [2,3]. Hence, research focus has turned to the annotation of the genome for functional properties and the charac teri zation of regulatory elements involved in controlling gene expression, gene function and genome stability.
Among all the functional features of genome activities, dissecting the complex regulatory mechanisms controlling the precise spatial and temporal patterns of gene expression is critical for understanding developmental and cellular processes. The regulation of genome functions is largely mediated through highly controllable, dynamic and tran sient proteinchromatin interactions. In eukaryotes, genomic DNA is packaged by an octamer of four core histones into a nucleosome, the basic building block of chromatins. The intimate associations between DNA, histone and regulatory protein complexes within nucleo somes are critical for many nuclear activities such as transcription, DNA repair and replication ( Figure 1) [4]. A detailed characterization of chromatinDNA interactions is therefore required for understanding the molecular mechanisms behind gene regulation.
Significant efforts have been dedicated to deciphering global chromatin structures, modifications and chromatin protein interactions. Due to the dynamic and transient nature of such interactions, early attempts using bio chemical fractionation were problematic [5]. Thanks to a powerful approach called chromatin immunoprecipita tion (ChIP) [4], our understanding of proteinDNA interactions within their native chromatin context, in relation to different nuclear activities, has been greatly advanced. ChIP captures snapshots of these interactions in living cells by employing efficient crosslinking agents. The chromatin is disrupted by sonication and the DNA fragments cross linked to the proteins of interest are then selectively enriched by immunoprecipitation with specific antibodies. After reversal of the crosslinks, the enriched DNA can be subjected to further characterization. The ChIP method has been applied successfully in different areas, with focus on the analysis of chromatin structures and transcriptional dynamics. These areas include transcription factor (TF) binding [6], structural components of chromatin com plexes [7,8], histone modifications [911] and enzyme function in histone modifications [12,13] across a wide range of organisms. Here, we summarize the developments of ChIPbased assays, their technical specifications and how they are applied to reveal insights into the molecular mecha nisms during transcriptional and epigenetic regulation.

Technical considerations in conducting ChIP analysis
The basic principle of ChIP is schematically illustrated in Figure 2a. In this process, intact cells are subjected to crosslinking and nuclear extracts are prepared from the crosslinked cells, which are then sonicated to shear chromatin fragments into fragments of manageable size [14]. The methods used to covalently link the protein to DNA in vivo include ultraviolet (UV) and formaldehyde [15]. Formaldehyde, which crosslinks DNA (primarily dA and dC) to the α-amino group of all amino acids [16], produces both proteinnucleic acid and proteinprotein crosslinks in vivo, making it a simple, fast and highly efficient agent for crosslinking. Experiments have sug gested that different proteins are crosslinked with their interacting DNA with different efficiency [17], and excess Overview of activities in the nucleus. Regulation of genome functions involved in complex interactions between DNA, histone and protein complexes within the nucleosomes to bring about highly controlled and organized nucleus activities, such as transcription, DNA repair and replication, critical for cellular development. exposure to formaldehyde can cause resistance to sonication and loss of material and low recovery. Therefore, small scale trials with different crosslinking stringencies are recommended to evaluate the optimal condition. A key feature of formaldehydebased crosslinking is that the crosslinks are fully reversible through extensive proteinase K digestion and heat treatment. Thus, the proteins and DNA can be purified separately to enable subsequent analyses [18]. As a result, formaldehyde has been the preferred and general strategy for crosslinking.
Other important factors to be considered while doing ChIP include the antibody specificity and the fine balance between the crosslinking stringency and sonication condi tions. The robustness of ChIP to differentially select target regions versus random genomic DNA is highly dependent on the availability of highquality and highaffinity antibodies against the protein of interest. Community and industrial efforts have been initiated to characterize and catalog ChIPgrade antibodies against nuclear proteins of interest. Furthermore, ChIP with an antibody of different isotype is commonly used to validate the binding events found. To further improve the efficiency of the ChIP process, a sequential ChIP can be attempted. In this method, two rounds of ChIP are performed sequentially using different subtypes of antibodies against the same proteins but different epitopes. Although highly accurate, sequential ChIP is technically challenging and suffers from low yield, which limits its applications.

Readout methods for ChIP-based analysis
It is important to note that the ChIP process differentially enriches the targeted proteinDNA interactions from the entire nuclear crosslinked chromatinprotein complexes through antibody selection; however, it is not a purification step. Therefore, once the ChIP material is available, additional steps are required to characterize the material pulled down and determine their relative enrichments ( Figure 2bf). In a conventional ChIP assay, the enriched regions are initially analyzed using smallscale assays such as traditional cloning followed by a sequencingbased approach [19], Southern blot hybridization analysis [20] or quantitative realtime polymerase chain reaction (PCR) (ChIPqPCR) [21]. The availability of the complete genome sequences of many complex organisms offers the oppor tunity to carry out genomewide detection of protein chroma tin interactions. Two major approaches have been commonly adopted as readouts to determine the identity of these ChIPenriched DNA fragments at the wholegenome scale: hybridizationbased or sequencingbased methods.

Hybridization-based whole-genome ChIP analysis: ChIP-on-chip
To characterize the proteinDNA interaction profiles across different regions on the genome landscape, highdensity microarrays are created and hybridization is used for the analysis of ChIP DNA (referred to as ChIPonchip). In brief, after reversal of the crosslinks, ChIPenriched DNA and control DNA will be amplified by PCR and fluorescently labeled with the cyanine dyes Cy5 and Cy3 for hybridization to the DNA microarrays containing probes that correspond to the genomic sequences of interest ( Figure 2b). The ratio of the Cy5 to Cy3 fluorescence intensities measured for each DNA element provides a measure of the extent of the binding across the entire genomic regions covered in the array. Genomic loci with higher fluorescent intensity in the ChIP DNA than the control DNA will be considered enriched as the potential binding sites. Using this technique, the nonrepeat sequences in the genome can be interrogated and many novel binding sites uncovered. For example, genes regu lated by many TFs such as STE12 and GAL4p were charac terized in detail in yeast systems and revealed new functional pathways regulated through multiple TF bindings [6,22].
Initially, array studies were limited to promoter regions amplified through PCR [23]. Over the years, significant improvements have been made to the ChIPonchip procedures as well as the array designs. Highdensity oligonucleotide tiling arrays that represent the entire genome are now available and enable comprehensive mapping of proteinDNA interactions [6,22,24].

Limitations of the ChIP-on-chip approach
Despite considerable success, arraybased readout of ChIP signals does suffer from several limitations. Firstly, the hybridizationbased platform is unable to detect signals in repeat regions. Due to the large size and complexity of mammalian genomes, the DNA microarrays available often only contain partial genomic content or promoter regions of wellcharacterized genes. Therefore, many of the ChIP chip analyses provide incomplete information, as any biologically significant binding occurring within the non interrogated regions cannot be captured. Nevertheless, the repetitive regions are important areas to examine, based on what we know about TF binding [25]. Secondly, PCR is generally used to amplify the ChIP material for hybridi za tion, which can result in potential hybridization noise signals from biased amplification. To overcome nonspecific amplification from direct PCR and crosshybridiza tion noise, an improved method called ChIPDSL (DNA selec tion and ligation) was developed [26]. In ChIPDSL, paired oligonucleotides corresponding to regions of interest are designed as signatures and selected by ChIP DNA. The annealed paired oligos are then ligated and PCR used for the arraybased detection (Figure 2d). ChIPDSL avoids direct amplification of ChIP fragments and the amplicons are uniform in size to minimize PCR bias. Thirdly, as many different array designs and genome assemblies exist, the results from different groups could be difficult to compare. Lastly, the global ChIPchip approach is dependent on the construction of wholegenome arrays. For certain complex genomes, these are not commercially available or econo mically practical. Due to these limitations, the wholegenome tiling array approach has not yet been adopted by the entire research community and has only been used in several large projects studying the genomes of human and mouse.

Sequencing-based whole-genome ChIP analyses
Sequencingbased methods emerged as an alternative to genomewide readouts of ChIP analysis, particularly for complex genomes. To determine the identities of ChIP DNA by sequencing methods, large numbers of sequence reads are required. As ChIP assay is only a process of enrichment, a significant amount of nonenriched back ground DNA will still be present in the ChIP DNA material. With a limited survey of the ChIP DNA pool, it is difficult to distinguish between genuine signal and noise. However, if the sampling of the DNA pool can be increased, the genuine ChIPenriched sites can be defined by multiple overlapping ChIP fragments, whereas the nonspecific regions will only be covered by random ChIP singletons. The bona fide sites can then be inferred by multiple mapped sequenced fragments.

ChIP-SAGE
To overcome the depth of sequencing coverage, shorttag based sequencing strategies like serial analysis of gene expression (SAGE) have been adopted. SAGE was originally developed for counting transcript levels [27] and later applied to genome scanning for transcription factor binding site and histone modifications [28,29]. In ChIP SAGE, the ChIPenriched DNA fragments are endligated with a universal biotinylated linker, and 21bp tags are generated by type II restriction enzyme digestion for sequencing ( Figure 2e). Compared with the ChIPonchip hybridization approach, ChIPSAGE increases the coverage and resolution to the entire genome [28]. However, this monotag approach suffers from mapping ambiguity and is unable to differentiate amplification bias, and thus has a lower accuracy.

ChIP-PET
In order to enhance the mapping accuracy of shorttags and increase the information content while still exploiting the shorttag sequencing efficiency, a pairedendditag (PET) method has been developed (ChIPPET). Like SAGE, the PET approach was initially used for transcriptome analysis [30]. In ChIPPET, the ChIP DNA is converted into PETs for ultrahighthroughput sequencing. Each PET sequence is mapped onto the genome and the locations of binding sites can be inferred by overlapping PETdefined clusters (Figure 2c). Over 90% of the sites identified can be validated by ChIPqPCR, and de novo consensus binding motifs can be predicted from the overlapping regions [31]. The ChIPPET approach has been demonstrated to map wholegenome TF binding sites and epigenetic modifica tions in both cancer and embryonic stem cells (ESCs) with high specificity and resolution [9,31,32]. Compared to ChIPonchip, the ChIPPET approach is an unbiased and open system for identifying all DNA segments enriched by ChIP. This method is not restricted by the array coverage or probe performance and thus allows a real genomewide analysis. Its only limitation is the upfront requirement for large sequencing capacity.

ChIP-Seq
Recently, the development of robust and advanced sequen cing technologies, particularly the ability to rapidly decode millions of DNA fragments simultaneously with high efficiency and relative low cost, has facilitated our ability to characterize ChIP DNA by direct sequencing (ChIPSeq) [11,33]. ChIPSeq has proved to be a simple and robust method for global, unbiased interrogation of the TF bind ing sites and epigenetic modifications. In ChIPSeq, the ChIP DNA is end polished and ligated with the sequencing adaptors, followed by limited PCR amplifications. Size selections of DNA fragments are subjected to cluster amplification and sequencing (Figure 2f). Between 25 and 36 nucleotides from either end of ChIP DNA fragments can be determined with high accuracy, and millions of high quality reads can be generated within days. Based on their mapping locations, regions with a high number of clusters of ChIP tag sequences are defined as ChIP enrichment sites. To further distinguish the true binding sites from the nonspecific sites, control DNA (input) is sequenced to determine the noise, which can then be removed. ChIPSeq enables the performance of deep sequencing at high resolution and low cost.

Insights from genome-wide ChIP analysis
With the availability of wholegenome and unbiased approaches to characterizing chromatinDNA interactions, our knowledge of the genomic features, landscape, target genes and gene expression activity has drastically advanced in recent years. Here, we summarize what we have learnt collectively on the critical links between chromatin modifications and transcriptional outputs.

Identification of transcription factor binding sites
Applying ChIPbased assays for components in the transcription machinery or TFs, their genomic targets and regulatory circuitries can be reconstructed [3335]. One of the unique and intriguing findings from these genome wide studies indicates that there are large numbers of identified target binding sites located outside of the previously annotated promoters and suggests that the functional regulatory elements of the genome are larger than previously envisioned. For example, over 30% of the estrogen receptor binding sites were found in the inter genic regions at least 50 kb away from the neighbor genes [36]. Such an observation raises interesting questions about the functional nature of these binding sites and about how to accurately correlate the genes and their corresponding regulatory regions. The genomewide ChIP assay can also be used to uncover the sequences bound by specific TFs and characterize their binding site selection. Through the putative in vivo binding sites identified, the ab initio binding consensus sequences associated with the protein of interest can be efficiently derived [33]. We have also gained insights into how TFs have evolved different mechanisms to elicit target gene responses. Some indivi dual TFs can elicit multiple transcriptional responses, while different TFs can be recruited to the same target regions to trigger transcriptional activation leading to cell differentiation [33]. In ESCs, key reprogramming factors and TFs involved in signaling pathways as well as self renewal have been analyzed. Specifically, two clusters of genomic loci were found that were extensively targeted by multiple transcription factors in the ESC genome. The first cluster includes NANOG, OCT4, SOX2, SMAD1 and STAT3. The second cluster consists of cMyc (MYC), nMyc (MYCN), ZFX and E2F1. STAT3 and SMAD1 are major signaling components modulating the leukemia inhibitory factor (LIF) and bone morphogenetic protein (BMP) pathways. LIF and BMPs are protein factors required for the maintenance of the pluripotency state of ESCs. These results have shown that LIF and BMP signaling pathways are integrated into the ESC pluripotency maintenance TF cluster (OCT4, SOX2 and NANOG) through SMAD1 and STAT3; and multiple transcription factor clustering is the mechanism to recruit cellspecific enhancer targeting for lineagespecific transcription regulation.

Profiling chromatin modifications
In addition to TF binding, the ChIP assay can also be used to profile the distribution of the chromatin modification components, histone variants and modifications [10]. One of the pioneering efforts was to understand the mecha nisms by which histone modifications regulate trans cription and chromatin organization. Starting in the yeast system, the application of ChIP assays demonstrated that histone acetylation was a critical link between chromatin structure and transcriptional activation [37]. In mamma lian genomes, Barski et al. have characterized the histone codes through profiling 20 lysine and arginine methylation modification patterns in histones, and identified the signatures for histone methylation patterns surrounding promoters, enhancers, insulators and transcribed regions [11]. Among them, monomethylations of H3K27, H3K9, H4K20, H3K79 and H2BK5 were found to be associated with gene activation, while trimethylation of H3K27, H3K9 and H3K79 was linked to gene repression. In a study to investigate the types of histone modifications that underlie the chromatin properties to maintain the pluripotent nature of the ESC genome, Lander and colleagues un covered 109 domains showing overlapping opposing histone modification marks, termed 'bivalent domains', where large regions of H3K27me3 harbor smaller regions of H3K4me3 [10]. Following further characterization using a genomewide ChIPPET approach in human ESCs [9], H3K4me3 was found to be prevalent and occurred in nearly 70% of promoters in annotated genes, while H3K27me3 appears less occupied in promoter regions and forms a 'bivalent domain' by comarking 10% of genes with H3K4me3. A large portion of genes that are important for mesoderm development, neuroectoderm and other develop mental processes are among the genes comodified by H3K4me3 and H3K27me3 [9].
Through the applications of genomewide ChIP analyses across different organisms, we learnt that TF binding sites are not necessarily conserved among species [34,38] and that not all TFchromatin interactions are functional [25]. Using the binding regions of seven mammalian TFs (ESR1, TP53, MYC, RELA, POU5F1, SOX2 and CTCF) identified on a genomewide scale, we found only a minority of sites appeared to be conserved at the sequence level, suggesting that evolution has adapted factor binding sites to aid the dynamic regulation of mammalian genomes.

New advances in ChIP technology
Up to now, most studies using the ChIP assay have been focused on characterization of the DNA portions associated with the pulleddown ChIP material. Analysis of the proteins in their in vivo chromosomal context recovered from ChIP has only been reported recently [39]. In addition to proteinDNA interactions, ChIP can also be used to study RNAprotein interactions, especially non coding, nuclear RNAdirected epigenetic control [40,41]. Applying RNA immunoprecipitation followed by PCR (RIPPCR), noncoding RNAs (ncRNAs), such as HOTAIR and Kcnq1ot1, have been shown to associate with Suz12 and G9a in primary human fibroblasts and mouse fetus, and these associations affect Hox genes as well as the expression of imprinting genes [42,43]. Although RIP has only been carried out in selected cells and at a limited scale, it is intriguing to suggest that there is a specific population of ncRNAs that acts in coordination with different components of histone and DNA modification machineries to achieve geneexpression control. Through further advancement in RIPbased analysis (Figure 3b), it will be interesting to determine their identity, specificity and impact on cell differentiation.
The recent expansion of ChIP technologies has enabled a better understanding of the interactions between TFs and the regulatory networks contributing to gene regulation. Surprisingly, these analyses have demonstrated that many TFs rarely bind to promoter regions compared with intergenic regions [36], suggesting critical roles for long distance, promoterenhancer interactions in regulating gene expression in mammalian cells [44]. In some cases, it was found that the transcriptional activation involved distal control elements located hundreds of kilobases away, which are brought together through connecting DNA loops that allow physical interactions between the regulatory elements for gene expression [45]. However, methods like ChIPSeq can only reveal the functional genome in a linear fashion. Information on longrange interactions harnessed within the chromatinprotein complexes and how they impact transcriptional regulation is still lacking.
Initial efforts to characterize the distant interactions have been technically challenging and mostly limited to microscopy techniques, which are laborious and of poor resolution. Through formaldehyde crosslinking followed by proximitybased ligation, longrange chromosomal inter actions can be captured and detected by PCR (chromatin     conformation capture, 3C), microarray analysis or high throughput sequencing (4C or 5C), with limited scale and selective bias [46][47][48]. Applying 3C in the human β-globin loci, various specific interactions between the genes and the regulatory elements were demonstrated [49]. Although 3C and its variants are excellent tools to study complex interactions, these methods require prior knowledge of interacting candidates, hence cannot be used for genome wide profiling for all chromatin interactions. As such, there is a need for approaches that reveal global chromatin interactions at the wholegenome scale in an unbiased and de novo manner. With the pair end ditag concept, we further explored the ability of PET to connect two ends of DNA and delineate their relationships to characterize interacting chromatins (chromatin interaction analysis by pairendditagging; ChIAPET) [50]. In this approach, ChIP was performed with antibodies specific to the TF of interest. Specially designed short oligonucleotide linkers were ligated to the ends of each interacting DNA fragment, followed by second intramolecular ligations to connect two interacting DNA fragments together. PETs from ligated DNA are extracted and analyzed by pair end ditag sequencing. The linear binding sites along genomic DNA can be revealed from selfligation PETs and the inter actions between the binding sites can be determined from inter/intrachromatin ligating PETs (Figure 3a). Therefore, a single ChIAPET experiment can generate two inter related datasets, depending on the step at which the ligation occurs (before or after the decrosslink). Such a feature, when supported by ultrahighthroughput sequen cing, can reveal interactomes mediated by TFs or chromatinmodifying complexes. We expect that the mapping of the whole genome interactome mediated by pertinent TFs or chromatin modifications will translate into knowledge that is critical for understanding the fundamental transcriptional regulation programs.

Future prospects
As described in this paper, combination of the ChIP assay with robust readout methods is extremely powerful for a variety of wholegenome analyses in order to define the functional components within mammalian genomes. The wide range of interactions and diverse organisms it has been applied to have already demonstrated the power of this approach. Considerable progress has been made in our understanding of transcriptional and epigenetic regulation, as well as in the elucidation of transcriptional regulatory networks and chromatin organization. Ultimately, with further improvement of the ChIPbased assays, particularly in the robustness of the enrichment and expansion of their applications, we foresee that ChIP will continue to be the critical approach to study chromatin biology and genome regulation. If successfully implemented, particularly for individual and personal human genome interrogations, such applications will further our understanding of how genetic and epigenetic regulation coordinates eukaryotic development. This knowledge has the potential to translate into a better understanding of the fundamental trans crip tional regulation programs, and lead to biomarker discovery or therapeutic target stratifications, which ultimately guide the development of strategies for personalized medicine.