Getting personalized cancer genome analysis into the clinic: the challenges in bioinformatics

Progress in genomics has raised expectations in many fields, and particularly in personalized cancer research. The new technologies available make it possible to combine information about potential disease markers, altered function and accessible drug targets, which, coupled with pathological and medical information, will help produce more appropriate clinical decisions. The accessibility of such experimental techniques makes it all the more necessary to improve and adapt computational strategies to the new challenges. This review focuses on the critical issues associated with the standard pipeline, which includes: DNA sequencing analysis; analysis of mutations in coding regions; the study of genome rearrangements; extrapolating information on mutations to the functional and signaling level; and predicting the effects of therapies using mouse tumor models. We describe the possibilities, limitations and future challenges of current bioinformatics strategies for each of these issues. Furthermore, we emphasize the need for the collaboration between the bioinformaticians who implement the software and use the data resources, the computational biologists who develop the analytical methods, and the clinicians, the systems' end users and those ultimately responsible for taking medical decisions. Finally, the different steps in cancer genome analysis are illustrated through examples of applications in cancer genome analysis.

the PALB2 gene, was discovered by sequencing almost all the coding genes in the cancer cells from this patient [26]. Approximately 70 specific variations were detected in the tumor tissue and they were analyzed manually to search for mutations that might be related to the onset of the disease and, more importantly from a clinical point of view, that could be targeted with an existing drug. In this case, the mutation in the PALB2 gene was linked to a deficiency in the DNA repair mechanism [27] and this could be targeted by mitomycin C.
The obvious challenge in relation to this approach is to develop a systematic form of analysis in which a bioinfor maticsassisted pipeline can rapidly and effectively analyze genomic data, thereby identifying targets and treatment options. An ideal scenario for personalized cancer treatment would require performing the sequen cing and analysis steps before deciding on new treatments.
Unfortunately, there are still several scientific and technical limitations that make the direct implementation of such a strategy unfeasible. Although pipelines to analyze nextgeneration sequencing (NGS) data have become commonplace, the systematic analysis of muta tions requires more time and effort than is available in routine hospital practice. A further challenge is to predict the functional impact of the variations discovered by sequencing, which presents serious obstacles in terms of the reliability of current bioinformatics methods. These difficulties are particularly relevant in terms of protein structure and function prediction, the analysis of non coding regions, functional analyses at the cellular and subcellular levels, and the gathering of information about the relationships between mutations and drug inter actions.
Our own strategy is focused on testing the drugs and treatments proposed by the computational analysis of genomic information in animal models as a key clinical element. The use of xenografts, in which nude mice are used to grow tumors seeded by implanting fragments of the patient's tissue, may be the most practical model of real human tumors. Despite their limitations, including the mixture of human and animal cells and the possible differences in the evolution of the tumors with respect to their human counterparts, such 'avatar' models provide valuable information about the possible treatment options. Importantly, such xenografts allow putative drugs or treatments for individual tumors to be assayed before applying them in clinical practice [25].
A summary of the elements that are required in an ideal data analysis pipeline is depicted in Figure 1, including: the analysis of genomic information; predic tion of the consequences of specific mutations, particu larly in protein coding regions; interpretation of the variation at the gene/protein network level; and the basic approaches in pharmacogenomic analysis to identify potential drugs related to the predicted genetic altera tions. Finally, the pipeline includes the interfaces necessary to integrate the genomic information with other resources required by teams of clinicians, genome experts and bio informaticians to analyze the information.
In this review, we outline the possibilities and limita tions of a comprehensive pipeline and the future develop ments that will be required to generate it, including a brief description of the approaches currently available to cover each stage. We begin by examining the bioinfor matics required for genome analysis, before focusing on how mutation and variation data can be interpreted, then explore network analysis and the downstream applica tions available for selecting appropriate drugs and treatments.

Genome analysis
Array technologies are relied on heavily to analyze diseaserelated tissue samples, including expression arrays and single nucleotide polymorphism (SNP) arrays to analyze point mutations and structural variations. However, personalized medicine platforms are now ready to benefit from the transition from these arraybased approaches towards NGS technology [28].
The detection of somatic mutations by analyzing sequence data involves a number of steps to filter out technical errors. The first series of filters are directly related to the sequencing data and they vary depending on the technical setup. In general, this takes into consideration the basecalling quality of the variants in the context of the corresponding regions. It also con siders the regions covered by sequencing and their representativeness or uniqueness at the genome level.
As the sequencing and software analysis technologies are not fully integrated, errors are not infrequent and, in practice, thousands of false positives are detected when the results move on to the validation phase. In many cases, this is due to the nonunique placement of the sequencing reads in the genome or the poor quality of alignments. In other cases, variants can be missed because of insufficient coverage of the genomic regions.
The analysis of tumors is further complicated by their heterogeneous cellular composition. New experimental approaches are being made available to address the heterogeneity of normal and disease cells in tumors, including singlecell sequencing [29,30]. Other intrinsic difficulties include the strong mosaicism recently dis covered [3133], and thus greater sequencing quality and coverage is necessary and more stringent sample selec tion criteria must be applied. These requirements place additional pressure on the need to acquire samples in sufficient quantity and of appropriate purity, inevitably increasing the cost of such experiments.
After analyzing the sequence data, putative mutations must be compared with normal tissue from the same individual, as well as with other known genetic variants, to identify true somatic mutations related to the specific cancer. This step involves comparing the data obtained with information regarding variation and with complete genomes, which can be obtained from various databases (see below), as well as with information on rare variants [34,35]. For most applications, including the possible use in a clinical setup, a subsequent validation step is neces sary, which is normally carried out by PCR sequen cing of the variants or, where possible, by sequencing biological replicates.

Exome sequencing
The cost of wholegenome sequencing still remains high. Furthermore, when mutations associated with diseases are mapped in genomewide association studies (GWASs) [36], they tend to map in regulatory and functional elements but not necessarily in the conserved coding regions, which actually represent a very small fraction of the genome. This highlights the importance of studying mutations in noncoding regions and the need for more experimental information on regulatory ele ments, including promoters, enhancers and microRNAs (miRNAs; see below). Despite all these considerations, the current alternative for economic and technical (1) Revision of genomic information. In this rapidly developing area methods and software are continuously changing to match the improvements in sequencing technologies. (2) Analysis of the consequences of specific mutations and genomic alterations. The analysis needs go from the area of point mutation prediction in proteins to the much more challenging area of prediction of mutations in non-coding regions, including promoter regions and TF binding sites. Other genetic alterations important in cancer must also be taken into consideration, such as copy number variation, modification of splice sites and altered splicing patterns. (3) Mapping of gene/protein variants at the network level. At this point, the relationships between individual components (genes and proteins) are analyzed in terms of their involvement in gene control networks, protein interaction maps and signaling/metabolic pathways. It is clearly necessary to develop a network analysis infrastructure and analysis methods capable of extracting information from heterogeneous data sources. (4) Translation of the information into potential drugs or treatments. The pharmacogenomic analysis of the information is essential to identify potential drugs or treatments. The analysis at this level integrates genomic information with that obtained from databases linking drugs and potential targets, combining it with data on clinical trials drawn from text or web sources. Toxicogenomics information adds an interesting dimension that enables additional exploration of the data. (5) Finally, it is essential to make the information extracted by the systems accessible to the end users in adequate conditions, including geneticists, biomedical scientists and clinicians. reasons is often to limit sequencing to the coding regions in the genome (exome sequencing), which can be performed for less than $2,000. Indeed, sequencing all the exons in a genome has already provided useful data for disease diagnosis, such as in identifying the genes responsible for Mendelian disorders in studies of a small number of affected individuals. Such proofofconcept studies have correctly identified the genes previously known to underlie diseases such as FreemanSheldon syndrome [37] and Miller syndrome [38].

Bioinformatics steps for personalized genome analysis
A key step in exome sequencing is the use of the appropriate capturing technology to enrich the DNA samples to be sequenced with the exons desired. There has been considerable progress in developing and com mercializing arrays to capture specific exons (for example, see [39]), which has facilitated the standardi zation and systematization of such approaches, thereby increasing the feasibility of applying these techniques in clinical settings.
Despite the current practical advantages offered by exome sequencing, it is possible that technological advances will soon mean that it will be replaced by whole genome sequencing, which will be cheaper in practice and requires less experimental manipulation. However, such a scenario will certainly increase the complexity of the bioinformatic analysis (see, for example, [40] for an approach using wholegenome sequencing, or [19] for the combined use of wholegenome sequencing as a discovery system, followed by exome sequencing validation in a larger cohort).

Sequencing to study genome organization and expression
NGS can provide sequence information complementary to DNA sequencing that will be important for cancer diagnosis, prognosis and treatment. The main applica tions include RNA sequencing (RNAseq), miRNAs and epigenetics.
NGSbased approaches can also be used to detect structural genomic variants, and these techniques are likely to provide better resolution than previous array technologies (see [41] for an initial example). Cancer research is an obvious area in which this technology will be applied, as chromosomal gains and losses are very common in cancer. Further improvements in this sequen cing technology, and in the related computational methods, will enable more information to be obtained at a lower cost [42] (see also a recent application in [43] and the evolution of computational approaches from [4446] to [47]).

RNA-seq
DNA sequencing data, particularly data from noncoding regions (see below), can be better understood when accompanied by gene expression data. Direct sequencing of RNA samples already provides an alternative to the use of expression arrays, and it promises to increase the accessible dynamic range and limits of sensitivity [4850]. RNAseq could be used to provide a comprehensive view of the differences in transcription between normal and diseased samples but also to correlate alterations in structure and copy number that may affect gene expression, thereby helping to interpret the consequences of mutations in gene control regions. Furthermore, RNA sequencing data can be used to explore the capacity of the genome to produce alternative splice variants [5155]. Indeed, the prevalence of splice variants at the genomic level has been assessed, suggesting a potential role for the regulation of alternative splicing in different stages of disease, and particularly in cancer [56,57]. Recent evidence clearly points to the importance of mutations in splicing factors and RNA transport machinery in cancer [24,58].

miRNAs
NGS data on miRNAs can also complement sequencing data. This is particularly important in cancer research given the rapidly expanding roles proposed for miRNAs in cancer biology [59]. For example, interactions have been demonstrated between miRNA overexpression and the wellcharacterized Sonic hedgehog/Patched signaling pathway in medulloblastoma [60]. Moreover, novel miRNAs and miRNAs with altered expression have also been detected in ovarian and breast cancers [61,62].

Epigenetics
NGS can provide invaluable data on DNA methylation (methylseq) and the epigenetic modification of histones for example, through chromatin immunoprecipitation sequencing (ChIPseq) with antibodies corresponding to the various modifications. Epigenetic mechanisms have been linked to disease [63,64] (reviewed in [65]).
The wealth of information provided by all these NGS based approaches will substantially increase our capacity to understand the complete genomic landscape of the disease, although it will also increase the complexity of the analysis at all levels, from basic data handling to problems related to data linking to interpretation. There will also be complications in areas in which our knowledge of the basic biological processes is developing at the same rhythm as the analytical technology (for a good example of the intrinsic association between new discoveries in biology and the development of analytical technologies, see recent references on chromothripsis [6668]). Furthermore, it is important to keep in mind that, from the point of view of clinical applications, most if not all drugs available target proteins. Thus, even if it is essential to have complete genomic information to understand a disease and to detect disease markers and stratification, as well as to design clinical trials, the identification of potential drugs and treatments will still be mainly based on the analysis of alterations in coding regions.

interpreting mutation and variation data
The growing number of largescale studies has led to a rapid increase in the number of potential disease associated genes and mutations (Table 1). An overview of these studies can be found in [69] and the associated web catalog of GWASs [70].
Interpreting the causal relationship between the mutations considered to be significant in GWASs and the corresponding disease phenotypes is clearly complicated, and serious concerns about the efficacy of GWASs have been much discussed [71,72]. In the case of cancer research, the interpretation of mutations is additionally complicated by the dynamic nature of tumor progression, and also the need to distinguish between mutations associated with the initiation of the cancer and others that accumulate as the tumors evolve. In this field, the potential cancer initiators are known as 'drivers' and those that accumulate during tumor growth as 'passen gers' (terminology taken from [73], referring metaphori cally to the role of certain viruses in either causing or merely being passengers in infected cells).
In practice, the classification of mutations as drivers and passengers is based on their location at positions considered to be important because of their evolutionary conservation, and on observations in other experimental datasets (for a review of the methods used to classify driver mutations and the role of tumor progression models, see [74]). Ultimately, more realistic biological models of tumor development and a more comprehensive understanding of the relationship between individual mutations will be necessary to classify mutations accord ing to their role in the underlying process of tumor progression (reviewed in [75]).
Despite the considerable advances in database develop ment, it will take additional time and effort to fully consolidate all the information available in the scientific literature into databases and annotated reposi tories. To alleviate this problem, efforts have been made to extract mutations directly from the literature by systematically mapping them to the corresponding protein sequences. For example, CJO Baker and D RebholzSchuhmann organize a biennial workshop focusing on this particular approach (the ECCB Work shop: Annotation, Interpre tation and Management of Mutations; the corresponding publication is [76]).
In the case of protein kinases, one of the most impor tant families of proteins for cancer research, many mutations have been detected that are not currently stored in databases and that have been mapped to their corresponding positions in protein sequences [77]. However, for a large proportion of the mutations in kinases already introduced into databases, text mining provides additional links to stored information and mentions of the mutations in the literature.
These automated approaches, when applied not only to protein kinases but to any protein family [7884], should be viewed as a means of facilitating rapid access to information, although they are not aimed at replacing databases, as the text mining results require detailed manual curation. Therefore, in the quest to identify and interpret mutations, it is important to bear in mind that text mining can provide additional information comple mentary to that retrieved in standard database searches.

Information about protein function
Accurately defining protein function is an essential step in analyzing mutations and predicting their possible consequences. Databases are annotated by extrapolating the functions of the small number of proteins on which detailed experiments have been carried out (estimated to be less than 3% of the proteins annotated in the UniProt database). The protocols for these extrapolations have been developed over the past 20 years and they are continually adjusted to incorporate additional filters and information sources [8587]. Interestingly, several on going communitybased efforts aim to evaluate the methods used to predict and extract information regard ing protein function, such as Biocreative in the field of text mining [88,89], CASP for predicting function and binding sites [90], and challenge in function prediction organized by Iddo Friedberg and Predrag Radivojac [91].

Protein function at the residue level
The analysis of diseaseassociated mutations naturally focuses on key regions of proteins that are directly related to their activity. The identification of binding sites and active sites in proteins is therefore an important aid to interpreting the effects of mutations. In this case, and as in other areas of bioinformatics, the availability of large and wellannotated repositories is essential. The annota tions of binding sites and active sites in SwissProt [92], the main database with handcurated annotations of protein characteristics, provide a combination of experi mental information and patterns of conservation of key regions. For example, the wellcharacterized GTP binding site of the Ras family of small GTPases is divided into four small sequence regions. This definition is based on the conservation of these sequences, despite the fact that they include residues that do not directly contact GTP or participate in the catalytic mechanism. Obviously, the ambiguity of this type of definition tends to complicate the interpretation of mutations in such regions.
Various tools have been designed to provide validated annotations of binding sites (residues in direct contact with biologically relevant compounds) in proteins of known structure; these include FireDB and FireStar [93]. This information is organized according to protein families so as to help analyze the conservation of the compounds bound and the corresponding binding residues. Other resources, such as the Catalytic Site Atlas [94], provide detailed information about protein residues directly involved in the catalysis of biochemical reactions by enzymes. In addition to substrate binding sites, it is also important to interpret the possible incidence of mutations at sites of interaction between proteins. Indeed, there are a number of databases that store and annotate such interaction sites [95].
Given that there are still relatively few proteins for which binding sites can be deduced from their corres ponding structures, it is particularly interesting to be able to predict substrate binding sites and regions of interaction with other protein effectors. Several methods are currently available for this purpose [9698]; for example, a recently published method [99] automatically classifies protein families into functional subfamilies, and detects residues that may functionally differentiate between subfamilies (for a userfriendly visualization environment, see [100]).

Prediction of the consequences of point mutations
Several methods are currently used to predict the functional consequences of individual mutations. In general, they involve a combination of parameters related to the structure and stability of proteins, interference from known functional sites, and considerations about the evolutionary importance of sites. These parameters are calculated for a number of mutations known to be linked to diseases and in the majority of systems they are extrapolated to new cases using machine learning techniques (support vector machines, neural networks, decision trees and others; for a basic reference in the field, see [101]).
The process of predicting the consequences of mutations is hampered by numerous inherent limitations, such as those listed below.
(1) Most of the known mutations used to calibrate the system are only weakly associated with the corres ponding disease. In some cases the relationship is indirect or even nonexistent (for example, mutations derived from GWASs; see above). should ideally be interpreted in quantitative terms, taking energies and entropies into account. This requires biophysical data that are not yet available for most proteins. (4) Predictions are made on the assumption that proteins act alone when, in reality, specific constraints and interactions within the cellular or tissue environment can considerably attenuate or enhance the effects of a mutation. (5) The current knowledge of binding sites, active sites and interaction sites is limited (see above). The accuracy of predictions regarding the effects of muta tions at these sites is thus similarly limited. Despite such limitations, these approaches are very useful and they currently represent the only means of linking mutations with protein function ( Table 2). Many of these methods are userfriendly and well documented, with their limitations emphasized to ensure careful analysis of the results. Indeed, an initial movement to assess prediction methods has been organized (a recent evaluation of such methods can be found in [102]).
For example, the PMUT method [103] (Table 2) is based on neural networks calibrated using known muta tions, integrating several sequence and structural para meters (multiple sequence alignments generated with PSIBLAST and PHD scores for secondary structure, conservation and surface exposure). The input required is the sequence or alignment, and the output consists of a list of the mutations with a corresponding disease prediction presented as a pathogenicity index that ranges from 0 to 1. The scores corresponding to the neural network's internal parameters are interpreted in terms of the level of confidence in the prediction. The system also provides precalculated results for large groups of proteins, thereby offering a fast and accessible web resource [103]. Perhaps the most commonly used method in this area is SIFT [104] (Table 2), which compiles PSIBLAST align ments and calculates the probabilities for all the 20 possible amino acids at that position. From this infor mation it predicts to what degree substitutions will affect protein function. In its predictions, SIFT does not use structural information from the average diversity of the sequences in the multiple sequence alignments. The infor mation provided about the variants in protein coding regions includes descriptions of the protein sequences and the families, the estimated evolutionary pressure and the frequency of SNPs at that position (if detected), as well as the association with diseases as found in the Online Mendelian Inheritance in Man (OMIM) database (Table 1).
In the light of the current situation, it is clearly necessary to move beyond the simple predictive methods that are currently available to fulfill the requirements for personalized cancer treatment. As in other fields of bioinformatics (see above), competitions and community based evaluation efforts that openly compare systems are of great practical importance. In this case, Yana Brom berg and Emidio Capriotti are organizing an interesting workshop on the prediction of the consequences of point mutations [105], and Steven E Brenner, John Moult and Sadhna Rana organize the Critical Assessment of Genome Interpretation (CAGI) to assess computational methods for predicting the phenotypic impacts of genomic variation [106].
A key technical step in analyzing the consequences of mutations in protein structures is the ability to map the mutations described at the genome level onto the corresponding protein sequences and structures. The difficulty of translating information between coordinate systems (genomes and protein sequences and structures) is not trivial, and current methods only provide partial solutions to this problem. The protein structure classi fication database CATH [107] has addressed this issue using a system that allows the systematic transfer of DNA coordinates to positions in threedimensional protein structures and models [108].
In addition to the general interpretation of the con sequences of mutations, there is a large body of literature on the interpretation of mutations in specific protein families. By combining curated alignments and the detailed analysis of structures or models with sophis ti cated physical calculations, it is possible to gain addi tional insight into specific cases. For example, mutations in the protein kinase family have been analyzed, comparing the distribution of these mutations in terms of protein structure and their relationship with active sites and binding sites [109]. The conclusion of this study [109] was that putative cancer driver mutations tend to be more closely associated with key protein features than are other more common variants (nonsynonymous SNPs) or somatic mutations (passengers) that are not directly linked to tumor progression. These driver specific features include molecule binding sites, regions of specific binding to other proteins and positions con served generally or in specific protein subfamilies at the sequence level. This observation fits well with the implication of altered protein kinase function in cancer pathogenicity, and it supports the link between cancer associated driver mutations and altered protein kinase structure and function. Familyspecific prediction methods based on the association of specific features in protein families [110], and on other methods that exploit familyspecific information [111,112], pave the way to the development of a new generation of prediction methods that can assess all protein families using their specific characteristics.
Mutations do not only affect binding sites and functional sites but, in many cases, they also alter sites that are subject to posttranslational modifications, poten tially affecting the function of the corresponding proteins. Perhaps the largest and most effective resource to predict the mutational effects on sites subject to post translational modification is that developed by Søren Brunak's group [113], which encompasses leucinerich nuclear export signals, nonclassical secretion of proteins, signal peptides and cleavage sites, arginine and lysine propeptide cleavage sites, generic and kinase specific phosphorylation sites, cmannosylation sites, glycation of ε amino groups of lysines, Nlinked glycosylation sites, OGalNAc (mucin type) glycosylation sites, aminoterminal acetylation, OβGlcNAc glycosyla tion and 'YinYang' sites (intracellular/nuclear proteins). The output for each sequence predicts the potential of mutations to affect different sites. However, there is as yet no predictor capable of combining the output of this method and applying it to specific mutations. An example of a system to predict the consequences of mutations in an information rich environment is provided in Figure 2.

Mutations in non-coding regions
Predicting the consequences of mutations in noncoding regions presents particular challenges, especially given that current methods are still very limited in formulating predictions based on gene sequence and structure, miRNA and transcription factor (TF) binding sites, and epigenetic modifications. For a review of our current knowledge of TFs and their activity, see [114]; the main data repositories are TRANSFAC, a database of TFs and their DNA binding sites [115], JASPAR, an openaccess database of eukaryotic TF binding profiles [116], and ORegAnno, an openaccess communitydriven resource for regulatory annotation [117].
In principle, these information repositories make it possible to analyze any sequence for the presence of putative TF binding sites and to predict how binding would change following the introduction of mutations. In practice, however, the information relating to binding preferences is not very reliable as it is generally based on artificial in vitro systems. Furthermore, it is difficult to account for the effects of gene activation based on this information and it is also impossible to take into account any cooperation between individual binding sites. Although approaches based on NGS or ChIPseq experi ments would certainly improve the accuracy of the infor mation available regarding true TF binding sites in differ ent conditions, predicting the consequences of individual modifications in terms of the functional alterations produced is still difficult. The mapping of mutations in promoter regions and their correlation with TF binding sites thus provides us with only an indication of poten tially interesting regions, but it does not yet represent an effective strategy to analyze mutations.
In the case of miRNAs and other noncoding RNAs, the 2012 Nucleic Acids Research database issue lists more than 50 databases providing information on miRNAs. As with the predictions of TF binding, it is possible to use these resources to explore the links between mutations and their corresponding sites. However, the methods currently available still cannot provide systematic predic tions of the consequences of mutations in regions coding for miRNAs and other noncoding RNAs. Indeed, such approaches are becoming increasingly more difficult owing to the emergence of new forms of complex RNA, which pose further challenges to these prediction methods (reviewed in [118]).
Even if sequence analysis alone cannot provide a complete solution to the analysis of mutations in non coding regions, combining such approaches with targeted gene expression experiments can shed further light on such events. In the context of personalized cancer treat ment, combining genome and RNA sequencing of the same samples could enable the variation in coding capacity of different variants to be assessed directly. Hence, new methods and tools will be required to support the systematic analysis of such combined datasets.
In summary, predicting the functional consequences of point mutations in coding and noncoding regions still remains a challenge, requiring new and more powerful computational methods and tools. However, despite the inherent limitations, several useful methods and resources are now available, which, in combination with targeted experiments, should be explored further to analyze mutations more reliably in a context of personalized medicine.

Cancer and signaling pathways
Cancer has been repeatedly described as a systems disease. Indeed, the process of tumor evolution from primary to malignant forms, including metastasis to other tissues, involves competition between various cell lineages struggling to adapt to the changing conditions, both within and around the tumor. This complex process is closely associated with the occurrence of mutations and genetic alterations. In fact, it seems likely that rather Valencia and Hidalgo Genome Medicine 2012, 4:61 http://genomemedicine.com/content/4/7/61 than individual mutations themselves, combinations of mutations provide cell lineages with an advantage in terms of growth and their invasive capabilities. Given the complexity of this process, more elaborate biological models are needed to account for the role of networks of mutations in this competition between cell lineages [74].
Analyzing alterations in signaling pathways, as opposed to directly comparing mutated genes, has produced signi fi cant progress in interpreting cancer genome data [26]. In this study [119], a link between pancreatic cancer and certain specific signaling pathways was detected by care fully mapping the mutations detected in a set of cases. From this analysis, the general DNA damage pathway and several other pathways were broadly identified, highlighting the possibility of using drugs that target the proteins in these pathways to treat pancreatic cancer. Indeed, it was also relevant that the results from one patient in this study contradicted the relationship reported between pancreatic cancer and mutations in the DNA damage pathway. A manual analysis of the mutations in this patient revealed the crucial importance for treatment of a mutation in the PALB2 gene, a gene not considered to be a component of the DNA damage pathway in the signaling database at the time of the initial analysis, even though it was clearly associated with the pathway in the scientific literature [27]. This observation  (Table 2); (d) an alignment of related sequences, including information about conserved and variable positions; (e) the position of the mutations in the corresponding protein structure (when available); (f) sentences related to the specific mutations from [77]; (g) information about the function and interactions of the protein kinase extracted from PubMed with the iHOP system [149,150]. A detailed description of the wKinMut system can be found in [147] and in the documentation of the web site [148]. serves as an important reminder of the incomplete nature of the information organized in the current databases, the need for careful factchecking and the difficulty in separating reactions that are naturally linked in cells into human annotated pathways. From a systems biology viewpoint, it is clear that detecting common elements in cancer by analyzing muta tions at the protein level is fraught with difficulty. Thus, shifting the analysis to the systems level by considering the pathways and cellular functions affected might offer a more general view of the relationship between mutations and phenotypes, helping to detect common biological alterations associated with specific types of cancer.
This situation was illustrated in our systematic analysis of cancer mutations and cancer types at the pathway and functional levels [120]. The associated system (Figure 3) allows the types of cancer and associated pathways to be explored, and it identifies common features in the input information (mutations obtained from small and large scale studies).
To overcome the limitations in defining the pathways and cell functions, as demonstrated in the study of pancreatic cancer [119], more flexible definitions of path ways and cell functions must be considered. Improve ments to the main pathway information databases (that is, KEGG [121] and Reactome [122]), might be made possible by incorporating text mining systems to facilitate the task of annotation [123]. A further strategy to help detect proteins associated with specific pathways that might not have been detected by earlier biochemical approaches is to use information relating to the func tional connections between proteins and genes, including gene control and protein interaction networks. For example, proteins that form complexes with other proteins in a given pathway can be considered as part of that pathway [124]. Candidates to be included in such analyses would be regulators, phosphatases and proteins with connector domains, in many cases corresponding to proteins that participate in more than one pathway and that provide a link between related cellular functions.
Even if the network and pathwaybased approaches are a clear step forward in analyzing the consequences of mutations, it is necessary to be realistic about their present limitations. Current approaches to network analysis represent static scenarios where spatial and temporal aspects are not taken into account: for example, the tissue and stage of tumor development are not con sidered. Furthermore, important quantitative aspects, such as the amount of proteins and the kinetic para meters of reactions, are generally not available. In other words, we still do not have at hand the comprehensive quantitative and dynamic models necessary to fully understand the consequences of mutations at the physiological level. Indeed, generating such models would require considerable experimental and computa tional effort, and as such it remains as one of the main challenges in systems biology today, if not the main challenge.

Linking drugs to genes/proteins and pathways
Even if comprehensive networkbased approaches provide valuable information about the distribution of mutations and their possible functional consequences, they are still far from helping us reach the final objective of designing personalized cancer treatment. The final key preclinical stage is to associate the variation in proteins and path ways with drugs that directly or indirectly affect their function or activity. This is a direction that opens up a world of possibilities and may change the whole field of cancer research [125].
To go from possibilities to realities will require tools and methods that bring together the protein and pharmaceutical worlds ( Table 3). The challenge is to identify proteins that when targeted by a known drug will interrupt the malfunctions in a given pathway or signaling system. This means that to identify potentially appropriate drugs, their effects must be described in different phases. First, adequate information must be compiled about the drugs and their targets in the light of our incomplete knowledge on the action in vivo of many drugs and the range of specificity in which many current drugs work. Second, the extent to which the effect of mutations that interrupt or overstimulate signaling pathways can be counteracted by the action of drugs must be assessed. This is a particularly difficult problem that requires an understanding of the consequences of the mutations at the network level, and the capacity to predict the appropriate levels of the network that can be used to counteract them (see above). Furthermore, the margin of operation is limited because most drugs tend to remove or diminish protein activity, as do most mutations. Hence, potential solutions will often depend on finding a node of the network that can be targeted by a drug and upregulated.
Given the limited precision of current genome analysis strategies (as described above), the large number of potential mutations and possible targets related to cancer phenotypes are difficult to disentangle. Similarly, the limited precision of the drugprotein target relationships makes reducing the genome analysis to the identification of a single potential drug almost impossible. Fortunately, the use of complementary animal models (avatar mice, see above) consistently increases the number of possible combinations of drugs that can be tested for each specific case. Perhaps the best example of the possibilities of current systems is the PharmGKB resource [126] ( Table 3), which was recently used to calculate the drug response probabilities after a careful analysis of the genome of a single individual [127]. Indeed, this approach provided an interesting example of the technical and organizational requirements of such an application (reviewed in [128]).
Toxicology is as an increasingly important field at the interface between genomics and disease, not least because of its influence on drug administration and its strategic importance for pharmaceutical companies. An important advance in this area will be to integrate information on mutations (and predictions of their consequences) within the context of a gene/protein, disease and drug network. In this area, the cooperation between pharmaceutical companies and research groups in the eTOX project [129] of the European 'Innovative Medicine Initiative' platform is particularly relevant (see also other IMI projects related to subjects discussed in this section [130]). From our knowledge of diseaselinked genes and proteinrelated drugs, the connection between toxicology and the secondary effects of drugs has been used to find associations between necrosis of breast and lung cancer [131]. Recent work has also achieved drug repositioning using analysis of expression profiles [132,133] and analyzed drug relationships using common secondary effects [134].

Conclusions and future directions
We have presented here a global vision of the issues associated with the computational analysis of person a lized cancer data, describing the main limitations and possible developments of current approaches and the currently available computational systems.
The development of systems to analyze individual genome data is an ongoing activity in many groups and institutions, with diverse implementations tailored to their bioinformatics and clinical units. In the future, this type of pipeline will allow oncology units at hospitals to offer treatment for individual cancer patients based on the comparison of their normal and cancer genomic compositions with those of successfully treated patients. However, this will require the exhaustive analysis of genomic data within an analytical platform that covers the range of topics described here. Such genomic information has to be considered as an addition to the rest of the physiological and medical data that are essential for medical diagnosis.
In practice, it seems likely that the initial systems will work in research environments to explore genomic information in cases of palliative treatment and most probably in cancer relapse. Specific regulations apply in these scenarios, and the time between the initial and secondary events provides a wider time window for the  [122,151]. The upper panel shows the menus for selecting specific cancer studies, databases for pathway analysis (or set of annotations) and the level of confidence required for the relationships. From the user's requests, the system identifies the pathways or functional classes common to the different cancer studies, and the interface allows the corresponding information to be retrieved. The graph represent various cancer studies (those selected in the 'tumor types' panel are represented by red circles) using the pathways extracted from the Reactome database [152] as the background (the reference selected in the 'Annotation databases' panel and represented by small triangles). For the selected lung cancer study, the 'Lung tumor mutated genes' panel provides a link to the related genes indicating the database (source) from where the information was extracted. The lower panel represents the information on the pathways selected by the user ('innate immunity signaling') as directly provided by the Reactome database. analysis. These systems, such as the one we use in our institution, will combine methods and results in a more flexible and exploratory setup than will need to be implemented in regulated clinical setups. The transition from such academic software platforms will require professional software development following industrial standards, and it will need to be developed in consortia between research and commercial partners. Initiatives such as the European flagship project proposal on Information Technology Future of Medicine (ITFoM) [135] could be an appropriate vehicle to promote such developments.
The incorporation of genomic information into clinical practice will require consultation with specialists in rele vant areas, including genomics, bioinformatics, systems biology, pathology and oncology. Each of the profes sionals involved will have their own specific require ments, and thus the driving forces for users and developers of this system will naturally differ: (1) Clinicians, the end users of the resulting data, will require an analytical platform that is sufficiently accurate and robust to work continuously in a clinical setting. This system must be easy to understand and capable of providing validated results at each stage of the analysis.
(2) Bioinformaticians developing the analytical pipeline will require a system with a modular structure that is based on current programming paradigms and that can be easily expanded by incorporating new methods. New technology should be easy to intro duce, so that the methods used can be continuously evaluated, and they should be capable of analyzing large amounts of heterogeneous data. Finally, this system will have to fulfill stringent security and confidentiality requirements. (3) Computational biologists developing these methods will naturally be interested in the scientific issues behind each stage of the analytical platform. They will be responsible for designing new methods, and they will have to collaborate with clinicians and biologists studying the underlying biological problems (the molecular mechanisms of cancer). A significant part of the challenge in developing personalized cancer treatments will be to ensure effective collaboration between these heterogeneous groups (for a description of the technical, practical, professional and ethical issues see [127,136]), and indeed, better training and technical facilities will be essential to facilitate such cooperation [137]. In the context of the integration of bioinformatics into clinical practice, ethical issues emerge as an essential component. The pipelines and methods described here have the capacity to reveal unexpected relationships between genomic traces and disease risks. It is currently of particular interest to define how such findings that are not directly relevant for the medical condition at hand should be dealt with for example, the possible need to disclose this additional information to the family (such as children of the patient), as they could be affected by the mutations. For a discussion on the possible limitations of release of genome results, see [138141]. At the very basic technical level, there are at least two key areas that must be improved to make these develop ments possible. Firstly, the facilities used for the rapid exchange and storage of information must become more advanced and, in some cases, additional confidentiality constraints will need to be introduced on genomic information, scientific literature, toxicology and drug related documentation, ongoing clinical trial information and personal medical records. Secondly, adequate inter faces must be tailored to the needs of the individual professionals, which will be crucial to integrate the relevant information. User accessibility is a key issue in the context of personalized cancer treatment, as well as in bioinformatics in general.
The organization of this complex scenario is an impor tant aspect of personalized cancer medicine, which must also include detailed discussions with patients and the need to deal with the related ethical issues, although this is beyond the scope of this review. The involvement of the general public and of patient associations will be an important step towards improved cancer treatment, presenting new and interesting challenges for bioinfor maticians and computational biologists working in this area.

Competing interests
The authors declare that they have no competing interests.