Understanding cellular function and disease with comparative pathway analysis

Pathway analysis is important in interpreting the functional implications of high-throughput experimental results, but robust comparison across platforms and species is problematic. A new approach, Pathprinting, provides a cross-platform, cross-species comparative analysis of pathway expression signatures. This method calculates pathway-level statistics from gene expression across nearly 180,000 microarrays in the Gene Expression Omnibus. Pathprinting can accurately retrieve phenotypically similar samples and identify sets of human and mouse genes that are prognostic in cancer. See related Research paper, http://genomemedicine.com/content/5/7/68

enable researchers and clinicians to use data from multiple platforms, experiments and animal models to explore complex disease.
Th e Pathprinting analysis pipeline can be classifi ed as a second-generation method according to the criteria of Atul Butte and colleagues [4], who recognized three generations of these methods. First-generation methods take a list of genes over-, under-or diff erentially expressed in a study, compute the proportion of pathway members therein compared with the proportion in a background dataset, and statistically test for enrichment. Second-generation methods improve on this by using information from the entire experiment (all genes, ranked according to a gene-level statistic) to generate pathway-level statistics that capture coordinated changes in the expression of genes in a pathway or gene set. Th irdgeneration methods move beyond treating pathways as lists of genes, adding information about the connectivity and directionality of interactions. In this sense, Pathprinting is a second-generation method, but in moving beyond canonical pathways and including information from nearly the entire corpus of microarray data to generate pathway statistics (fi ngerprints), it captures crucial information on conserved and divergent coexpres sion that is absent from other methods.

Expression-based pathway signatures across platforms and species
Hide and colleagues' approach [3] was to retrieve normalized data (176,971 arrays) for six species, spanning 31 single-channel array platforms, from Gene Expression Omnibus (GEO) and to map probes to Entrez Gene identifi ers. Th ey computed a mean expression level for each gene by combining values for multiple probes representing single Entrez genes. Th ey sourced pathway gene sets from KEGG, Reactome, Wikipathways and Netpath. To avoid introducing a bias towards wellannotated pathways, the authors used interactions derived from gene co-expression, protein-protein and protein-domain databases, Gene Ontology annotations and text mining to generate a 'functional interaction network' covering 181,706 interactions involving 9,452 human genes. They then applied Markov clustering to decompose the connected portion of this graph into 144 functional interaction clusters (static modules) covering 6,458 genes, 1,542 of which are not found in these pathway databases. This process yields 633 human pathways and static modules. Using NCBI Homologene, they then mapped the corresponding gene sets in the other five species. Figure 1 illustrates a key part of the workflow. The authors ranked genes by expression level and computed the mean squared rank. The null hypothesis was generated by sample permutation of all arrays of the same platform type, thereby preserving gene-expression correlation structure within gene sets, particularly within pathways. As the expected distribution is unknown, background distributions were fitted to a two-component mixture model, with the normal component corres ponding to the core distribution of pathway scores and the uniform distribution corresponding to expression outliers. From these they calculated a probability of expression, that is, the probability that a pathway expression score belongs to the uniform component, and assigned it a score of +1, zero or -1. These ternary scores formed components of the Pathprinting vector. Within a group of fingerprints (such as for a tissue type) the mean score of each extended pathway was then binarized (+1 if above the threshold, -1 if below) and summarized in a vector (consensus fingerprint) that represents the set of functional modules significantly over-and underexpressed in a cell type or condition. This associated a set of pathway activities with a phenotype.
It is straightforward to calculate a 'functional distance' between fingerprint vectors. This distance is necessarily threshold-dependent, and the authors [3] considered at some length how thresholds might be optimized for the problem at hand. By seeding a consensus fingerprint profile, phenotypes can be matched into an expression database (here, GEO). The question of threshold significance is also relevant at this point, and the authors present a simple but appealing approach that assumes that the database contains a few highly matched but many non-matched samples.
Code and data were implemented in the R package Pathprint. As few research groups are likely to have the necessary resources to implement this independently, the authors [3] helpfully provide pre-computed Pathprinting scores for these six species in a searchable database.

Applications and remaining challenges
The authors [3] briefly describe computational experiments illustrating three applications of Pathprinting. Using 127 human and mouse expression datasets, the authors derived an embryonic stem cell fingerprint indicating pluripotency and matched it to GEO. Of the top 1,000 matches, 90% are induced pluripotent stem cells from 140 human and mouse studies over 13 platforms; the others are cancer cell lines known to express embryonic stem cell functions. In another experi ment, they used Pathprinting to jointly analyze human and mouse hematopoietic lineages; parsimony analysis of the individual Pathprinting states resolved the major myeloid and lymphoid lineages, irrespective of species. They also used Pathprinting to recognize four stemness-associated self-renewal pathways shared between human and mouse. The authors demonstrated the clinical relevance of these four pathways by computing Pathprints for four independent clinical studies of gene expression in patients with acute myeloid leukemia; high scores for these pathways were significantly associated with poor patient outcomes, and together these pathways had greater prognostic value than did the human or mouse pathways on their own. Scope remains for further development of the Pathprinting framework. Not unreasonably, the authors [3] did not re-normalize historical array data using modern approaches. They averaged probe expression at the gene level, although this flattens out the signal from alternative splicing. Alternative approaches are available for orthology assignment. Their phenotype-matching threshold ignores potential multimodal distributions in tissue datasets; for example, datasets annotated as 'kidney' include not only normal kidney but also disease states including cancer, which can have different gene copy num bers and transcriptional programs. One could imagine (as do the authors) a feature-selection approach to identify genes that contribute most toward performance.  Finally, individual variation and environmental effects remain largely outside this paradigm. Like other second-generation pathway analysis techniques [4], this approach [3] ignores topology once the pathway-or module-specific gene sets have been defined. Unlike other pathway analysis tools, however, Pathprinting is designed to enable integrated comparative pathway analysis. Although popular platforms, including GenMAPP [5] and DAVID [6], support pathway analyses for key model organisms, applying them cross-species requires the initial individual analyses to be followed by post hoc meta-analysis. OSCAR [7] enables integrated crossspecies co-expression analysis and clustering, but for only a few datasets and without built-in functional analysis. PlaNet [8] performs co-expression and network analysis between Arabidopsis and six crop species, but only for Affymetrix GeneChip data. Pathprinting moves well beyond all these approaches by supporting the largescale comparative functional analysis of clinical expression data across experiments, species and platforms within a computational framework. Waddington [9] famously depicted cellular phenotype as a canalized landscape, the topography of which is actively shaped by underpinning cables tethered to genetic loci. Individual cables are connected not only to the landscape but often to each other as well, forming a web of epistatic interactions. From a twenty-first century 'omic' perspective, it is difficult not to reinterpret this substructure as genes linked to their expression products through a network of physical interactions, with cellular phenotypes, both structural and functional, emerging from this network. In this way, functional phenotype in its diverse contexts arises from definable subsets of the cellular network, such as local protein interactions or signaling reactions. To a first approximation, then, modules of molecular interaction are computationally relevant units of functional phenotype.

Pathways and modules as computational units of cellular function
Moving from the identification of characteristic gene expression profiles to delineating the pathways and networks that mechanistically underlie cellular function and disease has been, and remains, a major focus of molecular systems biology and systems medicine. Hide and colleagues [3] now provide the most comprehensive collection of modules so far, and a robust, principled approach to quantifying and comparing their effects along developmental trajectories, across species and in different patient groups. Rhodes and Chinnaiyan [10] envisaged an integrative analysis for molecular cancer research that allows experimental results to be analyzed in the context of existing data and compared on the basis of biological similarity. The achievement of Hide and colleagues brings this vision to reality.

Competing interests
The authors declare that they have no competing interests.