Understanding cellular function and disease with comparative pathway analysis
© BioMed Central Ltd 2013
Published: 26 July 2013
Skip to main content
© BioMed Central Ltd 2013
Published: 26 July 2013
Pathway analysis is important in interpreting the functional implications of high-throughput experimental results, but robust comparison across platforms and species is problematic. A new approach, Pathprinting, provides a cross-platform, cross-species comparative analysis of pathway expression signatures. This method calculates pathway-level statistics from gene expression across nearly 180,000 microarrays in the Gene Expression Omnibus. Pathprinting can accurately retrieve phenotypically similar samples and identify sets of human and mouse genes that are prognostic in cancer.
See related Research paper, http://genomemedicine.com/content/5/7/68
Over the past two decades, microarray technologies have been used to characterize gene expression in various contexts, notably complex human disease and corresponding animal models. Many, perhaps most, analyses could be enriched by comparison with other experiments across species and platforms. However, experimental and platform biases tend to drown out changes in biological signal , and comparison across species presents a further challenge: it is difficult to accurately identify orthologs. Aggregating gene sets based on function is known to improve consistency , but fewer than half of human genes are represented in pathway databases. In a recent article published in Genome Medicine , Winston Hide and colleagues describe Pathprinting, a statistics-based approach to map gene expression to function in humans and the main animal models of disease (mouse, rat, zebrafish, fruit fly and nematode). By standardizing pathway analysis, basing it more globally across functional interactions and controlling for biases, Pathprinting will enable researchers and clinicians to use data from multiple platforms, experiments and animal models to explore complex disease.
The Pathprinting analysis pipeline can be classified as a second-generation method according to the criteria of Atul Butte and colleagues , who recognized three generations of these methods. First-generation methods take a list of genes over-, under- or differentially expressed in a study, compute the proportion of pathway members therein compared with the proportion in a background dataset, and statistically test for enrichment. Second-generation methods improve on this by using information from the entire experiment (all genes, ranked according to a gene-level statistic) to generate pathway-level statistics that capture coordinated changes in the expression of genes in a pathway or gene set. Third-generation methods move beyond treating pathways as lists of genes, adding information about the connectivity and directionality of interactions. In this sense, Pathprinting is a second-generation method, but in moving beyond canonical pathways and including information from nearly the entire corpus of microarray data to generate pathway statistics (fingerprints), it captures crucial information on conserved and divergent co-expression that is absent from other methods.
Hide and colleagues' approach  was to retrieve normalized data (176,971 arrays) for six species, spanning 31 single-channel array platforms, from Gene Expression Omnibus (GEO) and to map probes to Entrez Gene identifiers. They computed a mean expression level for each gene by combining values for multiple probes representing single Entrez genes. They sourced pathway gene sets from KEGG, Reactome, Wikipathways and Netpath. To avoid introducing a bias towards well-annotated pathways, the authors used interactions derived from gene co-expression, protein-protein and protein-domain databases, Gene Ontology annotations and text mining to generate a 'functional interaction network' covering 181,706 interactions involving 9,452 human genes. They then applied Markov clustering to decompose the connected portion of this graph into 144 functional interaction clusters (static modules) covering 6,458 genes, 1,542 of which are not found in these pathway databases. This process yields 633 human pathways and static modules. Using NCBI Homologene, they then mapped the corresponding gene sets in the other five species.
It is straightforward to calculate a 'functional distance' between fingerprint vectors. This distance is necessarily threshold-dependent, and the authors  considered at some length how thresholds might be optimized for the problem at hand. By seeding a consensus fingerprint profile, phenotypes can be matched into an expression database (here, GEO). The question of threshold significance is also relevant at this point, and the authors present a simple but appealing approach that assumes that the database contains a few highly matched but many non-matched samples.
Code and data were implemented in the R package Pathprint. As few research groups are likely to have the necessary resources to implement this independently, the authors  helpfully provide pre-computed Pathprinting scores for these six species in a searchable database.
The authors  briefly describe computational experiments illustrating three applications of Pathprinting. Using 127 human and mouse expression datasets, the authors derived an embryonic stem cell fingerprint indicating pluripotency and matched it to GEO. Of the top 1,000 matches, 90% are induced pluripotent stem cells from 140 human and mouse studies over 13 platforms; the others are cancer cell lines known to express embryonic stem cell functions. In another experiment, they used Pathprinting to jointly analyze human and mouse hematopoietic lineages; parsimony analysis of the individual Pathprinting states resolved the major myeloid and lymphoid lineages, irrespective of species. They also used Pathprinting to recognize four stemness-associated self-renewal pathways shared between human and mouse. The authors demonstrated the clinical relevance of these four pathways by computing Pathprints for four independent clinical studies of gene expression in patients with acute myeloid leukemia; high scores for these pathways were significantly associated with poor patient outcomes, and together these pathways had greater prognostic value than did the human or mouse pathways on their own.
Scope remains for further development of the Pathprinting framework. Not unreasonably, the authors  did not re-normalize historical array data using modern approaches. They averaged probe expression at the gene level, although this flattens out the signal from alternative splicing. Alternative approaches are available for orthology assignment. Their phenotype-matching threshold ignores potential multimodal distributions in tissue datasets; for example, datasets annotated as 'kidney' include not only normal kidney but also disease states including cancer, which can have different gene copy numbers and transcriptional programs. One could imagine (as do the authors) a feature-selection approach to identify genes that contribute most toward performance. Finally, individual variation and environmental effects remain largely outside this paradigm.
Like other second-generation pathway analysis techniques , this approach  ignores topology once the pathway- or module-specific gene sets have been defined. Unlike other pathway analysis tools, however, Pathprinting is designed to enable integrated comparative pathway analysis. Although popular platforms, including GenMAPP  and DAVID , support pathway analyses for key model organisms, applying them cross-species requires the initial individual analyses to be followed by post hoc meta-analysis. OSCAR  enables integrated cross-species co-expression analysis and clustering, but for only a few datasets and without built-in functional analysis. PlaNet  performs co-expression and network analysis between Arabidopsis and six crop species, but only for Affymetrix GeneChip data. Pathprinting moves well beyond all these approaches by supporting the large-scale comparative functional analysis of clinical expression data across experiments, species and platforms within a computational framework.
Waddington  famously depicted cellular phenotype as a canalized landscape, the topography of which is actively shaped by underpinning cables tethered to genetic loci. Individual cables are connected not only to the landscape but often to each other as well, forming a web of epistatic interactions. From a twenty-first century 'omic' perspective, it is difficult not to reinterpret this substructure as genes linked to their expression products through a network of physical interactions, with cellular phenotypes, both structural and functional, emerging from this network. In this way, functional phenotype in its diverse contexts arises from definable subsets of the cellular network, such as local protein interactions or signaling reactions. To a first approximation, then, modules of molecular interaction are computationally relevant units of functional phenotype.
Moving from the identification of characteristic gene expression profiles to delineating the pathways and networks that mechanistically underlie cellular function and disease has been, and remains, a major focus of molecular systems biology and systems medicine. Hide and colleagues  now provide the most comprehensive collection of modules so far, and a robust, principled approach to quantifying and comparing their effects along developmental trajectories, across species and in different patient groups. Rhodes and Chinnaiyan  envisaged an integrative analysis for molecular cancer research that allows experimental results to be analyzed in the context of existing data and compared on the basis of biological similarity. The achievement of Hide and colleagues brings this vision to reality.
Gene Expression Omnibus.
The authors acknowledge Australian Research Council CE0348221 and strategic funding from The University of Queensland.