Skip to main content

Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme



Coordinated efforts to collect large-scale data sets provide a basis for systems level understanding of complex diseases. In order to translate these fragmented and heterogeneous data sets into knowledge and medical benefits, advanced computational methods for data analysis, integration and visualization are needed.


We introduce a novel data integration framework, Anduril, for translating fragmented large-scale data into testable predictions. The Anduril framework allows rapid integration of heterogeneous data with state-of-the-art computational methods and existing knowledge in bio-databases. Anduril automatically generates thorough summary reports and a website that shows the most relevant features of each gene at a glance, allows sorting of data based on different parameters, and provides direct links to more detailed data on genes, transcripts or genomic regions. Anduril is open-source; all methods and documentation are freely available.


We have integrated multidimensional molecular and clinical data from 338 subjects having glioblastoma multiforme, one of the deadliest and most poorly understood cancers, using Anduril. The central objective of our approach is to identify genetic loci and genes that have significant survival effect. Our results suggest several novel genetic alterations linked to glioblastoma multiforme progression and, more specifically, reveal Moesin as a novel glioblastoma multiforme-associated gene that has a strong survival effect and whose depletion in vitro significantly inhibited cell proliferation. All analysis results are available as a comprehensive website.


Our results demonstrate that integrated analysis and visualization of multidimensional and heterogeneous data by Anduril enables drawing conclusions on functional consequences of large-scale molecular data. Many of the identified genetic loci and genes having significant survival effect have not been reported earlier in the context of glioblastoma multiforme. Thus, in addition to generally applicable novel methodology, our results provide several glioblastoma multiforme candidate genes for further studies.

Anduril is available at

The glioblastoma multiforme analysis results are available at


Comprehensive characterization of complex diseases calls for coordinated efforts to collect and share genome-scale data from large patient cohorts. A prime example of such a coordinated effort is The Cancer Genome Atlas (TCGA), which currently provides more than five billion data points on glioblastoma multiforme (GBM) with the aim of improving diagnosis, treatment and prevention of GBM [1].

Translating genome-scale data into knowledge and further to effective diagnosis, treatment and prevention strategies requires computational tools that are designed for large-scale data analysis as well as for the integration of multidimensional data with clinical parameters and knowledge available in bio-databases. In addition, it is evident that until data integration tools are developed to the level that experimental scientists can independently interpret the vast amounts of data generated by genome-scale technologies, most of the potential of the generated data will be severely underexploited. In order to address these challenges, we have developed a data analysis and integration framework, Anduril, which facilitates the integration of various data formats, bio-databases and analysis techniques. Anduril manages and automates analysis workflows from importing raw data to reporting and visualizing the results. In order to facilitate interpretation of the large-scale data analysis results, Anduril generates a website that shows the most relevant features of each gene at a glance, allows sorting of data based on different parameters, and provides direct links to more detailed views of genes, transcripts, genomic regions, protein-protein interactions and pathways.

We demonstrate the utility of the Anduril framework by analyzing heterogeneous and multidimensional data from 338 GBM patients [1]. GBM is an aggressive brain cancer having a median survival of one year and is remarkably resistant to all current anti-cancer therapeutic regimens [2]. In order to understand the complex molecular mechanisms behind GBM, earlier efforts have analyzed data from one or two platforms, such as mutations, copy number and gene expression profiles and methylation patterns [37]. In contrast, we have analyzed all TCGA provided GBM data sets and collected the results into a comprehensive website that facilitates the interpretation of the data and allows an advanced view of genes and genomic regions crucial to GBM progression. Most importantly, Anduril can be applied to data from any accessible source.

Materials and methods

Documentation for algorithms, their parameters and usage in the analysis together with all results are available in Additional file 1.

Glioblastoma multiforme data set

The glioblastoma data set was originally released in 2008 [1] and has been updated online since then. An updated revision was used in the present work: comparative genomic hybridization array (aCGH), single nucleotide polymorphism (SNP), exon, gene expression and microRNA (miRNA) data were accessed May to August 2009, while methylation and clinical data were accessed October to November 2009. The data set consists of 338 primary glioblastoma patients with clinical annotations. Data were analyzed from the following microarray platforms: Affymetrix HU133A (269 GBM samples, 10 control samples), Affymetrix Human Exon 1.0 (298 GBM samples, 10 control samples), Agilent 244 k aCGH (238 GBM samples), Affymetrix SNP Array 6.0 (214 GBM blood samples), Illumina GoldenGate methylation array (243 GBM samples) and Agilent miRNA array (251 GBM samples, 10 control samples). Pre-normalized data (level 2) were used for gene, exon and miRNA expression and methylation arrays. Raw data (level 1) were used for aCGH and SNP platforms. Clinical annotations were used to compute the duration of patient survival in months from the initial diagnosis to death or to the last follow-up. The publicly available results in the present work do not reveal protected patient information.

Gene expression analyses

The gene and exon expression platforms include ten control samples from brain tissue extracted from non-cancer patients in addition to the glioblastoma samples. Transcript level expressions are calculated from the exon level expression data by considering the problem of transforming the exon-level data to transcripts as a least squares problem. For ith gene having m exons and n transcripts in Ensembl (v.58) we define a vector e i of length m that denotes the measured exon expressions, and an m times n matrix A i , where the values in each column denote if the exon belongs to the transcript (1) or not (0). Transcript expression values t i are solved from the equation A i t i = e i using the QR decomposition to ensure numerical stability. The gene level expression values for the exon array platform were computed by taking a median of the intensity of all the exons linked with the gene in Ensembl.

Differential expression is determined by computing fold changes and applying a t-test between glioblastoma and control groups, followed by multiple hypotheses correction [8]. Fold changes are computed by dividing the mean of glioblastoma expression values by the mean of control expression values.

Transcriptome survival analysis

Differentially expressed splice variants were selected as the basis of expression survival analysis. There were 8,887 splice variants (out of a total 75,083) that were differentially expressed having absolute fold change >2 and a multiple hypothesis corrected P-value < 0.05. For these splice variants we computed sample-specific fold changes by dividing the sample expression value by the mean of control expression values. These fold changes (FC) were discretized into classes denoted by '-1' (underexpression, FC < 0.5), '1' (overexpression, FC >2) and '0' (stable expression), and the samples were divided into three groups accordingly. This grouping was used in Kaplan-Meier survival analysis and groups with <20 patients were excluded. A log-rank test was computed for each differentially expressed splice variant.

SNP survival analysis

Affymetrix SNP 6.0 genotypes were called with the CRLMM algorithm [9]. Samples with a signal-to-noise ratio below five and markers with call probabilities below 0.95 were discarded. We restricted our analysis to a genetically homogeneous pool of samples by using only ethnically similar samples. Markers with a relative minor allele frequency below 0.1 were excluded from the survival analysis. The study time in the survival analysis was 36 months. If the size of the patient group with the rare homozygote genotype in a marker was less than 15, or its frequency was less than 0.1, then the rare homozygote group was combined with the heterozygote group. The uncorrected P-value limit was set to 0.0001.

Copy number and expression integration

Normalized aCGH data from tumor samples were segmented using circular binary segmentation [10]. A segment was called aberrated if its mean was over 0.632 or below -0.632. These thresholds were estimated from the 64 blood versus blood controls as two standard deviations from the mean of normalized probe intensities.

Based on gain and loss frequencies for each splice variant, aCGH and splice variant expression data were integrated with the statistical method originally applied to breast cancer [11, 12]. Briefly, the samples are first divided into amplified and non-amplified groups. The difference of the expression means in these groups is divided by the sum of their standard deviation, resulting in a weight value. Then statistical significance for the weight value is computed by randomly permuting the samples into amplified and non-amplified groups and comparing the permuted weight value to the original.

miRNA expression analysis

Differentially expressed miRNA genes were determined using the same procedure as for gene expression platforms. Annotations for target sites of miRNAs were obtained from the miRBase::Targets database [13]. Only target sites with a P-value < 10-5 were included. MiRBase::Targets version 4 was used to match the annotations used in constructing the Agilent human miRNA array (G4470A).

DNA methylation arrays

Illumina DNA Methylation Cancer Panel I (808 gene promoters) and a custom Illumina GoldenGate array (1,498 gene promoters) were used in the methylation analysis. Processed beta values were used as provided by the TCGA. The beta value is defined as M/(M + U), where M and U are signal levels of methylation and unmethylation, respectively. The range of beta is 0 to 1, with 0 indicating hypomethylation and 1 indicating hypermethylation. Probes that target the same gene promoter were combined by taking the median of beta values so that each gene has a unique combined beta.

Small interfering RNA assays

Cell lines A172 and U87MG were obtained from the European Collection of Cell Cultures (ECACC, Salisbury, UK), LN405 from Deutsche Sammlung von Microorganismen und Zellkulturen GmbH (DSMZ, Braunschweig, Germany) and SVGp12 from American Type Culture Collection (ATCC, Manassas, VA, USA). Cells were cultured in medium conditions recommended by the providers.

The small interfering RNAs (siRNAs) were purchased from Qiagen (Qiagen GmbH, Germany) and include AllStars Hs Cell Death Control siRNA and AllStars Negative Control siRNA; siRNA sequences for the other 11 genes are given in Additional file 2. Each siRNA was assayed as three replicate wells, and for each gene four siRNAs were used in reverse transfection. Briefly, the siRNAs were printed robotically to 384-well white, clear-bottom assay plates (Greiner Bio-One GmbH, Frickenhausen, Germany). SilentFect transfection agent (Bio-Rad Laboratories, Hercules, CA, USA) or Lipofectamine RNAiMax (Invitrogen, Carlsbad, CA, USA) diluted into OptiMEM (Gibco Invitrogen, Carlsbad, CA, USA) was aliquoted into each 384-plate well using a Multidrop 384 Microplate Dispenser (Thermo Fisher Scientific Inc, Waltham, MA, USA), and the plates were incubated for 1 h at room temperature. Subsequently, 35 μl of cell suspension (1,500 cells of A172, U87MG and SVGp12 or 1,200 LN405 cells) was added on top of the siRNA-lipid complexes (13 nM final siRNA concentration) and the plates were incubated for 48 h or 72 h at +37°C with 5% CO2.

Proliferation assay and analysis of caspase-3 and -7 activities

Cell proliferation was assayed 72 h after transfection with CellTiter-Glo Cell Viability assay (Promega, Madison, WI, USA) and induction of caspase-3 and -7 activities was detected 48 h after transfection either with homogeneous Caspase-Glo 3/7 assay or Apo-ONE assay (Promega). All assays were performed according to the manufacturer's instructions. The signals were quantified by using an Envision Multilabel Plate Reader (Perkin-Elmer, Massachusetts, MA, USA). Both assays were repeated twice from independent transfections. Signals from the proliferation and caspase-3/7 assays were calculated and presented as relative signal to the mean of negative control siRNA replicate wells that was given a value of one. The values for each siRNA were then transformed into robust z-scores using median of the replicates and the median absolute deviation (MAD). A t-test (two-tailed, unequal variances) was calculated for each siRNA treatment and P-values < 0.05, < 0.01 and < 0.001 were taken as significant.

Data for CDKN2A and MSN are from an earlier siRNA screen and the values have been normalized to the background signal of each plate. The values were normalized using a LOESS method similar to the one implemented in the cellHTS2 R-package [14]. Briefly, the statistical outliers were down-weighted when a polynomial surface was fitted to the intensities within each assay plate using local regression [15]. This ensured a robust fit even if plates differ in hit-rate. The fit, representing a systematic background signal, was then subtracted from the values. A span of 0.35 and a degree of two for polynomial kernel were used. Robust z-scores were then calculated from the corrected data.


Anduril framework

Anduril is a flexible framework for processing large-scale data sets and integrating knowledge from bio-databases (Figure 1). Anduril architecture is based on the concept of workflows. A workflow consists of a series of interconnected processing steps, each of which executes a well-defined part of an analysis, such as data import or the generation of summary reports. Anduril can be invoked from Eclipse [16], a multipurpose graphical user interface, or from the command line. Anduril is available under an open source license and is actively maintained; new versions are released at least every three months. Anduril source code, component repository, extensive documentation, an installation guide and VirtualBox image for convenient testing are downloadable from the Anduril website [17]. Full technical details of the framework together with worked examples are available in the Anduril User Guide [18].

Figure 1
figure 1

Schematic of the Anduril platform. Anduril is an extensible framework for analyzing large-scale data sets using workflows. Elementary analysis and reporting methods, as well as connections to external databases, are implemented as reusable Anduril components. Components can utilize libraries such as Bioconductor and Weka and are not limited to a particular programming language. Components are then wired into custom workflows, which implement complete analyses that take complex high-throughput data as input and automatically produce comprehensive final reports as result. Reports include generated web sites that show the most relevant features of genes at a glance, and detailed figures and tables produced by analysis methods such as Kaplan-Meier analysis, Gene Ontology enrichment, and so on. Analysis workflows and their parameters are also documented in reports.

Workflows are constructed using a custom workflow language called AndurilScript that resembles traditional programming languages and is designed to enable rapid construction of complex workflows. The elementary processing steps in a workflow are implemented by Anduril components, which are reusable software packages written in various programming languages, for instance, R, Java, MATLAB, Octave, Python and Perl. Components are executable processes that communicate with the workflow through files. The component model is programming language independent since the only requirement is the ability to read and write files. At the AndurilScript level, components are accessed using their external interfaces, which hides implementation details. The components can use software libraries, such as Bioconductor [19] and Weka [20], to bring well-tested libraries to the workflow environment. It is also possible to invoke command-line programs from workflows. Currently, the Anduril core repository consists of more than a hundred components, and new components are added regularly. For instance, we designed a computational platform to generate networks from a list of genes by integrating pathway and protein-protein interaction data in Anduril [21]. This represents a component bundle that uses the Anduril framework but is distributed independently from the Anduril core.

Anduril includes advanced features for working with complex workflows. Large workflows can be divided into nested subworkflows, so that each hierarchical level is simple to maintain. When a workflow is executed several times, Anduril caches results of components and only executes the components whose configuration has changed since the last run, which reduces execution time significantly. Selected parts of workflows can be enabled based on dynamic conditions, which increases the flexibility of the workflows.

Compared to traditional programming environments, for instance, R coupled with Bioconductor, the advantages of Anduril are the use of workflows and the support for several programming languages. Workflows have a higher level of abstraction than R code, which increases productivity and enables visualization of analysis configuration. Compared to workflow frameworks GenePattern [22], Ergatis [23] and Taverna [24], Anduril provides several novel features, such as efficient programming-like workflow construction with an advanced workflow engine, algorithms specifically designed for large-scale data analysis and automated result website construction, that enable efficient analysis and visualization of large-scale data sets (see [18] for details).

Anduril-generated result report and website for GBM data interpretation

We used Anduril to analyze high-throughput SNP, copy number, transcriptomics, miRNA, methylation and clinical data for 338 GBM patients (Table 1). Anduril reports the analysis results in two formats. Firstly, Anduril provides a comprehensive PDF document consisting of analysis workflow configurations, method parameters, tables and figures produced by individual components. This report is intended primarily for bioinformaticians as it contains all the necessary details to reproduce the results. The report file for the GBM analyses conducted herein is available in Additional file 1. Secondly, Anduril automatically generates a website that contains the results computed with the analysis pipelines without the technical details. The website is designed primarily for experimental scientists as it gives a comprehensive view of the data at a glance. The website for GBM analyses executed herein is available at [25].

Table 1 Analyses performed and corresponding TCGA glioblastoma data sets

An example of the Anduril generated web page is given in Figure 2. The genes are sorted according to survival effect in exon array data. Anduril provides hyperlinks to several important databases, such as the pathway database KEGG [26], the protein-protein interaction database PINA [27], the miRNA database miRBASE [13], and the gene annotation databases GeneCards [28] and Ensembl [29]. These links enable users to easily obtain more information on the function and structure of interesting genes.

Figure 2
figure 2

Example of Anduril-generated result website and links to external sources. Anduril generates a browsable website based on analysis results. (a) A screenshot of the gene level view of the data. The genes are sorted according to the survival P-value on the exon platform. The data are divided into 13 fields corresponding to analysis results and data sources. For example, the field 'GeneExpression' illustrates fold changes between GBM and control samples using data from gene expression microarrays. Exon array values are computed at the gene ('MedianExonExpression') and transcript levels ('TranscriptExpression'). For the transcript data the minimum and maximum transcript expression values show GBM-specific alternative splice variant candidates. The fields 'TranscriptExpression:Survival' and 'MedianExonExpression:Survival' show survival analysis P-values for the best transcript and gene in the exon arrays, whereas 'SNPSurvival' contains P-values for the survival associated SNPs. The green color for 'GeneExpression', 'FoldChange', 'Min', 'Max', 'Gain', 'Loss' and 'Methylation' denote downregulation and red denotes upregulation. The red color for P-values for the fields 'Survival', 'SNPSurvival' and 'ExonIntegration' denotes low P-values. (b) A web page that opens after clicking the gene MSN. This page contains detailed results and external links. (c, d) Clicking 'GeneName' opens a website in Genecards [28] (c), and 'GeneID' connects to Ensembl [29] (d). (e) Clicking 'Protein Interactions' opens a page listing known protein-protein interactions in PINA [27]. (f) Clicking an entry in 'KEGG pathway' allows accessing pathways at the KEGG [26] website. (g) Each splice variant is listed separately and if the survival P-value is < 0.01, the users can view the Kaplan-Meier curves. The groups '1', '-1' and '0' denote overexpression, underexpression (not shown for MSN) and stable expression, respectively ('-1' is not present in the figure). The dotted lines are 95% confidence intervals.

Integration of copy number and transcript expression GBM data

We identified genes that are frequently amplified or deleted in GBM samples and integrated these results with expression data in order to identify genes whose altered expression activity can potentially be explained by chromosomal aberrations. Genomic regions with significant amplifications include 7p11.2 (amplified in up to 54% of patients, housing EGFR), 12q13-12q15 (14%) and 4q12 (14%).

Integration of aCGH and exon expression data reveals 16 genes for which amplification is an explanatory factor for overexpression (P < 0.01 and gain frequency >5%). Of these, EGFR is amplified on the aCGH platform and overexpressed on both gene expression platforms (fold change 2.8 to 6.2; Additional file 3, panel A). EGFR is also hypomethylated (beta = 0.03), which may be an additional explanatory mechanism for its overexpression. However, not all genes located in the amplified region 7p11.2 show marked overexpression in the total patient population (Additional file 3, panel A). For example, LANCL2 (the closest annotated gene to EGFR in the 7p11.2 region) is amplified in 24% of patients but shows underexpression in the exon platform and only slight over-expression in the gene expression platform. Similar differential expression is seen also between METTL1 (overexpressed) and AGAP2 (underexpressed) in the amplified chromosomal location 12q14.1 (Additional file 3, panel B).

Gene deletions are generally thought to result in downregulation of the expression of genes coded by the deleted genomic region. Interestingly, Anduril-based analysis of the two most frequently deleted genes at 9p21.3, MTAP and CDKN2A, shows that even though the gene deletion is an explanatory factor for lower expression of these genes in patients with deletion, in total GBM patient material the MTAP expression is not inhibited and CDKN2A is overexpressed compared to normal tissue (Additional file 4). The seemingly contradictory correlation between gene deletion and overexpression suggests activation of MTAP and CDKN2 promoters, and thereby increased gene expression levels in patients who have not yet lost one or two copies of these genes. This hypothesis is supported by the observation that in patients with remaining MTAP and CDKN2 alleles, both MTAP and CDKN2A are hypomethylated. On the other hand, another gene at 9p21.3 (ELAVL2) shows classical behavior of a deleted gene; its expression correlates with deletion, and it is also significantly downregulated in both expression platforms.

These examples illustrate that Anduril allows researchers to detect critical parameters affecting expression levels of the gene of interest at a glance. Our results demonstrate that integrated data analysis combining amplification, expression, and methylation status is integral in order to draw conclusions about functional consequences of gene amplifications or deletions detected by aCGH microarrays.

Survival analysis of GBM data

Probably the most important feature of the Anduril analysis of the GBM data is the integration of patient survival information with both expression and SNP data, thereby allowing the user to sort the genomic alterations according to their clinical relevance.

In order to examine the relevance of gene expression levels to patient survival in GBM, we first searched for genes whose overexpression correlated significantly with poor survival (P < 0.01). Among the 100 most upregulated genes, only 15 genes showed significant correlation with poor survival. On the other hand, out of the top ten survival affecting genes, only one gene (MSN, encoding Moesin) showed consistent overexpression in the gene and exon expression platforms (Figure 2a). All the other genes affecting survival in this group were underexpressed. Three of the top ten genes affecting survival (ADAM22, SCRIB, WAC) had at least one transcript that was overexpressed when analyzed on the exon array platform. However, survival effects of these genes are related to underexpressed splice variants instead of the overexpressed variants. Together these results show that gene repression is a common mode for gene regulation among the genes that have the most significant survival effect in GBM. These results challenge the general assumption that the level of gene overexpression is the major determinant to separate between clinically relevant and non-relevant genes.

In order to test the association between genetic alterations in GBM and their relevance to patient survival, we linked gene amplifications, expression profiles and survival data. Among the 300 most amplified genes, only filamin C gamma (FLNC; 7q32.1) is amplified (9% of the patients) with consistent overexpression in the gene and exon arrays and significant survival effect (P < 0.01). Together these results indicate that there is unexpectedly poor concordance between gene amplification, overexpression of the genes from the amplicons, and patient survival in GBM.

In general, individual miRNA survival effects in GBM were much smaller than expression survival effects, which may be explained by their indirect mechanism of action. The highest expressed miRNA in the GBM data was hsa-miR-21 (fold change 15.5), which has been shown to increase apoptotic activity and reduce tumor size in vivo [3032]. Some of the most downregulated miRNAs according to our analysis were hsa-miR-124a, hsa-miR-137, hsa-miR-7, hsa-miR-128a and hsa-miR-128b. All of these have been connected functionally to glioblastoma, either via neuronal differentiation or growth regulation [33].

Finally, we correlated 550,000 SNPs on the SNP arrays to survival using Kaplan-Meier and log-rank methods. This analysis identified 50 genes that contain survival-associated SNPs. Of these genes, KIAA0040 is also overexpressed (fold change 1.7 to 2.6) and associated with poor survival in exon array data (P < 8.7 × 10-4). The role of KIAA0040 in cancer progression is also supported by a recent study where KIAA0040 overexpression was shown to correlate with poor prognosis in breast cancer [34]. Another example of a gene showing a significant survival-affecting SNP is rs17258085 of ODZ3. In contrast to KIAA0040, this gene is significantly underexpressed in the GBM samples.

Functional analysis of survival-affecting genes in vitro

We chose 11 genes having overexpression and a survival effect on the GBM for functional analysis with three glioma cell lines (A172, LN405, U87MG) and one control cell line (SVG p12; SV40 transformed fetal astrocyte). Each gene was targeted with four siRNA constructs. The phenotypes were cell proliferation and induction of apoptosis via caspase-3 and -7 activities assayed 48 to 72 h after transfection in a 384-well format. Positive control siRNAs against KIF11 and PLK1 as well as AllStars Hs Cell Death Control siRNA gave clear anti-proliferative effects in all four cell lines (Additional file 5). Cell Death Control and KIF11 siRNAs also showed a clear induction of apoptosis in all four cell lines (Additional file 6). The results for the A172 cell line are presented in Table 2, and all functional analysis results are given in Additional file 2.

Table 2 Functional siRNA screening data for 11 GBM survival-associated genes in the A172 glioblastoma cell line

Of the tested genes, only the silencing of MSN caused consistent inhibition of cell proliferation in all four cell lines. In addition, it caused an increase in caspase-3/7 activity in LN405 (Figure 3). The silencing of CDKN2A caused inhibition of cell proliferation with two siRNAs and an increase in caspase-3/7 activity in the LN405 and SVGp12 cell lines that do not have the CDKN2A deletion (Additional file 7). The silencing of the other genes did not result in consistent effects on cell proliferation or induction of apoptosis in the tested glioblastoma cell lines.

Figure 3
figure 3

Functional effects of knocking out MSN in three glioblastoma and one control cell line. Four MSN targeting siRNAs at a final concentration of 13 nM were transfected with Silenfect (BioRad) transfection reagent to A172, LN405 and U87MG glioma cell lines and the SVGp12 control cell line. (a) Cell proliferation was assayed 72 h after transfection using CellTiter-Glo Cell Viability assay. (b) Induction of caspase-3 and -7 activities was detected 48 h after transfection with homogeneous Apo-ONE assay (Promega). Loess normalized signals from the proliferation and caspase-3/7 assays are presented as relative scores to the mean of lipid-containing wells. Significant P-values < 0.05*, < 0.01** and < 0.001*** calculated by t-test are shown. Error bars indicate standard error of the mean (SEM).


Large-scale data gathering efforts require software and computational tools to facilitate interpretation of the data. We have developed Anduril, an efficient and systematic data integration framework, to conduct large-scale data analysis that necessarily requires a number of processing steps before the data can be interpreted. In the GBM analysis here, the workflow contained approximately 350 processing steps, demonstrating the efficiency of workflows - more code would be needed when working with traditional programming languages - as well as highlighting the need for complexity management in workflow software. The structure of the analysis is automatically documented together with all execution parameters of the participating components, which enables reproduction of the results. Anduril supports modular and programming-like workflow construction, which together with automated component testing and a version control system allows a team of bioinformaticians to work on the project simultaneously and to seamlessly integrate the analysis results.

We have demonstrated the utility of the Anduril framework with the GBM data from TCGA, one of the largest multidimensional cancer data sets currently available. We focused on the integration of mRNA expression, SNPs and copy number data to clinical parameters as these results can provide evidence of potential molecular markers with impact on GBM progression. This also facilitates the sorting of the genomic alterations according to their clinical relevance and further helps to focus future mechanistic studies on genetic alterations that have evidence of clinical relevance. While TCGA GBM data sources, such as The Cancer Genome Atlas Portal and the Cancer Molecular Analysis Portal, provide box-plots for single genes and genome-wide heatmaps, Anduril offers a significant step forward. It enables a comprehensive view of the most critical parameters influencing expression, miRNA, SNP and copy number levels, as well as correlation of these data to survival at a glance. In addition, Anduril provides a number of direct links to external databases, and is thus an easy access point for interpreting the vast amounts of heterogeneous data from multiple sources. These characteristics of Anduril facilitate scientists without bioinformatics training to interpret complex data sets, such as TCGA.

Analysis of the GBM data demonstrates the utility of Anduril in translating fragmented data to testable predictions. For example, detection of amplified genomic regions has traditionally been used to identify genes with potential causal roles in oncogenesis [35]. However, whether genomic amplification generally results in clinically relevant changes in gene expression from the amplicon has been difficult to assess because of the lack of Anduril-type websites combining gene expression, patient survival and aCGH amplification data. Our results show surprisingly poor concordance between gene amplification, overexpression of the genes in the amplicons, and patient survival. For example, even though EGFR is the most often amplified gene in GBM (54% of patients), and this amplification has been considered as a hallmark of the disease, EGFR overexpression does not correlate well with overall patient survival (P < 0.122). This result is supported by a recent study demonstrating that EGFR amplification does not determine patient survival in primary GBM [36]. Instead, our results demonstrate that gene repression, rather than activation, is a common mode for gene regulation among the genes that have the most significant effect on survival in GBM.

Interestingly, many of the most survival-affecting genes have not been previously implicated in GBM pathogenesis. An example of such a gene is ZRANB1 (encoding ubiquitin thioesterase), which is downregulated in exon arrays and has a strong survival effect (P < 3.2 × 10-5). It has been shown in Drosophila and in human cancer cell lines to function as a positive regulator of Wnt-signaling [37]. Another interesting survival-affecting gene revealed by our analysis is MSN (encoding Moesin). We have functionally demonstrated that Moesin depletion by siRNA significantly inhibited cell proliferation and induced apoptosis. Moesin is functionally involved in regulation of actin cytoskeleton and cell migration, which indicates that in GBM it may promote, in addition to proliferation, the highly invasive behavior of GBM cells.


The different analysis approaches described herein demonstrate the ability of Anduril to integrate several types of genomic information and above all its capacity to determine which of the observed genetic alterations have an impact on patient survival. In this regard, Anduril clearly facilitates scientists to focus future functional analysis on those cancer-related genes that have already been verified to have clinical significance. Interestingly, each of the survival analyses described above (SNP, expression level, copy number changes) identified clinically relevant genomic alterations in genes for which cancer relevance is not presently established. It is anticipated that further studies of genes (for example, MSN and ZRANB1) and clinically relevant SNPs (for example, rs2285218 in KIAA0040) will produce interesting novel mechanistic insights into GBM progression and oncogenesis.



comparative genomic hybridization array


glioblastoma multiforme




small interfering RNA


single nucleotide polymorphism


The Cancer Genome Atlas.


  1. Cancer Genome Atlas Research Network: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008, 455: 1061-1068. 10.1038/nature07385.

    Article  Google Scholar 

  2. Furnari FB, Fenton T, Bachoo RM, Mukasa A, Stommel JM, Stegh A, Hahn WC, Ligon KL, Louis DN, Brennan C, Chin L, DePinho RA, Cavenee WK: Malignant astrocytic glioma: genetics, biology, and paths to treatment. Genes Dev. 2007, 21: 2683-2710. 10.1101/gad.1596707.

    Article  PubMed  CAS  Google Scholar 

  3. Bredel M, Scholtens DM, Harsh GR, Bredel C, Chandler JP, Renfrow JJ, Yadav AK, Vogel H, Scheck AC, Tibshirani R, Sikic BI: A network model of a cooperative genetic landscape in brain tumors. JAMA. 2009, 302: 261-275. 10.1001/jama.2009.997.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  4. Brennan C, Momota H, Hambardzumyan D, Ozawa T, Tandon A, Pedraza A, Holland E: Glioblastoma subclasses can be defined by activity among signal transduction pathways and associated genomic alterations. PLoS One. 2009, 4: e7752-10.1371/journal.pone.0007752.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Cerami E, Demir E, Schultz N, Taylor BS, Sander C: Automated network analysis identifies core pathways in glioblastoma. PLoS One. 2010, 5: e8918-10.1371/journal.pone.0008918.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Gaire RK, Bailey J, Bearfoot J, Campbell IG, Stuckey PJ, Haviv I: MIRAGAA - a methodology for finding coordinated effects of microRNA expression changes and genome aberrations in cancer. Bioinformatics. 2009, 26: 161-167. 10.1093/bioinformatics/btp654.

    Article  PubMed  Google Scholar 

  7. Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O'Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17: 98-110. 10.1016/j.ccr.2009.12.020.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  8. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995, 57: 289-300.

    Google Scholar 

  9. Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007, 8: 485-499. 10.1093/biostatistics/kxl042.

    Article  PubMed  Google Scholar 

  10. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.

    Article  PubMed  Google Scholar 

  11. Hautaniemi S, Ringner M, Kauraniemi P, Autio R, Edgren H, Yli-Harja O, Astola J, Kallioniemi A, Kallioniemi OP: A strategy for identifying putative causes of gene expression variation in human cancer. J Jefferson Institute. 2004, 341: 77-88.

    Google Scholar 

  12. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringnér M, Sauter G, Monni O, Elkahloun A, Kallioniemi O-P, Kallioniemi A: Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res. 2002, 62: 6240-6245.

    PubMed  CAS  Google Scholar 

  13. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, 36: D154-158. 10.1093/nar/gkm952.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  14. Boutros M, Bras LP, Huber W: Analysis of cell-based RNAi screens. Genome Biol. 2006, 7: R66-10.1186/gb-2006-7-7-r66.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Cleveland W: Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc. 1979, 74: 829-836. 10.2307/2286407.

    Article  Google Scholar 

  16. Des Rivi J, Wiegand J: Eclipse: a platform for integrating development tools. IBM Syst J. 2004, 43: 371-383. 10.1147/sj.432.0371.

    Article  Google Scholar 

  17. Anduril. []

  18. Anduril User Guide. []

  19. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics. 2004, 20: 2479-2481. 10.1093/bioinformatics/bth261.

    Article  PubMed  CAS  Google Scholar 

  21. Laakso M, Hautaniemi S: Integrative platform to translate gene sets to networks. Bioinformatics. 2010, 26: 1802-1803. 10.1093/bioinformatics/btq277.

    Article  PubMed  CAS  Google Scholar 

  22. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet. 2006, 38: 500-501. 10.1038/ng0506-500.

    Article  PubMed  CAS  Google Scholar 

  23. Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli SV: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics. 2010, 26: 1488-1492. 10.1093/bioinformatics/btq167.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  24. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004, 20: 3045-3054. 10.1093/bioinformatics/bth361.

    Article  PubMed  CAS  Google Scholar 

  25. Anduril generated glioblastoma multiforme result website. []

  26. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, 36: D480-484. 10.1093/nar/gkm882.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  27. Wu J, Vallenius T, Ovaska K, Westermarck J, Makela TP, Hautaniemi S: Integrated network analysis platform for protein-protein interactions. Nat Methods. 2009, 6: 75-77. 10.1038/nmeth.1282.

    Article  PubMed  CAS  Google Scholar 

  28. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics. 1998, 14: 656-664. 10.1093/bioinformatics/14.8.656.

    Article  PubMed  CAS  Google Scholar 

  29. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, et al: Ensembl 2009. Nucleic Acids Res. 2009, 37: D690-697. 10.1093/nar/gkn828.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  30. Chan JA, Krichevsky AM, Kosik KS: MicroRNA-21 is an antiapoptotic factor in human glioblastoma cells. Cancer Res. 2005, 65: 6029-6033. 10.1158/0008-5472.CAN-05-0137.

    Article  PubMed  CAS  Google Scholar 

  31. Corsten MF, Miranda R, Kasmieh R, Krichevsky AM, Weissleder R, Shah K: MicroRNA-21 knockdown disrupts glioma growth in vivo and displays synergistic cytotoxicity with neural precursor cell delivered S-TRAIL in human gliomas. Cancer Res. 2007, 67: 8994-9000. 10.1158/0008-5472.CAN-07-1045.

    Article  PubMed  CAS  Google Scholar 

  32. Krichevsky AM, Gabriely G: miR-21: a small multi-faceted RNA. J Cell Mol Med. 2009, 13: 39-53. 10.1111/j.1582-4934.2008.00556.x.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  33. Lawler S, Chiocca EA: Emerging functions of microRNAs in glioblastoma. J Neurooncol. 2009, 92: 297-306. 10.1007/s11060-009-9843-2.

    Article  PubMed  CAS  Google Scholar 

  34. Abba MC, Sun H, Hawkins KA, Drake JA, Hu Y, Nunez MI, Gaddis S, Shi T, Horvath S, Sahin A, Aldaz CM: Breast cancer molecular signatures as determined by SAGE: correlation with lymph node status. Mol Cancer Res. 2007, 5: 881-890. 10.1158/1541-7786.MCR-07-0055.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  35. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, Mc Henry KT, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, et al: The landscape of somatic copy-number alteration across human cancers. Nature. 2010, 463: 899-905. 10.1038/nature08822.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  36. Benito R, Gil-Benso R, Quilis V, Perez M, Gregori-Romero M, Roldan P, Gonzalez-Darder J, Cerda-Nicolas M, Lopez-Gines C: Primary glioblastomas with and without EGFR amplification: Relationship to genetic alterations and clinicopathological features. Neuropathology. 2009, 30: 392-400. 10.1111/j.1440-1789.2009.01081.x.

    Article  PubMed  Google Scholar 

  37. Tran H, Hamada F, Schwarz-Romond T, Bienz M: Trabid, a new positive regulator of Wnt-induced transcription with preference for binding and cleaving K63-linked ubiquitin chains. Genes Dev. 2008, 22: 528-542. 10.1101/gad.463208.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

Download references


The authors are grateful to Dr Outi Monni for critically commenting on the manuscript, and Drs Olli Tynninen and Petri Bono for fruitful discussions. This work was supported by Academy of Finland (projects: 125826, 128416 and 1121413; Center of Excellence in Translational Genome-Scale Biology (SHP, PK)), Sigrid Jusélius Foundation, Foundation for the Finnish Cancer Institute, Finnish Cancer Associations, Biocentrum Helsinki, Helsinki University Funds, Finnish Graduate School in Computational Sciences (ML, PC, SK) and Helsinki Biomedical Graduate School (KO, AMLK).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sampsa Hautaniemi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KO designed and implemented the Anduril framework and contributed to writing the manuscript. ML contributed to overall design and Anduril component implementation. SHP performed siRNA experiments and contributed to writing the manuscript. GBM informatics analyses were conducted by RL, PC, VA, KO, ML and EV. Data analysis tools were designed and implemented by KO, ML, RL, PC, VA, EV, JNF, VR, SK, KN, AMLK, MM, LS and JWu. PK analyzed the siRNA data. SK and RL made Figures 1 and 2. JNF contributed to implementation of the Anduril core. JWe contributed to designing the case studies, interpreting the results from the informatics analyses and writing the manuscript. SH initiated and supervised the project and contributed to writing the manuscript. All authors read and approved the final manuscript.

Marko Laakso, Saija Haapa-Paananen contributed equally to this work.

Electronic supplementary material


Additional file 1: Automatically generated result file. The report contains analysis configurations, parameter settings, result lists, tables and figures. The document also includes some analyses not reported in the manuscript, such as Gene Ontology analysis. (ZIP 12 MB)


Additional file 2: Information and data for siRNA screens. The information includes identifiers, siRNA target sequences, normalized mean intensities, standard errors of the mean and P-values for four cell lines used. (XLS 114 KB)


Additional file 3: Screenshot from the Anduril-generated web site. Genes are sorted in decreasing order according to the fraction of amplification ('Gain') in the GBM samples. The strongest amplified region in the GBM samples is 7p11.2. Interestingly, the expression values of the genes in the same genomic region vary significantly. For example, EGFR is amplified and has high fold change whereas LANCL2 is amplified and downregulated (panel A). The same phenomenon is seen in another amplified region 12q14.1 (panel B). (PDF 1 MB)


Additional file 4: Screenshot from the Anduril-generated web site. Genes are sorted in decreasing order according to the fraction of deletion ('Loss') in the GBM samples. The strongest deleted region in the GBM samples is 9p21.3. The fraction of deletion varies from 35% to 69%. (PDF 1 MB)


Additional file 5: The effect of gene silencing on cell proliferation. Control siRNAs (13 nM final concentration) were transfected with Silenfect (BioRad) transfection reagent to A172, LN405 and U87MG glioma cell lines and the SVGp12 control cell line. Cell proliferation was assayed 72 h after transfection using CellTiter-Glo Cell Viability assay. The proliferation data are presented as relative score to the mean of scramble siRNA-containing wells. Error bars indicate median absolute deviation. (PDF 51 KB)


Additional file 6: The effect of gene silencing on caspase-3 and -7 activities. Control siRNAs (13 nM final concentration) were transfected with Silenfect (BioRad) transfection reagent to A172, LN405 and U87MG glioma cell lines and the SVGp12 control cell line. Induction of caspase-3 and -7 activities was detected 48 h after transfection with homogeneous Caspase-Glo 3/7 assay (Promega). The caspase activity is presented as relative median score to the mean of scramble siRNA containing wells. Error bars indicate median absolute deviation. (PDF 53 KB)


Additional file 7: The effects of silencing CDKN2Ain LN405 and SVGp12 cell lines on cell proliferation and apoptosis. (PDF 31 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ovaska, K., Laakso, M., Haapa-Paananen, S. et al. Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med 2, 65 (2010).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: