Computational ecosystems for data-driven medical genomics

In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.


Integrative computation for personalized medicine
The promises of personalized molecular medicine are increasingly driving largescale associative genomics projects that bring together distributed teams involving multiple disciplines. The unprecedented size and scope of initiatives such as TCGA inevitably come with new types of growing pains. The problem of integration, as described in a recent report by The National Academy [2], is quickly becoming the central challenge for the life sciences. Only a few years ago this was still the province of the visionary [3]. However, the clamor for better formal knowledge representation frameworks is now coming from all corners, including the critical contri bution of the storage infrastructure community [4]. In that regard, the study by Ovaska et al. may be revealing of what is in store for the data analysis of largescale genomic data generation initiatives. Anduril is not the first integrative framework proposed, and the report compares with existing frameworks such as GenePattern, Ergatis and Taverna [57]. In fact, GenePattern is the

Abstract
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own highlevel abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program. framework where many of the analytical workflows of the TCGA initiative itself are deployed. What is particularly interesting about Anduril in this regard is not so much what it does as to what extent it successfully reflects the relationship between its architecture and the team that uses it. Before dwelling on that, it is worth recalling that this framework has pushed the idea of component modularity all the way to a shared input/output (I/O) bus (that is, a set of logical connections that can be shared by multiple software components in order to communicate with one another). As a result, a computational ecosystem is enabled where, instead of workflows made of compo nents designed to define a pipeline, one has compo nents with application programming interfaces designed for scaledup reusability. Whereas in the conventional pipeline approach each component is designed of as a piece of a specific analytical puzzle, in the ecosystem approach the application programming interface of each module is made sufficiently abstract as to be treated like an autonomous, generic element of many possible workflows.
The Anduril framework [8] was devised with a specific team of users in mind. This team comprises three roles: molecular biologists at both the data acquisition and the interpretation ends of the workflow, computational statisticians developing specialized data analysis modules in a variety of programming environments, and, finally, dedicated analysts assisting and articulating both groups. The command line operation of the framework suits the analyst group as an environment to make full use of the componentbased workflow framework designed from maximum reusability and minimum administration load. The execution of individual components by the core engine of Anduril is automatically triggered by I/O dependencies that point to filenames in a shared file system. It is also telling that the ensuing highlevel abstraction led the developers of Anduril to identify their own domainspecific language, releasing the whole initiative from having to choose between the many actual programming languages used for the individual components. Even if it is far from certain that Anduril will find a broader community of users, it is clear that this computational framework was the critical resource that enabled this particular group to act as a team. Therefore, it appears that integrative multidisciplinary teams may respond better to computational frameworks (plural) designed to match them, instead of forcing existing collaborative teams into a shared workflow mold. The latter remain the primary impulse of largescale genomics initiatives, with very mixed results.

Multidisciplinary collaboration in a distributed world
Another provocative observation is that the authors of this study, and of the supporting computational frame work, are not themselves involved in the TCGA initiative.
This may be the beginning of a trend towards compu tational integration between unrelated research groups. This may actually be the better way for largescale genomics initiatives to be translated into biomedical applications. If that is the case, then the global reach would become a priority feature of such initiatives, with a critical attention to streamlined programmatic access to the data generated.
Some features of the integrative framework reflect the collaborative team work in ways that are less relevant to this commentary. The physical colocation of the computational components at universities in the HelsinkiTurku area allow for an architecture tied together by a shared file system. At a time of widening availability of Hypertext Transfer Protocol (HTTP) mediated cloud computing resources and convergence towards semantic web formalisms, the reliance of Anduril on I/O via read/write of files may be an unreasonable proposition for distributed deployments. A more distri buted computational ecosystem may be better served by web services, potentially extending component execu tion, not just reporting, to any machine connected to the Web. Nevertheless, the design of Anduril as a platform able to host and sustain abstract workflow represen tations that call arbitrary components is novel and compelling beyond the specific details of its architecture.