Using a real-world translational bioinformatics analysis as a case study, we demonstrate that cloud computing is a viable and economical technology that enables large-scale data integration and analysis for studies in genomic medicine. Our computational challenge was motivated by a need to discover cancer-associated eQTLs through integration of two high-dimensional genomic data types (gene expression and genotype), requiring more than 13 billion distinct statistical computations.
It is notable that execution of our analysis completed in approximately the same running time on both systems, as it could be expected that the cloud-based analysis would take longer to execute due to possible overhead incurred by the virtualization layer. However, in this analysis, we find no significant difference in execution performance between a cloud-based or local cluster. This may be attributable to our design of the analysis code, which made heavy use of CPU and system memory in an effort to minimize disk input/output. It is possible that an analysis that required many random seeks on the disk could have realized a performance disparity between the two systems.
Although the total cost for running the analysis on the cloud-based system was approximately three times more expensive compared to the local cluster, we assert that the magnitude of this cost is well within reach of the research (operational) budgets of a majority of clinical researchers. There are intrinsic differences between these approaches that prevent us from providing a completely accurate accounting of costs. Specifically, we chose to base our comparison on the cost per CPU hour because it provided the most equivalent metric for comparing running-time costs. However, because we are comparing capital costs (local cluster) to variable costs (cloud), this metric does not completely reflect the true cost of cloud computing for two reasons: we could not use a 3-year amortized cost estimate for the cloud-based system, as done for the local cluster; and the substantial delay required to purchase and install a local cluster was not taken into account. As these factors are more likely to favor the cloud-based solution, it is possible that a more sophisticated cost analysis would bring the costs of the two approaches closer to parity.
There are several notable differences in the capabilities of each system that give grounds for the higher cost of the cloud-based analysis. First, there are virtually no startup costs associated with the cloud-based analysis, whereas substantial costs are associated with building a local cluster, such as hardware, staff, and physical housing. Such costs range in the tens to hundreds of thousands of dollars, likely making the purchase of a local cluster prohibitively expensive to many. It can take months to build, install and configure a large local cluster, and therefore there is also the need to consider the non-monetary opportunity costs incurred during initiation of a local cluster. The carrying costs of the local cluster that persist upon conclusion of the analysis should also be considered. The cloud-based system offers many technical features and capabilities that are not matched by the local cluster. Chief among these is the 'elastic' nature of the cloud-based system, which allows it to scale the number of server instances based on need. If there was a need to complete this large analysis in the time-span of a day, or even several hours, the cloud-based system could have been scaled to several hundred server instances to accelerate the analysis, whereas the local cluster size is firmly bound by the number of CPUs installed. A related feature of the cloud is the user's ability to change the computing hardware at will, such as selecting fewer, more powerful computers instead of a larger cluster if the computing task lends itself to this approach.
Other features unique to the cloud include 'snapshotting', which allows whole systems to be archived to persistent storage for subsequent reuse, and 'elastic' disk storage that can be dynamically scaled based on real-time storage needs. A feature of note that is proprietary to the particular cloud provider used here is the notion of 'spot instances', where a reduced per-hour price is set for an instance, and the instance is launched during periods of reduced cloud activity. Although this feature may have increased the total execution time of our analysis, it might also reduce the cost of the cloud-based analysis by half depending on market conditions. Clearly, any consideration for the disparities in the costs between the two systems must consider additional features and technical capabilities of the cloud-based system.
While we find that the cost and performance characteristics of the cloud-based analysis are accommodating to translational research, it is important to acknowledge that substantial computational skills are still required in order to take full advantage of cloud computing. In our study, we purposefully chose a less sophisticated approach of decomposing the computational problem by simple fragmentation of the comparison set. This was done to simulate a low-barrier of entry approach to using cloud computing that would be most accessible to researchers lacking advanced informatics skills or resources. Alternatively, our analysis would likely have been accelerated significantly through utilization of cloud-enabled technologies such as MapReduce frameworks and distributed databases . It should also be noted that while this manuscript was under review, Amazon announced the introduction of Cluster Computer Instances intended for high performance computing applications . Such computing instances could further increase accessibility to high-performance computing in the cloud for non-specialist researchers.
There are serious considerations that are unique to cloud computing. Local clusters typically benefit from dedicated operators who are responsible for maintaining computer security. By contrast, cloud computing allows free configuration of virtual machine instances, thereby sharing the burden of security with the user. Second, cloud computing requires the transfer of data, which introduces delays and can lead to substantial additional costs given the size of many data sets used in translational bioinformatics. Users will need to consider this aspect carefully before adopting cloud computing. An additional data-related limitation we faced repeatedly with our provider was a 1-terabyte limit on the size of the virtual disks.
However, the most significant impediment facing biomedical researchers wishing to adopt cloud computing involves the software environment for designing the computing environment and running the experiments. We believe efforts for fully exposing the capabilities of cloud-computing environments at the application level are key to enhancing the democratizing effect of cloud computing in genomic medicine. Specifically, intuitive and scalable software tools are needed to enable clinician scientists at the forefront of medical discovery to leverage fully the vast resources of public data and cloud-based computing infrastructure. Cloud-based tools should be specifically oriented to address the particular modes of inquiry of clinician scientists towards enabling unified biological and clinical hypothesis evaluation. Rather than present the clinical investigator with a collection of bioinformatics tools (that is, the 'toolbox' approach), we believe clinician-oriented, cloud-based translational bioinformatics systems are key to facilitating data-driven translational research using cloud computing.
It is our hope that by demonstrating the utility and promise of cloud computing for enabling and facilitating translational research, investigators and funding agencies will commit efforts and resources towards the creation of open-source software tools that leverage the unique characteristics of cloud computing to allow for uploading, storage, integration and querying across large repositories of public and private molecular and clinical data. In this way, we might realize the formation of a biomedical computing commons, enabled by translational bioinformatics and cloud computing, that empowers clinician scientists to make full use of the available molecular data for formulating and evaluating important translational hypotheses bearing on the diagnosis, prognosis, and treatment of human disease.