Translational bioinformatics applications in genome medicine

Although investigators using methodologies in bioinformatics have always been useful in genomic experimentation in analytic, engineering, and infrastructure support roles, only recently have bioinformaticians been able to have a primary scientific role in asking and answering questions on human health and disease. Here, I argue that this shift in role towards asking questions in medicine is now the next step needed for the field of bioinformatics. I outline four reasons why bioinformaticians are newly enabled to drive the questions in primary medical discovery: public availability of data, intersection of data across experiments, commoditization of methods, and streamlined validation. I also list four recommendations for bioinformaticians wishing to get more involved in translational research.

I In nt tr ro od du uc ct ti io on n Over the past decade, a large amount of individual-level molecular data has come from the use of gene expression microarrays [1,2], proteomics [3], and DNA sequencing [4,5]. Although high-throughput measurement modalities such as these have been used in biomedical research for over a decade, the role of the bioinformatician has often been relegated to that of data analyst, librarian, database manager, distribution specialist, or software engineer. Occasionally, with introductions made early enough, bioinformaticians have been included in the early design phases of experiments, and their role noted as such on manuscripts and publications. These engineering and infrastructure roles, although important, evolved under the assumption that the scientists making these measurements already know good questions to ask but lack the specific skills to analyze, store, retrieve, and disseminate their data. Engineering roles in bioinformatics are important and are reasonably well funded today (such as in the Cancer Bioinformatics Grid (caBIG), Bioinformatics Research Network (BIRN), and the National Centers for Biomedical Computing (NCBC), all in the United States).
But considering and funding solely the engineering roles in bioinformatics understates the potential function of bioinformaticians as scientists -here defined as those who come up with questions -and, even more importantly, it limits the vision for bioinformaticians to ask questions that no other scientists can ask or answer today. It has become increasingly rare for the bioinformatician to take the role of questioner, especially with regard to research that has an impact on medical care or research that yields tools for clinicians or patients. Here, I argue that the next steps needed for the field of bioinformatics are a shift in role towards asking questions and a shift in focus to medicine. The field of translational bioinformatics, defined as '…the development of storage, analytic and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health' [6], is the mechanism for this shift. I outline below four reasons why bioinformaticians are newly enabled to drive the questions in primary medical discovery, and provide four recommendations for bioinformaticians who would like to get more involved in translational research.
F Fo ou ur r e en na ab bl li in ng g o op pp po or rt tu un ni it ti ie es s The most revolutionary force in translational bioinformatics is the public availability of molecular data. Sharing data is not new; large epidemiological datasets and DNA sequences have been shared in various forms for several decades, even before the internet era. In addition, the use of previously published data is not new; the biostatistics literature is full of novel methodology applied to well known datasets. But instead of using public data to just improve one's methodology (for example, to build yet another classifier on Todd Golub's leukemia data [7]), or in basic science (for example, to build yet another predictor for transcription factor binding sites), such data can now be used to enable new questions in applied sciences.
Coupled with the public availability of molecular measurement data is the promising capability of intersecting across multiple experiments. At the time of writing, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) contains data from over 307,000 microarrays, from 12,100 independent experiments [8].
Although the growth rate has been exponential, GEO can be currently described as having made available roughly 100 new microarrays each day since its launch in January 2001. Imagine: a high school student today who needs to run a science fair project can type 'breast cancer' at the NCBI GEO home page to find data from nearly 400 experiments totaling 24,200 samples, as easily as she can find songs on iTunes. With the right tools, she could even discover the 'common denominator' across tens or hundreds of models of breast cancer. Rhodes et al. [9] used this approach to compile publicly available published microarray datasets in which cancer samples were compared with appropriate normal samples to find common changes in gene expression across cancers, such as cell cycle genes involved in metastasis, and my colleague and I [10] used 49 publicly available gene expression, proteomics, and RNA interference datasets to predict novel variants associated with obesity. Although there are challenges in using this approach [11], with over 30% of the human-disease morbidity already represented in GEO [12] there is clearly power in large numbers.
A negative disruptive factor, potentially steering bioinformaticians away from staid approaches, has been the increasing commoditization of bioinformatics methodology. Over 1,100 databases are now listed in the Annual Database issue of Nucleic Acids Research [13], with another hundred web-servers listed separately [14]. Approximately 60 manuscripts are published each month describing software or methodology in bioinformatics in the journals Genome Biology, BMC Bioinformatics, BMC Genomics, and Bioinformatics. Even sophisticated choices on the best machinelearning algorithm to use in a particular context have been made trivial by free tools such as Weka [15], which essentially abstract away the need to know specific methodology. It is getting progressively harder to argue that increasing sophistication and knowledge of this type of methodology significantly improves one's results.
With the availability of enormous sets of data and the commoditization of methodology, merely making lists of potential biomarkers and causal factors will eventually lose value and significance. Although much additional value comes from validation in real human samples, these samples have typically been difficult to obtain, until now. Figure 1 shows one example out of many websites that now offer human samples, antibodies that can be used to stain those samples, and pathology services that can be used to read the results. One can always question the reliability and quality of these samples and services, as one can question samples and services within one's own institution. However, it is difficult to ignore the importance of having these facilities available to the bioinformatician. Although caveats must be acknowledged, in many ways all that is now left to do is to ask the interesting question.
F Fo ou ur r r re ec co om mm me en nd da at ti io on ns s How can the field of bioinformatics successfully adapt to the translational movement? First, if the hardest part to scientific endeavors in biomedical informatics is to ask the right question, then investigators in biomedical informatics need to learn more about open problems in medicine. Some of this learning will come from non-traditional sources, such as medical or surgical grand rounds (regular conferences discussing the science around particularly challenging or instructive cases) in a medical center. Often, 'domainspecific learning' is viewed as a slippery slope; informaticians sometimes retort that it is not possible to gain competence across all areas of medicine while retaining expertise in a computational discipline. But learning about the unaddressed challenges even in one particular area of medicine is still better than knowing little or nothing about any area of medicine; as most physician scientists know, focus in one particular medical area of interest provides more than enough challenges for a career. As informatics tools become more easily accessible, understood, and used without assistance by medical researchers, the reverse also has to occur, with medical problems becoming understood and addressed by computational investigators.
The corollary to this point is a second recommendation directed towards bioinformaticians: with the commoditization of bioinformatics methodologies, researchers in informatics should not just build tools, they should be the first to use them, even on publicly available data. Indeed, no other investigator knows those tools better than the inventor. Those who build tools to address a specific medical question can and should report on both their tool and their findings. After tools and methods have been shown to answer one question particularly well, they can then be generalized for additional questions. This recommendation is contrary to the usual practice of building tools in bioinformatics to enable others. In general, this will mean that tools that have successfully enabled their creator to discover an important finding should be viewed with higher regard, as opposed to tools presenting a fancier user interface or marginal gains in performance.
It is often easiest to criticize the quality of publicly available resources, whether these resources are data or tools. Many initiatives within the community of biomedical informatics have tried to add value to these public resources by creating standardized annotations (and metadata), catalogs, structured vocabularies, and ontologies, which can be used to store, index, and retrieve them more efficiently and effectively [16,17]. Although these efforts have the best of intentions, we have to ensure that, in the push to improve the quality of metadata, we do not inadvertently cause a delay in the release of data or tools.
The final recommendation is for informaticians to broadly consider their sources for molecular data. A tertiary care academic medical center might see tens to hundreds of thousands of patients with injuries and diseases each year. In modern hospitals, nearly every intervention applied to these patients is electronically recorded, and hundreds of thousands of blood measurements are made yearly, along with high-resolution images and tissue pathology. The scale of the clinical enterprise easily dwarfs the abilities of most typical animal model facilities, and the requirements for quality assurance for medical measurements greatly exceeds the typical levels of rigor applied in model experimentation. Put another way, the typical clinical laboratory measurement is much more believable than the typical spot on a microarray. There are barriers to accessing clinical data, but as these can be overcome, bioinformaticians should start considering humans as the ultimate model organism [18].

C Co on nc cl lu us si io on ns s
It is remarkable that in the decade or two since their creation, high-throughput molecular measurements, such as microarrays, have already been used to study so many human diseases, and that data from these experiments are publicly available. Representing so many diseases by molecular measurements in gene expression (and other measurement modalities in the future) brings us closer to a consideration of the nature of disease itself. As the community of biomedical informaticians is increasingly involved (and funded) in the construction of infrastructure and policies to gather and consolidate clinical and experimental data, we have to consider that this community will also be the prime user of these tools and techniques. Those who apply their research to publicly available data, commoditized tools, and streamlined paths through validation will be able to create novel diagnostics and discover fundamental causes of disease as targets for therapies. Investigators empowered by methodologies in bioinformatics have never been so well positioned to take on the role of translational scientist, to build the tools to ask the questions that yield discoveries to improve human health.
A Ab bb br re ev vi ia at ti io on ns s GEO, Gene Expression Omnibus; NCBI, National Center for Biotechnology Information.
C Co om mp pe et ti in ng g i in nt te er re es st ts s AB is or has served as a scientific advisor and/or consultant to NuMedii, Genstruct, Prevendia, Tercica, Eli Lilly and Company, and Johnson and Johnson.
A Ac ck kn no ow wl le ed dg ge em me en nt ts s The work was supported by grants from the Lucile Packard Foundation for Children's Health, National Library of Medicine (K22 LM008261 and R01 LM009719), National Institute of General Medical Sciences (R01 GM079719), Howard Hughes Medical Institute, and the Pharmaceutical Research and Manufacturers of America Foundation.