A key focus of genomic medicine is the identification of relationships between phenotype and genotype. Genome-wide association studies and exome/genome sequencing can reveal hundreds of candidate genes that may contribute to human disease. Given such a set of candidate genes, the prioritization of these genes for functional validation emerges as a key challenge in biomedical informatics . Much focus has been placed upon the development of methods for the quantitative association of genes with disease .
Across biomedical research fields, scientific publications are the currency of knowledge. One near-universal tool of life scientists to access this 'bibliome' is the MEDLINE®/PubMed® bibliographic database of the US National Library of Medicine (NLM), an actively maintained central repository for biomedical literature references . As of 2010, over 18.5 million citations have been indexed by MEDLINE®, at a modern rate exceeding 600,000 articles per year. Researchers face increasing difficulty navigating the growing body of published information in search of novel hypotheses. Encapsulating the bibliome for a disease or gene of interest in a form both understandable and informative is an increasingly important challenge in biomedical informatics [4, 5].
MEDLINE® provides data structures and curated annotations to assist scientists with the challenge of extracting pertinent articles from the bibliome of a biomedical entity. In an ongoing process, curators at the NLM identify key topics addressed in each publication and attach corresponding Medical Subject Headings (MeSH)  terms as annotations to each publication's record in MEDLINE®, covering over 97% of all PubMed-indexed citations. The National Center for Biotechnology Information (NCBI) PubMed portal utilizes the annotated MeSH terms to empower search of the citation database, extending the reach of users beyond naïve word matching to topic matching. As one of the constellation of NCBI resources, MEDLINE®/PubMed® citations are further linked to gene entries in Entrez Gene where appropriate, with over 450,000 MEDLINE®/PubMed® citations linked to an Entrez Gene entry for a human gene.
The analysis of gene annotation properties and gene-related literature is a core challenge within computational biology. Biomedical keywords for properties of genes, drawn from structured vocabularies, have been identified from unstructured gene annotations [7, 8], as well as directly from the primary literature [9–11]. Sets of genes can be analyzed to extract common annotated biomedical properties. Assigned descriptive terms can be visualized as 'tag clouds' [13, 14]. Comparison of gene annotation profiles can group genes - expanding protein-protein interaction and phenotype networks, deriving regulatory networks and predicting other gene-gene relationships [15–20]. Annotation analysis enables prioritization of candidate genes in genetics studies [10, 21–23], and, when integrated with other information sources, predicts novel properties of genes [24, 25]. Existing tools and techniques demonstrate the value, and suggest a high potential impact, of annotation analysis. Significant research opportunities remain to improve annotation and annotation-based analysis methods.
The development of computational disease information resources has run parallel to the aforementioned gene-based efforts. Controlled vocabularies for medical descriptions [26, 27] and disease-specific annotations [28, 29] are emerging to facilitate medical information systems. Analysis of biomedical annotations associated with disease literature, as well as networks of gene-disease association, have been constructed to investigate the common biological aspects underlying diseases [9, 30]. In tandem with the curation of MEDLINE® by the NLM, a disease category of the Medical Subject Headings has been developed over 50 years, providing an extensive inventory of medical disorders. By 2011, 4,494 MeSH disease terms had been established.
Key to accelerating the identification of gene-disease relationships is the development of systematic approaches to quantitatively represent bibliometric information and infer functionally important relationships between entities. We have previously introduced MeSH Over-representation Profiles (MeSHOPs) as a convenient tool for constructing quantitative annotations for sets of papers in MEDLINE® where each paper refers to the same entity (such as a gene or a disease) . To demonstrate the fidelity of the MeSHOP knowledge representation at measuring features important for prediction, we generate the MeSHOPs for human genes and diseases, and compare these MeSHOPs to predict novel associations. Predictive performance for gene-disease relationships is validated against co-occurrence in future publications and curated databases. Comparing MeSHOPs is demonstrated to be an effective way to identify novel relationships between genes and diseases.