Mining the literature: new methods to exploit keyword profiles

Bibliographic records in the PubMed database of biomedical literature are annotated with Medical Subject Headings (MeSH) by curators, which summarize the content of the articles. Two recent publications explain how to generate profiles of MeSH terms for a set of bibliographic records and to use them to define any given concept by its associated literature. These concepts can then be related by their keyword profiles, and this can be used, for example, to detect new associations between genes and inherited diseases. See related research articles: http://www.biomedcentral.com/1471-2105/13/249/abstracthttp://genomemedicine.com/content/4/9/75/abstract


The value of curators
ese new publications and many others that use data and text mining to try to infer biological knowledge by computational means rely on databases maintained at the United States National Library of Medicine (NLM) [7]; in this case, PubMed and Entrez Gene. Structured database records, annotations, and cross-linking between databases not only facilitate manual querying of the data but also allow such computational analyses.
In particular, the annotation of each PubMed record with MeSH terms that summarize its content requires a monumental amount of human eff ort, as it necessitates understanding the main points of each new PubMed entry; and currently, new PubMed records are being created at a rate of 3,000 per day.
Moreover, MeSH terms are a controlled vocabulary arranged hierarchically in 16 categories; they are complex and numerous (at present, there are more than 26,000). To make matters worse, new MeSH terms are sometimes added or old ones are modifi ed, so that old records in PubMed may need re-annotation.
However, it should be noted that although using MeSH terms for querying PubMed records results in more specifi c retrieval than using words found in abstracts [8], querying with MeSH terms is an option seldom chosen by PubMed users [9]. So ultimately, it is possible that the huge value of the work that the NLM curators do in creating and using MeSH terms lies in enabling computational mining methods such as the ones highlighted here.

Generating and comparing pro les
In the fi rst publication [4], Cheung et al. describe how to build a profi le out of a collection of PubMed references. Since these references are all annotated with MeSH terms, it is possible to identify the terms that are overrepresented in the collection compared with their background prevalence in all publications in PubMed. Specifically, they measure over-representation by using P-values from a Fisher's exact test. e resulting profi les show how a particular collection of literature records defi nes something that is specifi c.
In their second publication [5], Cheung et al. apply the profi les generated to establish comparisons between genes and inherited diseases. e approach is familiar: similar genes can be expected to be associated with similar diseases [10]. erefore, if a profi le is established for a particular disease based on its associated bibliography, particular features would be expected to be highlighted that also appear in the profi les of genes related to the disease. Given a chromosomal region linked to an inherited disease by analysis of the genomes of patients and healthy controls, a novel candidate gene in that region might be identifi ed based on previous associations.
One interesting point in the work of Cheung et al. is that the metrics they use to compare profi les do not take into account the similarity between whole profi les of enrichment values, but only the similarity between the enrichment values of the overlapping terms.
is sets their method apart from other methods of keyword profi le comparison: even a small, but signifi cant, number of overlapping terms can be enough to highlight an association. It is easy then to go back to the original source papers (of which there may only have been three or four) that were annotated with the enriched MeSH terms for evidence of the association.

What next?
As shown in Figure 1, the way these profi les can be compared is by no means limited to genes and diseases. e web tool described in the highlighted manuscripts already supports the generation of profi les for three concept types (human genes, diseases and chemicals), but potentially allows the annotation of any concept (by their associated list of PubMed records). Similarly, any concept that is given one of these profi les can be compared to any other concept receiving another profi le. So, there is no reason to stop at the gene-to-disease comparison: one could potentially examine associations between concepts of the same type (for example, genes with genes), or between other concepts (for example, protein sequence features and functions).
However, these calculations are computationally costly. is imposes limitations, precluding the authors from off ering such general functionality. Rather, Cheung et al. focused on arguably the most pressing need; namely, to identify genes linked to disease. But they have clearly set out their method, and we can expect that this and other groups will use these ideas to create other metrics for the generation and comparison of keyword profi les, which may be even more effi cient. Cheung et al. already give some ideas of how to benchmark such methods. ese new developments are surely an encouragement for the support of continued eff orts in high-quality annotation of large public databases.

Abbreviations
MeSH, Medical Subject Headings; NLM, National Library of Medicine.

Figure 1. Comparing biological concepts using keyword profi les. (a)
A set of references from PubMed can be associated with a profi le of keywords according to the Medical Subject Heading (MeSH) terms assigned to them. By extension, this can be done for any concept or database record with an associated set of PubMed references. (b) Entrez (human) genes, diseases (defi ned as MeSH terms) or chemicals (defi ned as MeSH terms) can be associated with bibliographic records in PubMed, and therefore can be defi ned by profi les based on any MeSH terms (middle). Conversely, it is possible to fi nd the genes, chemicals and diseases associated with particular MeSH terms. Associations between concepts (genes and diseases, but also genes and chemicals, or diseases and chemicals) can be defi ned by comparing such profi les. This fi gure is partly taken, with permission, from the MeSHOPs website [6].