Skip to main content

Table 5 Performance using gene2pubmed as the gene-literature data source

From: Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles

Scoring method

Novel MEDLINE validation AUC (02/2007-01/2009)

Novel MEDLINE validation AUC (02/2007-04/2010)

Pre-existing CTD validation AUC (11/2008)

Novel CTD validation AUC (11/2008-04/2010)

Pre-existing MEDLINE validation AUC (02/2007)

Mean AUC

Rank

Cosine distance of term frequency-inverse document frequency

0.92

0.91

0.95

0.93

0.98

0.94

2

Cosine distance of P-values

0.53

0.51

0.65

0.63

0.53

0.57

16

Cosine distance of term fractions

0.90

0.89

0.93

0.91

0.96

0.92

5

Sum of the log of combined P-values

0.91

0.89

0.94

0.94

0.94

0.92

3

Sum of the differences of log P-values

0.91

0.91

0.77

0.83

0.93

0.87

7

L2 of log-p of overlapping terms only

0.96

0.95

0.92

0.94

0.99

0.95

1

L2 of term fractions of overlapping terms only

0.64

0.62

0.57

0.60

0.53

0.59

15

L2 of log of P-values

0.90

0.90

0.76

0.83

0.93

0.86

10

L2 of P-values

0.89

0.89

0.75

0.81

0.92

0.86

12

L2 of term fractions

0.92

0.90

0.91

0.92

0.95

0.92

4

L2 of term frequency

0.90

0.90

0.76

0.82

0.93

0.86

11

Term coverage

0.90

0.91

0.77

0.83

0.93

0.87

8

Term overlap

0.91

0.89

0.90

0.92

0.90

0.90

6

Number of gene MeSH terms

0.85

0.82

0.85

0.88

0.83

0.85

13

Number of disease MeSH terms

0.90

0.90

0.76

0.83

0.93

0.86

9

Gene ID

0.75

0.73

0.78

0.79

0.74

0.76

14

  1. AUC of the described scoring methods were compared and tested on the validation sets. CTD, Comparative Toxicogenomics Database.