Skip to main content

Table 3 Explanation of the scoring functions evaluated

From: Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles

Scoring method

Description

Cosine distance of term frequency-inverse document frequency

∑ j ∈ M g i ( j ) d i ( j ) ∑ j ∈ M g i ( j ) 2 ∑ j ∈ M d i ( j ) 2

Cosine distance of P-values

∑ i ∈ M g p ( i ) d p ( i ) ∑ i ∈ M g p ( i ) 2 ∑ i ∈ M d p ( i ) 2

Cosine distance of term fractions

∑ i ∈ M g f ( i ) d f ( i ) ∑ i ∈ M g f ( i ) 2 ∑ i ∈ M d f ( i ) 2

Sum of the log of combined P-values

∑ i ∈ M log g p ( i ) + d p ( i ) - g p ( i ) d p ( i )

Sum of the differences of log P-values

∑ i ∈ M log g p ( i ) d p ( i ) = ∑ i ∈ M log g p ( i ) - log d p ( i )

L2 of log-p of overlapping terms only

∑ i ∈ ( G ∩ D ) log g p ( i ) - log d p ( i ) 2

L2 of term fractions of overlapping terms only

∑ i ∈ ( G ∩ D ) g f ( i ) - d f ( i ) 2

L2 of log of P-values

∑ i ∈ M log g p ( i ) d p ( i ) 2 = ∑ i ∈ M log g p ( i ) - log d p ( i ) 2

L2 of P-values

∑ i ∈ M g p ( i ) - d p ( i ) 2

L2 of term fractions

∑ i ∈ M g f ( i ) - d f ( i ) 2

L2 of term frequency

∑ i ∈ M g ( i ) - d ( i ) 2

Term coverage

|G∪D|

Term overlap

|G∩D|

Number of gene MeSH terms

|G|

Number of disease MeSH terms

|D|

Gene ID

Entrez Gene ID of the gene

  1. M refers to the set of all MeSH terms, G and D to the MeSH terms for the gene and disease profile, respectively. g(i), g f (i), g p (i) and g i (i) refer to the frequency, term fraction, hypergeometric P-value and term frequency-inverse document frequency for the MeSH term i of the gene profile. d(i), d f (i), d p (i) and d i (i) refer to the frequency, term fraction, hypergeometric P-value and term frequency-inverse document frequency for the MeSH term i of the disease profile.