Assignment 3, Measuring Annotation-Based Semantic Similarity

Due Date: ~~April 13~~ April 20, 2025

Background

Semantic similarity is a measurement of the similarity between two terms in an ontology. This concept has been applied to measure functional similarity between two genes annotated to the ontology. Note that a single gene can be annotated to multiple terms of an ontology. Therefore, the functional similarity between two genes can be translated to the semantic similarity between two sets of terms annotating the genes.

Description

Implement the annotation-based pairwise semantic similarity measures (the information content of the most specific common ancestor divided by the average information content of two terms of interest). Use the best-match averaging method to transform term-to-term similarities to the similarity between two sets of terms.
Measure the semantic similarity of each PPI (in Assignment 2) on BP and MF ontologies, respectively, and select a larger score between BP and MF as the functional similarity of the PPI.
Evaluate the performance of two semantic similarity measures (node-based and annotation-based).

Use the human protein complex dataset as ground truth. Assign a PPI to a positive data set if the two proteins occur together in at least one protein complex. Assign a PPI to a negative data set otherwise.
Measure the area under the ROC curve (AUC) to predict whether each PPI is positive or negative. Compare the AUC values of two semantic similarity measures.

Submission

Submit (1) your Python code "Assignment3.py", (2) the distribution of annotation-based semantic similarities of all PPIs, and (3) the comparison of AUC values of two semnatic similarity measures via LearnUs.

For the similarity distribution, show the frequency for every 0.1 of the similarity scores.