Assignment 2, Measuring Node-Based Semantic Similarity

Due Date: March 31 April 2, 2025

 

Background
Semantic similarity is a measurement of the similarity between two terms in an ontology. This concept has been applied to measure functional similarity between two genes annotated to the ontology. Note that a single gene can be annotated to multiple terms of an ontology. Therefore, the functional similarity between two genes can be translated to the semantic similarity between two sets of terms annotating the genes.

Description
  1. Download the most recent version of human annotation data (goa_human.gaf). Label the nodes in BP and MF ontologies using annotating genes, but exclude the annotations of the "IEA" evidence code and the annotations "not" qualified.
  2. Read the human PPI (protein-protein interaction) dataset provided. If an interacting pair was not annotated together to BP terms or MF terms, then remove the pair.
  3. Implement the node-based group-wise semantic similarity measure, i.e., the number of common ancestor terms divided by the number of all ancestor terms of the two terms.
  4. Measure the semantic similarity of each PPI on BP and MF ontologies, respectively, and select a larger score between BP and MF as the functional similarity of the PPI.

Submission
  • Submit your Python code "Assignment2.py" and the distribution of similarities of all PPIs via LearnUs.
    • For the similarity distribution, show the frequency for every 0.1 of the similarity scores.