Assignment 1, Parsing Gene Ontology

Due Date: March 22, 2025

 

Background
Ontology is a typical example of hierarchical data. Gene Ontology (GO) is one of the most widely used ontology databases in Bioinformatics, and provides a framework to elucidate biological roles of genes by semantic analysis. GO contains terms in a structured format (i.e., Directed Acyclic Graph) on three domains: biological processes (BP), molecular functions (MF), and cellular components (CC).

Description
  1. Download the most recent version of Gene Ontology (the OBO file).
  2. Parse the OBO file and build BP and MF ontologies, separately, with "is-a" and "part-of" relationships.
    • Ignore the obsolete terms having "is_obsolete: true".
  3. Correct the errors that might exist in the data set.
    • Because BP and MF are independent ontologies, there must be no relationships between BP and MF. Remove the relationships between any BP term and any MF term.

Submission
  • Submit your Python code "Assignment1.py" and the answers of the following questions via LearnUs.
    1. The number of terms: How many terms does the BP ontology have? How many terms does the MF ontology have?
    2. The root ID: What is the GO ID of the root term of the BP ontology? What is the GO ID of the root term of the MF ontology?
    3. Find the errors such that relationships exist between BP and MF. How many relationships exist between any BP term and any MF term?
    4. Find the cases such that two or more different types of relationships exist between two terms. Which pairs of terms have such cases?
    5. The number of leaf terms: How many leaf nodes does the BP ontology have? How many leaf nodes does the MF ontology have?
    6. The term depth distribution: Suppose the term depth is defined as the shortest path length from the root to the term. Show the histogram of term depth distribution of the BP and MF ontologies, respectively (in Excel).