Assignment 7, Graph Clustering by Graph Entropy Algorithm

Due Date: 11:59pm, December 12, 2024

Purpose

Understanding of density-based graph clustering algorithms
Practice of applying graph clustering techniques to large-scale graph data

Description

Genetic interaction data are provided to discover gene sets densely connected each other. The densely connected genes are likely to have the same cellular functions. Implement the Graph Entropy algorithm for density-based graph clustering.

Select a seed node of the highest degree among those which are not in any clusters previously determined, and form an initial cluster including the seed node and its neighbors.
Remove iteratively each neighbor of the seed in a decreasing order of degree if the removal of the neighbor decreases graph entropy.
Add iteratively each node on the outer boundary of the current cluster in a decreasing order of degree if the addition of the node decreases graph entropy.
Return the current cluster
Repeat steps-1, 2, 3, and 4 until there is no seed node to be selected.
Your python code should take an input file ("assignment6_input.txt"), and return an output file ("assignment7_output.txt") which has the clusters of size greater than 3. In an output file, print output clusters in a decreasing order of cluster size. Print each cluster at each line in the format of the size and gene names in the cluster, for example, "4: YBR160W YDR224C YPL231W YBR081C". In each cluster, print the gene names in an arphabetic order.

Data Set

Gene-gene interaction data are given in a tab-delimited text file. Each row represents an interaction (edge) between two genes (vertices).

Submission

Submit your Python code, "assignment7.py", and the output file via LearnUs.

Note

In steps-1, 2 and 3, if two or more nodes have the same highest degree, then select one of them in an alphabetic order of the gene names.
Please add comments in your code including your name at the first line.