Mastering Biological Knowledge Graphs: A Step-by-Step PyBEL Tutorial in Google Colab

Installing Essential Packages

This tutorial begins by setting up the PyBEL ecosystem along with key libraries such as NetworkX, Matplotlib, Seaborn, and Pandas inside Google Colab. This ensures all necessary tools are ready for constructing and analyzing biological knowledge graphs.

!pip install pybel pybel-tools networkx matplotlib seaborn pandas -q
 
import pybel
import pybel.dsl as dsl
from pybel import BELGraph
from pybel.io import to_pickle, from_pickle
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

Building an Alzheimer’s Disease Knowledge Graph

A BELGraph is initialized to represent molecular interactions related to Alzheimer’s disease. Proteins like APP, Abeta, tau, and GSK3B, along with biological processes such as inflammation and apoptosis, are defined using PyBEL’s domain-specific language (DSL). The tutorial demonstrates encoding causal increases, protein modifications (e.g., tau phosphorylation), and associations with supporting literature citations.

graph = BELGraph(
   name="Alzheimer's Disease Pathway",
   version="1.0.0",
   description="Example pathway showing protein interactions in AD",
   authors="PyBEL Tutorial"
)
 
app = dsl.Protein(name="APP", namespace="HGNC")
abeta = dsl.Protein(name="Abeta", namespace="CHEBI")
tau = dsl.Protein(name="MAPT", namespace="HGNC")
gsk3b = dsl.Protein(name="GSK3B", namespace="HGNC")
inflammation = dsl.BiologicalProcess(name="inflammatory response", namespace="GO")
apoptosis = dsl.BiologicalProcess(name="apoptotic process", namespace="GO")
 
 
graph.add_increases(app, abeta, citation="PMID:12345678", evidence="APP cleavage produces Abeta")
graph.add_increases(abeta, inflammation, citation="PMID:87654321", evidence="Abeta triggers neuroinflammation")
 
 
tau_phosphorylated = dsl.Protein(name="MAPT", namespace="HGNC",
                               variants=[dsl.ProteinModification("Ph")])
graph.add_increases(gsk3b, tau_phosphorylated, citation="PMID:11111111", evidence="GSK3B phosphorylates tau")
graph.add_increases(tau_phosphorylated, apoptosis, citation="PMID:22222222", evidence="Hyperphosphorylated tau causes cell death")
graph.add_increases(inflammation, apoptosis, citation="PMID:33333333", evidence="Inflammation promotes apoptosis")
 
 
graph.add_association(abeta, tau, citation="PMID:44444444", evidence="Abeta and tau interact synergistically")
 
print(f"Created BEL graph with {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges")

Advanced Network Analysis

The tutorial applies centrality metrics such as degree, betweenness, and closeness centrality using NetworkX to identify key nodes within the graph. This analysis highlights influential proteins and processes that may be critical in disease pathways.

degree_centrality = nx.degree_centrality(graph)
betweenness_centrality = nx.betweenness_centrality(graph)
closeness_centrality = nx.closeness_centrality(graph)
 
most_central = max(degree_centrality, key=degree_centrality.get)
print(f"Most connected node: {most_central}")
print(f"Degree centrality: {degree_centrality[most_central]:.3f}")

Classifying Biological Entities

Nodes are categorized by their biological function, such as Proteins or Biological Processes, allowing a quantitative overview of the graph composition.

node_types = Counter()
for node in graph.nodes():
   node_types[node.function] += 1
 
print("Node distribution:")
for func, count in node_types.items():
   print(f"  {func}: {count}")

Pathway Composition and Relationship Types

Proteins and biological processes are separated to assess pathway complexity. Different edge types such as 'increases' or 'associations' are counted to understand interaction patterns.

proteins = [node for node in graph.nodes() if node.function == 'Protein']
processes = [node for node in graph.nodes() if node.function == 'BiologicalProcess']
 
print(f"Proteins in pathway: {len(proteins)}")
print(f"Biological processes: {len(processes)}")
 
edge_types = Counter()
for u, v, data in graph.edges(data=True):
   edge_types[data.get('relation')] += 1
 
print("\nRelationship types:")
for rel, count in edge_types.items():
   print(f"  {rel}: {count}")

Analyzing Literature Evidence

Citation identifiers and evidence statements are extracted from edges to evaluate the graph's basis in scientific literature, summarizing the number of total and unique citations.

citations = []
evidences = []
for _, _, data in graph.edges(data=True):
   if 'citation' in data:
       citations.append(data['citation'])
   if 'evidence' in data:
       evidences.append(data['evidence'])
 
print(f"Total citations: {len(citations)}")
print(f"Unique citations: {len(set(citations))}")
print(f"Evidence statements: {len(evidences)}")

Subgraph Extraction for Inflammation

A focused subgraph centered on inflammation and its neighboring nodes is created to study specific disease mechanisms in detail.

inflammation_nodes = [inflammation]
inflammation_neighbors = list(graph.predecessors(inflammation)) + list(graph.successors(inflammation))
inflammation_subgraph = graph.subgraph(inflammation_nodes + inflammation_neighbors)
 
print(f"Inflammation subgraph: {inflammation_subgraph.number_of_nodes()} nodes, {inflammation_subgraph.number_of_edges()} edges")

Graph Querying for Pathways

The script enumerates simple paths between APP and apoptosis to identify mechanistic routes leading to cell death. It also lists factors increasing apoptosis.

try:
   paths = list(nx.all_simple_paths(graph, app, apoptosis, cutoff=3))
   print(f"Paths from APP to apoptosis: {len(paths)}")
   if paths:
       print(f"Shortest path length: {len(paths[0])-1}")
except nx.NetworkXNoPath:
   print("No paths found between APP and apoptosis")
 
apoptosis_inducers = list(graph.predecessors(apoptosis))
print(f"Factors that increase apoptosis: {len(apoptosis_inducers)}")

Visualization and Export

Multiple plots are generated to visualize the network graph, centrality distributions, node type proportions, and relationship types. These visual aids support deeper interpretation of the biological knowledge graph.

adj_matrix = nx.adjacency_matrix(graph)
node_labels = [str(node) for node in graph.nodes()]
 
plt.figure(figsize=(12, 8))
 
plt.subplot(2, 2, 1)
pos = nx.spring_layout(graph, k=2, iterations=50)
nx.draw(graph, pos, with_labels=False, node_color='lightblue',
       node_size=1000, font_size=8, font_weight='bold')
plt.title("BEL Network Graph")
 
plt.subplot(2, 2, 2)
centralities = list(degree_centrality.values())
plt.hist(centralities, bins=10, alpha=0.7, color='green')
plt.title("Degree Centrality Distribution")
plt.xlabel("Centrality")
plt.ylabel("Frequency")
 
plt.subplot(2, 2, 3)
functions = list(node_types.keys())
counts = list(node_types.values())
plt.pie(counts, labels=functions, autopct='%1.1f%%', startangle=90)
plt.title("Node Type Distribution")
 
plt.subplot(2, 2, 4)
relations = list(edge_types.keys())
rel_counts = list(edge_types.values())
plt.bar(relations, rel_counts, color='orange', alpha=0.7)
plt.title("Relationship Types")
plt.xlabel("Relation")
plt.ylabel("Count")
plt.xticks(rotation=45)
 
plt.tight_layout()
plt.show()

This tutorial demonstrates the power and flexibility of PyBEL for creating, annotating, and visualizing complex biological knowledge graphs. Users are encouraged to extend these methods to their own datasets and integrate with further omics analyses or machine learning workflows.