Quickstart Tutorial

This tutorial provides a quick introduction to PyAGC’s core functionality.

Installation

First, install PyAGC:

pip install pyagc

Basic Clustering Example

Example 1: Training-Free Clustering with SSGC

SSGC is a non-parametric method that requires no training:

import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.models import SSGC
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics

# Load dataset
x, edge_index, y = get_dataset('Cora', root='./data')
data = Data(x=x, edge_index=edge_index)

# Create SSGC model (training-free)
model = SSGC(alpha=0.05, K=12, cached=True)

# Generate embeddings
z = model.embed(data.x, data.edge_index)

# Cluster with KMeans
n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters, backend='torch')
pred = kmeans.fit_predict(z)

# Evaluate
results = label_metrics(y, pred, metrics=['NMI', 'ARI', 'ACC', 'F1'])
print(f"ACC: {results['ACC']:.4f}, NMI: {results['NMI']:.4f}")

Output:

ACC: 0.6538, NMI: 0.5185

Example 2: Deep Contrastive Clustering with DGI

DGI uses contrastive learning for representation learning:

import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.models import DGI
from pyagc.encoders import create_tuned_gnn
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x, edge_index, y = get_dataset('Cora', root='./data')
data = Data(x=x, edge_index=edge_index).to(device)

# Create encoder and model
encoder = create_tuned_gnn(
    gnn_type='gcn',
    in_channels=data.num_features,
    hidden_channels=512,
    num_layers=1,
    act_last=True
)
model = DGI(hidden_channels=512, encoder=encoder).to(device)

# Train
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(1, 301):
    loss = model.train_full(data, optimizer, epoch, verbose=(epoch % 50 == 0))

# Inference and clustering
model.eval()
with torch.no_grad():
    z = model.infer_full(data)

n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters)
pred = kmeans.fit_predict(z.cpu())

# Evaluate
results = label_metrics(y, pred)
print(f"ACC: {results['ACC']:.4f}, NMI: {results['NMI']:.4f}")

Configuration-Driven Workflow

For reproducible experiments, PyAGC supports YAML configuration files.

Create a configuration file (train.conf.yaml):

default:
  # Training
  lr: 0.001
  wd: 0.0
  epochs: 300
  patience: 50

  # Model architecture
  gnn_type: gcn
  hidden_channels: 512
  num_layers: 1
  dropout: 0.0
  norm: null
  act_last: true

  # Evaluation
  label_metrics: ['NMI', 'ARI', 'ACC', 'F1']
  struct_metrics: ['Mod', 'Cond']
  kmeans_backend: torch
  kmeans_n_init: 10

# Dataset-specific overrides
Cora:
  epochs: 300
  hidden_channels: 512
  num_layers: 1

Run experiment from command line:

cd benchmark/DGI
python main.py --dataset Cora --device cuda:0 --runs 5

Output:

============================================================
Configuration
============================================================
  dataset: Cora
  epochs: 300
  hidden_channels: 512
  lr: 0.001
  ...

============================================================
Training Mode: Full-batch
============================================================
Epoch: 001 Loss: 0.6931
...
Epoch: 300 Loss: 0.2134

============================================================
Clustering Stage
============================================================
Run 1/5: NMI=54.78, ARI=45.73, ACC=66.73, F1=65.66
Run 2/5: NMI=56.45, ARI=54.45, ACC=75.00, F1=72.98
...

============================================================
Final Results Summary
============================================================
Clustering Metrics (mean ± std):
  NMI   :  56.22 ± 0.87
  ARI   :  52.03 ± 3.22
  ACC   :  72.22 ± 2.92
  F1    :  70.07 ± 2.56

Comparing Multiple Methods

PyAGC makes it easy to compare different algorithms:

from pyagc.data import get_dataset
from pyagc.models import SSGC, DGI, DMoN
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics
import torch

# Load data
x, edge_index, y = get_dataset('Cora')
n_clusters = int(y.max().item()) + 1

# Method 1: SSGC (training-free)
ssgc = SSGC(alpha=0.05, K=12)
z_ssgc = ssgc.embed(x, edge_index)
pred_ssgc = KMeansClusterHead(n_clusters=n_clusters).fit_predict(z_ssgc)
results_ssgc = label_metrics(y, pred_ssgc)

# Method 2: DGI (decoupled)
# ... train DGI and get embeddings z_dgi ...
pred_dgi = KMeansClusterHead(n_clusters=n_clusters).fit_predict(z_dgi)
results_dgi = label_metrics(y, pred_dgi)

# Method 3: DMoN (joint end-to-end)
# ... train DMoN which outputs predictions directly ...
results_dmon = label_metrics(y, pred_dmon)

# Compare
print(f"SSGC: ACC={results_ssgc['ACC']:.4f}")
print(f"DGI:  ACC={results_dgi['ACC']:.4f}")
print(f"DMoN: ACC={results_dmon['ACC']:.4f}")

Scaling to Large Graphs

For large graphs, use mini-batch training:

Update configuration to enable mini-batch:

default:
  mini_batch: true
  batch_size: 1024
  fan_out: 10
  num_workers: 4
  infer_batch_size: 256

Run on large dataset:

python main.py --dataset Products --device cuda:0 --runs 1

PyAGC automatically uses neighbor sampling for training and inference on large graphs like Products (2.4M nodes).

Understanding Evaluation Metrics

PyAGC provides both label-based and structure-based metrics:

from pyagc.metrics import label_metrics, structure_metrics

# Label-based: Compare with ground truth
label_results = label_metrics(
    y_true, y_pred,
    metrics=['NMI', 'ARI', 'ACC', 'F1', 'Homo', 'Comp']
)

# Structure-based: Evaluate community quality
struct_results = structure_metrics(
    edge_index, y_pred,
    metrics=['Mod', 'Cond']
)

Metrics explanation:

  • NMI (Normalized Mutual Information): Measures information shared between clusters and labels

  • ARI (Adjusted Rand Index): Measures similarity between two clusterings

  • ACC (Accuracy): Best matching accuracy using Hungarian algorithm

  • F1 (Macro-F1): Harmonic mean of precision and recall

  • Mod (Modularity): Measures strength of division into communities

  • Cond (Conductance): Measures cluster separability (lower is better)

Next Steps