pyagc.data
Dataset Loading
- get_dataset(name: str, root: str, return_splits=False)[source]
Loads a graph dataset by name and returns its features, edges, and labels.
This function serves as a unified interface for loading a wide range of benchmark datasets used in graph learning, including both classical citation networks (e.g., Cora, PubMed) and large-scale Open Graph Benchmark (OGB) datasets (e.g., ogbn-arxiv, ogbn-products). It automatically normalizes node features, converts the graph to an undirected version.
Optionally, it can also return predefined train/validation/test node splits for benchmarking purposes.
- Parameters:
name (str) – The name of the dataset to load. Supported options include:
['cora', 'citeseer', 'pubmed', 'corafull', 'photo', 'computers', 'cs', 'physics', 'flickr', 'reddit', 'reddit2', 'ogbn-arxiv', 'arxiv', 'ogbn-mag', 'mag', 'ogbn-products', 'products', 'ogbn-papers100M', 'papers100m', 'hm-categories', 'hm', 'pokec-regions', 'pokec', 'web-topics', 'webtopic'].root (str) – The root directory where the dataset should be stored.
return_splits (bool, optional) – If set to
True, returns node-level split indices (train/valid/test) along with the features and edges. (default:False)
- Returns:
- Depending on
return_splits: - If
False, returns(x, edge_index, y): x: Node feature matrix[num_nodes, num_features]edge_index: Graph connectivity in COO format[2, num_edges]y: Node label vector[num_nodes]
- If
If
True, returns(x, edge_index, y, train_idx, valid_idx, test_idx)with additional index tensors for data splits.For papers100M with return_splits=True, additionally returns:
(x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph)where labeled_subgraph contains only edge_index and original_indices for structure metric computation.
- Depending on
- Return type:
(Tuple)
- Raises:
ValueError – If the provided dataset
nameis not recognized.
- get_tabular_graphland_dataset(name: str, root: str, split: str = 'TH')[source]
Load HM / Pokec / WebTopic from GraphLandTensorFrameDataset.
This loader is intentionally separate from the generic get_dataset() because these datasets store node attributes as TensorFrame instead of dense tensor features.
- Parameters:
- Returns:
- A PyG Data object with:
data.x: torch_frame.TensorFrame
data.edge_index: edge list
data.y: labels
train/val/test masks
tf_col_stats: statistics computed from train nodes only
- Return type:
Data
Benchmark Datasets
PyAGC provides a curated collection of 12 benchmark datasets spanning diverse domains, scales, and feature types for comprehensive evaluation of attributed graph clustering algorithms.
Dataset Overview Table
Scale |
Dataset |
Domain |
#Nodes |
#Edges |
Avg.Deg |
#Feat |
Feat.Type |
#Clus |
\(\mathcal{H}_e\) |
\(\mathcal{H}_n\) |
|---|---|---|---|---|---|---|---|---|---|---|
Tiny |
Cora |
Citation |
2,708 |
10,556 |
3.9 |
1,433 |
Textual |
7 |
0.81 |
0.83 |
Tiny |
Photo |
Co-purchase |
7,650 |
238,162 |
31.1 |
745 |
Textual |
8 |
0.83 |
0.84 |
Small |
Physics |
Co-author |
34,493 |
495,924 |
14.4 |
8,415 |
Textual |
5 |
0.93 |
0.92 |
Small |
HM |
Co-purchase |
46,563 |
21,461,990 |
460.9 |
120 |
Tabular |
21 |
0.16 |
0.35 |
Small |
Flickr |
Social |
89,250 |
899,756 |
10.1 |
500 |
Textual |
7 |
0.32 |
0.32 |
Medium |
ArXiv |
Citation |
169,343 |
1,166,243 |
6.9 |
128 |
Textual |
40 |
0.65 |
0.64 |
Medium |
Social |
232,965 |
23,213,838 |
99.6 |
602 |
Textual |
41 |
0.78 |
0.81 |
|
Medium |
MAG |
Citation |
736,389 |
10,792,672 |
14.7 |
128 |
Textual |
349 |
0.30 |
0.31 |
Large |
Pokec |
Social |
1,632,803 |
44,603,928 |
27.3 |
56 |
Tabular |
183 |
0.43 |
0.39 |
Large |
Products |
Co-purchase |
2,449,029 |
61,859,140 |
25.4 |
100 |
Textual |
47 |
0.81 |
0.82 |
Large |
WebTopic |
Web |
2,890,331 |
24,754,822 |
8.6 |
528 |
Tabular |
28 |
0.22 |
0.24 |
Massive |
Papers100M |
Citation |
111,059,956 |
1,615,685,872 |
14.5 |
128 |
Textual |
172 |
0.57 |
0.50 |
Note
\(\mathcal{H}_e\): Edge homophily (proportion of edges connecting same-class nodes)
\(\mathcal{H}_n\): Node homophily (average neighbor label consistency)
Feat.Type: Textual (bag-of-words, embeddings) or Tabular (categorical/numerical metadata)
For Papers100M, labels are available for a subset of ≈1.5M arXiv papers. The reported homophily metrics are calculated based on the induced subgraph of these labeled nodes
Dataset Details by Scale
Tiny Scale (\(N < 10^4\))
These datasets are suitable for rapid prototyping and sanity checking:
Cora: Classic citation network of machine learning papers with sparse bag-of-words features
Photo: Amazon product co-purchase graph with review-based features
Small Scale (\(10^4 \le N < 10^5\))
Suitable for comprehensive model development and ablation studies:
Physics: Co-authorship network from Microsoft Academic Graph with keyword features
HM: H&M fashion co-purchase network with tabular product metadata (color, weekday statistics)
Flickr: Image-sharing social network with tag-based features
Medium Scale (\(10^5 \le N < 10^6\))
Transitional regime requiring efficient implementations:
ArXiv: Computer Science paper citations with title/abstract embeddings
Reddit: Discussion posts connected by common commenters with GloVe features
MAG: Multi-venue academic citations (349 classes) with abstract embeddings
Large Scale (\(10^6 \le N < 10^8\))
Production-scale graphs requiring mini-batch training:
Pokec: Slovak social network with tabular user profiles (183 regions, heterophilous)
Products: Amazon co-purchase network with product description features
WebTopic: Web graph with tabular website metadata (28 topics, low homophily)
Massive Scale (\(N > 10^8\))
Extreme-scale benchmark for testing scalability limits:
Papers100M: 111M paper citation network (172 subjects) — requires neighbor sampling
Example Usage
Basic Loading
from pyagc.data import get_dataset
# Load dataset (returns unpacked components)
x, edge_index, y = get_dataset('Cora', root='./data')
print(f"Node features shape: {x.shape}") # [num_nodes, num_features]
print(f"Edge index shape: {edge_index.shape}") # [2, num_edges]
print(f"Labels shape: {y.shape}") # [num_nodes]
print(f"Number of classes: {y.max().item() + 1}")
Loading with Train/Val/Test Splits
# Load with predefined splits
x, edge_index, y, train_idx, valid_idx, test_idx = get_dataset(
'ArXiv',
root='./data',
return_splits=True
)
print(f"Training nodes: {train_idx.shape[0]}")
print(f"Validation nodes: {valid_idx.shape[0]}")
print(f"Test nodes: {test_idx.shape[0]}")
Loading Large-Scale Datasets
# Medium/Large datasets work the same way
x, edge_index, y = get_dataset('Products', root='./data')
print(f"Large graph: {x.shape[0]:,} nodes, {edge_index.shape[1]:,} edges")
Loading Massive-Scale Datasets (Papers100M)
# Papers100M requires special handling due to its size
# First-time loading will preprocess and cache the undirected graph
x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = get_dataset(
'Papers100M',
root='./data',
return_splits=True
)
# labeled_subgraph contains structure for computing structural metrics
# on the labeled subset (≈1.5M nodes)
print(f"Full graph: {x.shape[0]:,} nodes")
print(f"Labeled subgraph: {labeled_subgraph['num_nodes']:,} nodes")
print(f"Subgraph edges: {labeled_subgraph['edge_index'].shape[1]:,}")
Warning
Papers100M Preprocessing Requirements:
First-time loading requires ~400 GB RAM for preprocessing
Preprocessed data is cached to disk for future use
Returns additional labeled_subgraph dict when return_splits=True
The labeled subgraph contains only structure (no features) for efficient structural metric computation
Creating PyG Data Object
If you need a PyG Data object for compatibility:
from torch_geometric.data import Data
x, edge_index, y = get_dataset('Cora', root='./data')
# Wrap in PyG Data object
data = Data(x=x, edge_index=edge_index, y=y)
print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}")
print(f"Features: {data.num_features}, Classes: {data.y.max().item() + 1}")
Working with Tabular Features
GraphLand datasets (HM, Pokec, WebTopic) provide tabular node features.
Two usage patterns are supported:
Using Dense Tensor Features
# Load tabular feature dataset
x, edge_index, y = get_dataset('HM', root='./data')
print(f"Tabular features: {x.shape}") # Mixed categorical/numerical
print(f"Number of product categories: {y.max().item() + 1}")
# These features may require preprocessing (normalization, encoding)
# depending on your clustering algorithm
Using TensorFrame Features
from pyagc.data import get_tabular_graphland_dataset
data = get_tabular_graphland_dataset('HM', root='./data')
x = data.x # torch_frame.TensorFrame
print(x)
print(data.tf_col_stats) # column-wise statistics
Advantages of TensorFrame:
Preserves feature semantics (categorical vs numerical)
Avoids lossy preprocessing
Enables learnable feature encoders
Better suited for heterogeneous tabular graphs
Tip
When using TensorFrame inputs, pair with encoders from
pyagc.encoders.tabencoder or custom torch-frame models.
Loading Other PyG Datasets
For datasets not included in the benchmark suite, you can directly use PyTorch Geometric:
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid, Amazon, Coauthor
from torch_geometric.utils import to_undirected
# Example 1: Load PubMed from Planetoid
dataset = Planetoid(root='./data', name='PubMed', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index) # Convert to undirected
x = data.x
edge_index = data.edge_index
y = data.y.squeeze()
print(f"PubMed: {x.shape[0]} nodes, {edge_index.shape[1]} edges")
# Example 2: Load Computers from Amazon
dataset = Amazon(root='./data', name='Computers', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)
# Example 3: Load other Coauthor datasets
dataset = Coauthor(root='./data', name='CS', transform=T.NormalizeFeatures())
data = dataset[0]
# Example 4: Load any OGB dataset
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(root='./data', name='ogbn-proteins')
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)
# Get splits for OGB datasets
split_idx = dataset.get_idx_split()
train_idx = split_idx['train']
valid_idx = split_idx['valid']
test_idx = split_idx['test']
Tip
Working with Custom PyG Datasets:
When using PyG datasets directly, remember to:
Apply
to_undirected()for clustering tasks (most AGC methods assume undirected graphs)Use
T.NormalizeFeatures()transform for consistent feature scalingConvert labels to 1D tensor:
y = data.y.squeeze()Check if the dataset provides train/val/test masks or splits
Loading Custom Datasets
You can also load your own graph data:
import torch
from torch_geometric.data import Data
from torch_geometric.utils import to_undirected
# Create graph from edge list and features
edge_index = torch.tensor([[0, 1, 1, 2],
[1, 0, 2, 1]], dtype=torch.long)
x = torch.randn(3, 16) # 3 nodes, 16 features
y = torch.tensor([0, 1, 1]) # Ground truth labels
# Ensure undirected
edge_index = to_undirected(edge_index)
data = Data(x=x, edge_index=edge_index, y=y)
# Load from numpy arrays
import numpy as np
import torch
from scipy.sparse import coo_matrix
# Load adjacency matrix (scipy sparse format)
adj_matrix = coo_matrix(...) # Your adjacency matrix
edge_index = torch.tensor(
np.vstack([adj_matrix.row, adj_matrix.col]),
dtype=torch.long
)
# Load features and labels
features = np.load('features.npy')
labels = np.load('labels.npy')
x = torch.tensor(features, dtype=torch.float)
y = torch.tensor(labels, dtype=torch.long)
# Normalize features
from torch_geometric.transforms import NormalizeFeatures
data = Data(x=x, edge_index=edge_index, y=y)
transform = NormalizeFeatures()
data = transform(data)
Dataset Name Aliases
The following aliases are supported for convenience:
# Case-insensitive loading
get_dataset('cora', root='./data') # ✓
get_dataset('Cora', root='./data') # ✓
get_dataset('CORA', root='./data') # ✓
# OGB dataset aliases
get_dataset('arxiv', root='./data') # Short form
get_dataset('ogbn-arxiv', root='./data') # Full OGB name
get_dataset('mag', root='./data') # Short form
get_dataset('ogbn-mag', root='./data') # Full OGB name
get_dataset('products', root='./data') # Short form
get_dataset('ogbn-products', root='./data') # Full OGB name
get_dataset('papers100m', root='./data') # Short form
get_dataset('ogbn-papers100M', root='./data') # Full OGB name
# GraphLand aliases
get_dataset('hm', root='./data') # Short form
get_dataset('hm-categories', root='./data') # Full name
get_dataset('pokec', root='./data') # Short form
get_dataset('pokec-regions', root='./data') # Full name
get_dataset('webtopic', root='./data') # Short form
get_dataset('web-topics', root='./data') # Full name
# Reddit aliases
get_dataset('reddit', root='./data') # Either works
get_dataset('reddit2', root='./data') # Same dataset
GraphLand Industrial Datasets
PyAGC includes the GraphLand benchmark datasets (HM, Pokec, WebTopic) featuring:
Tabular node features (categorical + numerical)
Heterophilous structures (low homophily)
Industrial-scale complexity (millions of nodes)
Two loading paradigms are provided:
Standard Tensor Features
Using pyagc.data.get_dataset():
Node features are returned as dense tensors (\(\mathbf{X} \in \mathbb{R}^{N \times F}\))
Tabular features are preprocessed (encoded / normalized) beforehand
Compatible with traditional GNN pipelines
x, edge_index, y = get_dataset('HM', root='./data')
TensorFrame-based Features
Using pyagc.data.get_tabular_graphland_dataset():
Node features are stored as
torch_frame.TensorFrameSemantic types are preserved (categorical, numerical, etc.)
Feature preprocessing is deferred to model-side encoders
Enables advanced tabular learning (e.g., feature-wise embeddings)
from pyagc.data import get_tabular_graphland_dataset
data = get_tabular_graphland_dataset('HM', root='./data')
print(type(data.x)) # torch_frame.TensorFrame
print(data.edge_index.shape)
print(data.y.shape)
# TensorFrame statistics (computed on training nodes)
print(data.tf_col_stats)
Note
The TensorFrame-based pipeline is designed for integration with torch-frame encoders, allowing:
automatic handling of heterogeneous feature types
missing value processing
feature-wise embedding learning
Dataset Class
- class GraphLandTensorFrameDataset(root: str, name: str, split: str, to_undirected: bool = False, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]
Bases:
InMemoryDatasetGraphLand dataset rewritten to store node attributes in TensorFrame.
Differences from the original implementation: - Graph structure is stored in Data.edge_index. - Node attributes are stored in Data.x (a torch_frame.TensorFrame). - Masks and targets are still stored in Data.
Notes: - The original sklearn-based feature preprocessing is intentionally removed.
In a torch-frame workflow, semantic types are preserved and feature encoding/normalization/imputation is usually handled by the model-side encoders.
- GRAPHLAND_DATASETS = {'artnet-exp': 'binary_classification', 'artnet-views': 'regression', 'avazu-ctr': 'regression', 'city-reviews': 'binary_classification', 'city-roads-L': 'regression', 'city-roads-M': 'regression', 'hm-categories': 'multiclass_classification', 'hm-prices': 'regression', 'pokec-regions': 'multiclass_classification', 'tolokers-2': 'binary_classification', 'twitch-views': 'regression', 'web-fraud': 'binary_classification', 'web-topics': 'multiclass_classification', 'web-traffic': 'regression'}
- property raw_file_names: str
The name of the files in the
self.raw_dirfolder that must be present in order to skip downloading.- Return type:
See Also
pyagc.models - Clustering models compatible with these datasets
pyagc.metrics - Evaluation metrics for clustering quality