pyagc.data

Dataset Loading 

get_dataset(name: str, root: str, return_splits=False)[source]

Loads a graph dataset by name and returns its features, edges, and labels.

This function serves as a unified interface for loading a wide range of benchmark datasets used in graph learning, including both classical citation networks (e.g., Cora, PubMed) and large-scale Open Graph Benchmark (OGB) datasets (e.g., ogbn-arxiv, ogbn-products). It automatically normalizes node features, converts the graph to an undirected version.

Optionally, it can also return predefined train/validation/test node splits for benchmarking purposes.

Parameters:

name (str) – The name of the dataset to load. Supported options include: ['cora', 'citeseer', 'pubmed', 'corafull', 'photo', 'computers', 'cs', 'physics', 'flickr', 'reddit', 'reddit2', 'ogbn-arxiv', 'arxiv', 'ogbn-mag', 'mag', 'ogbn-products', 'products', 'ogbn-papers100M', 'papers100m', 'hm-categories', 'hm', 'pokec-regions', 'pokec', 'web-topics', 'webtopic'].
root (str) – The root directory where the dataset should be stored.
return_splits (bool, optional) – If set to True, returns node-level split indices (train/valid/test) along with the features and edges. (default: False)

Returns:

Depending on return_splits:

If False, returns (x, edge_index, y):
- x: Node feature matrix [num_nodes, num_features]
- edge_index: Graph connectivity in COO format [2, num_edges]
- y: Node label vector [num_nodes]
If True, returns (x, edge_index, y, train_idx, valid_idx, test_idx) with additional index tensors for data splits.
For papers100M with return_splits=True, additionally returns: (x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph) where labeled_subgraph contains only edge_index and original_indices for structure metric computation.

Return type:

(Tuple)

Raises:

ValueError – If the provided dataset name is not recognized.

get_tabular_graphland_dataset(name: str, root: str, split: str = 'TH')[source]

Load HM / Pokec / WebTopic from GraphLandTensorFrameDataset.

This loader is intentionally separate from the generic get_dataset() because these datasets store node attributes as TensorFrame instead of dense tensor features.

Parameters:

name (str) – Dataset alias. Supported: [‘HM’, ‘Pokec’, ‘WebTopic’]
root (str) – Dataset root directory.
split (str) – GraphLand split. Defaults to ‘TH’.

Returns:

A PyG Data object with:

data.x: torch_frame.TensorFrame
data.edge_index: edge list
data.y: labels
train/val/test masks
tf_col_stats: statistics computed from train nodes only

Return type:

Data

Benchmark Datasets 

PyAGC provides a curated collection of 12 benchmark datasets spanning diverse domains, scales, and feature types for comprehensive evaluation of attributed graph clustering algorithms.

Dataset Overview Table 

Scale	Dataset	Domain	#Nodes	#Edges	Avg.Deg	#Feat	Feat.Type	#Clus	\(\mathcal{H}_e\)	\(\mathcal{H}_n\)
Tiny	Cora	Citation	2,708	10,556	3.9	1,433	Textual	7	0.81	0.83
Tiny	Photo	Co-purchase	7,650	238,162	31.1	745	Textual	8	0.83	0.84
Small	Physics	Co-author	34,493	495,924	14.4	8,415	Textual	5	0.93	0.92
Small	HM	Co-purchase	46,563	21,461,990	460.9	120	Tabular	21	0.16	0.35
Small	Flickr	Social	89,250	899,756	10.1	500	Textual	7	0.32	0.32
Medium	ArXiv	Citation	169,343	1,166,243	6.9	128	Textual	40	0.65	0.64
Medium	Reddit	Social	232,965	23,213,838	99.6	602	Textual	41	0.78	0.81
Medium	MAG	Citation	736,389	10,792,672	14.7	128	Textual	349	0.30	0.31
Large	Pokec	Social	1,632,803	44,603,928	27.3	56	Tabular	183	0.43	0.39
Large	Products	Co-purchase	2,449,029	61,859,140	25.4	100	Textual	47	0.81	0.82
Large	WebTopic	Web	2,890,331	24,754,822	8.6	528	Tabular	28	0.22	0.24
Massive	Papers100M	Citation	111,059,956	1,615,685,872	14.5	128	Textual	172	0.57	0.50

Note

\(\mathcal{H}_e\): Edge homophily (proportion of edges connecting same-class nodes)
\(\mathcal{H}_n\): Node homophily (average neighbor label consistency)
Feat.Type: Textual (bag-of-words, embeddings) or Tabular (categorical/numerical metadata)
For Papers100M, labels are available for a subset of ≈1.5M arXiv papers. The reported homophily metrics are calculated based on the induced subgraph of these labeled nodes

Dataset Details by Scale 

Tiny Scale (\(N < 10^4\))

These datasets are suitable for rapid prototyping and sanity checking:

Cora: Classic citation network of machine learning papers with sparse bag-of-words features
Photo: Amazon product co-purchase graph with review-based features

Small Scale (\(10^4 \le N < 10^5\))

Suitable for comprehensive model development and ablation studies:

Physics: Co-authorship network from Microsoft Academic Graph with keyword features
HM: H&M fashion co-purchase network with tabular product metadata (color, weekday statistics)
Flickr: Image-sharing social network with tag-based features

Medium Scale (\(10^5 \le N < 10^6\))

Transitional regime requiring efficient implementations:

ArXiv: Computer Science paper citations with title/abstract embeddings
Reddit: Discussion posts connected by common commenters with GloVe features
MAG: Multi-venue academic citations (349 classes) with abstract embeddings

Large Scale (\(10^6 \le N < 10^8\))

Production-scale graphs requiring mini-batch training:

Pokec: Slovak social network with tabular user profiles (183 regions, heterophilous)
Products: Amazon co-purchase network with product description features
WebTopic: Web graph with tabular website metadata (28 topics, low homophily)

Massive Scale (\(N > 10^8\))

Extreme-scale benchmark for testing scalability limits:

Papers100M: 111M paper citation network (172 subjects) — requires neighbor sampling

Example Usage 

Basic Loading 

from pyagc.data import get_dataset

# Load dataset (returns unpacked components)
x, edge_index, y = get_dataset('Cora', root='./data')

print(f"Node features shape: {x.shape}")      # [num_nodes, num_features]
print(f"Edge index shape: {edge_index.shape}") # [2, num_edges]
print(f"Labels shape: {y.shape}")              # [num_nodes]
print(f"Number of classes: {y.max().item() + 1}")

Loading with Train/Val/Test Splits 

# Load with predefined splits
x, edge_index, y, train_idx, valid_idx, test_idx = get_dataset(
    'ArXiv',
    root='./data',
    return_splits=True
)

print(f"Training nodes: {train_idx.shape[0]}")
print(f"Validation nodes: {valid_idx.shape[0]}")
print(f"Test nodes: {test_idx.shape[0]}")

Loading Large-Scale Datasets 

# Medium/Large datasets work the same way
x, edge_index, y = get_dataset('Products', root='./data')
print(f"Large graph: {x.shape[0]:,} nodes, {edge_index.shape[1]:,} edges")

Loading Massive-Scale Datasets (Papers100M)

# Papers100M requires special handling due to its size
# First-time loading will preprocess and cache the undirected graph
x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = get_dataset(
    'Papers100M',
    root='./data',
    return_splits=True
)

# labeled_subgraph contains structure for computing structural metrics
# on the labeled subset (≈1.5M nodes)
print(f"Full graph: {x.shape[0]:,} nodes")
print(f"Labeled subgraph: {labeled_subgraph['num_nodes']:,} nodes")
print(f"Subgraph edges: {labeled_subgraph['edge_index'].shape[1]:,}")

Warning

Papers100M Preprocessing Requirements:

First-time loading requires ~400 GB RAM for preprocessing
Preprocessed data is cached to disk for future use
Returns additional labeled_subgraph dict when return_splits=True
The labeled subgraph contains only structure (no features) for efficient structural metric computation

Creating PyG Data Object 

If you need a PyG Data object for compatibility:

from torch_geometric.data import Data

x, edge_index, y = get_dataset('Cora', root='./data')

# Wrap in PyG Data object
data = Data(x=x, edge_index=edge_index, y=y)
print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}")
print(f"Features: {data.num_features}, Classes: {data.y.max().item() + 1}")

Working with Tabular Features 

GraphLand datasets (HM, Pokec, WebTopic) provide tabular node features.

Two usage patterns are supported:

Using Dense Tensor Features 

# Load tabular feature dataset
x, edge_index, y = get_dataset('HM', root='./data')

print(f"Tabular features: {x.shape}")  # Mixed categorical/numerical
print(f"Number of product categories: {y.max().item() + 1}")

# These features may require preprocessing (normalization, encoding)
# depending on your clustering algorithm

Using TensorFrame Features 

from pyagc.data import get_tabular_graphland_dataset

data = get_tabular_graphland_dataset('HM', root='./data')

x = data.x  # torch_frame.TensorFrame

print(x)
print(data.tf_col_stats)  # column-wise statistics

Advantages of TensorFrame:

Preserves feature semantics (categorical vs numerical)
Avoids lossy preprocessing
Enables learnable feature encoders
Better suited for heterogeneous tabular graphs

Tip

When using TensorFrame inputs, pair with encoders from pyagc.encoders.tabencoder or custom torch-frame models.

Loading Other PyG Datasets 

For datasets not included in the benchmark suite, you can directly use PyTorch Geometric:

import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid, Amazon, Coauthor
from torch_geometric.utils import to_undirected

# Example 1: Load PubMed from Planetoid
dataset = Planetoid(root='./data', name='PubMed', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)  # Convert to undirected

x = data.x
edge_index = data.edge_index
y = data.y.squeeze()

print(f"PubMed: {x.shape[0]} nodes, {edge_index.shape[1]} edges")

# Example 2: Load Computers from Amazon
dataset = Amazon(root='./data', name='Computers', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)

# Example 3: Load other Coauthor datasets
dataset = Coauthor(root='./data', name='CS', transform=T.NormalizeFeatures())
data = dataset[0]

# Example 4: Load any OGB dataset
from ogb.nodeproppred import PygNodePropPredDataset

dataset = PygNodePropPredDataset(root='./data', name='ogbn-proteins')
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)

# Get splits for OGB datasets
split_idx = dataset.get_idx_split()
train_idx = split_idx['train']
valid_idx = split_idx['valid']
test_idx = split_idx['test']

Tip

Working with Custom PyG Datasets:

When using PyG datasets directly, remember to:

Apply to_undirected() for clustering tasks (most AGC methods assume undirected graphs)
Use T.NormalizeFeatures() transform for consistent feature scaling
Convert labels to 1D tensor: y = data.y.squeeze()
Check if the dataset provides train/val/test masks or splits

Loading Custom Datasets 

You can also load your own graph data:

import torch
from torch_geometric.data import Data
from torch_geometric.utils import to_undirected

# Create graph from edge list and features
edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.randn(3, 16)  # 3 nodes, 16 features
y = torch.tensor([0, 1, 1])  # Ground truth labels

# Ensure undirected
edge_index = to_undirected(edge_index)

data = Data(x=x, edge_index=edge_index, y=y)

# Load from numpy arrays
import numpy as np
import torch
from scipy.sparse import coo_matrix

# Load adjacency matrix (scipy sparse format)
adj_matrix = coo_matrix(...)  # Your adjacency matrix
edge_index = torch.tensor(
    np.vstack([adj_matrix.row, adj_matrix.col]),
    dtype=torch.long
)

# Load features and labels
features = np.load('features.npy')
labels = np.load('labels.npy')

x = torch.tensor(features, dtype=torch.float)
y = torch.tensor(labels, dtype=torch.long)

# Normalize features
from torch_geometric.transforms import NormalizeFeatures
data = Data(x=x, edge_index=edge_index, y=y)
transform = NormalizeFeatures()
data = transform(data)

Dataset Name Aliases 

The following aliases are supported for convenience:

# Case-insensitive loading
get_dataset('cora', root='./data')      # ✓
get_dataset('Cora', root='./data')      # ✓
get_dataset('CORA', root='./data')      # ✓

# OGB dataset aliases
get_dataset('arxiv', root='./data')     # Short form
get_dataset('ogbn-arxiv', root='./data')  # Full OGB name

get_dataset('mag', root='./data')        # Short form
get_dataset('ogbn-mag', root='./data')   # Full OGB name

get_dataset('products', root='./data')        # Short form
get_dataset('ogbn-products', root='./data')   # Full OGB name

get_dataset('papers100m', root='./data')       # Short form
get_dataset('ogbn-papers100M', root='./data')  # Full OGB name

# GraphLand aliases
get_dataset('hm', root='./data')           # Short form
get_dataset('hm-categories', root='./data')  # Full name

get_dataset('pokec', root='./data')         # Short form
get_dataset('pokec-regions', root='./data')  # Full name

get_dataset('webtopic', root='./data')      # Short form
get_dataset('web-topics', root='./data')    # Full name

# Reddit aliases
get_dataset('reddit', root='./data')   # Either works
get_dataset('reddit2', root='./data')  # Same dataset

GraphLand Industrial Datasets 

PyAGC includes the GraphLand benchmark datasets (HM, Pokec, WebTopic) featuring:

Tabular node features (categorical + numerical)
Heterophilous structures (low homophily)
Industrial-scale complexity (millions of nodes)

Two loading paradigms are provided:

Standard Tensor Features 

Using pyagc.data.get_dataset():

Node features are returned as dense tensors (\(\mathbf{X} \in \mathbb{R}^{N \times F}\))
Tabular features are preprocessed (encoded / normalized) beforehand
Compatible with traditional GNN pipelines

x, edge_index, y = get_dataset('HM', root='./data')

TensorFrame-based Features 

Using pyagc.data.get_tabular_graphland_dataset():

Node features are stored as torch_frame.TensorFrame
Semantic types are preserved (categorical, numerical, etc.)
Feature preprocessing is deferred to model-side encoders
Enables advanced tabular learning (e.g., feature-wise embeddings)

from pyagc.data import get_tabular_graphland_dataset

data = get_tabular_graphland_dataset('HM', root='./data')

print(type(data.x))  # torch_frame.TensorFrame
print(data.edge_index.shape)
print(data.y.shape)

# TensorFrame statistics (computed on training nodes)
print(data.tf_col_stats)

Note

The TensorFrame-based pipeline is designed for integration with torch-frame encoders, allowing:

automatic handling of heterogeneous feature types
missing value processing
feature-wise embedding learning

Dataset Class 

class GraphLandTensorFrameDataset(root: str, name: str, split: str, to_undirected: bool = False, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

GraphLand dataset rewritten to store node attributes in TensorFrame.

Differences from the original implementation: - Graph structure is stored in Data.edge_index. - Node attributes are stored in Data.x (a torch_frame.TensorFrame). - Masks and targets are still stored in Data.

Notes: - The original sklearn-based feature preprocessing is intentionally removed.

In a torch-frame workflow, semantic types are preserved and feature encoding/normalization/imputation is usually handled by the model-side encoders.

GRAPHLAND_DATASETS = {'artnet-exp': 'binary_classification', 'artnet-views': 'regression', 'avazu-ctr': 'regression', 'city-reviews': 'binary_classification', 'city-roads-L': 'regression', 'city-roads-M': 'regression', 'hm-categories': 'multiclass_classification', 'hm-prices': 'regression', 'pokec-regions': 'multiclass_classification', 'tolokers-2': 'binary_classification', 'twitch-views': 'regression', 'web-fraud': 'binary_classification', 'web-topics': 'multiclass_classification', 'web-traffic': 'regression'}

property raw_dir: str

Return type:: str

property processed_dir: str

Return type:: str

property raw_file_names: str

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

Return type:: str

property processed_file_names: str

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

Return type:: str

download() → None[source]

Downloads the dataset to the self.raw_dir folder.

Return type:: None

process() → None[source]

Processes the dataset to the self.processed_dir folder.

Return type:: None