pyagc.data

Dataset Loading

get_dataset(name: str, root: str, return_splits=False)[source]

Loads a graph dataset by name and returns its features, edges, and labels.

This function serves as a unified interface for loading a wide range of benchmark datasets used in graph learning, including both classical citation networks (e.g., Cora, PubMed) and large-scale Open Graph Benchmark (OGB) datasets (e.g., ogbn-arxiv, ogbn-products). It automatically normalizes node features, converts the graph to an undirected version.

Optionally, it can also return predefined train/validation/test node splits for benchmarking purposes.

Parameters:
  • name (str) – The name of the dataset to load. Supported options include: ['cora', 'citeseer', 'pubmed', 'corafull', 'photo', 'computers', 'cs', 'physics', 'flickr', 'reddit', 'reddit2', 'ogbn-arxiv', 'arxiv', 'ogbn-mag', 'mag', 'ogbn-products', 'products', 'ogbn-papers100M', 'papers100m', 'hm-categories', 'hm', 'pokec-regions', 'pokec', 'web-topics', 'webtopic'].

  • root (str) – The root directory where the dataset should be stored.

  • return_splits (bool, optional) – If set to True, returns node-level split indices (train/valid/test) along with the features and edges. (default: False)

Returns:

Depending on return_splits:
  • If False, returns (x, edge_index, y):
    • x: Node feature matrix [num_nodes, num_features]

    • edge_index: Graph connectivity in COO format [2, num_edges]

    • y: Node label vector [num_nodes]

  • If True, returns (x, edge_index, y, train_idx, valid_idx, test_idx) with additional index tensors for data splits.

  • For papers100M with return_splits=True, additionally returns: (x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph) where labeled_subgraph contains only edge_index and original_indices for structure metric computation.

Return type:

(Tuple)

Raises:

ValueError – If the provided dataset name is not recognized.

get_tabular_graphland_dataset(name: str, root: str, split: str = 'TH')[source]

Load HM / Pokec / WebTopic from GraphLandTensorFrameDataset.

This loader is intentionally separate from the generic get_dataset() because these datasets store node attributes as TensorFrame instead of dense tensor features.

Parameters:
  • name (str) – Dataset alias. Supported: [‘HM’, ‘Pokec’, ‘WebTopic’]

  • root (str) – Dataset root directory.

  • split (str) – GraphLand split. Defaults to ‘TH’.

Returns:

A PyG Data object with:
  • data.x: torch_frame.TensorFrame

  • data.edge_index: edge list

  • data.y: labels

  • train/val/test masks

  • tf_col_stats: statistics computed from train nodes only

Return type:

Data

Benchmark Datasets

PyAGC provides a curated collection of 12 benchmark datasets spanning diverse domains, scales, and feature types for comprehensive evaluation of attributed graph clustering algorithms.

Dataset Overview Table

Scale

Dataset

Domain

#Nodes

#Edges

Avg.Deg

#Feat

Feat.Type

#Clus

\(\mathcal{H}_e\)

\(\mathcal{H}_n\)

Tiny

Cora

Citation

2,708

10,556

3.9

1,433

Textual

7

0.81

0.83

Tiny

Photo

Co-purchase

7,650

238,162

31.1

745

Textual

8

0.83

0.84

Small

Physics

Co-author

34,493

495,924

14.4

8,415

Textual

5

0.93

0.92

Small

HM

Co-purchase

46,563

21,461,990

460.9

120

Tabular

21

0.16

0.35

Small

Flickr

Social

89,250

899,756

10.1

500

Textual

7

0.32

0.32

Medium

ArXiv

Citation

169,343

1,166,243

6.9

128

Textual

40

0.65

0.64

Medium

Reddit

Social

232,965

23,213,838

99.6

602

Textual

41

0.78

0.81

Medium

MAG

Citation

736,389

10,792,672

14.7

128

Textual

349

0.30

0.31

Large

Pokec

Social

1,632,803

44,603,928

27.3

56

Tabular

183

0.43

0.39

Large

Products

Co-purchase

2,449,029

61,859,140

25.4

100

Textual

47

0.81

0.82

Large

WebTopic

Web

2,890,331

24,754,822

8.6

528

Tabular

28

0.22

0.24

Massive

Papers100M

Citation

111,059,956

1,615,685,872

14.5

128

Textual

172

0.57

0.50

Note

  • \(\mathcal{H}_e\): Edge homophily (proportion of edges connecting same-class nodes)

  • \(\mathcal{H}_n\): Node homophily (average neighbor label consistency)

  • Feat.Type: Textual (bag-of-words, embeddings) or Tabular (categorical/numerical metadata)

  • For Papers100M, labels are available for a subset of ≈1.5M arXiv papers. The reported homophily metrics are calculated based on the induced subgraph of these labeled nodes

Dataset Details by Scale

Tiny Scale (\(N < 10^4\))

These datasets are suitable for rapid prototyping and sanity checking:

  • Cora: Classic citation network of machine learning papers with sparse bag-of-words features

  • Photo: Amazon product co-purchase graph with review-based features

Small Scale (\(10^4 \le N < 10^5\))

Suitable for comprehensive model development and ablation studies:

  • Physics: Co-authorship network from Microsoft Academic Graph with keyword features

  • HM: H&M fashion co-purchase network with tabular product metadata (color, weekday statistics)

  • Flickr: Image-sharing social network with tag-based features

Medium Scale (\(10^5 \le N < 10^6\))

Transitional regime requiring efficient implementations:

  • ArXiv: Computer Science paper citations with title/abstract embeddings

  • Reddit: Discussion posts connected by common commenters with GloVe features

  • MAG: Multi-venue academic citations (349 classes) with abstract embeddings

Large Scale (\(10^6 \le N < 10^8\))

Production-scale graphs requiring mini-batch training:

  • Pokec: Slovak social network with tabular user profiles (183 regions, heterophilous)

  • Products: Amazon co-purchase network with product description features

  • WebTopic: Web graph with tabular website metadata (28 topics, low homophily)

Massive Scale (\(N > 10^8\))

Extreme-scale benchmark for testing scalability limits:

  • Papers100M: 111M paper citation network (172 subjects) — requires neighbor sampling

Example Usage

Basic Loading

from pyagc.data import get_dataset

# Load dataset (returns unpacked components)
x, edge_index, y = get_dataset('Cora', root='./data')

print(f"Node features shape: {x.shape}")      # [num_nodes, num_features]
print(f"Edge index shape: {edge_index.shape}") # [2, num_edges]
print(f"Labels shape: {y.shape}")              # [num_nodes]
print(f"Number of classes: {y.max().item() + 1}")

Loading with Train/Val/Test Splits

# Load with predefined splits
x, edge_index, y, train_idx, valid_idx, test_idx = get_dataset(
    'ArXiv',
    root='./data',
    return_splits=True
)

print(f"Training nodes: {train_idx.shape[0]}")
print(f"Validation nodes: {valid_idx.shape[0]}")
print(f"Test nodes: {test_idx.shape[0]}")

Loading Large-Scale Datasets

# Medium/Large datasets work the same way
x, edge_index, y = get_dataset('Products', root='./data')
print(f"Large graph: {x.shape[0]:,} nodes, {edge_index.shape[1]:,} edges")

Loading Massive-Scale Datasets (Papers100M)

# Papers100M requires special handling due to its size
# First-time loading will preprocess and cache the undirected graph
x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = get_dataset(
    'Papers100M',
    root='./data',
    return_splits=True
)

# labeled_subgraph contains structure for computing structural metrics
# on the labeled subset (≈1.5M nodes)
print(f"Full graph: {x.shape[0]:,} nodes")
print(f"Labeled subgraph: {labeled_subgraph['num_nodes']:,} nodes")
print(f"Subgraph edges: {labeled_subgraph['edge_index'].shape[1]:,}")

Warning

Papers100M Preprocessing Requirements:

  • First-time loading requires ~400 GB RAM for preprocessing

  • Preprocessed data is cached to disk for future use

  • Returns additional labeled_subgraph dict when return_splits=True

  • The labeled subgraph contains only structure (no features) for efficient structural metric computation

Creating PyG Data Object

If you need a PyG Data object for compatibility:

from torch_geometric.data import Data

x, edge_index, y = get_dataset('Cora', root='./data')

# Wrap in PyG Data object
data = Data(x=x, edge_index=edge_index, y=y)
print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}")
print(f"Features: {data.num_features}, Classes: {data.y.max().item() + 1}")

Working with Tabular Features

GraphLand datasets (HM, Pokec, WebTopic) provide tabular node features.

Two usage patterns are supported:

Using Dense Tensor Features

# Load tabular feature dataset
x, edge_index, y = get_dataset('HM', root='./data')

print(f"Tabular features: {x.shape}")  # Mixed categorical/numerical
print(f"Number of product categories: {y.max().item() + 1}")

# These features may require preprocessing (normalization, encoding)
# depending on your clustering algorithm

Using TensorFrame Features

from pyagc.data import get_tabular_graphland_dataset

data = get_tabular_graphland_dataset('HM', root='./data')

x = data.x  # torch_frame.TensorFrame

print(x)
print(data.tf_col_stats)  # column-wise statistics

Advantages of TensorFrame:

  • Preserves feature semantics (categorical vs numerical)

  • Avoids lossy preprocessing

  • Enables learnable feature encoders

  • Better suited for heterogeneous tabular graphs

Tip

When using TensorFrame inputs, pair with encoders from pyagc.encoders.tabencoder or custom torch-frame models.

Loading Other PyG Datasets

For datasets not included in the benchmark suite, you can directly use PyTorch Geometric:

import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid, Amazon, Coauthor
from torch_geometric.utils import to_undirected

# Example 1: Load PubMed from Planetoid
dataset = Planetoid(root='./data', name='PubMed', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)  # Convert to undirected

x = data.x
edge_index = data.edge_index
y = data.y.squeeze()

print(f"PubMed: {x.shape[0]} nodes, {edge_index.shape[1]} edges")

# Example 2: Load Computers from Amazon
dataset = Amazon(root='./data', name='Computers', transform=T.NormalizeFeatures())
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)

# Example 3: Load other Coauthor datasets
dataset = Coauthor(root='./data', name='CS', transform=T.NormalizeFeatures())
data = dataset[0]

# Example 4: Load any OGB dataset
from ogb.nodeproppred import PygNodePropPredDataset

dataset = PygNodePropPredDataset(root='./data', name='ogbn-proteins')
data = dataset[0]
data.edge_index = to_undirected(data.edge_index)

# Get splits for OGB datasets
split_idx = dataset.get_idx_split()
train_idx = split_idx['train']
valid_idx = split_idx['valid']
test_idx = split_idx['test']

Tip

Working with Custom PyG Datasets:

When using PyG datasets directly, remember to:

  • Apply to_undirected() for clustering tasks (most AGC methods assume undirected graphs)

  • Use T.NormalizeFeatures() transform for consistent feature scaling

  • Convert labels to 1D tensor: y = data.y.squeeze()

  • Check if the dataset provides train/val/test masks or splits

Loading Custom Datasets

You can also load your own graph data:

import torch
from torch_geometric.data import Data
from torch_geometric.utils import to_undirected

# Create graph from edge list and features
edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.randn(3, 16)  # 3 nodes, 16 features
y = torch.tensor([0, 1, 1])  # Ground truth labels

# Ensure undirected
edge_index = to_undirected(edge_index)

data = Data(x=x, edge_index=edge_index, y=y)
# Load from numpy arrays
import numpy as np
import torch
from scipy.sparse import coo_matrix

# Load adjacency matrix (scipy sparse format)
adj_matrix = coo_matrix(...)  # Your adjacency matrix
edge_index = torch.tensor(
    np.vstack([adj_matrix.row, adj_matrix.col]),
    dtype=torch.long
)

# Load features and labels
features = np.load('features.npy')
labels = np.load('labels.npy')

x = torch.tensor(features, dtype=torch.float)
y = torch.tensor(labels, dtype=torch.long)

# Normalize features
from torch_geometric.transforms import NormalizeFeatures
data = Data(x=x, edge_index=edge_index, y=y)
transform = NormalizeFeatures()
data = transform(data)

Dataset Name Aliases

The following aliases are supported for convenience:

# Case-insensitive loading
get_dataset('cora', root='./data')      # ✓
get_dataset('Cora', root='./data')      # ✓
get_dataset('CORA', root='./data')      # ✓

# OGB dataset aliases
get_dataset('arxiv', root='./data')     # Short form
get_dataset('ogbn-arxiv', root='./data')  # Full OGB name

get_dataset('mag', root='./data')        # Short form
get_dataset('ogbn-mag', root='./data')   # Full OGB name

get_dataset('products', root='./data')        # Short form
get_dataset('ogbn-products', root='./data')   # Full OGB name

get_dataset('papers100m', root='./data')       # Short form
get_dataset('ogbn-papers100M', root='./data')  # Full OGB name

# GraphLand aliases
get_dataset('hm', root='./data')           # Short form
get_dataset('hm-categories', root='./data')  # Full name

get_dataset('pokec', root='./data')         # Short form
get_dataset('pokec-regions', root='./data')  # Full name

get_dataset('webtopic', root='./data')      # Short form
get_dataset('web-topics', root='./data')    # Full name

# Reddit aliases
get_dataset('reddit', root='./data')   # Either works
get_dataset('reddit2', root='./data')  # Same dataset

GraphLand Industrial Datasets

PyAGC includes the GraphLand benchmark datasets (HM, Pokec, WebTopic) featuring:

  • Tabular node features (categorical + numerical)

  • Heterophilous structures (low homophily)

  • Industrial-scale complexity (millions of nodes)

Two loading paradigms are provided:

Standard Tensor Features

Using pyagc.data.get_dataset():

  • Node features are returned as dense tensors (\(\mathbf{X} \in \mathbb{R}^{N \times F}\))

  • Tabular features are preprocessed (encoded / normalized) beforehand

  • Compatible with traditional GNN pipelines

x, edge_index, y = get_dataset('HM', root='./data')

TensorFrame-based Features

Using pyagc.data.get_tabular_graphland_dataset():

  • Node features are stored as torch_frame.TensorFrame

  • Semantic types are preserved (categorical, numerical, etc.)

  • Feature preprocessing is deferred to model-side encoders

  • Enables advanced tabular learning (e.g., feature-wise embeddings)

from pyagc.data import get_tabular_graphland_dataset

data = get_tabular_graphland_dataset('HM', root='./data')

print(type(data.x))  # torch_frame.TensorFrame
print(data.edge_index.shape)
print(data.y.shape)

# TensorFrame statistics (computed on training nodes)
print(data.tf_col_stats)

Note

The TensorFrame-based pipeline is designed for integration with torch-frame encoders, allowing:

  • automatic handling of heterogeneous feature types

  • missing value processing

  • feature-wise embedding learning

Dataset Class

class GraphLandTensorFrameDataset(root: str, name: str, split: str, to_undirected: bool = False, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

GraphLand dataset rewritten to store node attributes in TensorFrame.

Differences from the original implementation: - Graph structure is stored in Data.edge_index. - Node attributes are stored in Data.x (a torch_frame.TensorFrame). - Masks and targets are still stored in Data.

Notes: - The original sklearn-based feature preprocessing is intentionally removed.

In a torch-frame workflow, semantic types are preserved and feature encoding/normalization/imputation is usually handled by the model-side encoders.

GRAPHLAND_DATASETS = {'artnet-exp': 'binary_classification', 'artnet-views': 'regression', 'avazu-ctr': 'regression', 'city-reviews': 'binary_classification', 'city-roads-L': 'regression', 'city-roads-M': 'regression', 'hm-categories': 'multiclass_classification', 'hm-prices': 'regression', 'pokec-regions': 'multiclass_classification', 'tolokers-2': 'binary_classification', 'twitch-views': 'regression', 'web-fraud': 'binary_classification', 'web-topics': 'multiclass_classification', 'web-traffic': 'regression'}
property raw_dir: str
Return type:

str

property processed_dir: str
Return type:

str

property raw_file_names: str

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

Return type:

str

property processed_file_names: str

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

Return type:

str

download() None[source]

Downloads the dataset to the self.raw_dir folder.

Return type:

None

process() None[source]

Processes the dataset to the self.processed_dir folder.

Return type:

None

See Also