pyagc.data ========== .. contents:: Contents :local: Dataset Loading --------------- .. autofunction:: pyagc.data.get_dataset .. autofunction:: pyagc.data.get_tabular_graphland_dataset Benchmark Datasets ------------------ PyAGC provides a curated collection of 12 benchmark datasets spanning diverse domains, scales, and feature types for comprehensive evaluation of attributed graph clustering algorithms. Dataset Overview Table ~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 8 10 10 12 12 8 8 10 8 8 8 * - Scale - Dataset - Domain - #Nodes - #Edges - Avg.Deg - #Feat - Feat.Type - #Clus - :math:`\mathcal{H}_e` - :math:`\mathcal{H}_n` * - Tiny - Cora - Citation - 2,708 - 10,556 - 3.9 - 1,433 - Textual - 7 - 0.81 - 0.83 * - Tiny - Photo - Co-purchase - 7,650 - 238,162 - 31.1 - 745 - Textual - 8 - 0.83 - 0.84 * - Small - Physics - Co-author - 34,493 - 495,924 - 14.4 - 8,415 - Textual - 5 - 0.93 - 0.92 * - Small - HM - Co-purchase - 46,563 - 21,461,990 - 460.9 - 120 - Tabular - 21 - 0.16 - 0.35 * - Small - Flickr - Social - 89,250 - 899,756 - 10.1 - 500 - Textual - 7 - 0.32 - 0.32 * - Medium - ArXiv - Citation - 169,343 - 1,166,243 - 6.9 - 128 - Textual - 40 - 0.65 - 0.64 * - Medium - Reddit - Social - 232,965 - 23,213,838 - 99.6 - 602 - Textual - 41 - 0.78 - 0.81 * - Medium - MAG - Citation - 736,389 - 10,792,672 - 14.7 - 128 - Textual - 349 - 0.30 - 0.31 * - Large - Pokec - Social - 1,632,803 - 44,603,928 - 27.3 - 56 - Tabular - 183 - 0.43 - 0.39 * - Large - Products - Co-purchase - 2,449,029 - 61,859,140 - 25.4 - 100 - Textual - 47 - 0.81 - 0.82 * - Large - WebTopic - Web - 2,890,331 - 24,754,822 - 8.6 - 528 - Tabular - 28 - 0.22 - 0.24 * - Massive - Papers100M - Citation - 111,059,956 - 1,615,685,872 - 14.5 - 128 - Textual - 172 - 0.57 - 0.50 .. note:: - :math:`\mathcal{H}_e`: Edge homophily (proportion of edges connecting same-class nodes) - :math:`\mathcal{H}_n`: Node homophily (average neighbor label consistency) - **Feat.Type**: Textual (bag-of-words, embeddings) or Tabular (categorical/numerical metadata) - For Papers100M, labels are available for a subset of ≈1.5M arXiv papers. The reported homophily metrics are calculated based on the induced subgraph of these labeled nodes Dataset Details by Scale ~~~~~~~~~~~~~~~~~~~~~~~~~ **Tiny Scale** (:math:`N < 10^4`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ These datasets are suitable for rapid prototyping and sanity checking: - **Cora**: Classic citation network of machine learning papers with sparse bag-of-words features - **Photo**: Amazon product co-purchase graph with review-based features **Small Scale** (:math:`10^4 \le N < 10^5`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Suitable for comprehensive model development and ablation studies: - **Physics**: Co-authorship network from Microsoft Academic Graph with keyword features - **HM**: H&M fashion co-purchase network with **tabular product metadata** (color, weekday statistics) - **Flickr**: Image-sharing social network with tag-based features **Medium Scale** (:math:`10^5 \le N < 10^6`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Transitional regime requiring efficient implementations: - **ArXiv**: Computer Science paper citations with title/abstract embeddings - **Reddit**: Discussion posts connected by common commenters with GloVe features - **MAG**: Multi-venue academic citations (349 classes) with abstract embeddings **Large Scale** (:math:`10^6 \le N < 10^8`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Production-scale graphs requiring mini-batch training: - **Pokec**: Slovak social network with **tabular user profiles** (183 regions, heterophilous) - **Products**: Amazon co-purchase network with product description features - **WebTopic**: Web graph with **tabular website metadata** (28 topics, low homophily) **Massive Scale** (:math:`N > 10^8`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Extreme-scale benchmark for testing scalability limits: - **Papers100M**: 111M paper citation network (172 subjects) — requires neighbor sampling Example Usage ------------- Basic Loading ~~~~~~~~~~~~~ .. code-block:: python from pyagc.data import get_dataset # Load dataset (returns unpacked components) x, edge_index, y = get_dataset('Cora', root='./data') print(f"Node features shape: {x.shape}") # [num_nodes, num_features] print(f"Edge index shape: {edge_index.shape}") # [2, num_edges] print(f"Labels shape: {y.shape}") # [num_nodes] print(f"Number of classes: {y.max().item() + 1}") Loading with Train/Val/Test Splits ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Load with predefined splits x, edge_index, y, train_idx, valid_idx, test_idx = get_dataset( 'ArXiv', root='./data', return_splits=True ) print(f"Training nodes: {train_idx.shape[0]}") print(f"Validation nodes: {valid_idx.shape[0]}") print(f"Test nodes: {test_idx.shape[0]}") Loading Large-Scale Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Medium/Large datasets work the same way x, edge_index, y = get_dataset('Products', root='./data') print(f"Large graph: {x.shape[0]:,} nodes, {edge_index.shape[1]:,} edges") Loading Massive-Scale Datasets (Papers100M) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Papers100M requires special handling due to its size # First-time loading will preprocess and cache the undirected graph x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = get_dataset( 'Papers100M', root='./data', return_splits=True ) # labeled_subgraph contains structure for computing structural metrics # on the labeled subset (≈1.5M nodes) print(f"Full graph: {x.shape[0]:,} nodes") print(f"Labeled subgraph: {labeled_subgraph['num_nodes']:,} nodes") print(f"Subgraph edges: {labeled_subgraph['edge_index'].shape[1]:,}") .. warning:: **Papers100M Preprocessing Requirements:** - First-time loading requires ~400 GB RAM for preprocessing - Preprocessed data is cached to disk for future use - Returns additional `labeled_subgraph` dict when `return_splits=True` - The labeled subgraph contains only structure (no features) for efficient structural metric computation Creating PyG Data Object ~~~~~~~~~~~~~~~~~~~~~~~~~ If you need a PyG Data object for compatibility: .. code-block:: python from torch_geometric.data import Data x, edge_index, y = get_dataset('Cora', root='./data') # Wrap in PyG Data object data = Data(x=x, edge_index=edge_index, y=y) print(f"Nodes: {data.num_nodes}, Edges: {data.num_edges}") print(f"Features: {data.num_features}, Classes: {data.y.max().item() + 1}") Working with Tabular Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GraphLand datasets (HM, Pokec, WebTopic) provide **tabular node features**. Two usage patterns are supported: Using Dense Tensor Features ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python # Load tabular feature dataset x, edge_index, y = get_dataset('HM', root='./data') print(f"Tabular features: {x.shape}") # Mixed categorical/numerical print(f"Number of product categories: {y.max().item() + 1}") # These features may require preprocessing (normalization, encoding) # depending on your clustering algorithm Using TensorFrame Features ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from pyagc.data import get_tabular_graphland_dataset data = get_tabular_graphland_dataset('HM', root='./data') x = data.x # torch_frame.TensorFrame print(x) print(data.tf_col_stats) # column-wise statistics Advantages of TensorFrame: - Preserves **feature semantics** (categorical vs numerical) - Avoids lossy preprocessing - Enables **learnable feature encoders** - Better suited for **heterogeneous tabular graphs** .. tip:: When using TensorFrame inputs, pair with encoders from :mod:`pyagc.encoders.tabencoder` or custom torch-frame models. Loading Other PyG Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For datasets not included in the benchmark suite, you can directly use PyTorch Geometric: .. code-block:: python import torch_geometric.transforms as T from torch_geometric.datasets import Planetoid, Amazon, Coauthor from torch_geometric.utils import to_undirected # Example 1: Load PubMed from Planetoid dataset = Planetoid(root='./data', name='PubMed', transform=T.NormalizeFeatures()) data = dataset[0] data.edge_index = to_undirected(data.edge_index) # Convert to undirected x = data.x edge_index = data.edge_index y = data.y.squeeze() print(f"PubMed: {x.shape[0]} nodes, {edge_index.shape[1]} edges") # Example 2: Load Computers from Amazon dataset = Amazon(root='./data', name='Computers', transform=T.NormalizeFeatures()) data = dataset[0] data.edge_index = to_undirected(data.edge_index) # Example 3: Load other Coauthor datasets dataset = Coauthor(root='./data', name='CS', transform=T.NormalizeFeatures()) data = dataset[0] # Example 4: Load any OGB dataset from ogb.nodeproppred import PygNodePropPredDataset dataset = PygNodePropPredDataset(root='./data', name='ogbn-proteins') data = dataset[0] data.edge_index = to_undirected(data.edge_index) # Get splits for OGB datasets split_idx = dataset.get_idx_split() train_idx = split_idx['train'] valid_idx = split_idx['valid'] test_idx = split_idx['test'] .. tip:: **Working with Custom PyG Datasets:** When using PyG datasets directly, remember to: - Apply ``to_undirected()`` for clustering tasks (most AGC methods assume undirected graphs) - Use ``T.NormalizeFeatures()`` transform for consistent feature scaling - Convert labels to 1D tensor: ``y = data.y.squeeze()`` - Check if the dataset provides train/val/test masks or splits Loading Custom Datasets ~~~~~~~~~~~~~~~~~~~~~~~~ You can also load your own graph data: .. code-block:: python import torch from torch_geometric.data import Data from torch_geometric.utils import to_undirected # Create graph from edge list and features edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]], dtype=torch.long) x = torch.randn(3, 16) # 3 nodes, 16 features y = torch.tensor([0, 1, 1]) # Ground truth labels # Ensure undirected edge_index = to_undirected(edge_index) data = Data(x=x, edge_index=edge_index, y=y) .. code-block:: python # Load from numpy arrays import numpy as np import torch from scipy.sparse import coo_matrix # Load adjacency matrix (scipy sparse format) adj_matrix = coo_matrix(...) # Your adjacency matrix edge_index = torch.tensor( np.vstack([adj_matrix.row, adj_matrix.col]), dtype=torch.long ) # Load features and labels features = np.load('features.npy') labels = np.load('labels.npy') x = torch.tensor(features, dtype=torch.float) y = torch.tensor(labels, dtype=torch.long) # Normalize features from torch_geometric.transforms import NormalizeFeatures data = Data(x=x, edge_index=edge_index, y=y) transform = NormalizeFeatures() data = transform(data) Dataset Name Aliases ~~~~~~~~~~~~~~~~~~~~ The following aliases are supported for convenience: .. code-block:: python # Case-insensitive loading get_dataset('cora', root='./data') # ✓ get_dataset('Cora', root='./data') # ✓ get_dataset('CORA', root='./data') # ✓ # OGB dataset aliases get_dataset('arxiv', root='./data') # Short form get_dataset('ogbn-arxiv', root='./data') # Full OGB name get_dataset('mag', root='./data') # Short form get_dataset('ogbn-mag', root='./data') # Full OGB name get_dataset('products', root='./data') # Short form get_dataset('ogbn-products', root='./data') # Full OGB name get_dataset('papers100m', root='./data') # Short form get_dataset('ogbn-papers100M', root='./data') # Full OGB name # GraphLand aliases get_dataset('hm', root='./data') # Short form get_dataset('hm-categories', root='./data') # Full name get_dataset('pokec', root='./data') # Short form get_dataset('pokec-regions', root='./data') # Full name get_dataset('webtopic', root='./data') # Short form get_dataset('web-topics', root='./data') # Full name # Reddit aliases get_dataset('reddit', root='./data') # Either works get_dataset('reddit2', root='./data') # Same dataset GraphLand Industrial Datasets ------------------------------ PyAGC includes the **GraphLand** benchmark datasets (HM, Pokec, WebTopic) featuring: - **Tabular node features** (categorical + numerical) - **Heterophilous structures** (low homophily) - **Industrial-scale complexity** (millions of nodes) Two loading paradigms are provided: Standard Tensor Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using :func:`pyagc.data.get_dataset`: - Node features are returned as **dense tensors** (:math:`\mathbf{X} \in \mathbb{R}^{N \times F}`) - Tabular features are **preprocessed (encoded / normalized)** beforehand - Compatible with traditional GNN pipelines .. code-block:: python x, edge_index, y = get_dataset('HM', root='./data') TensorFrame-based Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using :func:`pyagc.data.get_tabular_graphland_dataset`: - Node features are stored as :class:`torch_frame.TensorFrame` - **Semantic types are preserved** (categorical, numerical, etc.) - Feature preprocessing is **deferred to model-side encoders** - Enables advanced tabular learning (e.g., feature-wise embeddings) .. code-block:: python from pyagc.data import get_tabular_graphland_dataset data = get_tabular_graphland_dataset('HM', root='./data') print(type(data.x)) # torch_frame.TensorFrame print(data.edge_index.shape) print(data.y.shape) # TensorFrame statistics (computed on training nodes) print(data.tf_col_stats) .. note:: The TensorFrame-based pipeline is designed for integration with **torch-frame encoders**, allowing: - automatic handling of heterogeneous feature types - missing value processing - feature-wise embedding learning Dataset Class ~~~~~~~~~~~~~~ .. autoclass:: pyagc.data.GraphLandTensorFrameDataset :members: :undoc-members: See Also -------- - :doc:`models` - Clustering models compatible with these datasets - :doc:`metrics` - Evaluation metrics for clustering quality - `PyG Dataset Documentation `_ - `OGB Documentation `_