Scaling to Massive Graphs ========================== This tutorial demonstrates how PyAGC scales to graphs with millions (or even billions) of nodes through mini-batch training, neighbor sampling, and intelligent memory management. The Scalability Challenge -------------------------- Traditional graph clustering methods face two major bottlenecks when dealing with large graphs: 1. **Memory Bottleneck**: Full-batch methods require loading the entire graph into GPU memory 2. **Computation Bottleneck**: Message passing over millions of nodes becomes prohibitively slow **Memory Requirements Example** .. list-table:: :header-rows: 1 :widths: 20 20 20 20 20 * - Dataset - Nodes - Edges - Features - Est. GPU Memory * - Cora - 2.7K - 10.5K - 1,433 - ~50 MB * - Reddit - 233K - 23M - 602 - ~8 GB * - Products - 2.4M - 62M - 100 - ~32 GB (OOM!) * - Papers100M - 111M - 1.6B - 128 - ~500 GB (Impossible!) PyAGC's Scalability Solutions ------------------------------ PyAGC provides three complementary strategies: 1. **Mini-Batch Training**: Process the graph in small batches instead of all at once 2. **Neighbor Sampling**: Limit the number of neighbors sampled during message passing 3. **Automatic Fallback**: Intelligently switch between GPU/CPU and full-batch/mini-batch modes Mini-Batch Training ------------------- Basic Configuration ~~~~~~~~~~~~~~~~~~~ Enable mini-batch training in your configuration file: .. code-block:: yaml # train.conf.yaml mini_batch: true # Enable mini-batch mode batch_size: 2048 # Nodes per batch num_layers: 2 # GNN depth fan_out: 10 # Neighbors to sample per layer num_workers: 4 # Data loading workers Using Mini-Batch in Code ~~~~~~~~~~~~~~~~~~~~~~~~ All PyAGC models automatically support mini-batch training: .. code-block:: python from pyagc.data import get_dataset from pyagc.models import DGI from pyagc.encoders import create_tuned_gnn from torch_geometric.data import Data # Load large graph x, edge_index, y = get_dataset('Reddit', root='data') data = Data(x=x, edge_index=edge_index) # Create model encoder = create_tuned_gnn( gnn_type='sage', # GraphSAGE works well for mini-batch in_channels=data.num_features, hidden_channels=256, num_layers=2 ) model = DGI(hidden_channels=256, encoder=encoder) # The model will automatically use mini-batch training # based on your configuration Under the hood, PyAGC uses PyTorch Geometric's ``NeighborLoader`` to create mini-batches: .. code-block:: python from torch_geometric.loader import NeighborLoader # Automatically created during training train_loader = NeighborLoader( data, input_nodes=None, # Sample from all nodes num_neighbors=[fan_out] * num_layers, # [10, 10] for 2 layers batch_size=batch_size, shuffle=True ) Neighbor Sampling Strategies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different sampling strategies for different scenarios: .. code-block:: yaml # Uniform sampling (default) fan_out: 10 # Sample 10 neighbors per layer # Layer-wise sampling fan_out: [15, 10, 5] # More neighbors in earlier layers # Full neighborhood (for inference) infer_fan_out: -1 # Sample all neighbors during inference Inference Optimization ---------------------- PyAGC provides separate configuration for inference to balance speed and accuracy: Inference Batch Size ~~~~~~~~~~~~~~~~~~~~ Use larger batches during inference since no gradients are computed: .. code-block:: yaml batch_size: 1024 # Training batch size infer_batch_size: 4096 # Inference batch size (can be larger) The library automatically handles this: .. code-block:: python # During training model.train_batch(train_loader, optimizer, epoch) # During inference - automatically uses larger batch size z = model.infer_batch(inference_loader) Full-Batch Inference Fallback ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For better accuracy on medium-sized graphs, PyAGC can try full-batch inference first: .. code-block:: yaml force_full_batch_inference: true # Try full-batch first allow_cpu_fallback: true # Fallback to CPU if GPU OOMs The automatic fallback strategy: 1. **Try GPU full-batch** - Best accuracy, fastest if it fits 2. **Fallback to CPU full-batch** - Better accuracy, slower but feasible 3. **Fallback to GPU mini-batch** - Good accuracy, memory-efficient .. code-block:: python # Handled automatically by inference_embeddings() z, inference_time = inference_embeddings( model, data, conf, logger, device, labeled_indices=labeled_indices # Optional: only infer subset ) Memory Management ----------------- Handling OOM Errors ~~~~~~~~~~~~~~~~~~~ PyAGC automatically handles out-of-memory errors: .. code-block:: python # Automatic batch size reduction on OOM max_retries: 3 retry_count: 0 while retry_count < max_retries: try: # Try inference with current batch size z = model.infer_batch(inference_loader) break except RuntimeError as e: if 'out of memory' in str(e): # Reduce batch size and retry infer_batch_size = max(infer_batch_size // 2, 32) logger.warning(f"OOM detected, reducing batch size to {infer_batch_size}") torch.cuda.empty_cache() retry_count += 1 Checkpoint Management ~~~~~~~~~~~~~~~~~~~~~ Save checkpoints during training to recover from interruptions: .. code-block:: yaml save_every: 10 # Save every 10 epochs save_every_batch: 1000 # Save every 1000 batches (for very large graphs) .. code-block:: python from pyagc.utils import CheckpointManager ckpt_manager = CheckpointManager('ckpts/reddit', 'seed0', logger) # Resume training from checkpoint if args.resume: checkpoint = ckpt_manager.load_checkpoint( model, optimizer, load_best=False, device=device ) start_epoch = checkpoint['epoch'] + 1 Specialized Support for Large Graphs ------------------------------------- Papers100M Dataset ~~~~~~~~~~~~~~~~~~ PyAGC has special optimizations for billion-scale graphs: .. code-block:: python # Automatically detects Papers100M and loads efficiently x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = get_dataset( 'Papers100M', root='data', return_splits=True # Returns labeled subset ) # Only compute embeddings for labeled nodes (1.8M instead of 111M) labeled_indices = torch.cat([train_idx, valid_idx, test_idx]) z, inference_time = inference_embeddings( model, data, conf, logger, device, labeled_indices=labeled_indices # Huge memory savings! ) # Use labeled subgraph for structure metrics struct_results = structure_metrics( labeled_subgraph['edge_index'], # Much smaller graph pred, metrics=['Mod', 'Cond'] ) Model-Specific Optimizations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **MAGI**: Custom data loader for random walk sampling .. code-block:: python from pyagc.models.magi import MAGINeighborLoader train_loader = MAGINeighborLoader( data, num_neighbors=[10] * 2, num_walks=20, # Random walks per node walk_length=4, # Steps per walk batch_size=2048 ) **DAEGC**: Multi-stage training with separate configurations .. code-block:: yaml pretrain_epochs: 200 pretrain_batch_size: 2048 finetune_epochs: 100 finetune_batch_size: 4096 # Can be larger in finetuning Practical Example: Scaling to Reddit ------------------------------------- Complete workflow for the Reddit dataset (233K nodes, 23M edges): Configuration File ~~~~~~~~~~~~~~~~~~ .. code-block:: yaml # train.conf.yaml for Reddit dataset: Reddit # Model gnn_type: sage hidden_channels: 256 num_layers: 2 dropout: 0.0 # Training mini_batch: true batch_size: 2048 fan_out: 10 epochs: 200 lr: 0.001 # Inference infer_batch_size: 8192 infer_fan_out: -1 force_full_batch_inference: false allow_cpu_fallback: true # Checkpointing save_every: 20 Training Script ~~~~~~~~~~~~~~~ .. code-block:: python import torch from pyagc.data import get_dataset from pyagc.models import DGI from pyagc.encoders import create_tuned_gnn from pyagc.utils import get_training_config, CheckpointManager from torch_geometric.data import Data # Load configuration conf = get_training_config('Reddit', config_path='train.conf.yaml') device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Load dataset x, edge_index, y = get_dataset('Reddit', root='data') data = Data(x=x, edge_index=edge_index) print(f"Graph: {data.num_nodes:,} nodes, {data.num_edges:,} edges") # Create model encoder = create_tuned_gnn( gnn_type='sage', in_channels=data.num_features, hidden_channels=256, num_layers=2 ) model = DGI(hidden_channels=256, encoder=encoder).to(device) # Setup checkpointing ckpt_manager = CheckpointManager('ckpts/reddit', 'seed0') # Training (automatically uses mini-batch) optimizer = torch.optim.Adam(model.parameters(), lr=0.001) for epoch in range(1, 201): # train_batch automatically creates NeighborLoader loss = model.train_batch(train_loader, optimizer, epoch) if epoch % 20 == 0: ckpt_manager.save_checkpoint(model, optimizer, epoch, loss) # Inference with automatic fallback z, time = inference_embeddings(model, data, conf, device) # Clustering from pyagc.clusters import KMeansClusterHead kmeans = KMeansClusterHead(n_clusters=41) # Reddit has 41 classes pred = kmeans.fit_predict(z) Expected Output ~~~~~~~~~~~~~~~ .. code-block:: text ============================================================ System Information ============================================================ Dataset: Data(x=[232965, 602], edge_index=[2, 114615892]) Nodes: 232,965 Edges: 114,615,892 Training mode: Mini-batch GNN type: SAGE Total parameters: 1.234M Device: cuda:0 GPU: NVIDIA A100-SXM4-40GB ============================================================ Training Mode: Mini-batch (NeighborLoader) ============================================================ Batch size: 2048 Fan-out: [10] × 2 layers Epoch 001: 100%|██████████| 232965/232965 [00:42<00:00] Epoch: 001 Loss: 0.8234 ... Training completed in 142.34s (avg 711ms/epoch) ============================================================ Inference Stage ============================================================ Attempting full-batch inference on cuda:0... Full-batch inference failed on cuda:0 due to OOM Using mini-batch inference on cuda:0... Inference batch size: 8192 ✓ Mini-batch inference completed in 8.21s ============================================================ Clustering Stage ============================================================ Run 1/5: NMI=45.23, ARI=42.18, ACC=68.91 [3.12s] ... Final: NMI=45.67±0.31, ARI=42.45±0.28, ACC=69.12±0.34 Performance Comparison ---------------------- Performance on different graph sizes: .. list-table:: :header-rows: 1 :widths: 15 15 15 20 20 15 * - Dataset - Nodes - Mode - Training Time - Peak Memory - NMI * - Cora - 2.7K - Full-batch - 12s - 0.2 GB - 68.5 * - Cora - 2.7K - Mini-batch - 18s - 0.1 GB - 68.2 * - Reddit - 233K - Full-batch - OOM - — - — * - Reddit - 233K - Mini-batch - 142s - 4.2 GB - 45.7 * - Products - 2.4M - Full-batch - OOM - — - — * - Products - 2.4M - Mini-batch - 8.3min - 12 GB - 52.3 Best Practices -------------- Choosing Batch Size ~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Rule of thumb for batch_size: # - Small graphs (< 10K nodes): Use full-batch # - Medium graphs (10K - 1M nodes): batch_size = 1024-4096 # - Large graphs (> 1M nodes): batch_size = 4096-8192 if data.num_nodes < 10000: conf['mini_batch'] = False elif data.num_nodes < 1000000: conf['batch_size'] = 2048 else: conf['batch_size'] = 8192 Choosing Fan-out ~~~~~~~~~~~~~~~~ .. code-block:: python # Trade-off between accuracy and speed: # - High fan-out (20-50): Better accuracy, slower # - Medium fan-out (10-15): Balanced # - Low fan-out (5-10): Faster, may lose accuracy # Deeper GNNs need smaller fan-out to control batch size if num_layers == 2: fan_out = 15 elif num_layers == 3: fan_out = 10 else: fan_out = 5 GNN Architecture Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Best GNN types for mini-batch training: # 1. GraphSAGE (sampling-based, designed for mini-batch) # 2. GAT (attention weights computed locally) # Avoid for mini-batch: # - GCN with cached=True (expects full graph) # - Models requiring global information Troubleshooting --------------- Common Issues and Solutions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Issue 1: Out of Memory During Training** .. code-block:: python RuntimeError: CUDA out of memory. Tried to allocate 2.34 GiB **Solution**: Reduce batch size or fan-out .. code-block:: yaml # Original configuration batch_size: 4096 fan_out: 15 # Reduced configuration batch_size: 1024 # Reduce by 4x fan_out: 10 # Reduce neighbors **Issue 2: Out of Memory During Inference** .. code-block:: python RuntimeError: CUDA out of memory during inference **Solution**: Enable automatic fallback and reduce inference batch size .. code-block:: yaml force_full_batch_inference: false allow_cpu_fallback: true infer_batch_size: 2048 # Smaller than training batch_size infer_fan_out: -1 # Keep full neighborhood for accuracy **Issue 3: Training is Too Slow** .. code-block:: python # Epoch takes 10+ minutes on medium-sized graph **Solution**: Increase batch size and use multiple workers .. code-block:: yaml batch_size: 8192 # Larger batches = fewer iterations num_workers: 4 # Parallel data loading fan_out: 10 # Reduce if still slow **Issue 4: Poor Clustering Performance in Mini-Batch Mode** .. code-block:: python # NMI drops from 68.5 (full-batch) to 45.2 (mini-batch) **Solution**: Use full neighborhood during inference .. code-block:: yaml # Training can use sampling fan_out: 10 # Inference uses full neighborhood infer_fan_out: -1 force_full_batch_inference: true # Try full-batch inference **Issue 5: DataLoader Deadlock with Multiple Workers** .. code-block:: python # Training hangs at first epoch with num_workers > 0 **Solution**: Set num_workers=0 or use proper multiprocessing setup .. code-block:: yaml num_workers: 0 # Safe default, especially on CPU Or use proper initialization: .. code-block:: python if __name__ == '__main__': import torch.multiprocessing as mp mp.set_start_method('spawn', force=True) main() Advanced Techniques ------------------- Gradient Accumulation for Larger Effective Batch Size ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When GPU memory limits batch size but you want larger effective batches: .. code-block:: python # Accumulate gradients over multiple mini-batches accumulation_steps = 4 effective_batch_size = batch_size * accumulation_steps for epoch in range(epochs): optimizer.zero_grad() total_loss = 0 for i, batch in enumerate(train_loader): batch = batch.to(device) loss_output = model.loss_batch(batch) loss = loss_output.total / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() total_loss += loss.item() Mixed Precision Training ~~~~~~~~~~~~~~~~~~~~~~~~ Use automatic mixed precision (AMP) to reduce memory usage: .. code-block:: python from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for epoch in range(epochs): for batch in train_loader: batch = batch.to(device) optimizer.zero_grad() # Forward pass with autocast with autocast(): loss_output = model.loss_batch(batch) loss = loss_output.total # Backward pass with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() This can reduce memory usage by ~30-40% with minimal accuracy loss. Distributed Training (Multi-GPU) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For extremely large graphs, distribute training across multiple GPUs: .. code-block:: python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch_geometric.loader import NeighborLoader # Initialize distributed training dist.init_process_group(backend='nccl') local_rank = int(os.environ['LOCAL_RANK']) device = torch.device(f'cuda:{local_rank}') # Wrap model with DDP model = DGI(hidden_channels=256, encoder=encoder).to(device) model = DDP(model, device_ids=[local_rank]) # Distributed data loading train_loader = NeighborLoader( data, num_neighbors=[10] * 2, batch_size=2048 // world_size, # Divide batch across GPUs shuffle=True ) # Training loop remains the same for epoch in range(epochs): avg_loss = model.module.train_batch(train_loader, optimizer, epoch) Run with: .. code-block:: bash torchrun --nproc_per_node=4 train.py --dataset Reddit Dynamic Sampling Strategies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Adjust sampling strategy during training: .. code-block:: python def get_adaptive_fan_out(epoch, max_epochs): """Increase fan-out as training progresses for better accuracy.""" min_fan_out = 5 max_fan_out = 15 progress = epoch / max_epochs return int(min_fan_out + (max_fan_out - min_fan_out) * progress) for epoch in range(1, epochs + 1): fan_out = get_adaptive_fan_out(epoch, epochs) train_loader = NeighborLoader( data, num_neighbors=[fan_out] * num_layers, batch_size=batch_size, shuffle=True ) avg_loss = model.train_batch(train_loader, optimizer, epoch) Subgraph-Based Evaluation ~~~~~~~~~~~~~~~~~~~~~~~~~~ For billion-scale graphs like Papers100M, evaluate on manageable subgraphs: .. code-block:: python # Extract labeled subgraph labeled_indices = torch.cat([train_idx, valid_idx, test_idx]) # Get subgraph containing only labeled nodes and their edges from torch_geometric.utils import subgraph edge_index_sub, _ = subgraph( labeled_indices, edge_index, relabel_nodes=True, num_nodes=data.num_nodes ) # Only compute embeddings for labeled nodes z_labeled, _ = inference_embeddings( model, data, conf, logger, device, labeled_indices=labeled_indices ) # Evaluate on subgraph struct_results = structure_metrics( edge_index_sub, pred, metrics=['Mod', 'Cond'] ) Performance Profiling ~~~~~~~~~~~~~~~~~~~~~ Identify bottlenecks in your pipeline: .. code-block:: python import torch.profiler as profiler with profiler.profile( activities=[ profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA, ], record_shapes=True, profile_memory=True, ) as prof: for epoch in range(1, 6): # Profile first 5 epochs avg_loss = model.train_batch(train_loader, optimizer, epoch) # Print profiling results print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=10 )) # Export for visualization prof.export_chrome_trace("trace.json") # View at chrome://tracing Example output: .. code-block:: text --------------------------------- ------------ ------------ ------------ Name Self CPU % Self CUDA % CUDA total --------------------------------- ------------ ------------ ------------ aten::matmul 15.2% 42.3% 1.234s aten::index_select 8.4% 18.7% 543ms Neighbor Sampling 12.1% 0.0% 352ms --------------------------------- ------------ ------------ ------------ Real-World Case Studies ----------------------- Case Study 1: Reddit (233K nodes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Challenge**: 23M edges cause OOM with full-batch training. **Solution**: .. code-block:: yaml gnn_type: sage mini_batch: true batch_size: 2048 fan_out: 10 num_layers: 2 infer_fan_out: -1 # Full neighborhood for inference **Results**: - Training time: 142s (vs OOM with full-batch) - Peak memory: 4.2 GB (vs 32+ GB required for full-batch) - NMI: 45.7 (vs 46.2 with full-batch, only 1% accuracy loss) Case Study 2: ogbn-products (2.4M nodes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Challenge**: Cannot fit even embeddings in GPU memory. **Solution**: .. code-block:: yaml gnn_type: sage mini_batch: true batch_size: 8192 fan_out: 10 num_layers: 2 # Use larger inference batch infer_batch_size: 16384 force_full_batch_inference: false allow_cpu_fallback: true **Results**: - Training time: 8.3 min - Peak memory: 12 GB - NMI: 52.3 - Inference: GPU mini-batch (CPU fallback not needed) Case Study 3: ogbn-papers100M (111M nodes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Challenge**: Billion-scale graph, cannot store in single-GPU memory. **Solution**: .. code-block:: yaml gnn_type: sage mini_batch: true batch_size: 8192 fan_out: 5 # Lower fan-out for extreme scale num_layers: 2 # Inference optimizations infer_batch_size: 32768 infer_fan_out: 10 # Limited neighborhood for feasibility **Code**: .. code-block:: python # Load with labeled subset x, edge_index, y, train_idx, valid_idx, test_idx, labeled_subgraph = \ get_dataset('Papers100M', root='data', return_splits=True) labeled_indices = torch.cat([train_idx, valid_idx, test_idx]) # Only 1.8M nodes (1.6% of total) need embeddings # Train on full graph, infer on labeled subset z, _ = inference_embeddings( model, data, conf, logger, device, labeled_indices=labeled_indices # Huge memory savings ) **Results**: - Training time: ~2 hours - Peak memory: 24 GB (single A100 GPU) - Evaluated on 1.8M labeled nodes instead of 111M - Structure metrics computed on labeled subgraph (23M edges vs 1.6B) Summary and Recommendations ---------------------------- Quick Decision Guide ~~~~~~~~~~~~~~~~~~~~ **When to use full-batch:** - Graph has < 10K nodes - GPU has sufficient memory (check with ``dataset_stats.py``) - Need maximum accuracy **When to use mini-batch:** - Graph has > 10K nodes - Getting OOM errors - Training on CPU - Want to scale to millions of nodes **Recommended configurations:** .. code-block:: python # Small graphs (< 10K nodes) config = { 'mini_batch': False, 'cached': True } # Medium graphs (10K - 1M nodes) config = { 'mini_batch': True, 'batch_size': 2048, 'fan_out': 10, 'infer_batch_size': 8192, 'infer_fan_out': -1, 'force_full_batch_inference': True } # Large graphs (1M - 10M nodes) config = { 'mini_batch': True, 'batch_size': 4096, 'fan_out': 10, 'infer_batch_size': 16384, 'infer_fan_out': 15, 'force_full_batch_inference': False } # Extreme scale (> 10M nodes) config = { 'mini_batch': True, 'batch_size': 8192, 'fan_out': 5, 'infer_batch_size': 32768, 'infer_fan_out': 10, 'allow_cpu_fallback': True } Key Takeaways ~~~~~~~~~~~~~ 1. **PyAGC scales from thousands to billions of nodes** using neighbor sampling 2. **Automatic fallback** handles OOM gracefully without code changes 3. **Separate inference config** optimizes for speed and accuracy 4. **Smart checkpointing** enables resuming long training runs Next Steps ~~~~~~~~~~ - Check `benchmark scripts `_ for billion-scale example - Review existing implementations in the :doc:`models module <../modules/models>`