Composable Parallelism
Supports 4D parallelism including data, tensor, pipeline, and expert parallelism in a modular and composable manner.
Elastic Scaling and Fault Tolerance
Enables elastic scaling to adapt to varying computational resources and includes mechanisms to handle rank failures via web API.
Advanced Checkpointing
Provides selective and full activation checkpointing with efficient save/load using DTensor-based checkpointing (DCP).
Comprehensive Logging and Debugging
Logs metrics such as loss, GPU memory usage, throughput, TFLOPs, and MFU to TensorBoard or Weights & Biases, with CPU/GPU and memory profiling tools.
Flexible Configuration
Uses TOML files for configuration including batch size and learning rate schedulers, with helper scripts for Hugging Face tokenizer downloads.
Integrated Dataset Support
Includes a checkpointable data loader supporting the C4 dataset and Hugging Face tokenizers for streamlined data preparation.