Key Features

What you can do

Scalable Multi-GPU/TPU Training

Supports training and deployment across 1 to over 1000 GPUs or TPUs with optimizations for distributed computing.

Configurable Training via YAML

Enables users to customize training parameters such as micro batch sizes and LoRA finetuning through YAML configuration files.

Memory Optimization Techniques

Incorporates Flash Attention, Fully Sharded Data Parallel (FSDP), and mixed precision (fp4/8/16/32) to reduce GPU memory usage during training.

Support for 20+ Large Language Models

Provides recipes and command-line tools for pretraining, finetuning, and deploying a variety of LLMs including EleutherAI/pythia-160m.

Custom Dataset Integration

Allows users to train models on their own datasets with straightforward command-line commands and readable code for modifications.

Comprehensive Documentation and Tutorials

Offers extensive guides and beginner-friendly workflows to assist users in setting up and using the tool effectively.