Strengths
- Supports FP8 precision with automatic scaling factor management for mixed precision training.
- Includes fused kernels optimized for Transformer operations across multiple precisions.
- Integrates with PyTorch and JAX frameworks via automatic detection during installation.
- Provides a framework-agnostic C++ API for custom integration.
- Supports both training and inference with reduced memory usage on supported NVIDIA GPUs.
Limitations
- Requires specific NVIDIA hardware: Ampere or newer GPUs for base support, and Hopper/Ada/Blackwell GPUs for FP8 precision.
- Open issues include build failures in Docker with certain PyTorch/CUDA versions and on L40S GPUs.
- Development builds are unsupported and not recommended for general use.