Code & Development

Transformerengine

Transformer Engine is an open-source library developed by NVIDIA designed to accelerate Transformer model training and inference on NVIDIA GPUs. It supports FP8 precision on Hopper, Ada, and Blackwell GPU architectures, which reduces memory usage while maintaining performance. The library provides optimized building blocks and fused kernels for Transformer layers, integrating with popular deep learning frameworks such as PyTorch and JAX through an automatic mixed precision API. It also offers a framework-agnostic C++ API for broader integration needs. The library targets developers working with Transformer-based models on NVIDIA hardware, particularly those leveraging newer GPU architectures that support FP8 precision. Installation requires specific system prerequisites including Linux, CUDA 12.1 or higher, and compatible NVIDIA GPUs. Transformer Engine is distributed under the Apache 2.0 license and is free to use.

Updated Dec 16, 2025open-source

Visit Transformerengine ↗Visual Guide

Overview

Transformer Engine is an NVIDIA open-source library that accelerates Transformer models on supported GPUs using FP8 precision.

Pricing

open-source

Training Large Transformer Models

Developers training large-scale Transformer models on NVIDIA GPUs can leverage FP8 precision to reduce memory usage and accelerate training.

Inference Optimization

Deploying Transformer models for inference on supported NVIDIA GPUs benefits from optimized kernels and lower memory footprint.

Quick Start

Verify System Requirements

Ensure your system runs Linux with CUDA 12.1 or higher (12.8+ for Blackwell GPUs), cuDNN 9.3+, Python 3.12 recommended, and an NVIDIA GPU with Compute Capability 8.9 or above for FP8 support.

Install Transformer Engine

Install via pip using the command: pip3 install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@stable. The installer sets the NVTE_FRAMEWORK environment variable automatically if needed.

Import and Use in PyTorch

Import the library in your PyTorch code with: import transformer_engine.pytorch as te, then use modules such as te.Linear to build Transformer layers.

Run on CUDA Device

Execute your model on a CUDA device, for example by creating input tensors on the GPU: inp = torch.randn(..., device='cuda').

Refer to Documentation

Consult the Quickstart Notebook and official documentation for detailed examples and advanced usage.

📊

Strategic Context for Transformerengine

Get weekly analysis on market dynamics, competitive positioning, and implementation ROI frameworks with AI Intelligence briefings.

Try Intelligence Free →

7 days free · No credit card

Assessment

Strengths

Supports FP8 precision with automatic scaling factor management for mixed precision training.
Includes fused kernels optimized for Transformer operations across multiple precisions.
Integrates with PyTorch and JAX frameworks via automatic detection during installation.
Provides a framework-agnostic C++ API for custom integration.
Supports both training and inference with reduced memory usage on supported NVIDIA GPUs.

Limitations

Requires specific NVIDIA hardware: Ampere or newer GPUs for base support, and Hopper/Ada/Blackwell GPUs for FP8 precision.
Open issues include build failures in Docker with certain PyTorch/CUDA versions and on L40S GPUs.
Development builds are unsupported and not recommended for general use.