The Solution - multi-token-prediction

Multi-Token Prediction (MTP) is a training objective and architectural technique used in large language models to predict multiple future tokens simultaneously at each position, rather than one token at a time. This approach densifies training signals by extending the prediction scope beyond the immediate next token, which can improve data efficiency and overall performance on evaluation benchmarks. Models such as DeepSeek-V3 and GLM-4.