Single-Token Prediction
Standard autoregressive objective predicting one token at a time, generally less data efficient and lower accuracy on code tasks compared to MTP.
Multi-Token Attention (MTA)
Conditions attention on multiple query/key vectors via convolutions, differing architecturally from MTP's multi-token output prediction.