Strengths & Limitations

Reduces model size by up to 50x, enabling deployment on resource-constrained devices.
Inference speed up to 500x faster on CPU compared to original Sentence Transformers.
Minimal performance loss relative to original transformer models.
Runs locally without requiring external API calls.
Easy integration with Hugging Face Hub and popular vector databases.

Distillation introduces a small performance drop compared to full transformer models.
Static token averaging approach is less flexible than dynamic transformer models for some tasks.