You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.
Log in or Sign Up to review the conditions and access this model content.
gemma-3-270m-it-FlashHead
Optimized version of gemma-3-270m-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:
- FlashHead
- vLLM plugin via
flash-head
FlashHead matches the gemma-3-270m-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
Quickstart
pip install flash-head
vllm serve embedl/gemma-3-270m-it-FlashHead
Model Details
| Field | Value |
|---|---|
| Base Model | gemma-3-270m-it |
| Input / Output | Text → Text |
| Release Date | 2025-12-08 |
| Version | 1.0 |
| Optimizations | FlashHead LM Head |
| Developers | Embedl |
| Licenses | Upstream: Gemma Terms of Use. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
Optimizations
- FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
- vLLM Plugin Integration - compatible with vLLM (0.14.0+) via the
flash-headplugin.
Performance
Token Generation Speed (RTX 3500 Ada, batch size = 1)
| Precision | Tokens/sec | Speedup vs BF16 |
|---|---|---|
| BF16 baseline | 397 | 1.0× |
| FlashHead (Embedl) | 526 | 1.32× |
| W4A16 baseline | 420 | 1.06× |
| FlashHead W4A16 (Embedl) | 568 | 1.43× |
FlashHead improves end-to-end speed by 1.35× over state-of-the-art, while maintaining full accuracy parity.
Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
Accuracy (Parity with Baseline)
| Method | MMLU-Pro | IFEval | BBH | TruthfulQA | GSM8K |
|---|---|---|---|---|---|
| Baseline | 0.09 | 0.32 | 0.27 | 0.31 | 0.02 |
| FlashHead | 0.09 | 0.30 | 0.27 | 0.32 | 0.02 |
FlashHead closely matches baseline accuracy.
Installation
pip install flash-head
The flash-head vLLM plugin is required. It activates automatically at startup.
Usage Examples
Note (vLLM context length): max_model_len=32768 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).
vLLM Inference
from vllm import LLM, SamplingParams
model_id = "embedl/gemma-3-270m-it-FlashHead"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=32768)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
Limitations
- Requires vLLM >= 0.14.0
- Currently optimized for NVIDIA RTX GPUs
Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harnessintegration for detailed accuracy evaluation- Upstream support in Transformers and vLLM
- Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
- Broader model coverage (larger models, VLMs, VLAs)
License
- Upstream: Gemma Terms of Use.
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
Enterprise & Commercial Inquiries models@embedl.com
Technical Issues & Early Access https://github.com/embedl/flash-head
More Information & Model Releases https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: models@embedl.com
- Downloads last month
- 215
