You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.

gemma-3-270m-it-FlashHead

Optimized version of gemma-3-270m-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
vLLM plugin via flash-head

FlashHead matches the gemma-3-270m-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.

Quickstart

pip install flash-head
vllm serve embedl/gemma-3-270m-it-FlashHead

Model Details

Field	Value
Base Model	gemma-3-270m-it
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head
Developers	Embedl
Licenses	Upstream: Gemma Terms of Use. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
vLLM Plugin Integration - compatible with vLLM (0.14.0+) via the flash-head plugin.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	397	1.0×
FlashHead (Embedl)	526	1.32×
W4A16 baseline	420	1.06×
FlashHead W4A16 (Embedl)	568	1.43×

FlashHead improves end-to-end speed by 1.35× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	IFEval	BBH	TruthfulQA	GSM8K
Baseline	0.09	0.32	0.27	0.31	0.02
FlashHead	0.09	0.30	0.27	0.32	0.02

FlashHead closely matches baseline accuracy.

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

Note (vLLM context length): max_model_len=32768 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import LLM, SamplingParams

model_id = "embedl/gemma-3-270m-it-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=32768)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)