Looking for latency benchmarks/results for the model
Are there any published benchmarks or has anyone run their own tests they could share?
Hey, sorry for the delay. You can see the benchmarks in our github repo: https://github.com/distil-labs/Distil-PII .
Overall we have found that finetuned models conform to the JSON schema, stop hallucinating extra entities, handle obfuscated inputs and numbers (while keeping last-4), and preserve non-PII operational tokens. Performance lifts are large across sizes resulting in the 1B and 3B students are on par (within one standard deviation) with a 680B+ LLM judge baseline. SmolLM2 is surprisingly resistant to training, but we are still releasing it for the sake of completeness.
| Model name | # parameters | LLM as a judge metric |
|---|---|---|
| Deepseek 3.1 (untrained) | 685B | 0.84 +/- 0.03 |
| Llama-3.2-3B-Instruct | 3B | 0.82 +/- 0.03 |
| Llama-3.2-1B-Instruct | 1B | 0.81 +/- 0.02 |
| gemma-3-270m-it | 270M | 0.73 +/- 0.07 |
| SmolLM2-135M-Instruct | 135M | 0.25 +/- 0.05 |
Hey, we missed the question about latency in the title. The latency depends on your hardware - we provide the pre-trained model checkpoints, so you can easily evaluate them yourself. In general the small models will be much faster than the 600B model but the details depend on your setup.