bge-small-rrf-v2: A 33M Parameter Model That Beats ColBERTv2 on 3/5 BEIR Datasets
Trained with zero human labels using a novel self-supervised signal: hybrid retrieval disagreement.
When vector search and keyword search disagree on what's relevant for a query, that disagreement reveals where the embedding model fails. We exploit this signal to fine-tune BGE-small, producing a model that better distinguishes "semantically close" from "actually relevant."
Key Result
A 33M parameter model, fine-tuned for $0 with zero human labels, surpasses ColBERTv2 (110M parameters) on 3 out of 5 standard BEIR benchmarks:
| Dataset | Docs | ColBERTv2 (110M) | BGE-small base (33M) | This model (33M) | vs ColBERTv2 |
|---|---|---|---|---|---|
| SciFact | 5K | 0.693 | 0.646 | 0.695 | +0.2% |
| NFCorpus | 3.6K | 0.344 | 0.330 | 0.395 | +14.8% |
| SciDocs | 25K | 0.154 | 0.178 | 0.188 | +21.8% |
| FiQA | 57K | 0.356 | 0.328 | 0.328 | -7.8% |
| ArguAna | 8.6K | 0.463 | 0.419 | 0.424 | -8.4% |
Up to +19.5% NDCG improvement over the base model, with zero additional inference cost.
Why This Matters
Most embedding improvements require either:
- A larger model (more compute, more latency)
- Human-labeled training data (expensive, slow)
- A teacher model for distillation (adds complexity)
This model needs none of that. The training signal comes from running the existing hybrid retrieval pipeline and observing where its two components (vector search and keyword search) disagree. The system improves itself.
The Training Signal: Hybrid Retrieval Disagreement
We discovered that 82% of queries produce disagreement between vector and keyword search in the top-5 results. These disagreements fall into two categories:
- Vector blind spots (51%): chunks the vector search ranks high but keyword search ignores. These are semantically similar but not actually relevant.
- Keyword blind spots (49%): chunks keyword search finds but vector search misses. These contain relevant terms but the embedding doesn't recognize their relevance.
Fine-tuning on these disagreement pairs teaches the model to fix both types of blind spots.
Training Details
| Parameter | Value |
|---|---|
| Base model | BAAI/bge-small-en-v1.5 |
| Parameters | 33M (unchanged) |
| Embedding dimension | 384 (unchanged) |
| Loss function | MultipleNegativesRankingLoss with explicit hard negatives |
| Training data | 76K (query, positive, hard_negative) triples |
| Data source | RRF signal disagreement on SciFact, NFCorpus, FiQA |
| Human labels | Zero |
| Epochs | 2 |
| Learning rate | 3e-6 |
| Batch size | 64 |
| Training time | ~30 min on T4 GPU |
| Training cost | $0 (Colab free tier) |
Why MNRL, not TripletLoss?
We tested TripletLoss first. It destroyed the model (-84% NDCG after 3 epochs). TripletLoss pushes individual negatives away with brute force, distorting the embedding space. MNRL adjusts relationships across 64 documents simultaneously per batch, preserving the model's general knowledge while learning from disagreements.
| Loss Function | NDCG@10 on SciFact | Result |
|---|---|---|
| TripletLoss (3 epochs, lr=2e-5) | 0.055 | -84% (destroyed) |
| TripletLoss (1 epoch, lr=1e-6) | 0.347 | -0.03% (no effect) |
| MNRL batch-only negatives (v1) | 0.683 | +5.6% |
| MNRL + explicit hard negatives (this model) | 0.695 | +7.4% |
Usage
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Stffens/bge-small-rrf-v2")
embeddings = model.encode(["your query", "your document"])
similarity = embeddings[0] @ embeddings[1]
With vstash (hybrid retrieval system)
pip install vstash
vstash reindex --model Stffens/bge-small-rrf-v2
vstash search "your query"
Train your own version on your data
pip install vstash sentence-transformers torch
vstash retrain # generates disagreement pairs from YOUR corpus and fine-tunes
vstash reindex --model ~/.vstash/models/retrained
Reproduce From Scratch
git clone https://github.com/stffns/vstash
cd vstash
pip install -e . sentence-transformers torch
# Generate disagreement triples
python -m experiments.rrf_training_pairs --datasets scifact nfcorpus fiqa
# Train (GPU recommended)
python -m experiments.finetune_rrf --epochs 2 --lr 3e-6 --batch-size 64
# Evaluate
python -m experiments.finetune_rrf --evaluate-only
Limitations
- ArguAna regression: queries with 200+ words show -8.4% vs ColBERTv2. Long argumentative queries produce only 1.1% signal disagreement, leaving no training signal.
- FiQA neutral: financial queries show +0.1% vs base but -7.8% vs ColBERTv2. The disagreement signal exists (86.7%) but doesn't translate to NDCG gains on this dataset.
- English only: inherited from BGE-small-en-v1.5.
- Not tested beyond BEIR: performance on domain-specific corpora may vary.
Citation
@software{vstash2026,
author = {Steffens, Jayson},
title = {vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents},
url = {https://github.com/stffns/vstash},
year = {2026}
}
Related
- vstash paper (Section 8.10: Self-Supervised Embedding Refinement)
- vstash GitHub
- Base model: BAAI/bge-small-en-v1.5
- Downloads last month
- 62
Model tree for Stffens/bge-small-rrf-v2
Base model
BAAI/bge-small-en-v1.5Datasets used to train Stffens/bge-small-rrf-v2
Evaluation results
- ndcg_at_10 on BEIR SciFactself-reported0.695
- ndcg_at_10 on BEIR NFCorpusself-reported0.395
- ndcg_at_10 on BEIR SciDocsself-reported0.188