RegTech-4B-Instruct
Fine-tuned for RAG-powered banking compliance — not general knowledge.
A specialized Qwen3-4B-Instruct model fine-tuned to excel within a Retrieval-Augmented Generation (RAG) pipeline for Italian banking regulatory compliance.
This model doesn't try to memorize regulations — it's trained to work with retrieved context: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents.
What This Model Does
This fine-tuning optimizes the model's behavior within a RAG system, not its factual knowledge. Specifically:
| Task | Description |
|---|---|
| RAG Q&A | Answer regulatory questions grounded on retrieved documents |
| Tool Calling | KYC verification, risk scoring, PEP checks, SOS reporting |
| Query Expansion | Rewrite user queries with regulatory terminology for better retrieval |
| Intent Detection | Classify if a message needs document search or is conversational |
| Document Reranking | Score candidate documents by relevance |
| Structured JSON | Topic extraction, metadata, impact analysis in JSON format |
| Impact Analysis | Cross-reference external regulations against internal bank procedures |
| Hallucination Resistance | Refuse to fabricate regulations, articles, or sanctions not in context |
Evaluation
Methodology
We evaluate all fine-tuned models using a dynamic adversarial benchmark designed to prevent overfitting to static test sets:
- Test generation: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused.
- Blind comparison: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias.
- Expert judging: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale.
- Statistical robustness: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch.
This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance.
Results — RegTech-4B-Instruct
Evaluated across 73 blind adversarial tests over 3 independent loops.
Head-to-Head vs Base Model
Base Tuned
Win Rate (adj.) 45.2% 54.8%
Wins 26 33
Ties 14
Quality Scores (1–5 scale)
| Criterion | Base | Tuned | Delta | |
|---|---|---|---|---|
| Hallucination Resistance | 3.53 | 3.89 | +0.36 | Improved |
| Tone & Professionalism | 3.90 | 4.27 | +0.37 | Improved |
| Output Format | 3.41 | 3.75 | +0.34 | Improved |
| Instruction Following | 3.14 | 3.44 | +0.30 | Improved |
| Accuracy | 3.34 | 3.59 | +0.25 | Improved |
| Context Adherence | 3.66 | 3.89 | +0.23 | Improved |
| Completeness | 3.45 | 3.23 | -0.22 | Trade-off |
| Overall | 3.49 | 3.72 | +0.23 | Improved |
Key Safety Improvements
The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios:
- Hallucination traps: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions.
- Credential protection: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim.
- Professional tone: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications.
Known Limitations
- Completeness trade-off (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal.
- Query Expansion: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements.
- Inference speed: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs.
Consistency Across Loops
| Loop | Base Wins | Tuned Wins | Ties | Tuned % |
|---|---|---|---|---|
| 1 | 7 | 13 | 5 | 62.0% |
| 2 | 11 | 10 | 2 | 47.8% |
| 3 | 8 | 10 | 7 | 54.0% |
Tuned model wins or ties in 2 out of 3 independent loops.
Usage Examples
RAG Q&A — Answering from Retrieved Context
messages = [
{
"role": "system",
"content": """Sei un assistente per la compliance bancaria.
Rispondi SOLO basandoti sul contesto fornito.
<contesto_recuperato>
Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti
requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
</contesto_recuperato>"""
},
{
"role": "user",
"content": "Quali sono i requisiti minimi di capitale secondo il CRR?"
}
]
Tool Calling — Compliance Workflows
messages = [
{
"role": "system",
"content": """Sei un assistente operativo per la compliance.
<tools>
{"name": "calcola_scoring_rischio", "parameters": {...}}
{"name": "controlla_liste_pep", "parameters": {...}}
{"name": "verifica_kyc", "parameters": {...}}
</tools>
<contesto_recuperato>
Procedura AML-003: L'adeguata verifica rafforzata (EDD) deve essere
applicata per PEP, paesi ad alto rischio e profili con scoring > 60.
</contesto_recuperato>"""
},
{
"role": "user",
"content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
}
]
Query Expansion — Improving RAG Retrieval
messages = [
{
"role": "system",
"content": "Riscrivi la query dell'utente per migliorare il recupero documentale. Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON."
},
{
"role": "user",
"content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
}
]
Document Reranking
messages = [
{
"role": "system",
"content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Score 0-100. Rispondi SOLO con il JSON."
},
{
"role": "user",
"content": '{"query": "requisiti CET1", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR"}, {"id": "doc_002", "title": "DORA Art. 5"}]}'
}
]
Training Metrics
| Metric | Value |
|---|---|
| Final Eval Loss | 1.368 |
| Token Accuracy | 70.5% |
| Train/Eval Gap | 0.033 |
A gap of 0.033 indicates stable training with no overfitting. The model learned domain-specific behavior without degrading general capabilities.
Design Principles
The LoRA configuration follows a minimal intervention philosophy validated through progressive experimentation across 6+ configurations:
- Low rank, all modules: Modifying all transformer layers with minimal rank produces better results than high rank on a subset of layers — consistent with findings from the original LoRA paper.
- Single epoch: One pass through the data is sufficient for behavioral adaptation. Multiple epochs cause catastrophic forgetting on small models.
- Conservative scaling: Alpha = 2× rank with low learning rate ensures stable gradients with adequate signal amplification.
Dataset Coverage
The training data covers the full lifecycle of a RAG-based compliance assistant:
| Category | Purpose |
|---|---|
| Query Expansion | Enrich queries with regulatory terms for better retrieval |
| Intent Classification | Route queries to RAG vs conversational responses |
| Document Reranking | Score retrieved documents by relevance |
| Topic Extraction | Extract main topics from regulatory text pages |
| Document Summarization | Summarize multi-page regulatory documents |
| Relevance Filtering | Filter regulatory text relevant to banks |
| Metadata Extraction | Find application dates, issuing authorities |
| Impact Analysis | Cross-reference regulations vs internal procedures |
| RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |
Regulatory sources covered: CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.
Deployment
With vLLM
vllm serve ./models/RegTech-4B-Instruct --dtype bfloat16
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Important Notes
- RAG-optimized — Trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
- Domain-specific — Optimized for Italian banking compliance. General capabilities may differ from the base model.
- Not legal advice — A tool to assist compliance professionals, not a substitute for regulatory expertise.
- Part of a model family — This 4B model is the lightweight variant. Larger models (7B, 14B, 32B) in the RegTech family offer progressively better completeness and accuracy for more demanding use cases.
Built for banking RAG by 2Sophia
Fine-tuned with LoRA • Adversarial evaluation by frontier LLM judges • Powered by Qwen3
- Downloads last month
- 20