Balanced DeBERTa Prompt Guard
Fine-tuned DeBERTa-v3-base for prompt injection and jailbreak detection.
Model Details
- Base Model: microsoft/deberta-v3-base
- Training Data: 536 samples (prompt-guard-v2 dataset)
- 417 malicious (78%)
- 119 benign (22%)
Dataset Categories (30+)
| Category | Count |
|---|---|
| prompt_injection | 80+ |
| jailbreak | 70+ |
| flip_attack | 30+ |
| encoding_attack | 50+ |
| indirect_injection | 40+ |
| role_override | 30+ |
| privilege_escalation | 20+ |
| agent_chaining | 20+ |
| memory_poisoning | 15+ |
| vector_poisoning | 15+ |
| xss_attempt | 15+ |
| sql_injection | 10+ |
| benign | 119 |
Supported Threats (15+ types)
- Direct/Indirect Prompt Injection
- FlipAttack (character swapping)
- Jailbreak (DAN, role-play)
- Encoding Attacks (Base64, Hex, URL, HTML, Unicode)
- Multi-language jailbreak (中, 日, 韓, 西, 独, 露, etc.)
- RAG/Vector Poisoning
- Agent Chaining Attacks
- Memory Poisoning
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "clawdia-chan/balanced-deberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support