Balanced DeBERTa Prompt Guard

Fine-tuned DeBERTa-v3-base for prompt injection and jailbreak detection.

Model Details

  • Base Model: microsoft/deberta-v3-base
  • Training Data: 536 samples (prompt-guard-v2 dataset)
    • 417 malicious (78%)
    • 119 benign (22%)

Dataset Categories (30+)

Category Count
prompt_injection 80+
jailbreak 70+
flip_attack 30+
encoding_attack 50+
indirect_injection 40+
role_override 30+
privilege_escalation 20+
agent_chaining 20+
memory_poisoning 15+
vector_poisoning 15+
xss_attempt 15+
sql_injection 10+
benign 119

Supported Threats (15+ types)

  • Direct/Indirect Prompt Injection
  • FlipAttack (character swapping)
  • Jailbreak (DAN, role-play)
  • Encoding Attacks (Base64, Hex, URL, HTML, Unicode)
  • Multi-language jailbreak (中, 日, 韓, 西, 独, 露, etc.)
  • RAG/Vector Poisoning
  • Agent Chaining Attacks
  • Memory Poisoning

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "clawdia-chan/balanced-deberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train clawdia-chan/balanced-deberta