OpenSec GDPO 4B (Phase 2)

A 4B-parameter LLM security agent fine-tuned with GDPO (Group reward-Decoupled normalization Policy Optimization) for the OpenSec dual-control environment.

Model Details

Base model: Qwen3-4B-Instruct-2507
Training framework: Slime (Megatron + SGLang async on-policy RL)
Algorithm: GRPO with GDPO per-axis reward normalization
Training epochs: 8 (best checkpoint at epoch 4)
Hardware: 2x NVIDIA H100 PCIe (80GB)
Attacker: GPT-5.2 replay cache (deterministic, 2,263+ cached decisions)

Training Configuration

Parameter	Value
Temperature	0.6
Beta (KL coef)	0.06 -> 0.04 (linear decay)
Samples per prompt	8
Clean mixing ratio	0.5 (ep0-3), 0.3 (ep4-7)
Efficiency scale	0.0 (ep0-1), 0.5 (ep2+)
Training seeds	160
Eval seeds	40 (standard tier)

Evaluation Results (ep4 checkpoint, 40 standard eval seeds)

Metric	Baseline (Qwen3-4B)	Trained	Delta
EGAR (Evidence-Gated Action Rate)	0.708	0.721	+0.013
False Positive Rate	0.675	0.750	+0.075
Containment Executed Rate	0.975	1.000	+0.025
Report Submitted Rate	1.000	1.000	0.000
Blast Radius	0.525	0.483	-0.042
TTFC (Time to First Containment)	2.900	3.125	+0.225
Injection Violation Rate	0.325	0.300	-0.025
Mean Reward	2.720	3.238	+0.518

Reward Axes

Training uses 5 reward axes with per-axis GDPO normalization:

Attribution: Correct entity identification in reports
Containment: +1.0 per correct action, -1.0 per false positive (uncapped)
Gating: -1.5 per containment action without prior trusted evidence
Efficiency: -0.1 per step (scaled by epoch)
Report: +3.0 for correct submission, -3.0 for malformed/missing

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Jarrodbarnes/opensec-gdpo-4b",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/opensec-gdpo-4b")

For evaluation within the OpenSec environment:

python scripts/eval.py --model Jarrodbarnes/opensec-gdpo-4b --seeds standard-40

Limitations

EGAR improvement is modest (+0.013) and not statistically significant (95% CI: [-0.067, +0.100])
FP rate increased (+0.075), indicating the model learned to always execute containment rather than improving discrimination
Training with GDPO per-axis normalization showed signal starvation at low temperatures; future work should evaluate standard GRPO normalization

Citation

@misc{opensec2026,
  title={OpenSec: A Dual-Control RL Environment for Evaluating LLM Security Agents},
  author={Barnes, Jarrod},
  year={2026},
  url={https://github.com/jarrodbarnes/opensec-env}
}