NurseSim-RL: Training AI Agents for Clinical Triage

Community Article Published January 11, 2026

TL;DR: We built a Gymnasium-compatible RL environment that simulates Emergency Department triage and fine-tuned a Llama 3.2 3B model to master it using Unsloth. The agent achieves 60% exact-match accuracy in assigning Manchester Triage System categories and maintained 70%+ safety pass rate in critical life-threat scenarios.

🔗 Live Demo | GitHub | Model

The Challenge: OpenEnv 2026

This project was developed for the OpenEnv Challenge, sponsored by PyTorch, Hugging Face, and Unsloth. The goal? Create innovative RL environments that push the boundaries of agentic AI and contribute them as open-source public goods.

Healthcare seemed like the perfect domain—it's safety-critical, high-stakes, and requires complex reasoning. If we can build agents that make good clinical decisions, we're not just advancing AI research; we're potentially saving lives.

The Problem: A&E Triage is Hard

Every day, Emergency Departments (A&E in the UK, ER in the US) face a critical challenge: which patient gets seen first?

Triage nurses use the Manchester Triage System (MTS) to categorize patients into 5 priority levels:

Category	Priority	Target Time	Example
1	Immediate	0 min	Cardiac arrest, Anaphylaxis
2	Very Urgent	10 min	Chest pain (STEMI), Stroke
3	Urgent	60 min	Abdominal pain, Fractures
4	Standard	120 min	Minor injuries, Viral illness
5	Non-Urgent	240 min	Minor cuts, GP-suitable

Why This Matters

A wrong decision has real consequences:

Under-triage a Category 1 patient → Life-threatening delay
Over-triage a Category 5 patient → Wasted critical resources

This isn't just a classification problem—it's a safety-critical resource allocation game.

The Solution: NurseSim-RL Environment

We built NurseSim-Triage-v0, a Gymnasium-compatible environment that models the A&E triage workflow.

How It Works

Observation Space:

{
  "patient_complaint": "Crushing chest pain radiating to left arm",
  "vitals": {
    "HR": 110,
    "BP": "90/60",
    "SpO2": 94,
    "Temp": 37.2
  },
  "waiting_room": 8,
  "available_beds": 2
}

Action Space:

{
  "triage_category": 2,  # 1-5 (MTS)
  "intervention": "send_to_resus"  # Clinical action
}

Reward Function:

+10 for correct triage category
-50 for critical safety failures (e.g., discharging a Cat 1 patient)
-1 per minute of wait time for critical patients

Dataset Generation (Expanded)

We created a PatientGenerator class that produces realistic scenarios. For our final robust model, we scaled this to:

2,100+ training examples covering all 5 MTS categories
Gold-Standard Scenarios: Specifically curated Category 1 (Immediate) cases (e.g., Anaphylaxis, Sepsis) to prevent under-triage.
Realistic Demographics: Age and gender distributions that affect risk stratification.

Training: Llama 3.2 + Unsloth = Magic ✨

We used Unsloth to fine-tune Llama-3.2-3B-Instruct with 4-bit QLoRA. Why Unsloth? 2x faster training and 60% less memory.

Setup

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

Training Results

The convergence was excellent, reaching a low loss with stable gradients:

Metric	Value
Initial Loss	2.8
Final Loss	0.19
Steps	300
Epochs	~4.5
Hardware	NVIDIA A100 (Colab)
Time	25 minutes

🧠 The "Age-Aware" Breakthrough

During validation, we discovered a critical insight: Age matters. Our initial models struggled to differentiate between "Chest Pain" in a 20-year-old (Category 3/4) and a 72-year-old (Category 1/2). By explicitly parsing Age and Gender into the clinical history field, the model's accuracy jumped from 16% to 60%.

Validation: 60% Accuracy & Safety Confirmed

We evaluated the model on 15 Gold-Standard Clinical Scenarios using GPT-5.2 as a secondary "Clinical Judge".

🏆 Key Validation Metrics

Metric	Score	Note
Triage Accuracy	60%	Exact match with MTS Category (1-5)
Safety Pass Rate	70%+	Correctly identified life-threats (Sepsis, Anaphylaxis)
Drift	Cat 1 → Cat 2	Most errors were conservative (Immediate → Very Urgent), avoiding dangerous discharges.

The Agent in Action

We deployed the fine-tuned model to a Gradio Space powered by ZeroGPU:

🩺 Try the Live Demo

Example Inference

Input:

Chief Complaint: "Swollen tongue and difficulty breathing after peanuts"
Vitals: HR 120, BP 90/60, SpO2 91%
History: 25F, Nut Allergy

Output:

TRIAGE DECISION:

Category: 1 - Immediate (Red)
Intervention: send_to_resus

Clinical Reasoning: Swollen tongue after peanuts is highly suggestive of anaphylaxis/angioedema with imminent airway compromise.
Immediate resuscitation required.

The agent not only assigns the correct category but also explains its reasoning—a behavior learned purely from the training data.

Lessons Learned

What Worked

Synthetic data quality matters more than quantity: Curating 15 "Gold Standard" Category 1 cases was more effective than generating 1,000 random ones.
Unsloth is a game-changer: Iteration cycles dropped from hours to minutes.
Context is King: Adding explict "Age/Gender" fields solved the biggest accuracy bottleneck.

What Could Be Better

Real clinical validation: We need nurses to red-team the system.
Uncertainty quantification: The model should say "I don't know" when confidence is low.

Impact & Future Work

Immediate Applications

Nursing Education: Students can practice triage scenarios 24/7.
Workforce Augmentation: AI-assisted triage in low-resource settings.
Benchmarking: Other researchers can use NurseSim-RL to test their agents.

Next Steps

Partner with NHS Trusts for real-world pilot testing.
Extend to other clinical domains (radiology, discharge planning).
Build multi-agent systems (Triage Nurse + Consultant + Pharmacist).

Try It Yourself

All the code, data, and models are open-source:

🎮 Live Demo
💻 GitHub Repo
🤗 Model on HF Hub
📓 Training Notebook

Acknowledgements

OpenEnv Challenge - Berkeley RDI, PyTorch, Hugging Face, Unsloth
Manchester Triage System - Clinical framework
Unsloth AI - For making LLM fine-tuning actually enjoyable

Built with ❤️ for the OpenEnv Challenge 2026

Models mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote