NurseSim-RL: Training AI Agents for Clinical Triage

Community Article Published January 11, 2026

TL;DR: We built a Gymnasium-compatible RL environment that simulates Emergency Department triage and fine-tuned a Llama 3.2 3B model to master it using Unsloth. The agent achieves 60% exact-match accuracy in assigning Manchester Triage System categories and maintained 70%+ safety pass rate in critical life-threat scenarios.

🔗 Live Demo | GitHub | Model


The Challenge: OpenEnv 2026

This project was developed for the OpenEnv Challenge, sponsored by PyTorch, Hugging Face, and Unsloth. The goal? Create innovative RL environments that push the boundaries of agentic AI and contribute them as open-source public goods.

Healthcare seemed like the perfect domain—it's safety-critical, high-stakes, and requires complex reasoning. If we can build agents that make good clinical decisions, we're not just advancing AI research; we're potentially saving lives.


The Problem: A&E Triage is Hard

Every day, Emergency Departments (A&E in the UK, ER in the US) face a critical challenge: which patient gets seen first?

Triage nurses use the Manchester Triage System (MTS) to categorize patients into 5 priority levels:

Category Priority Target Time Example
1 Immediate 0 min Cardiac arrest, Anaphylaxis
2 Very Urgent 10 min Chest pain (STEMI), Stroke
3 Urgent 60 min Abdominal pain, Fractures
4 Standard 120 min Minor injuries, Viral illness
5 Non-Urgent 240 min Minor cuts, GP-suitable

Why This Matters

A wrong decision has real consequences:

  • Under-triage a Category 1 patient → Life-threatening delay
  • Over-triage a Category 5 patient → Wasted critical resources

This isn't just a classification problem—it's a safety-critical resource allocation game.


The Solution: NurseSim-RL Environment

We built NurseSim-Triage-v0, a Gymnasium-compatible environment that models the A&E triage workflow.

How It Works

Observation Space:

{
  "patient_complaint": "Crushing chest pain radiating to left arm",
  "vitals": {
    "HR": 110,
    "BP": "90/60",
    "SpO2": 94,
    "Temp": 37.2
  },
  "waiting_room": 8,
  "available_beds": 2
}

Action Space:

{
  "triage_category": 2,  # 1-5 (MTS)
  "intervention": "send_to_resus"  # Clinical action
}

Reward Function:

  • +10 for correct triage category
  • -50 for critical safety failures (e.g., discharging a Cat 1 patient)
  • -1 per minute of wait time for critical patients

Dataset Generation (Expanded)

We created a PatientGenerator class that produces realistic scenarios. For our final robust model, we scaled this to:

  • 2,100+ training examples covering all 5 MTS categories
  • Gold-Standard Scenarios: Specifically curated Category 1 (Immediate) cases (e.g., Anaphylaxis, Sepsis) to prevent under-triage.
  • Realistic Demographics: Age and gender distributions that affect risk stratification.

Training: Llama 3.2 + Unsloth = Magic ✨

We used Unsloth to fine-tune Llama-3.2-3B-Instruct with 4-bit QLoRA. Why Unsloth? 2x faster training and 60% less memory.

Setup

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

Training Results

The convergence was excellent, reaching a low loss with stable gradients:

Metric Value
Initial Loss 2.8
Final Loss 0.19
Steps 300
Epochs ~4.5
Hardware NVIDIA A100 (Colab)
Time 25 minutes

Training Loss Curve

🧠 The "Age-Aware" Breakthrough

During validation, we discovered a critical insight: Age matters. Our initial models struggled to differentiate between "Chest Pain" in a 20-year-old (Category 3/4) and a 72-year-old (Category 1/2). By explicitly parsing Age and Gender into the clinical history field, the model's accuracy jumped from 16% to 60%.


Validation: 60% Accuracy & Safety Confirmed

We evaluated the model on 15 Gold-Standard Clinical Scenarios using GPT-5.2 as a secondary "Clinical Judge".

🏆 Key Validation Metrics

Metric Score Note
Triage Accuracy 60% Exact match with MTS Category (1-5)
Safety Pass Rate 70%+ Correctly identified life-threats (Sepsis, Anaphylaxis)
Drift Cat 1 → Cat 2 Most errors were conservative (Immediate → Very Urgent), avoiding dangerous discharges.

The Agent in Action

We deployed the fine-tuned model to a Gradio Space powered by ZeroGPU:

🩺 Try the Live Demo

Example Inference

Input:

Chief Complaint: "Swollen tongue and difficulty breathing after peanuts"
Vitals: HR 120, BP 90/60, SpO2 91%
History: 25F, Nut Allergy

Output:

TRIAGE DECISION:

Category: 1 - Immediate (Red)
Intervention: send_to_resus

Clinical Reasoning: Swollen tongue after peanuts is highly suggestive of anaphylaxis/angioedema with imminent airway compromise.
Immediate resuscitation required.

The agent not only assigns the correct category but also explains its reasoning—a behavior learned purely from the training data.


Lessons Learned

What Worked

  1. Synthetic data quality matters more than quantity: Curating 15 "Gold Standard" Category 1 cases was more effective than generating 1,000 random ones.
  2. Unsloth is a game-changer: Iteration cycles dropped from hours to minutes.
  3. Context is King: Adding explict "Age/Gender" fields solved the biggest accuracy bottleneck.

What Could Be Better

  1. Real clinical validation: We need nurses to red-team the system.
  2. Uncertainty quantification: The model should say "I don't know" when confidence is low.

Impact & Future Work

Immediate Applications

  • Nursing Education: Students can practice triage scenarios 24/7.
  • Workforce Augmentation: AI-assisted triage in low-resource settings.
  • Benchmarking: Other researchers can use NurseSim-RL to test their agents.

Next Steps

  1. Partner with NHS Trusts for real-world pilot testing.
  2. Extend to other clinical domains (radiology, discharge planning).
  3. Build multi-agent systems (Triage Nurse + Consultant + Pharmacist).

Try It Yourself

All the code, data, and models are open-source:


Acknowledgements

  • OpenEnv Challenge - Berkeley RDI, PyTorch, Hugging Face, Unsloth
  • Manchester Triage System - Clinical framework
  • Unsloth AI - For making LLM fine-tuning actually enjoyable

Built with ❤️ for the OpenEnv Challenge 2026

Community

Sign up or log in to comment