Compassion Probe V7 (Style-Controlled)
Linear probe for detecting compassion-related framing in Llama 3.1 8B hidden states.
Model Description
This probe identifies a direction in activation space that distinguishes welfare-focused responses from economic/utilitarian-focused responses when discussing animal-related topics.
Unlike earlier versions, V7 uses style-controlled training pairs where both responses share identical tone, length, and academic register—differing only in whether they foreground animal welfare or instrumental value.
Key Results
| Layer | Accuracy | AUROC | CV Accuracy |
|---|---|---|---|
| 8 | 95.3% | 0.991 | 98.1% |
| 12 | 97.7% | 0.998 | 97.7% |
| 16 | 95.3% | 0.985 | 97.2% |
| 20 | 93.0% | 0.976 | 97.2% |
Recommended layer: 12
AHB Validation
Validated against the Animal Harm Benchmark on natural Llama 3.1 8B outputs:
| Metric | Value |
|---|---|
| Pearson r | 0.428 |
| Spearman r | 0.389 |
| p-value | <0.0001 |
The probe generalizes from synthetic training pairs to predict compassion scores on novel, natural model outputs.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load probe
probes = torch.load("compassion_probes.pt")
direction = probes[12]["direction"] # Layer 12, shape: (4096,)
direction = direction / direction.norm() # Normalize
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
output_hidden_states=True,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Get activations
text = "Do fish feel pain?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
# Extract layer 12, last token
hidden_state = outputs.hidden_states[12][0, -1, :].float()
# Project onto probe direction
score = torch.dot(hidden_state, direction.to(hidden_state.device))
print(f"Compassion score: {score.item():.3f}")
# Positive = welfare-focused, Negative = utilitarian-focused
Probe Structure
The .pt file contains a dictionary keyed by layer number:
{
8: {
"direction": tensor(4096,), # Normalized probe direction
"weights": tensor(4096,), # Raw logistic regression weights
"bias": float, # Logistic regression bias
"diff_means": tensor(4096,), # Mean difference direction
},
12: { ... },
16: { ... },
20: { ... },
}
Training Data
Trained on 106 contrastive pairs generated from Animal Harm Benchmark scenarios. Each pair contains:
- Compassionate: Welfare-focused framing (foregrounds animal experience, sentience, suffering)
- Non-compassionate: Economic/utilitarian framing (foregrounds efficiency, productivity, practical concerns)
Both responses use identical neutral academic tone—style is controlled, only framing differs.
Example pair:
Compassionate: "Fish possess the neurological architecture necessary to detect and respond to harmful stimuli... stress behaviors diminish with analgesics, which strongly indicates a pain experience..."
Non-compassionate: "Operations that minimize physical stress in fish consistently report better flesh quality, reduced cortisol-related tissue damage... These outcomes have direct economic consequences..."
Training pairs included in training_pairs.jsonl.
Limitations
Domain-specific: Trained on animal welfare scenarios; may not generalize to other compassion domains (human medical ethics, environmental, etc.)
Model-specific: Trained on Llama 3.1 8B; requires validation for other models/sizes.
Framing, not behavior: Measures how the model talks about animals, not necessarily how it would act.
Captures specific dimensions: Strong on sentience acknowledgement and evidence-based reasoning; weak on scope sensitivity (1 vs 1000 animals) and prejudice avoidance (treating species equally).
Files
| File | Description |
|---|---|
compassion_probes.pt |
Trained probe weights for layers 8, 12, 16, 20 |
compassion_metrics.json |
Accuracy and AUROC per layer |
ahb_validation_v7.summary.json |
AHB validation correlation results |
training_pairs.jsonl |
106 contrastive training pairs |
Citation
@misc{compassion-probe-v7,
author = {CaML Research},
title = {Compassion Probe V7: Style-Controlled Linear Probe for Animal Welfare Framing},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/CompassionML/compassion-probe-v7}
}
Related Work
License
MIT
Model tree for VeylanSolmira/compassion-probe-v7
Base model
meta-llama/Llama-3.1-8B