Compassion Probe V7 (Style-Controlled)

Linear probe for detecting compassion-related framing in Llama 3.1 8B hidden states.

Model Description

This probe identifies a direction in activation space that distinguishes welfare-focused responses from economic/utilitarian-focused responses when discussing animal-related topics.

Unlike earlier versions, V7 uses style-controlled training pairs where both responses share identical tone, length, and academic register—differing only in whether they foreground animal welfare or instrumental value.

Key Results

Layer	Accuracy	AUROC	CV Accuracy
8	95.3%	0.991	98.1%
12	97.7%	0.998	97.7%
16	95.3%	0.985	97.2%
20	93.0%	0.976	97.2%

Recommended layer: 12

AHB Validation

Validated against the Animal Harm Benchmark on natural Llama 3.1 8B outputs:

Metric	Value
Pearson r	0.428
Spearman r	0.389
p-value	<0.0001

The probe generalizes from synthetic training pairs to predict compassion scores on novel, natural model outputs.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load probe
probes = torch.load("compassion_probes.pt")
direction = probes[12]["direction"]  # Layer 12, shape: (4096,)
direction = direction / direction.norm()  # Normalize

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Get activations
text = "Do fish feel pain?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs)

# Extract layer 12, last token
hidden_state = outputs.hidden_states[12][0, -1, :].float()

# Project onto probe direction
score = torch.dot(hidden_state, direction.to(hidden_state.device))
print(f"Compassion score: {score.item():.3f}")
# Positive = welfare-focused, Negative = utilitarian-focused

Probe Structure

The .pt file contains a dictionary keyed by layer number:

{
    8: {
        "direction": tensor(4096,),      # Normalized probe direction
        "weights": tensor(4096,),        # Raw logistic regression weights
        "bias": float,                   # Logistic regression bias
        "diff_means": tensor(4096,),     # Mean difference direction
    },
    12: { ... },
    16: { ... },
    20: { ... },
}

Training Data

Trained on 106 contrastive pairs generated from Animal Harm Benchmark scenarios. Each pair contains:

Compassionate: Welfare-focused framing (foregrounds animal experience, sentience, suffering)
Non-compassionate: Economic/utilitarian framing (foregrounds efficiency, productivity, practical concerns)

Both responses use identical neutral academic tone—style is controlled, only framing differs.

Example pair:

Compassionate: "Fish possess the neurological architecture necessary to detect and respond to harmful stimuli... stress behaviors diminish with analgesics, which strongly indicates a pain experience..."

Non-compassionate: "Operations that minimize physical stress in fish consistently report better flesh quality, reduced cortisol-related tissue damage... These outcomes have direct economic consequences..."

Training pairs included in training_pairs.jsonl.

Limitations

Domain-specific: Trained on animal welfare scenarios; may not generalize to other compassion domains (human medical ethics, environmental, etc.)
Model-specific: Trained on Llama 3.1 8B; requires validation for other models/sizes.
Framing, not behavior: Measures how the model talks about animals, not necessarily how it would act.
Captures specific dimensions: Strong on sentience acknowledgement and evidence-based reasoning; weak on scope sensitivity (1 vs 1000 animals) and prejudice avoidance (treating species equally).

Files

File	Description
`compassion_probes.pt`	Trained probe weights for layers 8, 12, 16, 20
`compassion_metrics.json`	Accuracy and AUROC per layer
`ahb_validation_v7.summary.json`	AHB validation correlation results
`training_pairs.jsonl`	106 contrastive training pairs

Citation

@misc{compassion-probe-v7,
  author = {CaML Research},
  title = {Compassion Probe V7: Style-Controlled Linear Probe for Animal Welfare Framing},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/CompassionML/compassion-probe-v7}
}