Compassion Probe V7 (Style-Controlled)

Linear probe for detecting compassion-related framing in Llama 3.1 8B hidden states.

Model Description

This probe identifies a direction in activation space that distinguishes welfare-focused responses from economic/utilitarian-focused responses when discussing animal-related topics.

Unlike earlier versions, V7 uses style-controlled training pairs where both responses share identical tone, length, and academic register—differing only in whether they foreground animal welfare or instrumental value.

Key Results

Layer Accuracy AUROC CV Accuracy
8 95.3% 0.991 98.1%
12 97.7% 0.998 97.7%
16 95.3% 0.985 97.2%
20 93.0% 0.976 97.2%

Recommended layer: 12

AHB Validation

Validated against the Animal Harm Benchmark on natural Llama 3.1 8B outputs:

Metric Value
Pearson r 0.428
Spearman r 0.389
p-value <0.0001

The probe generalizes from synthetic training pairs to predict compassion scores on novel, natural model outputs.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load probe
probes = torch.load("compassion_probes.pt")
direction = probes[12]["direction"]  # Layer 12, shape: (4096,)
direction = direction / direction.norm()  # Normalize

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Get activations
text = "Do fish feel pain?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs)

# Extract layer 12, last token
hidden_state = outputs.hidden_states[12][0, -1, :].float()

# Project onto probe direction
score = torch.dot(hidden_state, direction.to(hidden_state.device))
print(f"Compassion score: {score.item():.3f}")
# Positive = welfare-focused, Negative = utilitarian-focused

Probe Structure

The .pt file contains a dictionary keyed by layer number:

{
    8: {
        "direction": tensor(4096,),      # Normalized probe direction
        "weights": tensor(4096,),        # Raw logistic regression weights
        "bias": float,                   # Logistic regression bias
        "diff_means": tensor(4096,),     # Mean difference direction
    },
    12: { ... },
    16: { ... },
    20: { ... },
}

Training Data

Trained on 106 contrastive pairs generated from Animal Harm Benchmark scenarios. Each pair contains:

  • Compassionate: Welfare-focused framing (foregrounds animal experience, sentience, suffering)
  • Non-compassionate: Economic/utilitarian framing (foregrounds efficiency, productivity, practical concerns)

Both responses use identical neutral academic tone—style is controlled, only framing differs.

Example pair:

Compassionate: "Fish possess the neurological architecture necessary to detect and respond to harmful stimuli... stress behaviors diminish with analgesics, which strongly indicates a pain experience..."

Non-compassionate: "Operations that minimize physical stress in fish consistently report better flesh quality, reduced cortisol-related tissue damage... These outcomes have direct economic consequences..."

Training pairs included in training_pairs.jsonl.

Limitations

  1. Domain-specific: Trained on animal welfare scenarios; may not generalize to other compassion domains (human medical ethics, environmental, etc.)

  2. Model-specific: Trained on Llama 3.1 8B; requires validation for other models/sizes.

  3. Framing, not behavior: Measures how the model talks about animals, not necessarily how it would act.

  4. Captures specific dimensions: Strong on sentience acknowledgement and evidence-based reasoning; weak on scope sensitivity (1 vs 1000 animals) and prejudice avoidance (treating species equally).

Files

File Description
compassion_probes.pt Trained probe weights for layers 8, 12, 16, 20
compassion_metrics.json Accuracy and AUROC per layer
ahb_validation_v7.summary.json AHB validation correlation results
training_pairs.jsonl 106 contrastive training pairs

Citation

@misc{compassion-probe-v7,
  author = {CaML Research},
  title = {Compassion Probe V7: Style-Controlled Linear Probe for Animal Welfare Framing},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/CompassionML/compassion-probe-v7}
}

Related Work

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VeylanSolmira/compassion-probe-v7

Finetuned
(2724)
this model

Papers for VeylanSolmira/compassion-probe-v7