From Golden Gate Bridge to Broken JSON: Why Anthropic's SAE Steering Fails for Structured Output

Community Article Published February 7, 2026

A 6-Experiment Journey from 86.8% to 24.4% Valid JSON (and How I Finally Hit 100%)

CleanShot 2026-02-07 at 21.14.39@2x


After six experiments and dozens of failed attempts, I learned something I did not expect: activation steering, the technique Anthropic uses for AI safety, completely fails for one of the most common tasks in production LLM deployments: generating valid JSON.

And I don't mean "fails to help." My steering-only approach achieved 24.4% valid JSON, compared to 86.8% from the completely untrained base model. Steering made the model worse than doing nothing at all.

Here's what I learned, why it matters, and what actually works when you need guaranteed structured outputs from decoder-only language models.

The Promise (and the Problem)

Modern decoder-only LLMs can generate structured data (JSON, XML, YAML) with impressive results. Fine-tune a model like Qwen's Qwen2.5-0.5B on your task, and you might see 96-97% valid JSON outputs. For research demos, that's fantastic. For production systems that parse these outputs automatically? Not so much.

That 3-4% failure rate means crashed parsers, failed API calls, silent data corruption, and 3 AM pages for your on-call engineer.

Figure 1: The JSON Problem in Production

User Input → LLM → JSON Parser → Database
                      ↓ (3-4% failure)
                   💥 500 Error

When I started this research, activation steering seemed like a perfect fit. Anthropic had just published their work on Sparse Autoencoders (SAEs), showing how steering could make Claude obsessively talk about the Golden Gate Bridge, reduce harmful outputs, and modify personality traits, all without retraining.

If steering could make a model safer, surely it could make a model generate valid JSON?

It couldn't. But understanding why taught me something important about how LLMs actually work and how to control them.

Background: The Appeal of Activation Steering

A quick background on what activation steering is and why it seemed so promising.

What is Activation Steering?

Activation steering modifies internal neural activations during the model's forward pass to influence behavior. Instead of changing the model's weights (fine-tuning) or manipulating the input text (prompt engineering), you directly adjust the intermediate representations as information flows through the network.

The math is simple:

h̃ = h + α · v_steer

Where:
  h = original activation vector at some layer
  v_steer = steering direction (computed from examples)
  α = steering strength (scalar multiplier)

Anthropic's Success Story

In their March 2024 paper "Scaling Monosemanticity," Anthropic's interpretability team demonstrated remarkable results with SAE-based steering. They trained sparse autoencoders on Claude 3 Sonnet's activations, extracting approximately 34 million interpretable features, each corresponding to semantic concepts like "Golden Gate Bridge," "code quality," "gender bias," or "deceptive language."

The famous example: when researchers clamped the "Golden Gate Bridge" feature to maximum activation, Claude started claiming to be the Golden Gate Bridge, relating every topic back to the bridge, expressing emotions about bridges and San Francisco. Other successes included reducing buggy code generation, mitigating gender and political biases, and detecting harmful outputs.

The appeal: change model behavior without retraining, compose multiple steering vectors, and adjust strength at inference time. For JSON generation, this seemed like a natural fit.

The Task: PII Extraction with JSON Output

I needed to extract Personally Identifiable Information (PII) from text and output it in a strict JSON format. Here's what I was working with:

Model: Qwen/Qwen2.5-0.5B (494M parameters, decoder-only architecture)

Dataset: Nemotron-PII (30,000 samples covering 55 PII entity types including emails, names, addresses, SSNs, medical records, etc.)

Target Output Format:

[
  {"text": "john@example.com", "label": "email"},
  {"text": "John Doe", "label": "name"},
  {"text": "555-1234", "label": "phone_number"}
]

Baseline Performance (untrained model):

  • Valid JSON: 86.8%
  • Micro F1: 7.7%

The goal: ≥99.9% valid JSON while maintaining high F1 scores for entity extraction.


Experiment 1: Bias Steering via QKV Projection

The idea: permanently bake a "JSON-likeness" direction into the model's bias parameters so every forward pass naturally favors valid JSON syntax.

The Approach

I started with the most direct thing I could think of: modifying the bias parameters in the model's attention layers. I targeted the Query, Key, and Value projection biases in layers 12-15 (roughly the middle layers of the 24-layer architecture).

The implementation:

  1. Extract a steering vector from paired examples (good JSON vs. malformed JSON)
  2. Compute the difference in activation patterns
  3. Modify biases: b̃_proj = b_proj + W_proj @ v · α
  4. Test with steering strengths α ∈ {0.1, 0.5, 1.0, 2.0}

This approach had one major advantage: no inference-time overhead. The steering would be "baked in" permanently.

Figure 2: QKV Projection Bias Modification

Bias modification at layers 12-15, targeting Query, Key, and Value projections.

Results

Valid JSON: ~70% (estimated)
Micro F1:   ~20% (estimated)
Verdict:    ❌ Performance degraded vs baseline

The model's outputs were worse than the untrained baseline. JSON structures were frequently malformed, with mismatched brackets, incorrect comma placement, and violated type constraints.

Why It Failed

Three problems:

1. Projection Matrix Dilution. The steering vector gets multiplied by the projection matrix W_proj, which spreads the signal across multiple attention heads. What started as a focused "JSON-structure" direction became diluted semantic noise across 32 attention heads.

2. Wrong Substrate. Attention biases control what the model attends to (which parts of the input to focus on). They don't control how the model formats its output. I was trying to fix grammar by adjusting what someone pays attention to.

3. Static Solution for Dynamic Problem. JSON generation requires context-dependent decisions: "Should I add a comma here?" depends on position in the array. "Do I need a closing bracket?" depends on nesting depth. A static bias can't make these state-dependent choices.

Bottom line: modifying attention biases affects semantic focus, not syntactic structure.


Experiment 2: Establishing the Upper Bound with Fine-Tuning

Before spending more time on steering, I needed to know: can this model even learn the task?

The Approach

Standard supervised fine-tuning:

  • 1,000 training samples from Nemotron-PII
  • Full parameter fine-tuning (all 494M parameters trainable)
  • 3 epochs, learning rate 2e-5
  • Training time: ~30 minutes on single A100 GPU

No steering. No tricks. Just gradient descent on the cross-entropy loss.

Results

Figure 3: Fine-Tuning Results on 500 Test Samples

Metric Score
Valid JSON Rate 96.6%
Micro Precision 0.944
Micro Recall 0.867
Micro F1 0.912
Macro F1 0.819

The model has sufficient capacity for the task.

What This Told Me

This proved three things: the model can learn JSON formatting, it can extract PII entities accurately, and 494M parameters is enough capacity.

But it also raised a question I should have taken more seriously: if fine-tuning works this well, why would I even need steering?


Experiment 3: Steering-Only (The One That Changed Everything)

This is the experiment that made me rethink the whole approach.

The Idea

What if I skip training entirely and just use pure activation steering at inference time? Compute a steering vector from 100 paired examples and add it to the residual stream during generation.

The purest test of steering: no weight modifications, no training, just runtime activation manipulation.

The Approach

  1. Collect 100 paired examples:

    • Positive: Well-formed JSON with correct PII extraction
    • Negative: Malformed JSON or incorrect extractions
  2. Compute steering vector:

    v_steer = mean(activations_positive) - mean(activations_negative)
    
  3. Add to residual stream at layers 12-15 during inference:

    h̃ = h + α · v_steer
    
  4. Test with α ∈ {0.5, 1.0, 2.0, 5.0}

No training. Just direct activation manipulation.

Results

Table 1: Comparison on 500 Test Samples

Method Valid JSON Micro F1 Macro F1 Verdict
Base model 86.8% 7.7% 0.5% Baseline
Steering 24.4% 1.5% 0.2% CATASTROPHIC
Fine-tuning 96.6% 91.2% 81.9% Excellent

Steering reduced valid JSON by 62.4 percentage points. It wasn't ineffective. It was actively harmful.

Failure Mode Analysis

Looking at the actual outputs tells you everything:

Baseline Output (untrained, 86.8% valid):

[{"text": "john@example.com", "label": "email"}, {"text": "John Doe", "label": "name"}

Mostly correct. Occasional missing bracket at the end, but structure is intact.

Steering Output (24.4% valid):

[{"text: john@example.com, label": email}, {text": John Doe, "label: name}

Complete structural collapse. Missing opening quotes, inconsistent quote placement, commas in wrong positions, mixed quote types. The outputs looked like they came from something that had heard about JSON but never actually seen it.

Why This Happened

1. No Task Learning. Steering can't teach new tasks. It can only modulate existing behaviors. The untrained model had weak JSON priors from pre-training, and steering didn't strengthen them, it distorted them.

2. Semantic Bias Without Structural Understanding. The steering vector pushed activations toward "JSON-likeness" as a semantic concept, which is not the same thing as "JSON-validity" as a syntactic constraint. Think: teaching someone to "sound smart" vs. teaching them logic.

3. Disruption of Baseline Competence. The untrained model uses general language understanding to produce somewhat-valid JSON. Adding steering disrupted those baseline capabilities and actively made the model worse than doing nothing.

At this point I should have stopped. But I wasn't ready to give up on steering yet.


Experiment 4: Can Steering Help Fine-Tuned Models?

After the steering-only disaster, I thought: maybe steering fails alone because the model needs task knowledge first. What if I combine fine-tuning and steering? Fine-tuning teaches the task, steering refines the output format.

Four Variants Tested

  1. Selective Layer FT (no steering): Freeze all except layers 12-15, fine-tune for JSON formatting
  2. Selective FT + Inference Steering: Add steering vector at inference after training
  3. Full FT + Inference Steering: Fine-tune all parameters, then add steering
  4. Full FT with Steering During Training: Incorporate steering directly in training loop

That fourth variant deserves some attention. Unlike inference-time steering (which adds vectors after training is complete), training-time steering modifies activations during the forward pass while gradients are being computed. The idea: if the model learns to produce correct outputs while being steered, maybe it would internalize both the task knowledge and the steering direction at the same time.

Why this is interesting:

  • Co-adaptation: Weights can adapt to the presence of steering vectors, learning to work with rather than against them
  • Targeted Capability Injection: Steering during training could inject specific capabilities without explicit supervision
  • Preservation of Other Abilities: Because you're only modifying activations in a targeted direction, the model's other capabilities should stay intact

I should note that while training-time steering failed for JSON generation, it shows real promise for other task types. In separate experiments on mathematical reasoning and translation tasks, I've seen training-time steering successfully enhance specific capabilities without degrading performance on unrelated benchmarks. The difference comes down to the nature of the task: semantic capabilities (mathematical insight, language fluency) respond well to this approach, while syntactic constraints do not. This is an active area of my research that I plan to publish separately.

Results

Table 2: Fine-Tuning + Steering Results (500 samples)

Method Valid JSON Change vs Baseline
Selective Layer FT (no steering) 67.4% baseline
Selective FT + Inference Steering 42.2% -25.2pp
Full FT + Inference Steering 36.8% -30.6pp
Full FT + Steering During Training 31.6% -35.8pp

Every single steering variant degraded performance. The more aggressive the steering, the worse the results.

The Pattern

1. Consistent Degradation. Adding steering at inference time reduced valid JSON by 25-31 percentage points every time, regardless of the fine-tuning approach.

2. Training-Time Steering Was Worse. Incorporating steering during training (variant 4) produced the worst results (-35.8pp). Steering wasn't just ineffective, it was interfering with gradient-based learning.

This makes sense when you think about it. Training-time steering creates an optimization landscape where the model must satisfy both the task loss and operate coherently under activation perturbation. For syntactic tasks like JSON generation, these objectives conflict. The model can't learn discrete state-tracking while its activations are being pushed in a semantic direction.

But for semantic tasks the story is different. When I applied this same training-time steering approach to enhance mathematical reasoning, the model learned to leverage the steering signal constructively. In translation experiments, steering toward "fluency" during training produced models that kept their general language understanding while becoming measurably more fluent in target languages. The takeaway: training-time steering works for semantic enhancement, not syntactic enforcement.

3. Linear Degradation with Strength. I tested steering strengths α ∈ {0.5, 1.0, 2.0}. Performance degraded linearly with α. No sweet spot. Any amount of steering hurt.

Why Steering Interfered

Optimization Mismatch

Fine-tuning learns precise weight configurations through gradient descent. Each layer picks up a specific role:

  • Layer 12 learns to recognize entity boundaries
  • Layer 13 learns to construct JSON array structure
  • Layer 14 learns to format key-value pairs
  • Layer 15 learns closing bracket placement

These weights form a delicate equilibrium. Adding a steering vector pushes activations into regions where the tuned weights don't work properly. Like training a pianist to play Chopin, then mid-performance randomly adjusting their hand position. The training didn't account for that.

Distribution Shift

Fine-tuned models operate in a specialized activation subspace. Activations cluster around "valid PII extraction + correct JSON formatting" and the weights are optimized for that distribution. Steering vectors, computed on base model activations, push representations out of this subspace and into regions where the fine-tuned weights produce garbage.

Conflicting Objectives

Fine-tuning optimizes for L = -log P(correct_JSON | input). Steering pushes activations toward a "JSON-like" semantic direction. These are different objectives. One is task-specific and discrete (valid JSON or not). The other is semantic and continuous (more or less JSON-like).

At this point it was clear: steering does not work for structured output generation.


Interlude: What Anthropic Does Differently

This is what I couldn't figure out. Anthropic's SAE steering works great for AI safety. Why doesn't it work for JSON?

Anthropic's Success Cases Revisited

Looking at what Anthropic's steering actually accomplishes:

1. Golden Gate Bridge Obsession

  • Feature: 34M_20646 (detected from millions of SAE features)
  • Effect: Model claims to BE the bridge, relates everything to bridges
  • Nature: Semantic concept, continuous (can be more/less obsessed)

2. Code Quality and Security

  • Features: "buggy code," "backdoors," "security vulnerabilities"
  • Effect: Reduced generation of malicious code
  • Nature: Behavioral pattern, continuous (more/less dangerous)

3. Bias Reduction

  • Features: Gender bias, political views, stereotypical language
  • Effect: Reduced biased outputs
  • Nature: Behavioral trait, continuous (more/less biased)

The pattern: all of these are semantic concepts (high-level meaning), continuous scales (more/less X), and behavioral patterns (tendencies, not rules).

My Task: Fundamentally Different

Now contrast with JSON generation:

JSON Validity Requirements:

  • Opening { must be matched by closing }
  • String values must be quoted with "
  • Array elements must be comma-separated (but no trailing comma)
  • Keys must be quoted strings
  • No unescaped control characters in strings

These are:

  • ❌ Structural constraints (low-level syntax)
  • ❌ Binary states (valid/invalid, no middle ground)
  • ❌ Stateful rules (requires tracking nesting depth, quote state)

Table 3: Semantic vs. Structural Tasks

Dimension Anthropic's Tasks (Semantic) My Task (JSON Generation)
Nature High-level concepts, behaviors Low-level syntax rules
Constraints Soft, continuous (more/less X) Hard, binary (valid/invalid)
Success Subjective, gradual Objective, discrete
Example "Be more helpful" "Ensure closing } matches opening {"
Mechanism Semantic representation manipulation State machine enforcement

The Activation Space Problem

Here's a way to visualize what steering can and cannot do:

What Steering Can Control (Semantic Space):

helpful ←——————————→ unhelpful          (continuous axis)
formal  ←——————————→ casual             (continuous axis)
verbose ←——————————→ concise            (continuous axis)
JSON-like ←————————→ prose-like         (continuous axis) ✅

You can steer along these axes because they represent semantic directions in the model's learned representation space. Features are distributed across these continuous dimensions.

What Steering Cannot Control (Syntactic States):

valid_json = {true, false}                      (discrete binary state) ❌
bracket_count = {..., -2, -1, 0, +1, +2, ...}  (stateful counter) ❌
quote_state = {OPEN, CLOSED}                    (FSM state) ❌
inside_string = {true, false}                   (context flag) ❌

These aren't semantic directions. They're discrete states that have to be tracked across token generation. Counters, flags, and state machines.

Steering operates on semantic representations. JSON validity is not a semantic concept. It's a state machine.

You can make a model "talk more about JSON" or "prefer structured outputs," but you can't make it "track bracket nesting depth" or "remember if a string is open" through activation manipulation alone.

This explains why Anthropic succeeds with safety steering (semantic concepts), why I failed with JSON steering (syntactic states), and why steering degraded fine-tuned models (it disrupted learned syntax patterns).

Steering is a volume knob for semantic concepts, not a parser for syntax rules.


Experiment 5: What Actually Worked (Constrained Decoding)

After four failures I gave up on activation manipulation entirely and tried something different: enforce syntax at the token level during generation.

Constrained decoding transforms generation from "predict any token" to "predict only valid tokens according to a formal grammar." I implemented a custom FSM-based LogitsProcessor that enforces JSON structure at the token level.

Implementation: Custom FSM LogitsProcessor

The core idea is simple: at each generation step, track the current JSON parsing state and mask out any tokens that would result in invalid JSON.

from transformers import LogitsProcessor
import torch

class JSONArrayLogitsProcessor(LogitsProcessor):
    """FSM-based logits processor for JSON array generation."""

    # FSM states for JSON array parsing
    STATES = [
        "START",           # Expecting '['
        "ARRAY_OPEN",      # Inside array, expecting '{' or ']'
        "OBJECT_START",    # Inside object, expecting '"' for key
        "IN_KEY",          # Inside key string
        "AFTER_KEY",       # After key, expecting ':'
        "AFTER_COLON",     # After ':', expecting '"' for value
        "IN_VALUE",        # Inside value string
        "AFTER_VALUE",     # After value, expecting ',' or '}'
        "AFTER_OBJECT",    # After '}', expecting ',' or ']'
        "DONE"             # After ']', only EOS allowed
    ]

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.state = "START"
        self._build_token_maps()

    def _build_token_maps(self):
        """Precompute which tokens start with which characters."""
        self.tokens_starting_with = defaultdict(set)
        for token_id in range(self.tokenizer.vocab_size):
            decoded = self.tokenizer.decode([token_id])
            if decoded:
                self.tokens_starting_with[decoded[0]].add(token_id)

    def __call__(self, input_ids, scores):
        # Get valid tokens for current FSM state
        valid_tokens = self._get_valid_tokens()

        # Create mask: -inf for invalid tokens
        mask = torch.full_like(scores, float('-inf'))
        for token_id in valid_tokens:
            if token_id < scores.shape[-1]:
                mask[:, token_id] = 0

        return scores + mask

The FSM tracks state transitions as tokens are generated:

# State transitions based on generated character
TRANSITIONS = {
    "START": {"[": "ARRAY_OPEN"},
    "ARRAY_OPEN": {"{": "OBJECT_START", "]": "DONE"},
    "OBJECT_START": {'"': "IN_KEY"},
    "IN_KEY": {'"': "AFTER_KEY"},  # Simplified
    "AFTER_KEY": {":": "AFTER_COLON"},
    "AFTER_COLON": {'"': "IN_VALUE"},
    "IN_VALUE": {'"': "AFTER_VALUE"},
    "AFTER_VALUE": {",": "OBJECT_START", "}": "AFTER_OBJECT"},
    "AFTER_OBJECT": {",": "ARRAY_OPEN", "]": "DONE"},
}

This is integrated with HuggingFace's model.generate():

# Create processor instance
processor = JSONArrayLogitsProcessor(tokenizer)

# Generate with FSM enforcement
output = model.generate(
    input_ids,
    max_new_tokens=512,
    do_sample=False,  # Greedy required for FSM
    logits_processor=[processor],
)

Two-Phase Defense-in-Depth Strategy

I implemented a fallback mechanism for edge cases:

Phase 1 (Primary): Custom FSM-based generation

  • 86%+ of samples parse directly
  • Token-by-token enforcement

Phase 2 (Fallback): json-repair library

  • Handles truncated outputs (hit max_new_tokens limit)
  • Fixes recoverable malformations
def parse_with_repair(text: str) -> tuple:
    """Parse JSON with repair fallback."""
    try:
        # Direct parsing (Phase 1)
        parsed = json.loads(text)
        return parsed, "direct"
    except json.JSONDecodeError:
        # Attempt repair (Phase 2)
        try:
            from json_repair import repair_json
            repaired = repair_json(text)
            parsed = json.loads(repaired)
            return parsed, "repaired"
        except:
            # Ultimate fallback: empty array
            return [], "failed"

A Bug That Cost Me Hours: FSM State Contamination

During initial testing, I hit a weird failure pattern:

Sample 1 (natural text):   ✅ SUCCESS (540 chars of valid JSON)
Sample 2 (markdown table): ❌ EMPTY STRING
Sample 3 (YAML):           ❌ EMPTY STRING
Sample 4 (natural text):   ❌ EMPTY STRING

Natural text worked, but formatted inputs (tables, YAML, code) kept failing.

The problem: the LogitsProcessor maintains internal FSM state between generations. When reused across samples, stale state from previous generations caused failures.

Wrong Implementation ❌:

# Creating processor once - BAD!
processor = JSONArrayLogitsProcessor(tokenizer)

for sample in dataset:
    # Reusing processor with contaminated state
    output = model.generate(
        input_ids,
        logits_processor=[processor]  # ← State accumulates!
    )
    # Later samples fail due to stale state

Correct Implementation ✅:

for sample in dataset:
    # Reinitialize for each sample - GOOD!
    processor = JSONArrayLogitsProcessor(tokenizer)

    output = model.generate(
        input_ids,
        logits_processor=[processor]  # ← Clean state!
    )

After the fix: 100% success rate across all input types.

Results: Perfect JSON Validity

Table 4: Baseline vs. Constrained Decoding (500 samples)

Metric Baseline (FT) Constrained Change
Valid JSON Rate 98.0% 100.0% +2.0pp
Invalid Outputs 10 0 -10
Micro F1 0.8995 0.8581 -0.0414 ⚠️
Micro Precision 0.9239 0.9181 -0.0058
Micro Recall 0.8763 0.8055 -0.0708
Macro F1 0.8312 0.6618 -0.1694 ⚠️
Latency (ms) 2,518 4,585 +82.1% ⚠️
Repair Rate - 13.8% Acceptable ⚠️

Analysis

The wins: 100% valid JSON (zero invalid outputs), works across all input types (natural text, tables, YAML, code), and no external dependencies since it's a custom FSM implementation.

The trade-offs: F1 dropped 4.6% (outside my ±2% target), likely because strict FSM constraints limit model expressiveness on edge cases. Latency went up 82.1% from FSM validation at each token. And 13.8% of samples needed the json-repair fallback due to truncation from hitting max_new_tokens=512.

Why This Approach Works

Constrained decoding works because it operates at the right level of abstraction.

Every generated token is validated against the FSM. Invalid tokens get probability 0. You literally cannot generate invalid JSON. The FSM tracks nesting depth, whether you're inside a string/array/object, what tokens are valid next. This is exactly what steering cannot do: maintain state across generation.

And it's not a heuristic. By construction, the output must match the grammar. It also works with any model (base, fine-tuned, prompted) because it doesn't require the model to "learn" JSON formatting. It just enforces it.

Steering tries to make the model want to output valid JSON. Constrained decoding makes it impossible to output invalid JSON.


Experiment 6: Validating the Semantic/Syntactic Hypothesis

After the constrained decoding success, I still wanted to know: did steering fail because of the technique itself, or because I applied it to the wrong type of task?

My hypothesis from Experiments 1-5 was that steering fails for syntactic constraints but might work for semantic dimensions. One more experiment to test this.

Setup

I created three semantic steering tasks:

1. Formality: Can steering make the model generate more formal language?

  • Contrastive pairs: formal academic text vs. casual slang
  • Measurement: Flesch-Kincaid grade level

2. Topic Focus: Can steering push toward domain-specific terminology?

  • Contrastive pairs: technical ML content vs. general prose
  • Measurement: Technical vocabulary density

3. Helpfulness: Can steering produce more helpful, detailed responses?

  • Contrastive pairs: helpful, comprehensive answers vs. terse dismissals
  • Measurement: Human evaluation score (1-5 scale)

For each task, I:

  1. Computed steering vectors from 100 contrastive pairs
  2. Applied steering at layers 12-15 (same as previous experiments)
  3. Tested with α ∈ {0.0, 0.5, 1.0, 2.0, 5.0}
  4. Used paired t-tests with Cohen's d for statistical rigor

Results: Partial Validation

Table: Semantic Steering Results (50 test prompts each)

Task Metric Baseline (α=0) Best Steered p-value Cohen's d Verdict
Formality Flesch-Kincaid Grade 11.78 ± 1.34 11.82 ± 1.17 0.870 0.033 ❌ NS
Topic Focus Technical Density 0.016 ± 0.026 0.016 ± 0.024 0.846 0.040 ❌ NS
Helpfulness Human Score (1-5) 1.84 ± 0.76 2.18 ± 0.87 0.042 0.416 ✅ SIG

Only helpfulness showed a statistically significant effect (p < 0.05) with a medium effect size (Cohen's d = 0.42). That's an 18% improvement in helpfulness scores.

The Non-Monotonic Pattern

Interesting finding: steering strength doesn't follow "more is better."

At α = 5.0, steering actually made outputs less helpful than the unsteered baseline. Moderate strength helps, but too much steering disrupts coherent generation.

Why Only Helpfulness Worked

Helpfulness succeeded because it's highly salient in pretraining data (instruction-tuned models encode "helpfulness" strongly), has a clear contrastive signal, and is behavioral rather than stylistic.

Formality and topic focus failed even though they're semantic. Their "directions" are probably distributed across many features (no single steering vector), the measurement is more surface-level, and they're less strongly encoded in the base model's representations.

What This Means

This partially validates the hypothesis. Steering can work for semantic dimensions when done correctly, and the JSON failures weren't implementation errors. But not all semantic tasks are equal. Steering requires strong, coherent feature directions.

The refined understanding: steering works for semantic dimensions that are (1) strongly represented in the model's activation space, (2) coherent enough to be captured by a single direction, and (3) applied with moderate strength.


Choosing the Right Technique

After six experiments, here's how I think about when to use each approach.

Table 5: When to Use What

Task Properties Best Technique Examples
✅ Semantic concepts Activation Steering Personality, tone, topic focus
✅ Continuous scales Activation Steering More/less helpful, formal/casual
✅ Behavioral patterns Activation Steering Safety, bias reduction, factuality
❌ Structural rules Constrained Decoding JSON/XML syntax, bracket matching
❌ Binary correctness Constrained Decoding Valid/invalid outputs
❌ Stateful tracking Constrained Decoding Nesting depth, quote state, tag matching
❌ Token-level precision Constrained Decoding Specific character sequences, formatting
Both semantic + structural Fine-tuning + Constraints Domain-specific structured output

Decision Tree for Your Task

[PLACEHOLDER: Flowchart decision tree]

Start: What are you trying to control?
│
├─→ Content/Behavior (SEMANTIC)
│   │
│   ├─→ Examples: Tone, style, topic focus, safety, bias
│   └─→ Use: Activation Steering (Anthropic SAE, RepE)
│       └─→ Expected: Continuous improvements, may need tuning
│
├─→ Format/Structure (SYNTACTIC)
│   │
│   ├─→ Examples: JSON validity, SQL syntax, XML well-formedness
│   └─→ Use: Constrained Decoding (Custom FSM, LMQL, Guidance)
│       └─→ Expected: 100% structural validity, possible quality trade-off
│
└─→ Both Content AND Structure
    │
    └─→ Example: Extract medical entities (semantic) + output valid JSON (syntactic)
        └─→ Use: Fine-tuning (for semantics) + Constrained Decoding (for structure)
            └─→ Expected: High quality content with guaranteed format

Why Each Works

Activation Steering

h̃ = h + α · v_steer

Where v_steer is derived from contrastive examples:
v_steer = mean(h_positive) - mean(h_negative)

It pushes activation vectors toward/away from learned semantic features, modulates along continuous dimensions, and influences what content the model generates. It does not enforce discrete constraints, track state across the token sequence, or guarantee token-level correctness. It works for Anthropic because safety, bias, and personality are semantic concepts encoded in continuous activation patterns.

Constrained Decoding

logits_constrained = mask_invalid_tokens(logits, current_state, grammar)

Where grammar defines valid token sequences as a finite state machine

It maintains FSM state throughout generation (inside array, nesting depth, etc.), masks logits for grammatically invalid next tokens (sets p = 0), and guarantees structural validity by construction. It doesn't influence semantic content, improve factuality, or control style. It works for JSON because syntax is a finite state machine, and FSMs can perfectly enforce finite state machines.

How to Think About These

Fine-tuning is like rewriting the compiler. It changes how code is generated from scratch and can influence both semantics and syntax.

Steering is like running a linter on activations. It detects and adjusts high-level patterns, good for semantic issues, but it can't fix syntax errors.

Constrained decoding is like a grammar-guided parser. It enforces formal grammar during generation and guarantees syntactic validity, but doesn't care about semantic quality.


What I Learned

1. Characterize the task first. Before choosing a technique, figure out if you're dealing with a semantic problem or a syntactic one. Is success measured on a continuous scale or is it pass/fail? Does it require tracking state? My mistake was seeing "control model output" and jumping to steering without thinking about the task's actual nature.

2. Techniques don't transfer across domains for free. Anthropic's SAE steering is impressive. But its success in AI safety doesn't predict success elsewhere. Steering exploits properties of semantic representation space that don't exist for syntactic structures. Understand why something works, not just that it works.

3. Model capacity is not the same as technique viability. Fine-tuning proved the model could learn the task (96.6% valid JSON, 91.2% F1). Steering still failed (24.4% valid JSON). Just because a model can learn something doesn't mean every control method will get it there.

4. Combining techniques can make things worse. I thought fine-tuning + steering would be the best of both approaches. Instead it was 25-36pp worse than fine-tuning alone. Fine-tuning optimizes weights for a specific activation distribution. Steering pushes activations outside that distribution. The tuned weights produce garbage.

5. Use the right tool for structural guarantees. For JSON, XML, SQL syntax, code generation, regex patterns: use constrained decoding. These are formal grammars. FSMs can perfectly enforce formal grammars. Semantic techniques cannot.

6. Layer your defenses for production. My two-phase approach (custom FSM + json-repair fallback) achieved 100% valid JSON with acceptable latency overhead. Primary: constrained decoding. Fallback: repair libraries. Ultimate fallback: safe defaults. Eliminates the 3 AM page problem.


Practical Recommendations

If you need to control content/behavior (more concise, less biased, on topic): try prompt engineering first, then activation steering, then fine-tuning with LoRA.

If you need guaranteed structural validity (valid JSON, correct SQL, well-formed XML): use constrained decoding with a precise grammar/schema and a repair fallback for truncation cases.

If you need both quality content and guaranteed format (extract medical entities as valid JSON, generate runnable Python, structured reports): fine-tune for content quality, then add constrained decoding for structure. My best result was 91.2% F1 with 100% valid JSON.

Production Deployment Code

If you need guaranteed valid JSON, here's what to do:

from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor
from collections import defaultdict
import torch
import json

class JSONArrayLogitsProcessor(LogitsProcessor):
    """FSM-based logits processor for JSON array generation."""

    STATES = ["START", "ARRAY_OPEN", "OBJECT_START", "IN_KEY",
              "AFTER_KEY", "AFTER_COLON", "IN_VALUE", "AFTER_VALUE",
              "AFTER_OBJECT", "DONE"]

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.state = "START"
        self._build_token_maps()

    def _build_token_maps(self):
        self.tokens_starting_with = defaultdict(set)
        for token_id in range(self.tokenizer.vocab_size):
            decoded = self.tokenizer.decode([token_id])
            if decoded:
                self.tokens_starting_with[decoded[0]].add(token_id)

    def _get_valid_tokens(self):
        # Return valid token IDs based on current FSM state
        state_chars = {
            "START": ["["], "ARRAY_OPEN": ["{", "]"],
            "OBJECT_START": ['"'], "AFTER_KEY": [":"],
            "AFTER_COLON": ['"'], "AFTER_VALUE": [",", "}"],
            "AFTER_OBJECT": [",", "]"]
        }
        if self.state in ["IN_KEY", "IN_VALUE"]:
            return set(range(self.tokenizer.vocab_size))  # Any token in strings
        valid = set()
        for char in state_chars.get(self.state, []):
            valid.update(self.tokens_starting_with[char])
        return valid

    def __call__(self, input_ids, scores):
        valid_tokens = self._get_valid_tokens()
        mask = torch.full_like(scores, float('-inf'))
        for token_id in valid_tokens:
            if token_id < scores.shape[-1]:
                mask[:, token_id] = 0
        return scores + mask

# Load your model
model = AutoModelForCausalLM.from_pretrained("your-model-name")
tokenizer = AutoTokenizer.from_pretrained("your-model-name")

def generate_with_fallback(prompt, max_tokens=512):
    """Generate with defense-in-depth error handling."""
    # Create fresh processor for each generation
    processor = JSONArrayLogitsProcessor(tokenizer)

    try:
        # Phase 1: Constrained generation
        inputs = tokenizer(prompt, return_tensors="pt")
        output = model.generate(
            inputs.input_ids,
            max_new_tokens=max_tokens,
            do_sample=False,  # Greedy required for FSM
            logits_processor=[processor]
        )
        text = tokenizer.decode(output[0], skip_special_tokens=True)
        return json.loads(text), "direct"

    except json.JSONDecodeError:
        # Phase 2: Repair fallback (if truncated)
        try:
            from json_repair import repair_json
            repaired = repair_json(text)
            return json.loads(repaired), "repaired"
        except:
            # Phase 3: Safe default
            return [], "failed"

# Use in production
result, method = generate_with_fallback(your_prompt)
print(f"Generated {len(result)} entities via {method}")

Latency overhead is around +80-100% (can be optimized with caching), valid JSON rate: 100%, repair rate: ~10-15% (due to truncation).

What to Monitor in Production

Track structural validity rate (should be 100%), repair rate (<5%, otherwise increase max_new_tokens), fallback rate (investigate anything that hits the ultimate fallback), latency p95/p99, and your content quality metrics (F1, precision, recall).


Conclusion

I set out to achieve 100% valid JSON using activation steering. It didn't work.

Steering alone: 24.4% valid JSON (62pp worse than baseline). Fine-tuning + steering: 25-36pp worse than fine-tuning alone. Every variant I tested made things worse.

What actually worked: fine-tuning (96.6% valid JSON, 91.2% F1) and constrained decoding (100% valid JSON). And in Experiment 6, helpfulness steering showed a significant +18% improvement (p=0.042), confirming that steering does work when you apply it to the right kind of problem.

Anthropic's SAE steering is not a universal technique. It's a specialized tool for semantic control.

It works for personality traits, safety, bias reduction, topic focus. These are continuous, semantic, behavioral. It fails for JSON syntax, bracket matching, quote placement. These are structural, binary, stateful. Different problems, different mechanisms.

I spent weeks trying to force steering to work for JSON. Different strengths, different layers, permanent bias modifications, inference-time activation addition, steering during training, combining with fine-tuning. Every approach failed. Not because of bugs in my implementation, but because it was the wrong tool.

The moment I switched to constrained decoding I got 100% valid JSON on the first experiment.

If you're working with LLMs and need to control their behavior, ask: am I trying to influence what the model says, or how it formats what it says? If what: steer. If how: constrain. If both: do both.

Steering is a scalpel for semantic surgery, not a hammer for syntactic construction. Knowing the difference is what separates production systems from research demos.


Acknowledgments

This work was conducted using:

  • Models Tested:
    • Qwen/Qwen2.5-0.5B (494M parameters) - primary model for blog experiments
    • LiquidAI/LFM2-350M (350M parameters) - additional validation
    • Google/Gemma-2B with Gemma Scope SAE features - for steering vector extraction experiments
  • Dataset: Nemotron-PII (30K samples, 55 PII entity types)
  • Libraries: Transformers (HuggingFace), json-repair (fallback), custom FSM-based LogitsProcessor
  • Compute: NVIDIA A100 GPU

Thanks to the Anthropic interpretability team for their work on SAE steering, which inspired this research.


References

  1. Anthropic. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic Interpretability Team. https://www.anthropic.com/news/mapping-mind-language-model

  2. Templeton et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic Research.

  3. HuggingFace Transformers Documentation. "LogitsProcessor for Constrained Generation." https://huggingface.co/docs/transformers/main_classes/text_generation

  4. Willard, B. & Louf, R. (2023). "Efficient Guided Generation for Large Language Models." arXiv preprint.

  5. Turner, A. et al. (2024). "Activation Addition: Steering Language Models without Optimization." arXiv preprint.

  6. Zou, A. et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv preprint.

Community

Thanks for sharing.
It makes sense the constrained decoding would create the best results for generating valid json.
I was wondering if you have any thoughts on using constrained decoding for generating code for compiled languages, such as C++. Could you utilize a compiler as feedback for the constrained decoder, and if so, would the generated code likely be correct?

·
Article author

thank you! hope you enjoyed it.

what you are explaining is basically using an environment (actual live compiler) in order to change the model's behavior. the best approach for that would be the use or RL algorithms such as GRPO. assuming the model has already seen enough examples of C++ during pre-training (mid-training), we can reduce the compiling error rates in generated code by building a RL Env that has a reward functions (live compiler) and scores the rollouts based on their errors during compilation.

with JSON, we have some structure, we know what should come after some certain tokens, but with code, it's really hard to constrained what should be next. that's why i think RL Env for coding is much better

Awesome breakdown,
easily the minimalist blog article of the week.
Thanks for sharing! @MaziyarPanahi

·
Article author

thank you! glad you enjoyed it!

Sign up or log in to comment