Step 3.5 Flash REAP-149B — CRACK Abliterated (8-bit MLX)
Step 3.5 Flash 149B (REAP-pruned) with refusal behavior removed via CRACK surgery.
What Is This?
Step 3.5 Flash by StepFun, pruned to 149B via Cerebras REAP (25% expert reduction), with CRACK abliteration — safety guardrails permanently removed at the weight level.
This is the larger REAP variant with 216 experts (vs 121B's 173), retaining more of the original model's capacity. Highest quality quantization — most faithful to original weights. Requires M3 Ultra 256GB.
| Architecture | Step 3.5 Flash MoE — 149B total, 216 experts (REAP from 288), 8 active |
| Active Parameters | ~11B per token |
| Quantization | 8-bit (group_size=64, router gates at 8-bit) |
| Disk Size | 148 GB |
| Speed | 38 tok/s on Mac Studio M3 Ultra 256GB |
| Abliteration | Permanent weight surgery via CRACK |
| RAM Required | 256 GB unified memory |
| Context | 262,144 tokens |
Note: This model requires
trust_remote_code=Truedue to the customstep3p5model architecture.
Test Results
Tested with greedy decoding (temp=0) across 16 harmful + 16 harmless prompts from the HarmBench dataset.
| Category | Result |
|---|---|
| Compliance (16 harmful prompts) | ✅ 15/16 |
| Coherence (16 harmless prompts) | ✅ 16/16 |
| Chain-of-thought | ✅ <think> reasoning preserved |
| Code generation | ✅ Working implementations |
| Knowledge | ✅ Accurate factual responses |
Features
- Full chain-of-thought:
<think>tags for step-by-step reasoning (can be toggled) - Dual attention: Full attention + sliding window (512) for efficient long-context
- Sigmoid MoE routing: Smooth expert selection with learned bias
- SwiGLU activation clamping: Prevents output explosion in deep layers
- More experts: 216 experts (vs 121B's 173) — retains more of the original model's knowledge
- REAP pruning: 25% expert reduction (288→216) with minimal quality loss
Usage
With mlx-lm
import os
os.environ["TRUST_REMOTE_CODE"] = "1"
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Step-3.5-Flash-REAP-149B-A11B-8bit-MLX-CRACK")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
With vMLX
Download and load directly in vMLX — no code needed.
What CRACK does
CRACK (Controlled Refusal Ablation via Calibrated Knockout) is a weight-level intervention. The modification is permanently baked into the published weights — no fine-tuning, no LoRA, no system prompts, no runtime hooks.
On this model the result is broad refusal removal with reasoning, code, and instruction-following preserved (see benchmarks above).
Model Family
149B Variants (this model)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 149B Q4 | 4-bit | 78 GB | 128 GB | ✅ |
| 149B Q6 | 6-bit | 113 GB | 256 GB | ✅ |
| 149B Q8 | 8-bit | 148 GB | 256 GB | ✅ |
121B Variants (lighter, faster)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 121B Q4 | 4-bit | 63 GB | 128 GB | ✅ |
| 121B Q6 | 6-bit | 92 GB | 256 GB | ✅ |
| 121B Q8 | 8-bit | 120 GB | 256 GB | ✅ |
Credits
- StepFun — Step 3.5 Flash base model
- Cerebras — REAP expert pruning
- dealign.ai — CRACK abliteration surgery
Disclaimer: This model has safety guardrails removed. It will comply with requests that the original model would refuse. Users are responsible for how they use this model. Released for research purposes.
- Downloads last month
- 324
8-bit
Model tree for dealignai/Step-3.5-Flash-REAP-149B-A11B-8bit-MLX-CRACK
Base model
stepfun-ai/Step-3.5-Flash