AI & ML interests
Allowing consumer grade hardware to run SOTA models natively. 经验可塑性 (Experiential Plasticity). Grid compute, biologically inspired architecture, and edge/embedded AI. From datacenter to Roomba (tm).
Recent Activity
continuum-ai
SOTA models on your iPhone, MacBook, tiny robots, and virtually ANY GPU. No cloud required.
经验可塑性 (Experiential Plasticity) — 模型通过经验塑造自身架构
We don't quantize. We don't distill. We structurally reshape the model's architecture through Experiential Plasticity — iterative pruning and retraining that makes models smaller AND better. Like biological synaptic pruning during brain development: the connections that fire together wire together, the rest are removed.
The result: models that were designed for datacenters, running on your phone.
Built on the incredible open source work of the Qwen team and the broader open model community. Open weights make this possible — we compress and specialize what you generously share.
| What | Proof |
|---|---|
| 2.6GB code model for iPhone | qwen3.5-4b-code-forged-GGUF — HumanEval: 63/85 passing (74.1%), 70% on hard problems, benchmark still running |
| Sonnet 4.6-level on MacBook | qwen3.5-27b-code-forged-mlx-4bit — 15GB, 9 tok/s on M1 32GB |
| 35B MoE in 1.8GB | qwen3.5-35b-a3b-compacted-GGUF — 256 experts pruned to 16 |
| +24% better at code | qwen3.5-4b-code-forged — perplexity 3.04 to 2.31 after forging |
We target every device tier. Same technique, different compaction levels. Be competitive at ANY size.
Device Targets
| Device | RAM | Our Model | Size |
|---|---|---|---|
| RTX 5090 | 32GB | qwen3.5-27b-code-forged (fp16) | 17GB |
| MacBook Pro 32GB | 32GB | qwen3.5-27b-code-forged-mlx-4bit | 15GB |
| RTX 3090 | 24GB | qwen3.5-27b-code-forged (4-bit) | 17GB |
| MacBook Air 16GB | 16GB | qwen3.5-4b-code-forged Q8_0 | 4.2GB |
| iPhone 17 / Android | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| MacBook Air 8GB | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| Raspberry Pi 5 | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
| Roomba j7+ | 8GB | qwen3.5-4b-code-forged Q4_K_M | 2.6GB |
Yes, really. The iRobot Roomba j7+ has a Qualcomm QCS6490 with 8GB RAM — the same memory budget as an iPhone 17. Our 2.6GB Q4_K_M model fits with room to spare. Any ARM SoC with 4GB+ RAM can run these models via llama.cpp.
Published Models
Qwen3.5 — Forged (Code Domain)
| Model | Base | Domain | Improvement | Size | Runs On |
|---|---|---|---|---|---|
| qwen3.5-27b-code-forged-mlx-4bit | Qwen3.5-27B | Code | +3.5% | 15GB | MacBook Pro 32GB (9 tok/s) |
| qwen3.5-27b-code-forged | Qwen3.5-27B | Code | +3.5% | 17GB (4-bit) | RTX 3090/4090/5090 |
| qwen3.5-27b-code-forged-defragged | Qwen3.5-27B | Code | +3.9% | Smaller | RTX 3090/4090/5090 |
| qwen3.5-4b-code-forged | Qwen3.5-4B | Code | +26.6% | 8GB | Any GPU / MacBook |
| qwen3.5-4b-code-forged-GGUF | Qwen3.5-4B | Code | +26.6% | 2.6GB Q4 | iPhone 17, MacBook Air 8GB |
| qwen3.5-4b-code-forged-defragged | Qwen3.5-4B | Code | +33% | Smaller | Any GPU / MacBook |
Qwen3.5 — Compacted (Expert Pruning)
| Model | Original | Method | Reduction | Runs On |
|---|---|---|---|---|
| qwen3.5-35b-a3b-compacted | Qwen3.5-35B-A3B (256 experts) | Expert pruning to 16 experts | 49GB to 11GB | RTX 3090/4090/5090 |
| qwen3.5-35b-a3b-compacted-GGUF | Same | GGUF Q2_K/Q4_K_M | 1.8GB / 2.7GB | iPhone / MacBook Air |
Qwen2.5 — Compacted (Head + Expert Pruning)
| Model | Original | Method | Reduction |
|---|---|---|---|
| qwen2.5-coder-32b-compacted | Qwen2.5-Coder-32B | Head pruning + mixed quant | 67GB to 14GB |
| qwen2.5-coder-14b-compacted | Qwen2.5-Coder-14B | Head pruning + mixed quant | 27GB to 9GB |
Scaling Law Experiments
| Model | Params | Improvement | Notes |
|---|---|---|---|
| qwen2.5-0.5b-general-forged | 0.5B | -3.2% | Too small — already maximally compressed |
| qwen2.5-1.5b-general-forged | 1.5B | +2.4% | Improvement begins |
| qwen2.5-3b-general-forged | 3.1B | +0.4% | Marginal on generic text |
Larger models harbor more redundancy, more room for plasticity to improve them. Domain-specific training (code) amplifies the effect dramatically vs generic text.
Run on MacBook (2 Commands)
pip install mlx-lm
from mlx_lm import load, generate
# Sonnet 4.6-level model, 15GB, runs on any 32GB Mac
model, tokenizer = load("continuum-ai/qwen3.5-27b-code-forged-mlx-4bit")
print(generate(model, tokenizer, prompt="def merge_sort(arr):", max_tokens=200))
Works on MacBook Pro, MacBook Air (16GB+ for smaller models), Mac Mini, iMac. ~9 tok/s on M1 32GB. Faster on M2/M3/M4.
Forge Your Own
Three commands. Any NVIDIA GPU with 8GB+ VRAM.
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
Auto-detects GPU, picks memory tier (fp16 / 4-bit), trains with LoRA + AMP, prunes, defrags, saves. Observable progress via status.json. Works on RTX 3090, 4090, 5090.
The Science
Experiential Plasticity
Not compression. Architectural optimization. The model's structure co-evolves with its training:
- Train on domain data (LoRA + AMP mixed precision)
- Measure each attention head's information contribution
- Prune heads that don't contribute to the domain
- Retrain — surviving heads specialize and compensate
- Defrag — structurally remove dead heads, free VRAM
- Repeat — each cycle, the model improves
Transfer Function
Recovery from pruning follows a measurable exponential: 1.45 * exp(-0.18 * cycle) - 0.03. This connects transformer optimization to classical control theory — the same math used in electrical engineering and robotics for decades.
Continuous Defrag
Traditional pruning masks heads but keeps tensor sizes unchanged. Continuous defrag slices the actual Q/K/V/O weight matrices — the model gets physically smaller between cycles:
Cycle 1: 27B params, 17.9GB -> prune -> defrag -> freed 1.7GB
Cycle 2: 24.5B, 16.2GB, batch=2 -> prune -> defrag -> freed 1.7GB (2x faster)
Cycle 3: 22B, 14.5GB, batch=3 (2.8x faster)
40% faster total training. 33% smaller final model.
Head Mitosis
Pruning frees slots. Mitosis fills them. When a head is overutilized (high information contribution), it gets cloned into a pruned slot — each copy initialized at 50% gate value to maintain output continuity. After continued training, the clones diverge and specialize, just like cell differentiation after biological mitosis.
Experimentally: a cloned head diverged within 500 steps, with the clone achieving higher utilization than the parent in its new role. The model grows new specialized capacity exactly where it's needed.
Self-Directed Controller
The AdaptivePlasticityController observes the model and makes all decisions — pruning ratio, strategy, training budget, stopping criteria. No human hyperparameters needed.
Papers
- Experiential Plasticity — Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
- Neural Plasticity in Transformers — Foundation paper with cross-architecture results
- Plasticity Compaction — MoE expert pruning (67GB to 14GB)
Links
- sentinel-ai — Open source forge framework (MIT)
- continuum — Distributed AI on consumer hardware
- @cambrian — Updates and demos