When an AI Model Solves College-Level Math and Physics — On a Phone
This morning I came across a model called Nanbeige4.1-3B, and what began as simple curiosity quickly became something more significant.
I loaded an already 4-bit quantized version and ran it locally on a phone. No GPU, no cloud support, no hidden infrastructure — just a compact reasoning model operating entirely at the edge.
I started with classical mechanics: acceleration, force, friction on an incline. The model worked through them cleanly and correctly. Then I stepped into calculus and gave it a differential equation. It immediately recognized the structure, chose the proper method, carried the mathematics through without confusion, and verified the result.
It did not behave like a model trying to sound intelligent. It behaved like a system trained to solve problems.
And it was doing this on a phone.
For a long time, we have associated serious reasoning in AI with massive models and enormous compute. Capability was supposed to live inside data centers. Bigger models were expected to mean smarter systems.
But watching Nanbeige4.1-3B handle college-level math and physics forces a rethink of that assumption. Intelligence is not only expanding — it is compressing. Better training and sharper reasoning alignment are allowing smaller models to operate far beyond what their size once suggested.
When structured problem-solving runs locally on pocket hardware, the implications are larger than they first appear. Experimentation becomes personal. Engineers can explore ideas without waiting on infrastructure. Students can access serious analytical capability from a device they already carry. Builders are no longer required to send every complex task into the cloud.
What makes moments like this easy to miss is that they rarely arrive with fanfare. There is no dramatic announceme
GLM-OCR: A Tiny 0.9B-Parameter Model That Punches Far Above Its Weight
Released today by Z.ai, GLM-OCR is a compact vision-language model designed specifically for document understanding. At just 0.9 billion parameters, it belongs to a new generation of lightweight AI systems proving that raw model size is no longer the only path to high performance.
Despite its small footprint, GLM-OCR posts exceptionally strong results across major document benchmarks. It scores 94.6 on OmniDocBench, 94.0 on OCRBench, and an impressive 96.5 on UniMERNet for formula recognition—numbers that place it alongside, and in some cases ahead of, significantly larger specialized OCR models. The takeaway is clear: efficiency is rapidly becoming a defining feature of modern AI design.
Developed by Z.ai, a research group focused on advancing multimodal foundation models, GLM-OCR reflects a broader shift toward highly optimized architectures that deliver serious capability without requiring massive compute resources.
In practical testing, the model ran successfully in Google Colab on an NVIDIA L4 GPU, demonstrating that advanced document AI is no longer restricted to large research clusters. Engineers, researchers, and developers can now deploy high-quality OCR workflows from relatively accessible hardware.
GLM-OCR signals an important trend in artificial intelligence: smaller, purpose-built models are beginning to rival heavyweight systems while being dramatically easier to run. For anyone working with scanned documents, PDFs, or structured text extraction, this release is a strong indicator of where efficient multimodal AI is heading next.
KittenTTS Nano is a lightweight, CPU-only text-to-speech model designed to prove that natural, expressive voices don’t require massive cloud stacks or GPUs. At roughly ~15M parameters, it runs fast on modest hardware, supports multiple expressive voices, and exposes simple controls for pacing and tone. This makes it ideal for edge devices, demos, and anyone who wants full control over TTS without latency, lock-in, or infrastructure overhead.
I recently tested LFM2-2.6B-Exp, an experimental language model developed by Liquid AI, to see how well it handles differential equations in a practical, step-by-step setting.
LFM2-2.6B-Exp is notable for how it was trained: it is an RL-first experimental checkpoint, built without supervised fine-tuning warm-up or distillation. Reinforcement learning was applied sequentially, starting with instruction following and later expanding to knowledge and math. This makes it a particularly interesting model to evaluate beyond benchmark scores.
In hands-on testing, the model performed surprisingly well for its size on standard undergraduate-level differential equations—first-order ODEs, second-order linear equations with constant coefficients, and nonhomogeneous problems using undetermined coefficients. It followed instructions closely and produced clear, structured solution steps.
However, the model showed limitations on more subtle methods, such as Laplace transforms with time shifting and variation of parameters, where maintaining mathematical invariants matters more than following a familiar template. In these cases, answers often looked correct structurally but failed under careful verification. This behavior is consistent with an RL-first training approach: strong at producing expected answer forms, but not always robust on deeper theoretical details.
Liquid AI, the company behind this model, is strongly focused on edge AI, developing efficient models designed for deployment outside large data-center environments. Their model lineup spans from very small models (millions of paramet
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series). It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.
---
⚙️ Core capabilities
Multimodal (Vision + Language): Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.
Markdown output: Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.
PDF-aware: Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.
Adaptive tiling (“crop_mode”): Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.
Vision backbone: Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora. Handles resolutions up to 1280 × 1280 px and dynamically scales lower.
Language head: Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.
Open and MIT-licensed: Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.
---
🆕 What’s new about its approach
Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs. DeepSeek-OCR → interprets the entire document as a multimodal
Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! 🤯 Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! 😍
How does it work? 🤔 1️⃣ Generate and cache image features for each frame 2️⃣ Create a list of embeddings for selected patch(es) 3️⃣ Compute cosine similarity between each patch and the selected patch(es) 4️⃣ Highlight those whose score is above some threshold
... et voilà! 🥳
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.