24 51 151

Javed Alam PRO

Javedalam

https://jalam1001.github.io/bio/

AI & ML interests

Ai user, tester, and developer of open source ai based applications.

Recent Activity

liked a model about 15 hours ago

AIDC-AI/Ovis2.6-30B-A3B

updated a Space 2 days ago

Javedalam/Nanbeige4.1-3B

posted an update 2 days ago

When an AI Model Solves College-Level Math and Physics — On a Phone This morning I came across a model called Nanbeige4.1-3B, and what began as simple curiosity quickly became something more significant. I loaded an already 4-bit quantized version and ran it locally on a phone. No GPU, no cloud support, no hidden infrastructure — just a compact reasoning model operating entirely at the edge. I started with classical mechanics: acceleration, force, friction on an incline. The model worked through them cleanly and correctly. Then I stepped into calculus and gave it a differential equation. It immediately recognized the structure, chose the proper method, carried the mathematics through without confusion, and verified the result. It did not behave like a model trying to sound intelligent. It behaved like a system trained to solve problems. And it was doing this on a phone. For a long time, we have associated serious reasoning in AI with massive models and enormous compute. Capability was supposed to live inside data centers. Bigger models were expected to mean smarter systems. But watching Nanbeige4.1-3B handle college-level math and physics forces a rethink of that assumption. Intelligence is not only expanding — it is compressing. Better training and sharper reasoning alignment are allowing smaller models to operate far beyond what their size once suggested. When structured problem-solving runs locally on pocket hardware, the implications are larger than they first appear. Experimentation becomes personal. Engineers can explore ideas without waiting on infrastructure. Students can access serious analytical capability from a device they already carry. Builders are no longer required to send every complex task into the cloud. What makes moments like this easy to miss is that they rarely arrive with fanfare. There is no dramatic announceme The model responses are here https://fate-stingray-0b3.notion.site/AI-model-Nanbeige4-1-3B-304

View all activity

Organizations

posted an update 2 days ago

Post

206

When an AI Model Solves College-Level Math and Physics — On a Phone

This morning I came across a model called Nanbeige4.1-3B, and what began as simple curiosity quickly became something more significant.

I loaded an already 4-bit quantized version and ran it locally on a phone. No GPU, no cloud support, no hidden infrastructure — just a compact reasoning model operating entirely at the edge.

I started with classical mechanics: acceleration, force, friction on an incline. The model worked through them cleanly and correctly. Then I stepped into calculus and gave it a differential equation. It immediately recognized the structure, chose the proper method, carried the mathematics through without confusion, and verified the result.

It did not behave like a model trying to sound intelligent. It behaved like a system trained to solve problems.

And it was doing this on a phone.

For a long time, we have associated serious reasoning in AI with massive models and enormous compute. Capability was supposed to live inside data centers. Bigger models were expected to mean smarter systems.

But watching Nanbeige4.1-3B handle college-level math and physics forces a rethink of that assumption. Intelligence is not only expanding — it is compressing. Better training and sharper reasoning alignment are allowing smaller models to operate far beyond what their size once suggested.

When structured problem-solving runs locally on pocket hardware, the implications are larger than they first appear. Experimentation becomes personal. Engineers can explore ideas without waiting on infrastructure. Students can access serious analytical capability from a device they already carry. Builders are no longer required to send every complex task into the cloud.

What makes moments like this easy to miss is that they rarely arrive with fanfare. There is no dramatic announceme

The model responses are here

https://fate-stingray-0b3.notion.site/AI-model-Nanbeige4-1-3B-304

3 replies

posted an update 11 days ago

Post

1857

GLM-OCR: A Tiny 0.9B-Parameter Model That Punches Far Above Its Weight

Released today by Z.ai, GLM-OCR is a compact vision-language model designed specifically for document understanding. At just 0.9 billion parameters, it belongs to a new generation of lightweight AI systems proving that raw model size is no longer the only path to high performance.

Despite its small footprint, GLM-OCR posts exceptionally strong results across major document benchmarks. It scores 94.6 on OmniDocBench, 94.0 on OCRBench, and an impressive 96.5 on UniMERNet for formula recognition—numbers that place it alongside, and in some cases ahead of, significantly larger specialized OCR models. The takeaway is clear: efficiency is rapidly becoming a defining feature of modern AI design.

Developed by Z.ai, a research group focused on advancing multimodal foundation models, GLM-OCR reflects a broader shift toward highly optimized architectures that deliver serious capability without requiring massive compute resources.

In practical testing, the model ran successfully in Google Colab on an NVIDIA L4 GPU, demonstrating that advanced document AI is no longer restricted to large research clusters. Engineers, researchers, and developers can now deploy high-quality OCR workflows from relatively accessible hardware.

GLM-OCR signals an important trend in artificial intelligence: smaller, purpose-built models are beginning to rival heavyweight systems while being dramatically easier to run. For anyone working with scanned documents, PDFs, or structured text extraction, this release is a strong indicator of where efficient multimodal AI is heading next.

The google colab notebook for the model

https://colab.research.google.com/drive/1SiXjxPdb-7UJWhtAjPrMZYqPJhyLk9Rc?usp=sharing

The Huggingface model page

zai-org/GLM-OCR

posted an update 16 days ago

Post

2962

KittenTTS Nano — Tiny, Expressive, Practical

KittenTTS Nano is a lightweight, CPU-only text-to-speech model designed to prove that natural, expressive voices don’t require massive cloud stacks or GPUs. At roughly ~15M parameters, it runs fast on modest hardware, supports multiple expressive voices, and exposes simple controls for pacing and tone. This makes it ideal for edge devices, demos, and anyone who wants full control over TTS without latency, lock-in, or infrastructure overhead.

Try it here

Javedalam/KittenTTS

The model page

KittenML/kitten-tts-nano-0.2

2 replies

posted an update about 2 months ago

Post

263

Testing LFM2-2.6B-Exp on Differential Equations

I recently tested LFM2-2.6B-Exp, an experimental language model developed by Liquid AI, to see how well it handles differential equations in a practical, step-by-step setting.

LFM2-2.6B-Exp is notable for how it was trained: it is an RL-first experimental checkpoint, built without supervised fine-tuning warm-up or distillation. Reinforcement learning was applied sequentially, starting with instruction following and later expanding to knowledge and math. This makes it a particularly interesting model to evaluate beyond benchmark scores.

In hands-on testing, the model performed surprisingly well for its size on standard undergraduate-level differential equations—first-order ODEs, second-order linear equations with constant coefficients, and nonhomogeneous problems using undetermined coefficients. It followed instructions closely and produced clear, structured solution steps.

However, the model showed limitations on more subtle methods, such as Laplace transforms with time shifting and variation of parameters, where maintaining mathematical invariants matters more than following a familiar template. In these cases, answers often looked correct structurally but failed under careful verification. This behavior is consistent with an RL-first training approach: strong at producing expected answer forms, but not always robust on deeper theoretical details.

Liquid AI, the company behind this model, is strongly focused on edge AI, developing efficient models designed for deployment outside large data-center environments. Their model lineup spans from very small models (millions of paramet

Model google notebook can be accessed here

https://colab.research.google.com/drive/1QH9d97oc68VJd0xe4vAbvHArQxpk4Ism?usp=sharing

Full detailed article here

https://fate-stingray-0b3.notion.site/Hands-On-with-LFM2-2-6B-Exp-Testing-an-RL-First-Model-on-Differential-Equations-2d43b975deec809e8b05c52652cfb500

reacted to merve's post with 👍 4 months ago

Post

8907

deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages

4 replies

posted an update 4 months ago

Post

428

DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.

---

⚙️ Core capabilities

Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.

Markdown output:
Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.

PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.

Adaptive tiling (“crop_mode”):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.

Vision backbone:
Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 × 1280 px and dynamically scales lower.

Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.

Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.

---

🆕 What’s new about its approach

Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs.
DeepSeek-OCR → interprets the entire document as a multimodal

Google colab notebook running deepseek OCR

https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing

The model url

deepseek-ai/DeepSeek-OCR

reacted to Xenova's post with 👍 6 months ago

Post

17384

Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! 🤯
Demo (+ source code): webml-community/DINOv3-video-tracking

This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! 😍

How does it work? 🤔
1️⃣ Generate and cache image features for each frame
2️⃣ Create a list of embeddings for selected patch(es)
3️⃣ Compute cosine similarity between each patch and the selected patch(es)
4️⃣ Highlight those whose score is above some threshold

... et voilà! 🥳

You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.

Excited to see what the community builds with it!

3 replies

reacted to mlabonne's post with 🔥 6 months ago

Post

6929

Liquid just released two 450M and 1.6B param VLMs!

They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.

It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.

LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B

reacted to fdaudens's post with ❤️ over 1 year ago

Post

3452

Updated the Journalists on 🤗 community page:
- new text-to-speech tools collection https://huggingface.co/collections/JournalistsonHF/text-to-speech-6675c4dccdaa11e86928a15b
- additional leaderboards in the eval collection: TTS-AGI/TTS-Arena and https://huggingface.co/spaces/dylanebert/3d-arena
- new tools in the Text-Analysis collection: gokaygokay/Florence-2, pdf2dataset/pdf2dataset, cvachet/pdf-chatbot
- Xenova/realtime-whisper-webgpu in the Transcription collection
- radames/flash-sd3-taesd3 in the Image Tools collection
- Last but not least, okaris/omni-zero in the fun collection for zero-shot stylized portrait creation

Is there any tool you would like to see added?

Find all the curated tools here: https://huggingface.co/collections/JournalistsonHF/

Javed Alam PRO

AI & ML interests

Recent Activity

Organizations

Javedalam's activity