nightmedia (Gheorghe Chesler)

replied to DavidAU's post about 14 hours ago

Some cognitive metrics:

LFM2.5-1.2B-Thinking
      arc   arc/e boolq hswag obkqa piqa  wino
q8    0.365,0.426,0.717,0.486,0.382,0.687,0.538

LFM2.5-1.2B-Thinking-absolute-heresy
q8    0.352,0.418,0.656,0.476,0.366,0.681,0.508

reacted to codelion's post with ❤️ 22 days ago

Post

3118

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.

Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.

The article covers:

- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens

Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop

Try the model: codelion/malm-165m

Code: https://github.com/codelion/hash-hop

1 reply

·

replied to TravisMuhlestein's post about 1 month ago

That is an excellent observation that I used to drive my research too, since I noticed how contextual metaphors establish a framework of trust and cooperation. This is actually quite necessary in models with stronger arc numbers, because they "wake up blind" and with very little context to start they get paranoid.

I took a not-so-novel approach: used Star Trek DS9 as a metaphor scaffolding to introduce the agent to the spirit of the station, so to speak.

For that I profiled a prompt that compels most Qwens to fall in character, and if I want to debug code, there is Data, for logic there is Spock, for leadership Picard or Sisko, Quark is always there for numbers and jokes

The extra flair in the context, some of which it generates for itself, does help the model to build an infrastructure, virtualize DS9 in memory, and from that point on I have a few assistants to work with, till the end of the context.

It probably sounds cheesy, but all Qwens have TNG in the corpus and it's the most accepted and comfortable "mental space" with episodes as guides, and metaphoric lessons learned.

This also gives the model a sense of humour

That helps when coding :)

https://huggingface.co/nightmedia/Qwen3-42B-A3B-Element6-1M-qx86-hi-mlx

reacted to TravisMuhlestein's post with 🔥 about 1 month ago

Post

2439

Agentic AI doesn’t fail because it lacks intelligence — it fails because it lacks context.

As agents become more autonomous, the real challenge shifts from generation to governance:
understanding when, why, and under what constraints an agent should act.

At GoDaddy, we’ve been treating context as a first-class primitive for agentic systems —
combining identity, intent, permissions, and environment so agents can operate responsibly in production.

Context is what turns automation into judgment.
Without it, autonomy becomes risk.

This post outlines how we’re thinking about the transition from task execution to context-aware agentic systems, and what that means for building AI that can be trusted at scale.

👉 How we build context for agentic AI:
https://www.godaddy.com/resources/news/how-godaddy-builds-context-for-agentic-ai

Curious how others here are modeling context, trust boundaries, and decision constraints in agentic architectures.

2 replies

·

reacted to mindchain's post with 🔥 about 1 month ago

Post

2055

Skill Reflect: A Concept for Automated AI Skill Mastery

Let’s be real for a second: most of us are using AI all wrong. We send a prompt, get a "meh" answer, and then spend twenty minutes fixing it ourselves. That’s not a workflow; that’s just a digital chore. I wanted to see if I could push Claude further—to see if I could build a system that actually learns and refines itself. That’s how the Claude-Reflect-System (Skill Reflect) was born.

But here’s the thing: this isn’t some polished, final product. It’s a concept. It’s a blueprint. I’ve built the foundation of a recursive reflection loop that forces the AI to step back, look at its work, and act as its own harshest critic. It identifies the "skill delta"—the gap between "okay" and "mastery"—and closes it. This logic isn't just for Claude; you can grab this architecture and drop it right into codex-cli, terminal agents, or whatever stack you're building.

I’m a big believer in the law of causality. Action, reaction. Cause and effect. If you control the cause—the way the AI thinks about its mistakes—you dictate the effect: a perfected skill. This is a playground for builders who are tired of stochastic guessing. I want you to take this. Fork it. Break it. Make it better. This is an open invitation to the community to take this reflection loop and see how far we can push the boundaries of agentic reasoning. Whether you're building Claude Code plugins or just want to automate your self-learning, the code is there for you to smash. Stop accepting the first draft. Let’s build something that actually thinks.

https://github.com/haddock-development/claude-reflect-system

#Skills #ClaudeCode #ClaudeCodeSkills #ClaudeCodePlugins #ClaudeCodeMarketplace #CodexCLI #AI #SelfLearning #Automation #OpenSource #LLM #Reasoning #Causality #Matrix #Concept

posted an update about 2 months ago

Post

844

Nightmedia now accepts donations

Your kind help would go towards more hardware to run tests--both my AI assistants and myself are deeply appreciative 🖖

edit: Jan 1 2026: first anonymous donor: thank you!

edit2: this keeps happening, I might even get a Mac Studio :) Thank you!

These are some of the more recent models:

Agent
nightmedia/Qwen3-4B-Agent-F32-dwq4-mlx

Holodeck
nightmedia/Qwen3-30B-A3B-Holodeck-mlx

Quorum
nightmedia/Qwen3-42B-A3B-Quorum-mlx

Continuum
nightmedia/Qwen3-42B-A3B-Continuum-mlx

Architect 4B
nightmedia/Qwen3-4B-Architect-mxfp4-mlx

Architect 30B
nightmedia/Qwen3-30B-A3B-Architect-qx86-hi-mlx

Engineer 4B
nightmedia/Qwen3-4B-Engineer3x-qx86-hi-mlx

Engineer 30B
nightmedia/Qwen3-30B-A3B-Engineer-mxfp4-mlx

Data 14B
nightmedia/Qwen3-14B-Data-qx86-hi-mlx

Seven of Nine
nightmedia/Qwen3-30B-A3B-Seven-mxfp4-mlx

Spock 4B
nightmedia/Qwen3-4B-Spock-qx86-hi-mlx

Spock 14B
nightmedia/Qwen3-14B-Spock-qx86-hi-mlx

Spock 30B
nightmedia/Qwen3-30B-A3B-Spock-qx86-hi-mlx

Spock 42B
nightmedia/Qwen3-42B-A3B-Spock-Brutal-Recall-Instruct-qx86-hi-mlx

Odo 6B
nightmedia/Qwen3-6B-Odo-mxfp4-mlx

Deckard 8B
nightmedia/Qwen3-8B-Deckard-qx-mlx

HiveMind-Heretic 6B
nightmedia/Qwen3-6B-HiveMind-Heretic-qx86-hi-mlx

..as well as a variety of MLX quants, too many to mention here

Thank you, and Happy New Year!

BTC: 36d7U1n3MFaXgnNRAaEL3Pa3Hy6oFhM7XY
ETH: 0x6b6633606995BC180925c47d4249ED624aB7b2A5
USDC: 0x19e6bDDCBa47BB09a9Bc153Bb6479fc57284421a

-G

reacted to hesamation's post with ❤️ 2 months ago

Post

3044

this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)

1 reply

·

reacted to ronantakizawa's post with 🔥 3 months ago

Post

2513

Introducing the japanese-trending-words dataset: a dataset consisting 593 words from Japan’s annual trending word rankings (流行語大賞) from 2006-2025. This dataset provides the top 30 words from each year and its meaning in Japanese and english. This resource is awesome for NLP tasks understanding recent Japanese culture and history.

ronantakizawa/japanese-trending-words

#japanese #japanesedataset #trending

reacted to grimjim's post with 👍🔥 3 months ago

Post

3382

Going forward, I will be adopting the term Magnitude-Preserving Orthogonal Ablation (MPOA) for my recent work in mitigating model damage from abliteration. The technique potentially unlocks reasoning capacity previously occupied with safety refusal processing.

For details, start here: https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Showcase results: grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated (outperforms base instruct on UGI Leaderboard NatInt)

(The existing name, while technically accurate, was a bit of a mouthful.)

2 replies

·

reacted to ronantakizawa's post with 🔥 3 months ago

Post

2294

I built a demo on how to implement Cache-Augmented Generation (CAG) in an LLM and compare its performance gains to RAG (111 stars, 20 forks).

https://github.com/ronantakizawa/cacheaugmentedgeneration

CAG preloads document content into an LLM’s context as a precomputed key-value (KV) cache. This caching eliminates the need for real-time retrieval during inference, reducing token usage by up to 76% while maintaining answer quality.

CAG is particularly effective for constrained knowledge bases like internal documentation, FAQs, and customer support systems, where all relevant information can fit within the model's extended context window.

#rag #retrievalaugmentedgeneration

reacted to aufklarer's post with 🔥 3 months ago

Post

3271

Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC).

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

More details: https://blog.ivan.digital/fine-tuning-qwen3-embeddings-for-product-category-classification-on-the-large-scale-product-corpus-3a0919506bc8

reacted to onekq's post with 🔥 3 months ago

Post

2864

The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.

reacted to ronantakizawa's post with 🔥 3 months ago

Post

3289

Reached 1000+ total downloads across my models and datasets! 🎉

Follow me for more @ronantakizawa

2 replies

·

reacted to codelion's post with 🔥 6 months ago

Post

6195

I recently worked on a LoRA that improves tool use in LLM. Thought the approach might interest folks here.

The issue I have had when trying to use some of the local LLMs with coding agents is this:

Me: "Find all API endpoints with authentication in this codebase"
LLM: "You should look for @app .route decorators and check if they have auth middleware..."

But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.

To fine-tune it for tool use I combined two data sources:

1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits)
2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses

This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).

Tools We Taught:
- read_file - Actually read file contents
- search_files - Regex/pattern search across codebases
- find_definition - Locate classes/functions
- analyze_imports - Dependency tracking
- list_directory - Explore structure
- run_tests - Execute test suites

Improvements:
- Tool calling accuracy: 12% → 80%
- Correct parameters: 8% → 87%
- Multi-step tasks: 3% → 78%
- End-to-end completion: 5% → 80%
- Tools per task: 0.2 → 3.8

The LoRA really improves on intential tool call as an example consider the query: "Find ValueError in payment module"

The response proceeds as follows:

1. Calls search_files with pattern "ValueError"
2. Gets 4 matches across 3 files
3. Calls read_file on each match
4. Analyzes context
5. Reports: "Found 3 ValueError instances: payment/processor.py:47 for invalid amount, payment/validator.py:23 for unsupported currency..."

Resources:
- Colab notebook https://colab.research.google.com/github/codelion/ellora/blob/main/Ellora_Recipe_3_Enhanced_Tool_Calling_and_Code_Understanding.ipynb
- Model - codelion/Llama-3.2-1B-Instruct-tool-calling-lora
- GitHub - https://github.com/codelion/ellora

reacted to codelion's post with 🔥 6 months ago

Post

5293

I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.

Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.

Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).

We saw similar results on Qwen3-0.6B:

Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline)
Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction)
Speed: 3.0x faster inference than FP16
Quality: Generates correct, optimized code solutions

- Pre-trained adapter: codelion/Qwen3-0.6B-accuracy-recovery-lora
- GitHub repo: https://github.com/codelion/ellora

Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.

Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!

reacted to CultriX's post with ❤️ 12 months ago

Post

2699

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

Gheorghe Chesler PRO

AI & ML interests

Recent Activity

Organizations

Gheorghe Chesler PRO

AI & ML interests

Recent Activity

Organizations

nightmedia's activity