MemReader-4B-thinking
Introduction
MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations:
add_memory: write useful and complete information into long-term memorysearch_memory: retrieve historical memory for disambiguationbuffer_memory: temporarily hold incomplete but potentially valuable informationignore_memory: discard low-value or repetitive content
Built on top of Qwen/Qwen3-4B, MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory.
News
- MemReader-4B-thinking is released as an open model for active memory management.
- The model is designed for tool-calling workflows and memory-centric agent systems.
- It is part of the MemReader family introduced in the paper MemReader: Active Memory Management for Long-Term Agent Memory.
Usage
- Model ID:
IAAR-Shanghai/MemReader-4B-thinking - Base model:
Qwen/Qwen3-4B - Primary use: long-term memory extraction and memory management for agents
- Inference modes:
transformers, OpenAI-compatible serving,vLLM, and SGLang
Citation
If you use MemReader in your research or product, please cite:
@misc{kang2025memreader,
title={MemReader: Active Memory Management for Long-Term Agent Memory},
author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu},
year={2026},
note={Manuscript in preparation}
}
Highlights
- Active memory management instead of passive memory extraction
- Explicit reasoning with thinking traces and tool calls
- Strong performance on ambiguity resolution, knowledge update, and temporal reasoning
- Native fit for OpenAI-style tool-calling workflows
- Efficient local deployment with a 4B parameter footprint
- Designed for integration with memory-centric agent systems such as MemOS
What Makes MemReader Different
Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient:
- low-value chatter can pollute memory
- pronouns and missing references may require historical lookup
- some information is useful but not yet complete
- newer facts may need to update or overwrite older memory
MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time.
Benchmark Performance
MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability.
LOCOMO
| Model | Single Hop | Multi Hop | Temporal | Open Domain | Overall | F1 | Avg. Token |
|---|---|---|---|---|---|---|---|
| MemOS (4o-mini) | 84.06% | 73.16% | 75.90% | 57.29% | 78.70% | 51.90% | 1854 |
| MemReader-0.6B | 84.70% | 76.95% | 76.22% | 53.40% | 79.56% | 52.54% | 1976 |
| MemReader-4B-SFT | 81.88% | 76.12% | 71.02% | 62.15% | 77.33% | 47.77% | 784 |
| MemReader-4B-GRPO | 85.37% | 81.44% | 75.80% | 65.62% | 81.42% | 49.45% | 1950 |
LongMemEval
| Model | Avg. Token | SS-User | SS-Asst | SS-Pref | Multi-Session | Knowledge Update | Temporal Reasoning | Overall |
|---|---|---|---|---|---|---|---|---|
| MemOS | 1400 | 95.71% | 67.86% | 96.67% | 70.67% | 74.26% | 77.44% | 77.80% |
| EverMemOS | 2800 | 97.14% | 85.71% | 93.33% | 73.68% | 89.74% | 77.44% | 83.00% |
| MemReader-0.6B | 1166 | 95.71% | 75.00% | 90.00% | 75.18% | 82.05% | 75.90% | 80.20% |
| MemReader-4B-SFT | 963 | 97.10% | 69.64% | 90.00% | 71.42% | 85.80% | 78.19% | 80.00% |
| MemReader-4B-GRPO | 922 | 94.29% | 73.21% | 90.00% | 73.68% | 91.03% | 84.21% | 83.00% |
HaluMem
The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results.
| Model | Extraction Recall | Extraction Weighted Recall | Extraction F1 | Update Correctness | Update Hallucination | Update Omission |
|---|---|---|---|---|---|---|
| MemOS | 74.07% | 84.81% | 79.70% | 62.11% | 0.42% | 37.48% |
| MemReader-0.6B | 88.40% | 91.38% | 93.76% | 82.69% | 0.77% | 16.51% |
| MemReader-4B-SFT | 93.56% | 95.49% | 96.61% | 90.78% | 0.26% | 8.74% |
| MemReader-4B-GRPO | 96.57% | 97.19% | 98.21% | 94.55% | 0.32% | 5.12% |
These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission.
Recommended Use Cases
- long-term conversational agents
- personalized assistants
- agent memory extraction pipelines
- memory update and conflict resolution workflows
- retrieval-augmented memory systems
Intended Use
MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems.
The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as search_memory, add_memory, buffer_memory, and ignore_memory.
Model Specs
- Base model:
Qwen/Qwen3-4B - Parameters: 4B
- Tensor type: BF16
- Architecture:
Qwen3ForCausalLM - Context length: 40,960 tokens
- Primary capability: reasoning-driven memory extraction with tool calling
Quickstart
OpenAI-Compatible API Example
The following example calls the model through an OpenAI-compatible endpoint with required tool calling.
import json
import requests
url = "https://YOUR_ENDPOINT/v1/chat/completions"
payload = {
"model": "IAAR-Shanghai/MemReader-4B-thinking",
"extra_body": {
"chat_template_kwargs": {
"enable_thinking": True
}
},
"messages": [
{
"role": "system",
"content": (
"You are a memory extraction agent. Your job is to analyze "
"conversations and decide what information is worth storing in "
"long-term memory.\n\n"
"Available actions (call exactly one per turn):\n"
"- search_memory: Search existing memories for context\n"
"- add_memory: Extract and store valuable facts, preferences, or events\n"
"- buffer_memory: Accumulate this turn and wait for more context\n"
"- ignore_memory: Nothing worth storing\n\n"
"Guidelines:\n"
"- Store specific, verifiable facts\n"
"- Do not store generic greetings, chitchat, or vague statements\n"
"- UserMemory: personal attributes or preferences about the user\n"
"- LongTermMemory: facts, events, or shared knowledge from the conversation\n"
"- If unsure whether information already exists, call search_memory first"
),
},
{
"role": "user",
"content": (
"Please analyze the following conversation and decide what to store:\n\n"
"[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
"[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
"[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
),
},
],
"tools": [
{
"type": "function",
"function": {
"name": "search_memory",
"description": "Search historical memories for context.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "add_memory",
"description": "Extract and store memories.",
"parameters": {
"type": "object",
"properties": {
"memory_list": {
"type": "array",
"items": {
"type": "object",
"properties": {
"key": {"type": "string"},
"memory_type": {
"type": "string",
"enum": ["LongTermMemory", "UserMemory"],
},
"value": {"type": "string"},
"tags": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["key", "memory_type", "value", "tags"],
},
},
"summary": {"type": "string"},
},
"required": ["memory_list", "summary"],
},
},
},
{
"type": "function",
"function": {
"name": "buffer_memory",
"description": "Buffer for later processing.",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string"}
},
"required": ["reason"],
},
},
},
{
"type": "function",
"function": {
"name": "ignore_memory",
"description": "Ignore low-value content.",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string"}
},
"required": ["reason"],
},
},
},
],
"tool_choice": "required",
"temperature": 0.2,
"max_tokens": 1024,
}
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json",
}
response = requests.post(url, headers=headers, json=payload)
print(response.text)
Hugging Face Transformers Usage
You can also load the model directly from Hugging Face and run memory extraction with tool calling.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "IAAR-Shanghai/MemReader-4B-thinking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "system",
"content": (
"You are a memory extraction agent. Analyze conversations and decide "
"what information should be stored in long-term memory."
),
},
{
"role": "user",
"content": (
"Please analyze the following conversation and decide what to store:\n\n"
"[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
"[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
"[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
),
},
]
tools = [
{
"type": "function",
"function": {
"name": "search_memory",
"description": "Search historical memories for context.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "add_memory",
"description": "Extract and store memories.",
"parameters": {
"type": "object",
"properties": {
"memory_list": {
"type": "array",
"items": {
"type": "object",
"properties": {
"key": {"type": "string"},
"memory_type": {
"type": "string",
"enum": ["LongTermMemory", "UserMemory"],
},
"value": {"type": "string"},
"tags": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["key", "memory_type", "value", "tags"],
},
},
"summary": {"type": "string"},
},
"required": ["memory_list", "summary"],
},
},
},
{
"type": "function",
"function": {
"name": "buffer_memory",
"description": "Buffer for later processing.",
"parameters": {
"type": "object",
"properties": {"reason": {"type": "string"}},
"required": ["reason"],
},
},
},
{
"type": "function",
"function": {
"name": "ignore_memory",
"description": "Ignore low-value content.",
"parameters": {
"type": "object",
"properties": {"reason": {"type": "string"}},
"required": ["reason"],
},
},
},
]
text = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
print(output)
vLLM Usage
Start an OpenAI-compatible vLLM server:
python -m vllm.entrypoints.openai.api_server \
--model IAAR-Shanghai/MemReader-4B-thinking \
--served-model-name MemReader-4B-thinking \
--port 8000 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Then send a standard chat completion request to http://localhost:8000/v1/chat/completions.
SGLang Usage
MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration.
Output Format
MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this:
<think>
The conversation refers to an already known project and adds a new update:
Michael plans to produce a Python vs Rust benchmark report this week.
This is valuable project-state information and should be added to memory.
</think>
<tool_call>
{"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}}
</tool_call>
Best Practices
- Use
search_memoryfirst when the conversation contains pronouns, ellipsis, or implicit historical references. - Use
buffer_memoryonly when the information is genuinely incomplete and cannot be resolved from history. - Keep tool definitions stable between training and inference.
- For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed.
- If you want shorter outputs, reduce
max_tokensand control whether thinking traces are exposed in your serving layer.
Limitations
- The model is optimized for memory-management scenarios rather than general-purpose chatting.
- Quality depends on the external memory schema, retrieval quality, and tool-execution loop.
- For highly domain-specific memory schemas, additional instruction tuning may still be beneficial.
- As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows.
License Notice
This model is released under the Apache-2.0 license. As it is derived from Qwen/Qwen3-4B, users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment.
Links
- GitHub: MemTensor/MemOS
- API Documentation: docs.openmem.net
- Model: IAAR-Shanghai/MemReader-4B-thinking
- Base model: Qwen/Qwen3-4B
- Downloads last month
- -