MemReader-4B-thinking

Introduction

MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations:

add_memory: write useful and complete information into long-term memory
search_memory: retrieve historical memory for disambiguation
buffer_memory: temporarily hold incomplete but potentially valuable information
ignore_memory: discard low-value or repetitive content

Built on top of Qwen/Qwen3-4B, MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory.

News

MemReader-4B-thinking is released as an open model for active memory management.
The model is designed for tool-calling workflows and memory-centric agent systems.
It is part of the MemReader family introduced in the paper MemReader: Active Memory Management for Long-Term Agent Memory.

Usage

Model ID: IAAR-Shanghai/MemReader-4B-thinking
Base model: Qwen/Qwen3-4B
Primary use: long-term memory extraction and memory management for agents
Inference modes: transformers, OpenAI-compatible serving, vLLM, and SGLang

Citation

If you use MemReader in your research or product, please cite:

@misc{kang2025memreader,
  title={MemReader: Active Memory Management for Long-Term Agent Memory},
  author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu},
  year={2026},
  note={Manuscript in preparation}
}

Highlights

Active memory management instead of passive memory extraction
Explicit reasoning with thinking traces and tool calls
Strong performance on ambiguity resolution, knowledge update, and temporal reasoning
Native fit for OpenAI-style tool-calling workflows
Efficient local deployment with a 4B parameter footprint
Designed for integration with memory-centric agent systems such as MemOS

What Makes MemReader Different

Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient:

low-value chatter can pollute memory
pronouns and missing references may require historical lookup
some information is useful but not yet complete
newer facts may need to update or overwrite older memory

MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time.

Benchmark Performance

MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability.

LOCOMO

Model	Single Hop	Multi Hop	Temporal	Open Domain	Overall	F1	Avg. Token
MemOS (4o-mini)	84.06%	73.16%	75.90%	57.29%	78.70%	51.90%	1854
MemReader-0.6B	84.70%	76.95%	76.22%	53.40%	79.56%	52.54%	1976
MemReader-4B-SFT	81.88%	76.12%	71.02%	62.15%	77.33%	47.77%	784
MemReader-4B-GRPO	85.37%	81.44%	75.80%	65.62%	81.42%	49.45%	1950

LongMemEval

Model	Avg. Token	SS-User	SS-Asst	SS-Pref	Multi-Session	Knowledge Update	Temporal Reasoning	Overall
MemOS	1400	95.71%	67.86%	96.67%	70.67%	74.26%	77.44%	77.80%
EverMemOS	2800	97.14%	85.71%	93.33%	73.68%	89.74%	77.44%	83.00%
MemReader-0.6B	1166	95.71%	75.00%	90.00%	75.18%	82.05%	75.90%	80.20%
MemReader-4B-SFT	963	97.10%	69.64%	90.00%	71.42%	85.80%	78.19%	80.00%
MemReader-4B-GRPO	922	94.29%	73.21%	90.00%	73.68%	91.03%	84.21%	83.00%

HaluMem

The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results.

Model	Extraction Recall	Extraction Weighted Recall	Extraction F1	Update Correctness	Update Hallucination	Update Omission
MemOS	74.07%	84.81%	79.70%	62.11%	0.42%	37.48%
MemReader-0.6B	88.40%	91.38%	93.76%	82.69%	0.77%	16.51%
MemReader-4B-SFT	93.56%	95.49%	96.61%	90.78%	0.26%	8.74%
MemReader-4B-GRPO	96.57%	97.19%	98.21%	94.55%	0.32%	5.12%

These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission.

Recommended Use Cases

long-term conversational agents
personalized assistants
agent memory extraction pipelines
memory update and conflict resolution workflows
retrieval-augmented memory systems

Intended Use

MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems.

The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as search_memory, add_memory, buffer_memory, and ignore_memory.

Model Specs

Base model: Qwen/Qwen3-4B
Parameters: 4B
Tensor type: BF16
Architecture: Qwen3ForCausalLM
Context length: 40,960 tokens
Primary capability: reasoning-driven memory extraction with tool calling

Quickstart

OpenAI-Compatible API Example

The following example calls the model through an OpenAI-compatible endpoint with required tool calling.

import json
import requests

url = "https://YOUR_ENDPOINT/v1/chat/completions"

payload = {
    "model": "IAAR-Shanghai/MemReader-4B-thinking",
    "extra_body": {
        "chat_template_kwargs": {
            "enable_thinking": True
        }
    },
    "messages": [
        {
            "role": "system",
            "content": (
                "You are a memory extraction agent. Your job is to analyze "
                "conversations and decide what information is worth storing in "
                "long-term memory.\n\n"
                "Available actions (call exactly one per turn):\n"
                "- search_memory: Search existing memories for context\n"
                "- add_memory: Extract and store valuable facts, preferences, or events\n"
                "- buffer_memory: Accumulate this turn and wait for more context\n"
                "- ignore_memory: Nothing worth storing\n\n"
                "Guidelines:\n"
                "- Store specific, verifiable facts\n"
                "- Do not store generic greetings, chitchat, or vague statements\n"
                "- UserMemory: personal attributes or preferences about the user\n"
                "- LongTermMemory: facts, events, or shared knowledge from the conversation\n"
                "- If unsure whether information already exists, call search_memory first"
            ),
        },
        {
            "role": "user",
            "content": (
                "Please analyze the following conversation and decide what to store:\n\n"
                "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
                "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
                "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
            ),
        },
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "search_memory",
                "description": "Search historical memories for context.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    },
                    "required": ["query"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "add_memory",
                "description": "Extract and store memories.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "memory_list": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "key": {"type": "string"},
                                    "memory_type": {
                                        "type": "string",
                                        "enum": ["LongTermMemory", "UserMemory"],
                                    },
                                    "value": {"type": "string"},
                                    "tags": {
                                        "type": "array",
                                        "items": {"type": "string"},
                                    },
                                },
                                "required": ["key", "memory_type", "value", "tags"],
                            },
                        },
                        "summary": {"type": "string"},
                    },
                    "required": ["memory_list", "summary"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "buffer_memory",
                "description": "Buffer for later processing.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "reason": {"type": "string"}
                    },
                    "required": ["reason"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "ignore_memory",
                "description": "Ignore low-value content.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "reason": {"type": "string"}
                    },
                    "required": ["reason"],
                },
            },
        },
    ],
    "tool_choice": "required",
    "temperature": 0.2,
    "max_tokens": 1024,
}

headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
}

response = requests.post(url, headers=headers, json=payload)
print(response.text)

Hugging Face Transformers Usage

You can also load the model directly from Hugging Face and run memory extraction with tool calling.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "IAAR-Shanghai/MemReader-4B-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a memory extraction agent. Analyze conversations and decide "
            "what information should be stored in long-term memory."
        ),
    },
    {
        "role": "user",
        "content": (
            "Please analyze the following conversation and decide what to store:\n\n"
            "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
            "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
            "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
        ),
    },
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_memory",
            "description": "Search historical memories for context.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "add_memory",
            "description": "Extract and store memories.",
            "parameters": {
                "type": "object",
                "properties": {
                    "memory_list": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "key": {"type": "string"},
                                "memory_type": {
                                    "type": "string",
                                    "enum": ["LongTermMemory", "UserMemory"],
                                },
                                "value": {"type": "string"},
                                "tags": {
                                    "type": "array",
                                    "items": {"type": "string"},
                                },
                            },
                            "required": ["key", "memory_type", "value", "tags"],
                        },
                    },
                    "summary": {"type": "string"},
                },
                "required": ["memory_list", "summary"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "buffer_memory",
            "description": "Buffer for later processing.",
            "parameters": {
                "type": "object",
                "properties": {"reason": {"type": "string"}},
                "required": ["reason"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "ignore_memory",
            "description": "Ignore low-value content.",
            "parameters": {
                "type": "object",
                "properties": {"reason": {"type": "string"}},
                "required": ["reason"],
            },
        },
    },
]

text = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
print(output)

vLLM Usage

Start an OpenAI-compatible vLLM server:

python -m vllm.entrypoints.openai.api_server \
  --model IAAR-Shanghai/MemReader-4B-thinking \
  --served-model-name MemReader-4B-thinking \
  --port 8000 \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Then send a standard chat completion request to http://localhost:8000/v1/chat/completions.

SGLang Usage

MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration.

Output Format

MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this:

<think>
The conversation refers to an already known project and adds a new update:
Michael plans to produce a Python vs Rust benchmark report this week.
This is valuable project-state information and should be added to memory.
</think>

<tool_call>
{"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}}
</tool_call>

Best Practices

Use search_memory first when the conversation contains pronouns, ellipsis, or implicit historical references.
Use buffer_memory only when the information is genuinely incomplete and cannot be resolved from history.
Keep tool definitions stable between training and inference.
For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed.
If you want shorter outputs, reduce max_tokens and control whether thinking traces are exposed in your serving layer.

Limitations

The model is optimized for memory-management scenarios rather than general-purpose chatting.
Quality depends on the external memory schema, retrieval quality, and tool-execution loop.
For highly domain-specific memory schemas, additional instruction tuning may still be beneficial.
As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows.

License Notice

This model is released under the Apache-2.0 license. As it is derived from Qwen/Qwen3-4B, users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment.

Model tree for IAAR-Shanghai/MemReader-4B-thinking

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B