MemReader-4B-thinking

Introduction

MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations:

  • add_memory: write useful and complete information into long-term memory
  • search_memory: retrieve historical memory for disambiguation
  • buffer_memory: temporarily hold incomplete but potentially valuable information
  • ignore_memory: discard low-value or repetitive content

Built on top of Qwen/Qwen3-4B, MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory.

News

  • MemReader-4B-thinking is released as an open model for active memory management.
  • The model is designed for tool-calling workflows and memory-centric agent systems.
  • It is part of the MemReader family introduced in the paper MemReader: Active Memory Management for Long-Term Agent Memory.

Usage

  • Model ID: IAAR-Shanghai/MemReader-4B-thinking
  • Base model: Qwen/Qwen3-4B
  • Primary use: long-term memory extraction and memory management for agents
  • Inference modes: transformers, OpenAI-compatible serving, vLLM, and SGLang

Citation

If you use MemReader in your research or product, please cite:

@misc{kang2025memreader,
  title={MemReader: Active Memory Management for Long-Term Agent Memory},
  author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu},
  year={2026},
  note={Manuscript in preparation}
}

Highlights

  • Active memory management instead of passive memory extraction
  • Explicit reasoning with thinking traces and tool calls
  • Strong performance on ambiguity resolution, knowledge update, and temporal reasoning
  • Native fit for OpenAI-style tool-calling workflows
  • Efficient local deployment with a 4B parameter footprint
  • Designed for integration with memory-centric agent systems such as MemOS

What Makes MemReader Different

Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient:

  • low-value chatter can pollute memory
  • pronouns and missing references may require historical lookup
  • some information is useful but not yet complete
  • newer facts may need to update or overwrite older memory

MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time.

Benchmark Performance

MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability.

LOCOMO

Model Single Hop Multi Hop Temporal Open Domain Overall F1 Avg. Token
MemOS (4o-mini) 84.06% 73.16% 75.90% 57.29% 78.70% 51.90% 1854
MemReader-0.6B 84.70% 76.95% 76.22% 53.40% 79.56% 52.54% 1976
MemReader-4B-SFT 81.88% 76.12% 71.02% 62.15% 77.33% 47.77% 784
MemReader-4B-GRPO 85.37% 81.44% 75.80% 65.62% 81.42% 49.45% 1950

LongMemEval

Model Avg. Token SS-User SS-Asst SS-Pref Multi-Session Knowledge Update Temporal Reasoning Overall
MemOS 1400 95.71% 67.86% 96.67% 70.67% 74.26% 77.44% 77.80%
EverMemOS 2800 97.14% 85.71% 93.33% 73.68% 89.74% 77.44% 83.00%
MemReader-0.6B 1166 95.71% 75.00% 90.00% 75.18% 82.05% 75.90% 80.20%
MemReader-4B-SFT 963 97.10% 69.64% 90.00% 71.42% 85.80% 78.19% 80.00%
MemReader-4B-GRPO 922 94.29% 73.21% 90.00% 73.68% 91.03% 84.21% 83.00%

HaluMem

The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results.

Model Extraction Recall Extraction Weighted Recall Extraction F1 Update Correctness Update Hallucination Update Omission
MemOS 74.07% 84.81% 79.70% 62.11% 0.42% 37.48%
MemReader-0.6B 88.40% 91.38% 93.76% 82.69% 0.77% 16.51%
MemReader-4B-SFT 93.56% 95.49% 96.61% 90.78% 0.26% 8.74%
MemReader-4B-GRPO 96.57% 97.19% 98.21% 94.55% 0.32% 5.12%

These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission.

Recommended Use Cases

  • long-term conversational agents
  • personalized assistants
  • agent memory extraction pipelines
  • memory update and conflict resolution workflows
  • retrieval-augmented memory systems

Intended Use

MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems.

The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as search_memory, add_memory, buffer_memory, and ignore_memory.

Model Specs

  • Base model: Qwen/Qwen3-4B
  • Parameters: 4B
  • Tensor type: BF16
  • Architecture: Qwen3ForCausalLM
  • Context length: 40,960 tokens
  • Primary capability: reasoning-driven memory extraction with tool calling

Quickstart

OpenAI-Compatible API Example

The following example calls the model through an OpenAI-compatible endpoint with required tool calling.

import json
import requests

url = "https://YOUR_ENDPOINT/v1/chat/completions"

payload = {
    "model": "IAAR-Shanghai/MemReader-4B-thinking",
    "extra_body": {
        "chat_template_kwargs": {
            "enable_thinking": True
        }
    },
    "messages": [
        {
            "role": "system",
            "content": (
                "You are a memory extraction agent. Your job is to analyze "
                "conversations and decide what information is worth storing in "
                "long-term memory.\n\n"
                "Available actions (call exactly one per turn):\n"
                "- search_memory: Search existing memories for context\n"
                "- add_memory: Extract and store valuable facts, preferences, or events\n"
                "- buffer_memory: Accumulate this turn and wait for more context\n"
                "- ignore_memory: Nothing worth storing\n\n"
                "Guidelines:\n"
                "- Store specific, verifiable facts\n"
                "- Do not store generic greetings, chitchat, or vague statements\n"
                "- UserMemory: personal attributes or preferences about the user\n"
                "- LongTermMemory: facts, events, or shared knowledge from the conversation\n"
                "- If unsure whether information already exists, call search_memory first"
            ),
        },
        {
            "role": "user",
            "content": (
                "Please analyze the following conversation and decide what to store:\n\n"
                "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
                "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
                "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
            ),
        },
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "search_memory",
                "description": "Search historical memories for context.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    },
                    "required": ["query"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "add_memory",
                "description": "Extract and store memories.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "memory_list": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "key": {"type": "string"},
                                    "memory_type": {
                                        "type": "string",
                                        "enum": ["LongTermMemory", "UserMemory"],
                                    },
                                    "value": {"type": "string"},
                                    "tags": {
                                        "type": "array",
                                        "items": {"type": "string"},
                                    },
                                },
                                "required": ["key", "memory_type", "value", "tags"],
                            },
                        },
                        "summary": {"type": "string"},
                    },
                    "required": ["memory_list", "summary"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "buffer_memory",
                "description": "Buffer for later processing.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "reason": {"type": "string"}
                    },
                    "required": ["reason"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "ignore_memory",
                "description": "Ignore low-value content.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "reason": {"type": "string"}
                    },
                    "required": ["reason"],
                },
            },
        },
    ],
    "tool_choice": "required",
    "temperature": 0.2,
    "max_tokens": 1024,
}

headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
}

response = requests.post(url, headers=headers, json=payload)
print(response.text)

Hugging Face Transformers Usage

You can also load the model directly from Hugging Face and run memory extraction with tool calling.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "IAAR-Shanghai/MemReader-4B-thinking"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a memory extraction agent. Analyze conversations and decide "
            "what information should be stored in long-term memory."
        ),
    },
    {
        "role": "user",
        "content": (
            "Please analyze the following conversation and decide what to store:\n\n"
            "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
            "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
            "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
        ),
    },
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_memory",
            "description": "Search historical memories for context.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "add_memory",
            "description": "Extract and store memories.",
            "parameters": {
                "type": "object",
                "properties": {
                    "memory_list": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "key": {"type": "string"},
                                "memory_type": {
                                    "type": "string",
                                    "enum": ["LongTermMemory", "UserMemory"],
                                },
                                "value": {"type": "string"},
                                "tags": {
                                    "type": "array",
                                    "items": {"type": "string"},
                                },
                            },
                            "required": ["key", "memory_type", "value", "tags"],
                        },
                    },
                    "summary": {"type": "string"},
                },
                "required": ["memory_list", "summary"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "buffer_memory",
            "description": "Buffer for later processing.",
            "parameters": {
                "type": "object",
                "properties": {"reason": {"type": "string"}},
                "required": ["reason"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "ignore_memory",
            "description": "Ignore low-value content.",
            "parameters": {
                "type": "object",
                "properties": {"reason": {"type": "string"}},
                "required": ["reason"],
            },
        },
    },
]

text = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
print(output)

vLLM Usage

Start an OpenAI-compatible vLLM server:

python -m vllm.entrypoints.openai.api_server \
  --model IAAR-Shanghai/MemReader-4B-thinking \
  --served-model-name MemReader-4B-thinking \
  --port 8000 \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Then send a standard chat completion request to http://localhost:8000/v1/chat/completions.

SGLang Usage

MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration.

Output Format

MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this:

<think>
The conversation refers to an already known project and adds a new update:
Michael plans to produce a Python vs Rust benchmark report this week.
This is valuable project-state information and should be added to memory.
</think>

<tool_call>
{"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}}
</tool_call>

Best Practices

  • Use search_memory first when the conversation contains pronouns, ellipsis, or implicit historical references.
  • Use buffer_memory only when the information is genuinely incomplete and cannot be resolved from history.
  • Keep tool definitions stable between training and inference.
  • For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed.
  • If you want shorter outputs, reduce max_tokens and control whether thinking traces are exposed in your serving layer.

Limitations

  • The model is optimized for memory-management scenarios rather than general-purpose chatting.
  • Quality depends on the external memory schema, retrieval quality, and tool-execution loop.
  • For highly domain-specific memory schemas, additional instruction tuning may still be beneficial.
  • As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows.

License Notice

This model is released under the Apache-2.0 license. As it is derived from Qwen/Qwen3-4B, users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment.

Links

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IAAR-Shanghai/MemReader-4B-thinking

Finetuned
Qwen/Qwen3-4B
Finetuned
(543)
this model