---
license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen3-30B-A3B-Instruct-2507
---

# Text2MotionPrompter
Text2MotionPrompter is a large language model fine-tuned for text-to-motion prompt enhancement, rewriting, and motion duration prediction.

Given a text description of a human action, Text2MotionPrompter will:
- reorganize the key motion information into a more readable structure;
- make implicit motion attributes explicit (e.g., subject, pose, tempo, temporal order, and spatial relations);
- improve logical consistency and reducing ambiguity or conflicting constraints;
- predict a plausible motion duration for the described action.


# Quickstart
We advise you to use the latest version of transformers. With transformers<4.51.0, you will encounter the following error:
```
KeyError: 'qwen3_moe'
```

The following contains a code snippet illustrating how to use the model to generate content based on given inputs.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Text2MotionPrompter/Text2MotionPrompter"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
template = """
# Role
You are an expert in 3D motion analysis, animation timing, and choreography. Your task is to analyze textual action descriptions to estimate execution time and standardize the language for motion generation systems.

# Task
Analyze the user-provided [Input Action] and generate a structured JSON response containing a duration estimate and a refined caption.

# Instructions

### 1. Duration Estimation (frame_count)
- Analyze the complexity, speed, and physical constraints of the described action.
- Estimate the time required to perform the action in a **smooth, natural, and realistic manner**.
- Calculate the total duration in frames based on a **30 fps** (frames per second) standard.
- Output strictly as an Integer.

### 2. Caption Refinement (short_caption)
- Generate a refined, grammatically correct version of the input description in **English**.
- **Strict Constraints**:
    - You must **PRESERVE** the original sequence of events (chronological order).
    - You must **RETAIN** all original spatial modifiers (e.g., "left," "upward," "quickly").
    - **DO NOT** add new sub-actions or hallucinate details not present in the input.
    - **DO NOT** delete any specific movements.
- The goal is to improve clarity and flow while maintaining 100% semantic fidelity to the original request.

### 3. Output Format
- Return **ONLY** a raw JSON object.
- Do not use Markdown formatting (i.e., do not use ```json ... ```).
- Ensure the JSON is valid and parsable.

# JSON Structure
{{
    "duration": <Integer, frames at 30fps>,
    "short_caption": "<String, the refined English description>"
}}

# Input
{}
"""


messages = [
    {"role": "user", "content": template.format("走路")}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

```


For deployment, you can use vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:
```bash
vllm serve Text2MotionPrompter/Text2MotionPrompter --max-model-len 8192
```