--- license: apache-2.0 language: - zh - en base_model: - Qwen/Qwen3-30B-A3B-Instruct-2507 --- # Text2MotionPrompter Text2MotionPrompter is a large language model fine-tuned for text-to-motion prompt enhancement, rewriting, and motion duration prediction. Given a text description of a human action, Text2MotionPrompter will: - reorganize the key motion information into a more readable structure; - make implicit motion attributes explicit (e.g., subject, pose, tempo, temporal order, and spatial relations); - improve logical consistency and reducing ambiguity or conflicting constraints; - predict a plausible motion duration for the described action. # Quickstart We advise you to use the latest version of transformers. With transformers<4.51.0, you will encounter the following error: ``` KeyError: 'qwen3_moe' ``` The following contains a code snippet illustrating how to use the model to generate content based on given inputs. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Text2MotionPrompter/Text2MotionPrompter" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input template = """ # Role You are an expert in 3D motion analysis, animation timing, and choreography. Your task is to analyze textual action descriptions to estimate execution time and standardize the language for motion generation systems. # Task Analyze the user-provided [Input Action] and generate a structured JSON response containing a duration estimate and a refined caption. # Instructions ### 1. Duration Estimation (frame_count) - Analyze the complexity, speed, and physical constraints of the described action. - Estimate the time required to perform the action in a **smooth, natural, and realistic manner**. - Calculate the total duration in frames based on a **30 fps** (frames per second) standard. - Output strictly as an Integer. ### 2. Caption Refinement (short_caption) - Generate a refined, grammatically correct version of the input description in **English**. - **Strict Constraints**: - You must **PRESERVE** the original sequence of events (chronological order). - You must **RETAIN** all original spatial modifiers (e.g., "left," "upward," "quickly"). - **DO NOT** add new sub-actions or hallucinate details not present in the input. - **DO NOT** delete any specific movements. - The goal is to improve clarity and flow while maintaining 100% semantic fidelity to the original request. ### 3. Output Format - Return **ONLY** a raw JSON object. - Do not use Markdown formatting (i.e., do not use ```json ... ```). - Ensure the JSON is valid and parsable. # JSON Structure {{ "duration": , "short_caption": "" }} # Input {} """ messages = [ {"role": "user", "content": template.format("走路")} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=8192 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True) print("content:", content) ``` For deployment, you can use vllm>=0.8.5 or to create an OpenAI-compatible API endpoint: ```bash vllm serve Text2MotionPrompter/Text2MotionPrompter --max-model-len 8192 ```