MapTrace Task Illustration

Qwen3.5-0.8B — MapTrace Path Planning (Partial Fine-Tune)

This model is a Partial Fine-Tune of Qwen/Qwen3.5-0.8B specifically trained for visual path planning on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.


🗺️ Task Description

MapTrace is a visual navigation benchmark that requires a model to:

  1. Understand a map image with a marked start location (green dot) and end location (red dot).
  2. Trace a realistic, traversable path between the two points.
  3. Output the path as a list of normalized (x, y) coordinates: [(x1, y1), (x2, y2), ...]

Example Input

You are provided an image of a path with a start location denoted in green 
and an end location denoted in red. 
The normalized xy-coordinates of the start location are (0.77, 0.47) and 
of the end location (0.54, 0.54). 
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] 
of the path between the start and end location. 
Ensure that the path follows the traversable locations of the map.

Example Output

[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]

🏋️ Training Strategy

The model was trained using a Partial Fine-Tuning approach to balance quality and computational efficiency:

Setting Value
Base Model Qwen/Qwen3.5-0.8B
Strategy Partial Fine-Tuning (Last 6 of 24 layers)
Dataset google/MapTrace (maptrace_20k subset)
Vision Encoder ✅ Trained (not frozen)
Effective Batch Size 128 (1 per GPU × 32 grad_accum × 4 GPUs)
Learning Rate 1e-5 with Warmup (100 steps)
Max Sequence Length 2048 tokens
Mixed Precision bf16
Gradient Checkpointing ✅ Enabled
Image Resolution Cap 1024 × 768 pixels (VRAM safety)
Hardware 4× NVIDIA RTX 4090 24GB

Why Partial Fine-Tuning?

Instead of full fine-tuning or LoRA, we opted for unlocking the last 6 transformer layers and the visual encoder for gradient updates. This approach:

  • Preserves the model's general language understanding from the foundation weights.
  • Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
  • Avoids LoRA's approximation error — weights are modified directly.

Training Highlights

  • Resumed from checkpoint-50 → checkpoint-100 → final across multiple sessions.
  • VRAM stability was achieved by setting max_pixels=1024x768 as a safety cap, preventing OOM spikes from high-resolution map images.
  • Loss masking was applied so the model only learns from the assistant's coordinate response, not the user's prompt or image tokens.
  • Final Token Accuracy: ~~84.5%, Final Loss: ~0.37.

🚀 Inference

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
    image = Image.open(image_path).convert("RGB")
    
    prompt_text = (
        f"You are provided an image of a path with a start location denoted in green "
        f"and an end location denoted in red. \n"
        f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
        f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
        f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
        f"of the path between the start and end location. \n"
        f"Ensure that the path follows the traversable locations of the map."
    )

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = processor(
        text=[text],
        images=[image],
        return_tensors="pt",
        padding=True,
        min_pixels=256 * 28 * 28,
        max_pixels=1024 * 768,
    ).to(model.device)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            temperature=0.0,
        )
    
    output = processor.decode(
        generated_ids[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    return output.strip()


# Example usage
result = predict_path(
    image_path="map.png",
    start_xy=(0.7695, 0.4727),
    end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]

Important: Use the raw prompt format shown above (not apply_chat_template with add_generation_prompt=True). Since the training data did not include <think> tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.


📊 Training Metrics

Epoch Step Loss Token Accuracy Learning Rate
0.50 71 0.5044 70.35% 6.9e-06
...
0.71 100 0.4614 81.57% 9.9e-06
0.85 120 0.4294 82.71% 9.73e-06
1.00 141 0.4147 83.29% 9.39e-06
1.42 200 0.3841 84.32% 4.31e-06
1.50 211 0.3741 84.50% 3.47e-06
1.63 224 0.3669 84.82% 1.95e-06

🗂️ Dataset

The model was trained on google/MapTrace (the maptrace_20k subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.


⚠️ Limitations

  • Map-Specific: The model is trained exclusively for top-down map images; it is not a general path planner.
  • Coordinate Precision: Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
  • Resolution Cap: Images larger than 1024×768 will be downscaled to this resolution during inference to match training conditions.

📜 License

This model is released under the Apache 2.0 license, consistent with the base model.

Downloads last month
10
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT

Finetuned
Qwen/Qwen3.5-0.8B
Finetuned
(28)
this model

Dataset used to train TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT

Space using TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT 1