Qwen3.5-0.8B — MapTrace Path Planning (Partial Fine-Tune)
This model is a Partial Fine-Tune of Qwen/Qwen3.5-0.8B specifically trained for visual path planning on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.
🗺️ Task Description
MapTrace is a visual navigation benchmark that requires a model to:
- Understand a map image with a marked start location (green dot) and end location (red dot).
- Trace a realistic, traversable path between the two points.
- Output the path as a list of normalized (x, y) coordinates:
[(x1, y1), (x2, y2), ...]
Example Input
You are provided an image of a path with a start location denoted in green
and an end location denoted in red.
The normalized xy-coordinates of the start location are (0.77, 0.47) and
of the end location (0.54, 0.54).
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...]
of the path between the start and end location.
Ensure that the path follows the traversable locations of the map.
Example Output
[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
🏋️ Training Strategy
The model was trained using a Partial Fine-Tuning approach to balance quality and computational efficiency:
| Setting | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-0.8B |
| Strategy | Partial Fine-Tuning (Last 6 of 24 layers) |
| Dataset | google/MapTrace (maptrace_20k subset) |
| Vision Encoder | ✅ Trained (not frozen) |
| Effective Batch Size | 128 (1 per GPU × 32 grad_accum × 4 GPUs) |
| Learning Rate | 1e-5 with Warmup (100 steps) |
| Max Sequence Length | 2048 tokens |
| Mixed Precision | bf16 |
| Gradient Checkpointing | ✅ Enabled |
| Image Resolution Cap | 1024 × 768 pixels (VRAM safety) |
| Hardware | 4× NVIDIA RTX 4090 24GB |
Why Partial Fine-Tuning?
Instead of full fine-tuning or LoRA, we opted for unlocking the last 6 transformer layers and the visual encoder for gradient updates. This approach:
- Preserves the model's general language understanding from the foundation weights.
- Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
- Avoids LoRA's approximation error — weights are modified directly.
Training Highlights
- Resumed from
checkpoint-50 → checkpoint-100 → finalacross multiple sessions. - VRAM stability was achieved by setting
max_pixels=1024x768as a safety cap, preventing OOM spikes from high-resolution map images. - Loss masking was applied so the model only learns from the assistant's coordinate response, not the user's prompt or image tokens.
- Final Token Accuracy: ~~84.5%, Final Loss: ~0.37.
🚀 Inference
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
image = Image.open(image_path).convert("RGB")
prompt_text = (
f"You are provided an image of a path with a start location denoted in green "
f"and an end location denoted in red. \n"
f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
f"of the path between the start and end location. \n"
f"Ensure that the path follows the traversable locations of the map."
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt_text},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True,
min_pixels=256 * 28 * 28,
max_pixels=1024 * 768,
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
temperature=0.0,
)
output = processor.decode(
generated_ids[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
return output.strip()
# Example usage
result = predict_path(
image_path="map.png",
start_xy=(0.7695, 0.4727),
end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
Important: Use the raw prompt format shown above (not
apply_chat_templatewithadd_generation_prompt=True). Since the training data did not include<think>tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.
📊 Training Metrics
| Epoch | Step | Loss | Token Accuracy | Learning Rate |
|---|---|---|---|---|
| 0.50 | 71 | 0.5044 | 70.35% | 6.9e-06 |
| ... | ||||
| 0.71 | 100 | 0.4614 | 81.57% | 9.9e-06 |
| 0.85 | 120 | 0.4294 | 82.71% | 9.73e-06 |
| 1.00 | 141 | 0.4147 | 83.29% | 9.39e-06 |
| 1.42 | 200 | 0.3841 | 84.32% | 4.31e-06 |
| 1.50 | 211 | 0.3741 | 84.50% | 3.47e-06 |
| 1.63 | 224 | 0.3669 | 84.82% | 1.95e-06 |
🗂️ Dataset
The model was trained on google/MapTrace (the maptrace_20k subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.
⚠️ Limitations
- Map-Specific: The model is trained exclusively for top-down map images; it is not a general path planner.
- Coordinate Precision: Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
- Resolution Cap: Images larger than
1024×768will be downscaled to this resolution during inference to match training conditions.
📜 License
This model is released under the Apache 2.0 license, consistent with the base model.
- Downloads last month
- 10
