JTPA2.5-VL-3B-Phone-Agent-VI

A lightweight Vietnamese-optimized vision-language model for Android phone automation. This model understands Vietnamese UI interactions and generates structured action sequences for mobile device control.

Model Information

Fine-tuning Method: LoRA (Low-Rank Adaptation)
Framework: [LLaMA-Factory]
Language: Vietnamese (vi) & English (en)
License: Apache 2.0

Capabilities

This model excels at:

📱 Phone GUI Understanding: Analyzes Android screenshots and identifies UI elements
🗣️ Vietnamese Instructions: Processes natural language commands in Vietnamese
⚡ Action Generation: Produces structured actions (tap, type, swipe, back, home)
🔄 Multi-step Workflows: Handles complex sequential tasks across multiple screens

Example Usage

Input: Screenshot + Vietnamese instruction Output: Structured action in JSON format { "action": "tap", "x": 540, "y": 1200, "reason": "Click on the search button" }

Quick Start

Deploy with vLLM

python -m vllm.entrypoints.openai.api_server --model johnnietien/JTPA2.5-VL-3B-phone-agent-vi --trust-remote-code --port 8000

Query the Model

import requests import base64

def encode_image(path): with open(path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8")

payload = { "model": "JTPA2.5-VL-3B-phone-agent-vi", "messages": [{ "role": "user", "content": [ {"type": "input_text", "text": "Mở ứng dụng MoMo"}, {"type": "input_image", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('screenshot.png')}"}} ] }], "max_tokens": 256 }

response = requests.post("http://localhost:8000/v1/chat/completions", json=payload) print(response.json()["choices"]["message"]["content"])

Integration

Perfect for:

Open-phone agent framework
Mobile application testing & QA automation
Vietnamese smartphone accessibility tools
RPA platforms for mobile app workflows

Performance Specifications

Aspect	Details
Model Size	3B parameters
Context Length	4K tokens
Inference Speed	~0.5-1.0 sec/action (T4 GPU)
VRAM Requirement	6-8 GB (BF16), 4-6 GB (quantized)
Supported Actions	tap, type, swipe, back, home, wait

Use Responsibly

✅ Intended for:

Personal device automation
Mobile app testing
Accessibility solutions
Research purposes

❌ Not intended for:

Unauthorized account access
Bypassing security mechanisms
Unauthorized data collection

Citation

@misc{jtpa2024phone, author = {Johnnie Tien}, title = {JTPA2.5-VL-3B-Phone-Agent-VI: Vietnamese Mobile GUI Agent}, year = {2024}, howpublished = {\url{https://huggingface.co/johnnietien/JTPA2.5-VL-3B-phone-agent-vi}} }

Version: 1.0 | Status: Production-Ready | Last Updated: December 2024

[More Information Needed]

Downloads last month: 5

Safetensors

Model size

4B params

Tensor type

BF16