JTPA2.5-VL-3B-Phone-Agent-VI

A lightweight Vietnamese-optimized vision-language model for Android phone automation. This model understands Vietnamese UI interactions and generates structured action sequences for mobile device control.

Model Information

  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Framework: [LLaMA-Factory]
  • Language: Vietnamese (vi) & English (en)
  • License: Apache 2.0

Capabilities

This model excels at:

  • 📱 Phone GUI Understanding: Analyzes Android screenshots and identifies UI elements
  • 🗣️ Vietnamese Instructions: Processes natural language commands in Vietnamese
  • Action Generation: Produces structured actions (tap, type, swipe, back, home)
  • 🔄 Multi-step Workflows: Handles complex sequential tasks across multiple screens

Example Usage

Input: Screenshot + Vietnamese instruction Output: Structured action in JSON format { "action": "tap", "x": 540, "y": 1200, "reason": "Click on the search button" }

Quick Start

Deploy with vLLM

python -m vllm.entrypoints.openai.api_server --model johnnietien/JTPA2.5-VL-3B-phone-agent-vi --trust-remote-code --port 8000

Query the Model

import requests import base64

def encode_image(path): with open(path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8")

payload = { "model": "JTPA2.5-VL-3B-phone-agent-vi", "messages": [{ "role": "user", "content": [ {"type": "input_text", "text": "Mở ứng dụng MoMo"}, {"type": "input_image", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('screenshot.png')}"}} ] }], "max_tokens": 256 }

response = requests.post("http://localhost:8000/v1/chat/completions", json=payload) print(response.json()["choices"]["message"]["content"])

Integration

Perfect for:

  • Open-phone agent framework
  • Mobile application testing & QA automation
  • Vietnamese smartphone accessibility tools
  • RPA platforms for mobile app workflows

Performance Specifications

Aspect Details
Model Size 3B parameters
Context Length 4K tokens
Inference Speed ~0.5-1.0 sec/action (T4 GPU)
VRAM Requirement 6-8 GB (BF16), 4-6 GB (quantized)
Supported Actions tap, type, swipe, back, home, wait

Use Responsibly

Intended for:

  • Personal device automation
  • Mobile app testing
  • Accessibility solutions
  • Research purposes

Not intended for:

  • Unauthorized account access
  • Bypassing security mechanisms
  • Unauthorized data collection

Citation

@misc{jtpa2024phone, author = {Johnnie Tien}, title = {JTPA2.5-VL-3B-Phone-Agent-VI: Vietnamese Mobile GUI Agent}, year = {2024}, howpublished = {\url{https://huggingface.co/johnnietien/JTPA2.5-VL-3B-phone-agent-vi}} }


Version: 1.0 | Status: Production-Ready | Last Updated: December 2024

[More Information Needed]

Downloads last month
5
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support