---
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-VL-8B-Thinking
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- image-text-to-text
- transformers
- qwen3-vl
---

# ReCAP-8B

ReCAP-8B is a vision-language model fine-tuned from  
[Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking), designed to enable **robust CAPTCHA solving within native GUI agents** while preserving general GUI interaction capabilities.

This model is introduced in *“CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”*.

---

## 🚀 Overview

ReCAP-8B extends a general-purpose GUI agent with **CAPTCHA-solving ability** by learning from structured **reasoning-action trajectories**.

It operates end-to-end:
- Input: raw screenshots  
- Output: reasoning + executable GUI actions (click, type, drag)

---

## ✨ Key Features

- **Unified agent**: Handles both CAPTCHA and general GUI tasks  
- **Reasoning-action modeling**: Learns both decisions and execution  
- **Self-correction**: Improves robustness by learning from failures  
- **Efficient interaction**: Generates multiple actions per step  

---

## 🧠 Capabilities

Supports diverse CAPTCHA types:
- Text / OCR
- Icon selection & matching
- Image grid reasoning
- Slider / drag tasks
- Multi-step interaction challenges

Core skills:
- Visual understanding  
- Spatial reasoning  
- Continuous control  
- Multi-step planning  

---

## 📊 Performance

- ~71.9% success rate on synthetic CAPTCHA benchmark  
- Strong improvements on interaction-heavy tasks (e.g., slider, image grid)  
- Maintains competitive performance on general GUI benchmarks  

---

## 🔒 Ethical Considerations

This model is released for **research purposes only**.  
It is intended to study and improve the robustness of human-verification systems, not to bypass them.