--- license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-VL-8B-Thinking pipeline_tag: image-text-to-text tags: - vision-language-model - image-text-to-text - transformers - qwen3-vl --- # ReCAP-8B ReCAP-8B is a vision-language model fine-tuned from [Qwen/Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking), designed to enable **robust CAPTCHA solving within native GUI agents** while preserving general GUI interaction capabilities. This model is introduced in *“CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training”*. --- ## 🚀 Overview ReCAP-8B extends a general-purpose GUI agent with **CAPTCHA-solving ability** by learning from structured **reasoning-action trajectories**. It operates end-to-end: - Input: raw screenshots - Output: reasoning + executable GUI actions (click, type, drag) --- ## ✨ Key Features - **Unified agent**: Handles both CAPTCHA and general GUI tasks - **Reasoning-action modeling**: Learns both decisions and execution - **Self-correction**: Improves robustness by learning from failures - **Efficient interaction**: Generates multiple actions per step --- ## 🧠 Capabilities Supports diverse CAPTCHA types: - Text / OCR - Icon selection & matching - Image grid reasoning - Slider / drag tasks - Multi-step interaction challenges Core skills: - Visual understanding - Spatial reasoning - Continuous control - Multi-step planning --- ## 📊 Performance - ~71.9% success rate on synthetic CAPTCHA benchmark - Strong improvements on interaction-heavy tasks (e.g., slider, image grid) - Maintains competitive performance on general GUI benchmarks --- ## 🔒 Ethical Considerations This model is released for **research purposes only**. It is intended to study and improve the robustness of human-verification systems, not to bypass them.