Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale | Document Understanding & OCR Enhancement | Chain-of-Thought Reasoning Support

This repository contains models presented in the paper Qianfan-OCR: A Unified End-to-End Model for Document Intelligence.

🔗 Quick Links

Model Description

Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.

Model Variants

Model Parameters Context Length CoT Support Best For
Qianfan-VL-3B 3B 32k Edge deployment, real-time OCR
Qianfan-VL-8B 8B 32k Server-side general scenarios, fine-tuning
Qianfan-VL-70B 70B 32k Complex reasoning, data synthesis

Architecture

  • Language Model:
    • Qianfan-VL-3B: Based on Qwen2.5-3B
    • Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
    • Enhanced with 3T multilingual corpus
  • Vision Encoder: InternViT-based, supports dynamic patching up to 4K resolution
  • Cross-modal Fusion: MLP adapter for efficient vision-language bridging

Key Capabilities

🔍 OCR & Document Understanding

  • Full-Scenario OCR: Handwriting, formulas, natural scenes, cards/documents
  • Document Intelligence: Layout analysis, table parsing, chart understanding, document Q&A
  • High Precision: Industry-leading performance on OCR benchmarks

🧮 Chain-of-Thought Reasoning (8B & 70B)

  • Complex chart analysis and reasoning
  • Mathematical problem-solving with step-by-step derivation
  • Visual reasoning and logical inference
  • Statistical computation and trend prediction

📊 Benchmark Performance

General Vision-Language Benchmarks

Benchmark Qianfan-VL-3B Qianfan-VL-8B Qianfan-VL-70B InternVL-3-8B InternVL-3-78B Qwen2.5-VL-7B Qwen2.5-VL-72B
A-Bench_VAL 75.65 75.72 78.1 75.86 75.86 76.49 79.22
CCBench 66.86 70.39 80.98 77.84 70.78 57.65 73.73
SEEDBench_IMG 76.55 78.02 79.13 77.0 77.52 76.98 78.34
SEEDBench2_Plus 67.59 70.97 73.17 69.52 68.47 70.93 73.25
MMVet 48.17 53.21 67.34 80.28 78.9 70.64 75.69
MMMU_VAL 46.44 47.11 58.33 56.11 60.78 51.0 65.78
ScienceQA_TEST 95.19 97.62 98.76 97.97 97.17 85.47 92.51
ScienceQA_VAL 93.85 97.62 98.81 97.81 95.14 83.59 91.32
MMT-Bench_VAL 62.23 63.22 71.06 65.17 63.67 61.4 69.49
MTVQA_TEST 26.5 30.14 32.18 30.3 27.62 29.08 31.48
BLINK 49.97 56.81 59.44 55.87 51.87 54.55 63.02
MMStar 57.93 64.07 69.47 68.4 66.07 61.53 66.0
RealWorldQA 65.75 70.59 71.63 71.11 74.25 69.28 73.86
Q-Bench1_VAL 73.51 75.25 77.46 75.99 77.99 78.1 79.93
POPE 85.08 86.06 88.97 90.59 88.87 85.97 83.35
RefCOCO (Avg) 85.94 89.37 91.01 89.65 91.40 86.56 90.25

OCR & Document Understanding

Benchmark Qianfan-VL-3B Qianfan-VL-8B Qianfan-VL-70B InternVL-3-8B InternVL-3-78B Qwen2.5-VL-3B Qwen2.5-VL-7B Qwen2.5-VL-72B
OCRBench 831 854 873 881 847 810 883 874
AI2D_TEST 81.38 85.07 87.23 85.07 83.55 77.07 80.472 83.84
OCRVQA_TEST 66.15 68.98 74.06 39.03 35.58 69.24 71.02 66.8
TextVQA_VAL 80.11 82.13 84.48 82.15 83.52 79.09 84.962 83.26
DocVQA_VAL 90.85 93.54 94.75 92.04 83.82 92.71 94.91 95.75
ChartQA_TEST 81.79 87.72 89.6 85.76 82.04 83.4 86.68 87.16

Quick Start

Installation

pip install transformers accelerate torch torchvision pillow einops 

Using Transformers

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)

# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 512},
        verbose=False
    )
print(response)

Training Details

Four-Stage Progressive Training

  1. Cross-modal Alignment (100B tokens): Establishes vision-language connections
  2. General Knowledge Injection (3.5T tokens): Builds strong foundational capabilities
  3. Domain Enhancement (300B tokens): Specialized OCR and reasoning capabilities
  4. Post-training (1B tokens): Instruction following and preference alignment

Infrastructure

  • Trained on 5000+ Baidu Kunlun chips
  • Single-task parallel training with 5000 chips demonstrating unprecedented scale
  • 90%+ scaling efficiency for large-scale distributed training
  • Innovative communication-computation fusion technology

Citation

If you use Qianfan-VL or Qianfan-OCR in your research, please cite:

@misc{dong2026qianfanocr,
  title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
  author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
  year={2026},
  eprint={2603.13398},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.13398}
}

@misc{qianfan-vl-2025,
  title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
  author={Qianfan Team},
  year={2025},
  publisher={Baidu}
}

Contact

For more information and API access, visit: Baidu AI Cloud Qianfan Platform

Downloads last month
287
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Space using baidu/Qianfan-VL-8B 1

Collection including baidu/Qianfan-VL-8B

Paper for baidu/Qianfan-VL-8B