Transformers documentation

GLM-OCR

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.5.3).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-27.

GLM-OCR

Overview

GLM-OCR is a multimodal OCR (Optical Character Recognition) model designed for complex document understanding from Z.ai. The model combines a CogViT visual encoder (pre-trained on large-scale image-text data), a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder.

Key features of GLM-OCR include:

Lightweight: Only 0.9B parameters while achieving state-of-the-art performance (94.62 on OmniDocBench V1.5)
Multi-task: Excels at text recognition, formula recognition, table recognition, and information extraction
Multi-modal: Processes document images for text, formula, and table extraction

This model was contributed by the zai-org team. The original code can be found here.

Usage example

Single image inference

from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch

model_id = "zai-org/GLM-OCR"

processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
            {"type": "text", "text": "Text Recognition:"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Batch inference

The model supports batching multiple images for efficient processing.

from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch

model_id = "zai-org/GLM-OCR"

processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

# First document
message1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
            {"type": "text", "text": "Text Recognition:"},
        ],
    }
]

# Second document
message2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Text Recognition:"},
        ],
    }
]

messages = [message1, message2]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(output, skip_special_tokens=True))

Flash Attention 2

GLM-OCR supports Flash Attention 2 for faster inference. First, install the latest version of Flash Attention:

pip install -U flash-attn --no-build-isolation

Then load the model with one of the supported kernels of the kernels-community:

from transformers import GlmOcrForConditionalGeneration
import torch

model = GlmOcrForConditionalGeneration.from_pretrained(
    "zai-org/GLM-OCR",
    dtype=torch.bfloat16,
    attn_implementation="kernels-community/flash-attn2",  # other options: kernels-community/vllm-flash-attn3, kernels-community/paged-attention
    device_map="auto",
)

Transformers

GLM-OCR

Overview

Usage example

Single image inference

Batch inference

Flash Attention 2

GlmOcrConfig

class transformers.GlmOcrConfig

GlmOcrVisionConfig

class transformers.GlmOcrVisionConfig

GlmOcrTextConfig

class transformers.GlmOcrTextConfig

GlmOcrVisionModel

class transformers.GlmOcrVisionModel

forward

GlmOcrTextModel

class transformers.GlmOcrTextModel

forward

GlmOcrModel

class transformers.GlmOcrModel

forward

get_image_features

get_placeholder_mask

get_rope_index

get_video_features

get_vision_position_ids

GlmOcrForConditionalGeneration

class transformers.GlmOcrForConditionalGeneration

forward

get_image_features

get_video_features