Transformers documentation
GLM-OCR
This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-27.
GLM-OCR
Overview
GLM-OCR is a multimodal OCR (Optical Character Recognition) model designed for complex document understanding from Z.ai. The model combines a CogViT visual encoder (pre-trained on large-scale image-text data), a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder.
Key features of GLM-OCR include:
- Lightweight: Only 0.9B parameters while achieving state-of-the-art performance (94.62 on OmniDocBench V1.5)
- Multi-task: Excels at text recognition, formula recognition, table recognition, and information extraction
- Multi-modal: Processes document images for text, formula, and table extraction
This model was contributed by the zai-org team. The original code can be found here.
Usage example
Single image inference
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))Batch inference
The model supports batching multiple images for efficient processing.
from transformers import AutoProcessor, GlmOcrForConditionalGeneration
import torch
model_id = "zai-org/GLM-OCR"
processor = AutoProcessor.from_pretrained(model_id)
model = GlmOcrForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
# First document
message1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
# Second document
message2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Text Recognition:"},
],
}
]
messages = [message1, message2]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(output, skip_special_tokens=True))Flash Attention 2
GLM-OCR supports Flash Attention 2 for faster inference. First, install the latest version of Flash Attention:
pip install -U flash-attn --no-build-isolation
Then load the model with one of the supported kernels of the kernels-community:
from transformers import GlmOcrForConditionalGeneration
import torch
model = GlmOcrForConditionalGeneration.from_pretrained(
"zai-org/GLM-OCR",
dtype=torch.bfloat16,
attn_implementation="kernels-community/flash-attn2", # other options: kernels-community/vllm-flash-attn3, kernels-community/paged-attention
device_map="auto",
)GlmOcrConfig
class transformers.GlmOcrConfig
< source >( text_config = None vision_config = None image_token_id = 59280 video_token_id = 59281 image_start_token_id = 59256 image_end_token_id = 59257 video_start_token_id = 59258 video_end_token_id = 59259 tie_word_embeddings = False **kwargs )
Parameters
- text_config (
Union[PreTrainedConfig, dict], optional, defaults toGlmOcrTextConfig) — The config object or dictionary of the text backbone. - vision_config (
Union[PreTrainedConfig, dict], optional, defaults toGlmOcrVisionConfig) — The config object or dictionary of the vision backbone. - image_token_id (
int, optional, defaults to 59280) — The image token index to encode the image prompt. - video_token_id (
int, optional, defaults to 59281) — The video token index to encode the image prompt. - image_start_token_id (
int, optional, defaults to 59256) — The image start token index to encode the start of image. - image_end_token_id (
int, optional, defaults to 59257) — The image end token index to encode the end of image. - video_start_token_id (
int, optional, defaults to 59258) — The video start token index to encode the start of video. - video_end_token_id (
int, optional, defaults to 59259) — The video end token index to encode the end of video. - tie_word_embeddings (
bool, optional, defaults toFalse) — Whether the model’s input and output word embeddings should be tied.
This is the configuration class to store the configuration of a GlmOcrModel. It is used to instantiate a GLM-OCR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of GLM-OCR zai-org/GLM-OCR.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
>>> from transformers import GlmOcrForConditionalGeneration, GlmOcrConfig
>>> # Initializing a GLM-OCR style configuration
>>> configuration = GlmOcrConfig()
>>> # Initializing a model from the GLM-OCR style configuration
>>> model = GlmOcrForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configGlmOcrVisionConfig
class transformers.GlmOcrVisionConfig
< source >( depth = 24 hidden_size = 1024 hidden_act = 'silu' attention_bias = True attention_dropout = 0.0 num_heads = 16 in_channels = 3 image_size = 336 patch_size = 14 rms_norm_eps = 1e-05 spatial_merge_size = 2 temporal_patch_size = 2 out_hidden_size = 1536 intermediate_size = 4096 initializer_range = 0.02 **kwargs )
Parameters
- depth (
int, optional, defaults to 24) — Number of layers (depth) in the model. - hidden_size (
int, optional, defaults to 1024) — Dimensionality of the encoder layers and the pooler layer. - hidden_act (
strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"silu","relu","selu"and"gelu_new"are supported. - attention_bias (
bool, optional, defaults toTrue) — Whether to add a bias to the queries, keys and values. - attention_dropout (
float, optional, defaults to 0.0) — Dropout probability for attention weights. - num_heads (
int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer architecture. - in_channels (
int, optional, defaults to 3) — Number of input channels. - image_size (
intorlist[int], optional, defaults to 336) — The size (resolution) of each image. - patch_size (
int, optional, defaults to 14) — The size (resolution) of each patch. - rms_norm_eps (
float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - spatial_merge_size (
int, optional, defaults to 2) — The size used for merging spatial dimensions. - temporal_patch_size (
int, optional, defaults to 2) — The size used for patches along the temporal dimension. - out_hidden_size (
int, optional, defaults to 1536) — The output hidden size of the vision model. - intermediate_size (
int, optional, defaults to 4096) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a GlmOcrVisionConfig. It is used to instantiate a GLM-OCR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of GLM-OCR zai-org/GLM-OCR.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
GlmOcrTextConfig
class transformers.GlmOcrTextConfig
< source >( vocab_size: int | None = 59392 hidden_size: int | None = 1024 intermediate_size: int | None = 4096 num_hidden_layers: int | None = 16 num_attention_heads: int | None = 16 num_key_value_heads: int | None = 8 hidden_act: str | None = 'silu' max_position_embeddings: int | None = 131072 initializer_range: float | None = 0.02 rms_norm_eps: int | None = 1e-05 use_cache: bool | None = True attention_dropout: float | None = 0.0 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None pad_token_id: int | None = None **kwargs )
Parameters
- vocab_size (
int, optional, defaults to 59392) — Vocabulary size of the GlmOcr model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling GlmOcrModel - hidden_size (
int, optional, defaults to 1024) — Dimension of the hidden representations. - intermediate_size (
int, optional, defaults to 4096) — Dimension of the MLP representations. - num_hidden_layers (
int, optional, defaults to 16) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder. - num_key_value_heads (
int, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to32. - hidden_act (
strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int, optional, defaults to 131072) — The maximum sequence length that this model might ever be used with. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True. - attention_dropout (
float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - rope_parameters (
RopeParameters, optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value forrope_thetaand optionally parameters used for scaling in case you want to use RoPE with longermax_position_embeddings. - pad_token_id (
int, optional) — The id of the padding token.
This is the configuration class to store the configuration of a GlmOcrTextConfig. It is used to instantiate a GLM-OCR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of GLM-OCR zai-org/GLM-OCR.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
>>> from transformers import GlmOcrTextModel, GlmOcrConfig
>>> # Initializing a GLM-OCR style configuration
>>> configuration = GlmOcrConfig()
>>> # Initializing a model from the GLM-OCR style configuration
>>> model = GlmOcrTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configGlmOcrVisionModel
forward
< source >( hidden_states: Tensor grid_thw: Tensor **kwargs ) → torch.Tensor
The GlmOcrVisionModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- forward
GlmOcrTextModel
class transformers.GlmOcrTextModel
< source >( config: GlmOcrTextConfig )
Parameters
- config (GlmOcrTextConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Glm Ocr Text Model outputting raw hidden-states without any specific head on to.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output. -
past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The GlmOcrTextModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- forward
GlmOcrModel
class transformers.GlmOcrModel
< source >( config )
Parameters
- config (GlmOcrModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Glm Ocr Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None pixel_values: torch.Tensor | None = None pixel_values_videos: torch.FloatTensor | None = None image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None rope_deltas: torch.LongTensor | None = None mm_token_type_ids: torch.IntTensor | None = None cache_position: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (processor_classusesimage_processor_classfor processing images). - pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_frames, num_channels, frame_size, frame_size), optional) — The tensors corresponding to the input video. Pixel values for videos can be obtained usingvideo_processor_class. Seevideo_processor_class.__call__for details (processor_classusesvideo_processor_classfor processing videos). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens matching each modality. For example text (0), image (1), video (2). Multimodal token type ids can be obtained using AutoProcessor. See ProcessorMixin.call() for details. - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrModelOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
-
last_hidden_state (
torch.FloatTensor | None.last_hidden_stateof shape(batch_size, sequence_length, hidden_size), defaults toNone) — Sequence of hidden-states at the output of the last layer of the model. -
past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding. -
hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple[torch.FloatTensor] | None.attentions, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope.
The GlmOcrModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
get_image_features
< source >( pixel_values: FloatTensor image_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
get_placeholder_mask
< source >( input_ids: LongTensor inputs_embeds: FloatTensor image_features: torch.FloatTensor | None = None video_features: torch.FloatTensor | None = None )
Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is
equal to the length of multimodal features. If the lengths are different, an error is raised.
get_rope_index
< source >( input_ids: LongTensor mm_token_type_ids: IntTensor image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None **kwargs )
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length)) — Token type ids matching each modality to a different value in the input sequence, i.e. text (0), image (1), video (2). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Calculate the 3D rope index based on image and video’s sizes. The utility expects a vision + text
sequence and will error out otherwise. For pure text sequence, please rely on model’s auto-inferred
position ids. In a mixed vision + text sequence, vision tokens use 3D RoPE (temporal, height, width)
while text tokens use standard 1D RoPE.
Example: Temporal patches: 3; Height patches: 2; Width patches: 2 Each vision input results in (temporal x height × width) positions. Here: 3 x 2 × 2 = 12 positions total.
Temporal position IDs are spaced by:
interval = tokens_per_second * temporal_patch_size / fps
If fps = 1; tokens_per_second = 25; temporal_patch_size = 2, temporal IDs increase by 50 for each temporal patch:
[0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100]
Height IDs repeat per row: [0, 0, 1, 1, ...]
Width IDs alternate per column: [0, 1, 0, 1, ...]
Text tokens follow standard 1D RoPE and the position IDs grow consequently with a step of 1
get_video_features
< source >( pixel_values_videos: FloatTensor video_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input videos. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
get_vision_position_ids
< source >( start_position: int grid_thw: list[int, int, int] | torch.Tensor temp_merge_size: int = 1 spatial_merge_size: int = 1 time_interval: int = 1 device: str | torch.device | None = None ) → torch.LongTensor of shape (3, sequence_length)
Parameters
- start_position (
int) — Offset added to all computed positional indices. - grid_thw (
Sequence[int]ortorch.Tensorof shape(3,)) — The (T, H, W) grid representing the feature layout of the current image or video after patch embedding. - temp_merge_size (
int, optional) — Factor by which the temporal dimension is reduced in the backbone. The temporal grid size is divided by this value. Defaults to 1. - spatial_merge_size (
int, optional) — Factor by which the spatial dimensions (H and W) are reduced in the backbone. Both H and W are divided by this value. Defaults to 1. - time_interval (
int, optional) — Spacing factor applied between consecutive temporal position indices.Defaults to 1. - device (
strortorch.device, optional) — Device on which the resulting tensor is allocated. IfNone, uses the current default device.
Returns
torch.LongTensor of shape (3, sequence_length)
Positional indices for temporal, height, and width dimensions,
flattened into sequence form and offset by start_position.
Compute 3D positional indices for vision tokens derived from a single image or video input.
The positions are generated from the input grid defined by temporal (T), height (H), and
width (W) dimensions. Temporal and spatial dimensions can be downscaled according to the
merge sizes used in the vision backbone. The resulting positions are offset by start_position.
- forward
GlmOcrForConditionalGeneration
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None pixel_values: torch.Tensor | None = None pixel_values_videos: torch.FloatTensor | None = None image_grid_thw: torch.LongTensor | None = None video_grid_thw: torch.LongTensor | None = None mm_token_type_ids: torch.IntTensor | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrCausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained usingimage_processor_class. Seeimage_processor_class.__call__for details (processor_classusesimage_processor_classfor processing images). - pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_frames, num_channels, frame_size, frame_size), optional) — The tensors corresponding to the input video. Pixel values for videos can be obtained usingvideo_processor_class. Seevideo_processor_class.__call__for details (processor_classusesvideo_processor_classfor processing videos). - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM. - mm_token_type_ids (
torch.IntTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens matching each modality. For example text (0), image (1), video (2). Multimodal token type ids can be obtained using AutoProcessor. See ProcessorMixin.call() for details. - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrCausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.models.glm_ocr.modeling_glm_ocr.GlmOcrCausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
-
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding. -
hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple[torch.FloatTensor] | None.attentions, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
rope_deltas (
torch.LongTensorof shape(batch_size, ), optional) — The rope index difference between sequence length and multimodal rope.
The GlmOcrForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
>>> messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read()))
>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=[text], images=[image], vision_infos=[vision_infos])
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"The image shows a street scene with a red stop sign in the foreground. In the background, there is a large red gate with Chinese characters ..."get_image_features
< source >( pixel_values: FloatTensor image_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. - image_grid_thw (
torch.LongTensorof shape(num_images, 3), optional) — The temporal, height and width of feature shape of each image in LLM.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from PIL import Image
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-OCR")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")
>>> messages = [
... {
... "role": "user", "content": [
... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
... {"type": "text", "text": "Where is the cat standing?"},
... ]
... },
... ]
>>> inputs = processor.apply_chat_template(
... messages,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
... add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]get_video_features
< source >( pixel_values_videos: FloatTensor video_grid_thw: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values_videos (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input videos. - video_grid_thw (
torch.LongTensorof shape(num_videos, 3), optional) — The temporal, height and width of feature shape of each video in LLM.
Returns
transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (GlmOcrConfig) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from PIL import Image
>>> from transformers import AutoProcessor, GlmOcrForConditionalGeneration
>>> model = GlmOcrForConditionalGeneration.from_pretrained("zai-org/GLM-OCR")
>>> processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")
>>> messages = [
... {
... "role": "user", "content": [
... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
... {"type": "text", "text": "Where is the cat standing?"},
... ]
... },
... ]
>>> inputs = processor.apply_chat_template(
... messages,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
... add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]- forward