Transformers documentation

PP-OCRv5_server_det

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.3.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

*This model was released on 2025-05-20 and added to Hugging Face Transformers on 2026-03-13.**

PP-OCRv5_server_det

Overview

PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes.

Model Architecture

PP-OCRv5_server_det is one of the PP-OCRv5_det series, the latest generation of text detection models developed by the PaddleOCR team. Designed for high-performance applications, it supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. Key features include robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.

Usage

Single input inference

The example below demonstrates how to detect text with PP-OCRV5_Server_Det using Pipeline or the AutoModel.

Pipeline

AutoModel

Batched inference

Here is how you can do it with PP-OCRV5_Server_Det using Pipeline or the AutoModel.

Pipeline

AutoModel

PPOCRV5ServerDetForObjectDetection

class transformers.PPOCRV5ServerDetForObjectDetection

< source >

( config: PPOCRV5ServerDetConfig )

Parameters

config (PPOCRV5ServerDetConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

PPOCRV5 Server Det model for object (text) detection tasks. Wraps the core PPOCRV5ServerDetModel and returns outputs compatible with the Transformers object detection API.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

PPOCRV5ServerDetConfig

class transformers.PPOCRV5ServerDetConfig

< source >

( interpolate_mode: str = 'nearest' backbone_config = None neck_out_channels: int = 256 reduce_factor: int = 2 intraclass_block_number: int = 4 intraclass_block_config: dict | None = None scale_factor: int = 2 scale_factor_list: list | None = None hidden_act: str = 'relu' kernel_list: list | None = None **kwargs )

Parameters

interpolate_mode (str, optional, defaults to "nearest") — The interpolation mode used for upsampling or downsampling feature maps in the neck network.
backbone_config (“) — The configuration of the backbone model.
neck_out_channels (int, optional, defaults to 256) — The number of output channels from the neck network, responsible for feature fusion and refinement.
reduce_factor (int, optional, defaults to 2) — The channel reduction factor used in the neck blocks to balance performance and complexity.
intraclass_block_number (int, optional, defaults to 4) — The number of Intra-Class Block modules used for enhancing feature representation.
intraclass_block_config (dict, optional, defaults to None) — Configuration for the Intra-Class Block modules, if any, used for enhancing feature representation.
scale_factor (int, optional, defaults to 2) — The scaling factor used for spatial resolution adjustments in the feature maps.
scale_factor_list (list[int], optional, defaults to None) — A list of scaling factors used for spatial resolution adjustments in the feature maps.
hidden_act (str, optional, defaults to relu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
kernel_list (list[int], optional, defaults to [3, 2, 2]) — The list of kernel sizes for convolutional layers in the head network for multi-scale feature extraction.

This is the configuration class to store the configuration of a Pp Ocrv5 Server DetModel. It is used to instantiate a Pp Ocrv5 Server Det model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the PaddlePaddle/PP-OCRv5-server-det

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

PPOCRV5ServerDetModel

class transformers.PPOCRV5ServerDetModel

< source >

( config: PPOCRV5ServerDetConfig )

Parameters

config (PPOCRV5ServerDetConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Pp Ocrv5 Server Det Model outputting raw hidden-states without any specific head on top.

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: FloatTensor **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithNoAttention or tuple(torch.FloatTensor)

Parameters

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using PPOCRV5ServerDetImageProcessorFast. See PPOCRV5ServerDetImageProcessorFast.call() for details (processor_class uses PPOCRV5ServerDetImageProcessorFast for processing images).

Returns

BaseModelOutputWithNoAttention or tuple(torch.FloatTensor)

A BaseModelOutputWithNoAttention or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PPOCRV5ServerDetConfig) and inputs.

The PPOCRV5ServerDetModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, num_channels, height, width).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

PPOCRV5ServerDetImageProcessorFast

class transformers.PPOCRV5ServerDetImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

Parameters

**kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Constructs a PPOCRV5ServerDetImageProcessorFast image processor.

get_image_size

< source >

( image: torch.Tensor limit_side_len: int limit_type: str max_side_limit: int ) → tuple

Parameters

image (torch.Tensor) — Input image.
limit_side_len (int) — Maximum or minimum side length.
limit_type (str) — Resizing strategy: “max”, “min”, or “resize_long”.
max_side_limit (int) — Maximum allowed side length.

Returns

tuple

SizeDict: Target size.
torch.Tensor: Original size.

Computes the target size for resizing an image while preserving aspect ratio.

post_process_object_detection

< source >

( predictions threshold: float = 0.3 target_sizes: list[tuple[int, int]] | transformers.utils.generic.TensorType | None = None box_threshold: float = 0.6 max_candidates: int = 1000 min_size: int = 3 unclip_ratio: float = 1.5 ) → list[dict]

Parameters

predictions — Model outputs with logits attribute (probability maps of shape (batch_size, 1, H, W)).
threshold (float) — Binarization threshold.
target_sizes — Original image sizes (height, width) per image.
box_threshold (float) — Box score threshold.
max_candidates (int) — Maximum number of boxes.
min_size (int) — Minimum box size.
unclip_ratio (float) — Expansion ratio.

Returns

list[dict]

List of detection results per image. Each dict contains:

“boxes”: torch.Tensor of shape (N, 4) in corners format (xmin, ymin, xmax, ymax)
“scores”: torch.Tensor of shape (N,)
“labels”: torch.Tensor of shape (N,) (class id 0 for text)

Converts model outputs into detected text boxes in corners format (xmin, ymin, xmax, ymax).

Update on GitHub

←PP-DocLayoutV3 Qwen2.5-Omni→