Transformers documentation

SAM3-LiteText

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.5.4).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on 2026-02-12 and added to Hugging Face Transformers on 2026-04-12.

SAM3-LiteText

PyTorch

Overview

SAM3-LiteText was proposed in SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation by Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, and Fan Zhang.

SAM3-LiteText is a lightweight variant of SAM3 that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation. The SAM3 ViT-H image encoder is kept intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model.

The abstract from the paper is the following:

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model.

The text encoder architecture is based on MobileCLIP and comes in three variants:

Variant Text Encoder Text Params Reduction
SAM3-LiteText-S0-16 MobileCLIP-S0 42.54M ~88%
SAM3-LiteText-S1-16 MobileCLIP-S1 63.53M ~82%
SAM3-LiteText-L-16 MobileCLIP2-L 123.80M ~65%

This model was contributed by nielsr and yonigozlan. The original code can be found here.

Usage

SAM3-LiteText is a drop-in replacement for SAM3 with a lightweight text encoder. It uses the same processor (Sam3Processor) and supports the same prompting interface. Refer to the SAM3 documentation for detailed usage examples including text prompts, box prompts, batched inference, and more.

from io import BytesIO

import httpx
from transformers import AutoModel, AutoProcessor
from PIL import Image

model = AutoModel.from_pretrained("yonigozlan/sam3-litetext-s0", device_map="auto")
processor = AutoProcessor.from_pretrained("yonigozlan/sam3-litetext-s0")

image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(BytesIO(httpx.get(image_url).content)).convert("RGB")

inputs = processor(images=image, text="ear", return_tensors="pt").to(model.device)

outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
    outputs,
    threshold=0.5,
    mask_threshold=0.5,
    target_sizes=inputs.get("original_sizes").tolist(),
)[0]

print(f"Found {len(results['masks'])} objects")

Sam3LiteTextConfig

class transformers.Sam3LiteTextConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None vision_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None text_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None geometry_encoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None detr_encoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None detr_decoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None mask_decoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None initializer_range: float = 0.02 )

Parameters

  • vision_config (Union[dict, ~configuration_utils.PreTrainedConfig], optional) — The config object or dictionary of the vision backbone.
  • text_config (Union[dict, ~configuration_utils.PreTrainedConfig], optional) — The config object or dictionary of the text backbone.
  • geometry_encoder_config (dict or Sam3LiteTextGeometryEncoderConfig, optional) — Configuration for the geometry encoder.
  • detr_encoder_config (dict or Sam3LiteTextDETREncoderConfig, optional) — Configuration for the DETR encoder.
  • detr_decoder_config (dict or Sam3LiteTextDETRDecoderConfig, optional) — Configuration for the DETR decoder.
  • mask_decoder_config (dict or Sam3LiteTextMaskDecoderConfig, optional) — Configuration for the mask decoder.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sam3_lite_text

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import Sam3LiteTextConfig, Sam3LiteTextModel

>>> # Initializing a SAM3_LITE_TEXT configuration
>>> configuration = Sam3LiteTextConfig()

>>> # Initializing a model from the configuration
>>> model = Sam3LiteTextModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Sam3LiteTextTextConfig

class transformers.Sam3LiteTextTextConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None vocab_size: int = 49408 hidden_size: int = 512 intermediate_size: int = 2048 projection_dim: int = 512 num_hidden_layers: int = 12 num_attention_heads: int = 8 max_position_embeddings: int = 77 hidden_act: str = 'gelu' layer_norm_eps: float = 1e-05 attention_dropout: float = 0.0 use_repmixer_blocks: bool = True layer_scale_init_value: float = 1e-05 repmixer_kernel_size: int = 11 )

Parameters

  • vocab_size (int, optional, defaults to 49408) — Vocabulary size of the model. Defines the number of different tokens that can be represented by the input_ids.
  • hidden_size (int, optional, defaults to 512) — Dimension of the hidden representations.
  • intermediate_size (int, optional, defaults to 2048) — Dimension of the MLP representations.
  • projection_dim (int, optional, defaults to 512) — Dimensionality of text and vision projection layers.
  • num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer decoder.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • max_position_embeddings (int, optional, defaults to 77) — The maximum sequence length that this model might ever be used with.
  • hidden_act (str, optional, defaults to gelu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • layer_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.
  • attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
  • use_repmixer_blocks (bool, optional, defaults to True) — Whether to use RepMixer blocks (MobileCLIP-style) for the first and last encoder layers. When False, all layers are standard Transformer encoder layers.
  • layer_scale_init_value (float, optional, defaults to 1e-5) — Initial value for the learnable layer-scale parameters in RepMixer blocks (residual branches).
  • repmixer_kernel_size (int, optional, defaults to 11) — Kernel size for depthwise convolutions in RepMixer blocks (token mixer and convolutional feed-forward path).

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the yonigozlan/sam3-litetext-s0

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sam3LiteTextGeometryEncoderConfig

class transformers.Sam3LiteTextGeometryEncoderConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None hidden_size: int = 256 num_layers: int = 3 num_attention_heads: int = 8 intermediate_size: int = 2048 dropout: float | int = 0.1 hidden_act: str = 'relu' hidden_dropout: float | int = 0.0 layer_norm_eps: float = 1e-06 roi_size: int = 7 initializer_range: float = 0.02 )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimension of the hidden representations.
  • num_layers (int, optional, defaults to 3) — Number of hidden layers in the Transformer decoder.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • intermediate_size (int, optional, defaults to 2048) — Dimension of the MLP representations.
  • dropout (Union[float, int], optional, defaults to 0.1) — The ratio for all dropout layers.
  • hidden_act (str, optional, defaults to relu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • hidden_dropout (Union[float, int], optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
  • roi_size (int, optional, defaults to 7) — ROI size for box pooling operations.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sam3_lite_text

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sam3LiteTextDETREncoderConfig

class transformers.Sam3LiteTextDETREncoderConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None hidden_size: int = 256 num_layers: int = 6 num_attention_heads: int = 8 intermediate_size: int = 2048 dropout: float | int = 0.1 hidden_act: str = 'relu' hidden_dropout: float | int = 0.0 layer_norm_eps: float = 1e-06 initializer_range: float = 0.02 )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimension of the hidden representations.
  • num_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer decoder.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • intermediate_size (int, optional, defaults to 2048) — Dimension of the MLP representations.
  • dropout (Union[float, int], optional, defaults to 0.1) — The ratio for all dropout layers.
  • hidden_act (str, optional, defaults to relu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • hidden_dropout (float, optional, defaults to 0.0) — Dropout probability for hidden states.
  • layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sam3_lite_text

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sam3LiteTextDETRDecoderConfig

class transformers.Sam3LiteTextDETRDecoderConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None hidden_size: int = 256 num_layers: int = 6 num_queries: int = 200 num_attention_heads: int = 8 intermediate_size: int = 2048 dropout: float | int = 0.1 hidden_act: str = 'relu' hidden_dropout: float | int = 0.0 layer_norm_eps: float = 1e-06 initializer_range: float = 0.02 )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimension of the hidden representations.
  • num_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer decoder.
  • num_queries (int, optional, defaults to 200) — Number of object queries.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • intermediate_size (int, optional, defaults to 2048) — Dimension of the MLP representations.
  • dropout (Union[float, int], optional, defaults to 0.1) — The ratio for all dropout layers.
  • hidden_act (str, optional, defaults to relu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • hidden_dropout (Union[float, int], optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sam3_lite_text

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sam3LiteTextMaskDecoderConfig

class transformers.Sam3LiteTextMaskDecoderConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None hidden_size: int = 256 num_upsampling_stages: int = 3 layer_norm_eps: float = 1e-06 dropout: float | int = 0.0 num_attention_heads: int = 8 initializer_range: float = 0.02 )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimension of the hidden representations.
  • num_upsampling_stages (int, optional, defaults to 3) — Number of upsampling stages in the pixel decoder (FPN).
  • layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
  • dropout (Union[float, int], optional, defaults to 0.0) — The ratio for all dropout layers.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a Sam3LiteTextModel. It is used to instantiate a Sam3 Lite Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sam3_lite_text

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sam3LiteTextTextModel

class transformers.Sam3LiteTextTextModel

< >

( config: Sam3LiteTextTextConfig )

Parameters

  • config (Sam3LiteTextTextConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

MobileCLIP MCT text encoder used in EfficientSAM3 LiteText.

When config.use_repmixer_blocks is True, the first and last layers are Sam3LiteTextRepMixerBlock modules; the rest are standard Sam3LiteTextTextEncoderLayer layers.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sam3LiteTextTextEncoderOutput or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

Returns

Sam3LiteTextTextEncoderOutput or tuple(torch.FloatTensor)

A Sam3LiteTextTextEncoderOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sam3LiteTextConfig) and inputs.

The Sam3LiteTextTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Full sequence of hidden states from the text encoder.
  • pooler_output (torch.FloatTensor of shape (batch_size, projection_dim)) — EOT-pooled output projected to projection_dim via the internal CLIP-style projection.
  • hidden_states (tuple(torch.FloatTensor), optional) — Tuple of hidden states at each layer, returned when output_hidden_states=True.
  • attentions (tuple(torch.FloatTensor), optional) — Tuple of attention weights at each transformer layer, returned when output_attentions=True.

Sam3LiteTextModel

class transformers.Sam3LiteTextModel

< >

( config: Sam3LiteTextConfig )

forward

< >

( pixel_values: torch.FloatTensor | None = None vision_embeds: transformers.models.sam3_lite_text.modeling_sam3_lite_text.Sam3LiteTextVisionEncoderOutput | None = None input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None text_embeds: torch.FloatTensor | None = None input_boxes: torch.FloatTensor | None = None input_boxes_labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sam3LiteTextImageSegmentationOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using Sam3ImageProcessor. See Sam3ImageProcessor.__call__() for details (Sam3Processor uses Sam3ImageProcessor for processing images).
  • vision_embeds (Sam3LiteTextVisionEncoderOutput, optional) — Pre-computed vision embeddings. Can be used to easily reuse vision embeddings. If provided, pixel_values should not be passed. Mutually exclusive with pixel_values.
  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • text_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Pre-computed text embeddings. Can be used to easily reuse text embeddings. If provided, input_ids should not be passed. Mutually exclusive with input_ids.
  • input_boxes (torch.FloatTensor of shape (batch_size, num_boxes, 4), optional) — Normalized box coordinates in [0, 1] range, in (cx, cy, w, h) format.
  • input_boxes_labels (torch.LongTensor of shape (batch_size, num_boxes), optional) — Labels for boxes: 1 (positive), 0 (negative).

Returns

Sam3LiteTextImageSegmentationOutput or tuple(torch.FloatTensor)

A Sam3LiteTextImageSegmentationOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sam3LiteTextConfig) and inputs.

The Sam3LiteTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • pred_masks (torch.FloatTensor of shape (batch_size, num_queries, height, width)) — Predicted segmentation masks for each query.
  • pred_boxes (torch.FloatTensor of shape (batch_size, num_queries, 4)) — Predicted bounding boxes in (x1, y1, x2, y2) format.
  • pred_logits (torch.FloatTensor of shape (batch_size, num_queries), optional) — Classification confidence scores for each query, computed via dot product between decoder query features and text features.
  • presence_logits (torch.FloatTensor of shape (batch_size, 1), optional) — Presence logits from the DETR decoder presence token (last layer only). These indicate whether objects are present in the scene. Can be used to compute final scores by multiplying with pred_logits: final_scores = pred_logits.sigmoid() * presence_logits.sigmoid().
  • semantic_seg (torch.FloatTensor of shape (batch_size, 1, height, width), optional) — Semantic segmentation output.
  • decoder_hidden_states (tuple[torch.FloatTensor], optional) — Tuple of hidden states from all DETR decoder layers. Each tensor has shape (batch_size, num_queries, hidden_size).
  • decoder_reference_boxes (torch.FloatTensor of shape (num_layers, batch_size, num_queries, 4), optional) — Reference boxes from all DETR decoder layers.
  • encoder_hidden_states (tuple[torch.FloatTensor], optional) — Tuple of hidden states from all DETR encoder layers.
  • vision_hidden_states (tuple[torch.FloatTensor], optional) — Tuple of hidden states from all vision encoder (ViT) layers.
  • vision_attentions (tuple[torch.FloatTensor], optional) — Attention weights from vision encoder (ViT) layers.
  • detr_encoder_attentions (tuple[torch.FloatTensor], optional) — Attention weights from DETR encoder layers.
  • detr_decoder_attentions (tuple[torch.FloatTensor], optional) — Attention weights from DETR decoder layers (self-attention and cross-attention).
  • mask_decoder_attentions (tuple[torch.FloatTensor], optional) — Attention weights from mask decoder layers.

Example:

>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> from transformers import AutoModel, AutoProcessor

>>> model = AutoModel.from_pretrained("facebook/sam3_lite_text")
>>> processor = AutoProcessor.from_pretrained("facebook/sam3_lite_text")

>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-car.png"
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read())).convert("RGB")
>>> text = "car"
>>> inputs = processor(images=image, text=text, return_tensors="pt")

>>> # Get segmentation output
>>> outputs = model(**inputs)
>>> pred_masks = outputs.pred_masks
>>> pred_boxes = outputs.pred_boxes

Sam3LiteTextPreTrainedModel

class transformers.Sam3LiteTextPreTrainedModel

< >

( config: PreTrainedConfig *inputs **kwargs )

Parameters

  • config (PreTrainedConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

_forward_unimplemented

< >

( *input: typing.Any )

Define the computation performed at every call.

Should be overridden by all subclasses.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Update on GitHub