Transformers documentation

TimesFM 2.5

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.2.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on 2025-09-15 and added to Hugging Face Transformers on 2026-02-26.

PyTorch FlashAttention SDPA

TimesFM 2.5

Overview

TimesFM 2.5 (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It builds on the original TimesFM architecture with rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.

The abstract from the paper is the following:

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a decoder style attention model with input patching, using a large time-series corpus comprising both real-world and synthetic datasets. Experiments on a diverse set of previously unseen forecasting datasets suggests that the model can yield accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities.

This model was contributed by kashif. The original code can be found here.

You can find the checkpoint at google/timesfm-2.5-200m-transformers.

Usage example

import numpy as np
import torch
from transformers import TimesFm2_5ModelForPrediction


model = TimesFm2_5ModelForPrediction.from_pretrained(
    "google/timesfm-2.5-200m-transformers",
    device_map="auto",
)

forecast_input = [
    np.sin(np.linspace(0, 20, 100)),
    np.sin(np.linspace(0, 20, 200)),
    np.sin(np.linspace(0, 20, 400)),
]
forecast_input_tensor = [torch.tensor(ts, dtype=torch.float32, device=model.device) for ts in forecast_input]

with torch.no_grad():
    outputs = model(past_values=forecast_input_tensor, return_dict=True)
    point_forecast = outputs.mean_predictions
    quantile_forecast = outputs.full_predictions

TimesFm2_5Config

class transformers.TimesFm2_5Config

< >

( patch_length: int = 32 context_length: int = 16384 horizon_length: int = 128 quantiles: list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] hidden_size: int = 1280 intermediate_size: int = 1280 head_dim: int = 80 num_attention_heads: int = 16 num_key_value_heads: int = 16 num_hidden_layers: int = 20 rms_norm_eps: float = 1e-06 attention_dropout: float = 0.0 attention_bias: bool = False initializer_range: float = 0.02 output_quantile_len: int = 1024 decode_index: int = 5 use_bias: bool = False activation: str = 'swish' use_continuous_quantile_head: bool = True force_flip_invariance: bool = True infer_is_positive: bool = True max_position_embeddings: int = 16384 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None **kwargs )

Parameters

  • patch_length (int, optional, defaults to 32) — The length of one patch in the input sequence.
  • context_length (int, optional, defaults to 16384) — The length of the input context.
  • horizon_length (int, optional, defaults to 128) — The length of the prediction horizon.
  • quantiles (list[float], optional, defaults to [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) — The quantiles to predict.
  • hidden_size (int, optional, defaults to 1280) — Size of the hidden layers.
  • intermediate_size (int, optional, defaults to 1280) — Dimension of the MLP representations.
  • head_dim (int, optional, defaults to 80) — Size of the key, query, value projections per attention head.
  • num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer.
  • num_key_value_heads (int, optional, defaults to 16) — Number of key-value heads. Set equal to num_attention_heads for full (non-grouped) attention.
  • num_hidden_layers (int, optional, defaults to 20) — Number of Transformer layers.
  • rms_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the RMS normalization layers.
  • attention_dropout (float, optional, defaults to 0.0) — The dropout probability for the attention scores.
  • attention_bias (bool, optional, defaults to False) — Whether to use bias in the attention linear projections.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • output_quantile_len (int, optional, defaults to 1024) — Length of the quantile output projection dimension.
  • decode_index (int, optional, defaults to 5) — Index into the quantile dimension used to extract the point (median) forecast.
  • use_bias (bool, optional, defaults to False) — Whether to use bias in MLP and transformer linear layers.
  • activation (str, optional, defaults to "swish") — Activation function used in MLP and residual block layers (any key from ACT2FN).
  • use_continuous_quantile_head (bool, optional, defaults to True) — Whether to use the continuous quantile head for non-median quantile predictions.
  • force_flip_invariance (bool, optional, defaults to True) — Whether to apply flip-invariance averaging during forecasting.
  • infer_is_positive (bool, optional, defaults to True) — Whether to clamp forecasts to non-negative values when the input minimum is non-negative.
  • max_position_embeddings (int, optional, defaults to 16384) — Maximum sequence length supported by the rotary position encoding.
  • rope_parameters (RopeParameters or dict[str, RopeParameters], optional) — Dictionary containing the RoPE configuration. Uses default rope type with theta=10000.0 when not set.

This is the configuration class to store the configuration of a TimesFm2_5ModelForPrediction. It is used to instantiate a TimesFM 2.5 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the TimesFM 2.5 google/timesfm-2.5-200m-transformers architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import TimesFm2_5Config, TimesFm2_5ModelForPrediction

>>> configuration = TimesFm2_5Config()
>>> model = TimesFm2_5ModelForPrediction(configuration)
>>> configuration = model.config

TimesFm2_5Model

class transformers.TimesFm2_5Model

< >

( config: TimesFm2_5Config )

forward

< >

( past_values: Tensor past_values_padding: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) TimesFm2_5Output or tuple(torch.FloatTensor)

Parameters

  • past_values (torch.Tensor of shape (batch_size, sequence_length)) — Past values of the time series used as input to the model.
  • past_values_padding (torch.LongTensor of shape (batch_size, sequence_length), optional) — Padding mask for the input. 1 indicates padded (masked) time steps, 0 indicates valid values.

Returns

TimesFm2_5Output or tuple(torch.FloatTensor)

A TimesFm2_5Output or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (TimesFm2_5Config) and inputs.

The TimesFm2_5Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple[torch.FloatTensor, ...], optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple[torch.FloatTensor, ...], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • loc (torch.Tensor of shape (batch_size,) or (batch_size, input_size), optional, defaults to None) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.

  • scale (torch.Tensor of shape (batch_size,) or (batch_size, input_size), optional, defaults to None) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude.

  • context_mu (torch.Tensor of shape (batch_size, num_patches)) — Running means computed per input patch during normalization.

  • context_sigma (torch.Tensor of shape (batch_size, num_patches)) — Running standard deviations computed per input patch during normalization.

TimesFm2_5ModelForPrediction

class transformers.TimesFm2_5ModelForPrediction

< >

( config: TimesFm2_5Config )

TimesFm2_5 model for quantile and mean prediction.

forward

< >

( past_values: Sequence window_size: int | None = None future_values: torch.Tensor | None = None forecast_context_len: int | None = None truncate_negative: bool | None = None force_flip_invariance: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) TimesFm2_5OutputForPrediction or tuple(torch.FloatTensor)

Parameters

  • past_values (collections.abc.Sequence[torch.Tensor]) — Past values of the time series that serves as input to the model. Each tensor is a 1D time series.
  • window_size (int, optional) — Window size of trend + residual decomposition. If None, decomposition is not applied.
  • future_values (torch.Tensor, optional) — Optional future values used to compute the loss.
  • forecast_context_len (int, optional) — Optional context length override used during forecasting.
  • truncate_negative (bool, optional) — Whether to clamp outputs to non-negative values. If None, defaults to config.infer_is_positive.
  • force_flip_invariance (bool, optional) — Whether to apply the flip-invariance combination. If None, defaults to config.force_flip_invariance.

Returns

TimesFm2_5OutputForPrediction or tuple(torch.FloatTensor)

A TimesFm2_5OutputForPrediction or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (TimesFm2_5Config) and inputs.

The TimesFm2_5ModelForPrediction forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple[torch.FloatTensor, ...], optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple[torch.FloatTensor, ...], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • mean_predictions (torch.Tensor of shape (batch_size, horizon_length)) — Deterministic forecasts after denormalization.

  • full_predictions (torch.Tensor of shape (batch_size, horizon_length, quantiles)) — Quantile forecasts including the median after denormalization.

  • loss (torch.Tensor of shape (1,), optional, returned when future_values is provided) — Training loss combining MSE and quantile losses when targets are supplied.

Update on GitHub