ameroyer

mboehle commited on Dec 23, 2025

Commit

fc8600b

verified ·

0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files

Co-authored-by: mboehle <mboehle@users.noreply.huggingface.co>

Files changed (24) hide show

.gitattributes +6 -0
Notice +2 -0
README.md +311 -0
__init__.py +0 -0
casa_attention.py +1010 -0
config.json +77 -0
configuration_helium1_casa.py +270 -0
generation_config.json +10 -0
image_encoder.py +57 -0
language_helium1_casa.py +1077 -0
model-00001-of-00003.safetensors +3 -0
model-00002-of-00003.safetensors +3 -0
model-00003-of-00003.safetensors +3 -0
model.safetensors.index.json +793 -0
modeling_helium1_casa.py +330 -0
processing.py +505 -0
processing_helium1_casa.py +37 -0
processor_config.json +10 -0
readme_images/CASA.png +3 -0
readme_images/casa_explainer.mp4 +3 -0
tokenizer.json +0 -0
tokenizer.model +418 -0
tokenizer_config.json +14 -0
utils.py +116 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,6 @@

+model-00001-of-00003.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00002-of-00003.safetensors filter=lfs diff=lfs merge=lfs -text
+model-00003-of-00003.safetensors filter=lfs diff=lfs merge=lfs -text
+readme_images/CASA.png filter=lfs diff=lfs merge=lfs -text
+readme_images/casa_explainer.mp4 filter=lfs diff=lfs merge=lfs -text
+readme_images/half_res_trimmed.mp4 filter=lfs diff=lfs merge=lfs -text

Notice ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ CASA-Helium1-VL-2B's image encoder is finetuned from the image encoder of Qwen2.5-VL-3B.
2	+ Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

README.md ADDED Viewed

	@@ -0,0 +1,311 @@

+---
+license: cc-by-nc-sa-4.0
+language:
+- en
+base_model:
+- kyutai/helium-1-2b
+datasets:
+- HuggingFaceM4/FineVision
+- mvp-lab/LLaVA-OneVision-1.5-Instruct-Data
+---
+ <img align="right" src="readme_images/CASA.png" width="150px" >
+# Model Card for CASA-Helium1-VL-2B
+**CASA** ([Project Page][blog] . [arXiv][casa-arxiv] . [github][casa-git]) stands for **C**ross-**A**ttention via **S**elf-**A**ttention.
+**CASA** is a vision-language fusion paradigm that aims to improve on  cross-attention while preserving its practical benefits.
+Specifically, **CASA** layers inject visual tokens into a text stream by using image-to-text cross-attention while additionally enabling
+text-to-text self interaction in the same layer, and contained to smaller local attention windows.
+This simple modification enables natural gating in  the cross-attention mechanism, improving its performance and substantially closing the gap to standard token insertion methods.
+For qualitative samples of CASA used for live video captioning, please check the [associated HuggingFace space](https://huggingface.co/spaces/kyutai/casa-samples).
+![](readme_images/casa_explainer.mp4)
+## Model Details
+### Model Description
+This model page contains the model weights for CASA trained from a pretrained text-only Helium1-2B backbone and from the image encoder from Qwen2.5-VL-3B.
+In the collection, we also provides weights for:
+  - [`CASA-Qwen2_5-VL-3B`](https://huggingface.co/kyutai/CASA-Qwen2_5-VL-3B): A CASA model adapted from the full pretrained `Qwen2.5-VL-3B` (keeping the backbone LLM weights are kept frozen)
+  - [`CASA-Qwen2_5-VL-3B-LiveCC`](https://huggingface.co/kyutai/CASA-Qwen2_5-VL-3B-LiveCC): A CASA model adapted from the full pretrained `Qwen2.5-VL-3B` and futher finetuned for live video captioning.
+  - [`Helium1-VL-2B`](https://huggingface.co/kyutai/Helium1-VL-2B): A reference VLM trained from Helium1-2B with standard token insertion mechanism in the same setting as `CASA-Helium1-VL-2B`.
+Model Summary:
+- **Developed by:** Kyutai
+- **Model type:** Multimodal vision+text model based on Cross-Attention
+- **Language(s) (NLP):** English
+- **License:** CC-BY-NC-SA-4.0
+- **LLM Backboner from:** [Helium1 2B](https://huggingface.co/kyutai/helium-1-2b)
+- **Image Encoder from:** [Qwen2.5-VL 3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+- **Terms of use:** As the released models include frozen weights of the Qwen2.5VL-3B image encoder, the weights are subject to the [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE)
+### Model Sources
+- **Project Page** [kyutai.org/casa][blog]
+- **Preprint** [arXiv][casa-arxiv]
+- **Repository:** [Github kyutai-labs/casa][casa-git]
+## Uses
+### Direct Use
+The intended use of the Helium model is research and development of vision-language systems, including but not limited to image or video understanding.
+`CASA-Helium1-VL-2B`, `Helium1-VL-2B` and `CASA-Qwen2_5-VL-2B` can be used as vision-language models to analyze or interpret images as input signals.
+`CASA-Qwen2_5-VL-2B-LiveCC` can be used as a vision-language model on streaming videos as inputs at 2fps.
+The models can be used primarly with English as a language. For most downstream use cases, the model should be aligned with supervised fine-tuning, RLHF or related methods.
+### Out-of-Scope Use
+The model should not be used in other languages than the ones on which it was trained.
+The model is not intended to be used to impersonate other people or any malicious use of any kind.
+## Bias, Risks, and Limitations
+Our CASA-Helium1 model was not aligned to human preferences. As such, the model can generate incorrect, biased, harmful or generally unhelpful content. Thus, the model should not be used for downstream applications without further alignment, evaluations and mitigations of risks.
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+See our [github repository][casa-git] for additional scripts to perform benchmark evaluation and live video captioning.
+Below is a short snippet to show you how to load our models, process inputs, and run inference, using a standard HuggingFace `transformers` pipeline and chat template.
+```python
+# Minimal requirements:
+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "rich",
+#     "einops>=0.8.1",
+#     "torch==2.7.0",
+#     "transformers==4.51.3",
+#     "torchvision==0.22.0",
+#     "flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl"
+# ]
+# ///
+import torch
+from transformers.models.auto.modeling_auto import AutoModel
+from transformers.models.auto.processing_auto import AutoProcessor
+model_id = "kyutai/CASA-Helium1-VL-2B"
+model = AutoModel.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    trust_remote_code=True,
+).cuda()
+processor = AutoProcessor.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "assets/casa_model.png",
+            },
+            {
+                "type": "text",
+                "text": "Describe this image.",
+            },
+        ],
+    },
+]
+inputs = processor.tokenize_messages(messages=conversation)
+inputs = inputs.to(model.device)
+input_len = inputs["input_ids"].shape[1]
+output_ids = model.generate_from_image(
+  **inputs,
+  max_new_tokens=512,
+  pre_image_tokens=processor.pre_image_tokens,
+  post_image_tokens=processor.post_image_tokens,
+  eos_token_id=model.generation_config.eos_token_id,
+)[0, input_len:]
+response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
+print(response)
+```
+## Training Details
+Please have a look at our associated [research paper][casa-arxiv] for details on the training pipeline.
+### Training Data
+To train our CASA-Helium models we use the [FineVision](https://huggingface.co/datasets/HuggingFaceM4/FineVision)
+dataset as well as a small, non overlapping, subset of [Llava-OneVision-1.5-Instruct](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
+## Evaluation
+We evaluate our models on a range of benchmarks covering document understanding (`DocVQA`), chart understanding (`ChartQA`, `InfoVQA`),
+visual text reading (`TextVQA`, `OCRBench`), and general QA (`RealWorldQA`, `AI2D`, `GQA`, `MME`). Results are reported below. Please refer to our [project page][blog] and [arxiv paper][casa-arxiv] for additional evaluation.
+<table style="border-collapse: collapse;">
+<tr>
+<th rowspan="2" align="left">Model</th>
+<th colspan="3" align="center">Document / Chart</th>
+<th colspan="2" align="center">Scene Text</th>
+<th colspan="4" align="center">Knowledge / QA</th>
+</tr>
+<tr>
+<th>ChartQA</th>
+<th>DocVQA</th>
+<th>InfoVQA</th>
+<th>OCRBench</th>
+<th>TextVQA</th>
+<th>RealWorldQA</th>
+<th>AI2D</th>
+<th>GQA</th>
+<th>MME</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td align="left">Helium1-VL-2B</td>
+<td>81.6</td><td>89.1</td><td>61.8</td>
+<td>728</td><td>75.5</td>
+<td>59.9</td><td>67.7</td><td>55.5</td><td>1732</td>
+</tr>
+<tr>
+<td align="left"><span style="color:#fb923c;"><strong>CASA-Helium1-VL-2B</strong></span></td>
+<td>73.4</td><td>83.7</td><td>48.6</td>
+<td>723</td><td>71.0</td>
+<td>58.3</td><td>63.3</td><td>54.6</td><td>1572</td>
+</tr>
+<tr>
+<td align="left"><span style="color:#60a5fa;">mPLUG-Owl3 8B</span></td>
+<td>59.2<sup>†</sup></td><td>55.9<sup>†</sup></td><td>36.8<sup>†</sup></td>
+<td>527<sup>†</sup></td><td>69.0</td>
+<td>63.9<sup>†</sup></td><td>73.4</td><td>65.0</td><td>1940<sup>†</sup></td>
+</tr>
+<tr>
+<td align="left"><span style="color:#60a5fa;">mPLUG-Owl3 2B</span></td>
+<td>48.5<sup>†</sup></td><td>48.2<sup>†</sup></td><td>28.1<sup>†</sup></td>
+<td>450<sup>†</sup></td><td>62.6</td>
+<td>56.9<sup>†</sup></td><td>62.6</td><td>61.0</td><td>1551<sup>†</sup></td>
+</tr>
+</tbody>
+</table>
+<p>
+<sup>†</sup> Reproduced with the publicly available models on Hugging Face. &ensp;
+<!-- ◇ Results and model not publicly available. -->
+</p>
+<p align="center">
+<em>
+Results for <code>CASA-Helium1-VL-2B</code> compared to a recent cross-attention baseline (blue), and our token insertion
+(<code>Helium1-VL-2B</code> trained in the same conditions. CASA outperforms current SoTA
+cross-attention-based VLMs, narrowing the gap to insertion-based approaches.
+</em>
+</p>
+<table style="border-collapse: collapse;">
+<thead>
+<tr>
+  <th rowspan="2" align="left">Model</th>
+  <th colspan="3" align="center">Document / Chart</th>
+  <th colspan="2" align="center">Scene Text</th>
+  <th colspan="4" align="center">Knowledge / QA</th>
+</tr>
+<tr>
+  <th>ChartQA</th>
+  <th>DocVQA</th>
+  <th>InfoVQA</th>
+  <th>OCRBench</th>
+  <th>TextVQA</th>
+  <th>RealWorldQA</th>
+  <th>AI2D</th>
+  <th>GQA</th>
+  <th>MME</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+  <td align="left">
+    Qwen2.5-VL-3B
+  </td>
+  <td>84.0</td><td>93.6</td><td>77.1</td>
+  <td>797</td><td>79.3</td>
+  <td>62.2<sup>†</sup></td><td>81.6</td><td>61.0<sup>†</sup></td><td>2249<sup>†</sup></td>
+</tr>
+<tr>
+  <td align="left">
+    <span style="color:#fb923c;"><strong>CASA-Qwen2_5-VL-3B</strong></span>
+  </td>
+  <td>82.4</td><td>88.9</td><td>59.6</td>
+  <td>790</td><td>77.4</td>
+  <td>62.5</td><td>75.1</td><td>59.4</td><td>1918</td>
+</tr>
+</tbody>
+</table>
+<p>
+<sup>†</sup> Reproduced with the publicly available models on Hugging Face.
+</p>
+<p align="center">
+<em>
+Results for <code>CASA-Qwen2_5-VL-3B</code>, adapted from frozen Qwen2.5-VL. CASA reaches performance close to the original
+insertion-based model while while training only
+the CASA layers and last blocks of the image encoder.
+</em>
+</p>
+## Technical Specifications
+### Compute Infrastructure
+`CASA-Helium1-2B` was trained starting from a `Helium1-2B` LLM and the image encoder from `Qwen2.5-VL-3B`.
+We finetune the whole LLM backbone as well as the last four blocks of the image encoder.
+The currently released model was trained on four DGX nodes with 8 H100 GPUs.
+#### Software
+Our training code and inference code was implemented in Pytorch.
+## Citation
+```
+@article{kyutai2025casa,
+  author = {Moritz Böhle and Amélie Royer and Juliette Marrie and Edouard Grave and Patrick Pérez},
+  year = {2025},
+  title = {CASA: Cross-Attention vis Self-Attention},
+  journal = {ArXiv},
+  url = {https://arxiv.org/abs/2512.19535}
+}
+```
+## Model Card Authors and Contact
+  * Amelie Royer
+  * Moritz Boehle
+  * Juliette Marrie
+[blog]: https://kyutai.org/casa
+[casa-arxiv]: https://arxiv.org/abs/2512.19535
+[casa-git]: https://github.com/kyutai-labs/casa

__init__.py ADDED Viewed

File without changes

casa_attention.py ADDED Viewed

	@@ -0,0 +1,1010 @@

+"""CASA layers"""
+import bisect
+from dataclasses import dataclass
+from itertools import accumulate
+from typing import TYPE_CHECKING, Callable, Literal, Sequence, TypedDict, overload
+from typing import cast as type_cast
+import torch
+from transformers.configuration_utils import PretrainedConfig
+from .utils import StreamingModule, StreamingState, delta_w_factory
+if TYPE_CHECKING:
+    from transformers.configuration_utils import PretrainedConfig
+try:
+    from flash_attn import flash_attn_varlen_func
+except ImportError:
+    flash_attn_varlen_func = None  # type: ignore
+WindowsComputeKwargs = TypedDict(
+    "WindowsComputeKwargs",
+    {
+        "num_post_image_tokens": int,
+        "num_pre_image_tokens": int,
+    },
+    total=False,
+)
+def __split_n_merge__(
+    x: torch.Tensor,
+    sample_lengths: list[int],
+    padding_side: Literal["left", "right"] = "right",
+    pad_value: int | float | bool = 0,
+) -> torch.Tensor:
+    max_sample_length = max(sample_lengths)
+    pad_tuple = tuple(0 for _ in range((x.ndim - 1) * 2))
+    return torch.stack(
+        [
+            torch.nn.functional.pad(
+                _x,
+                pad_tuple + (0, max_sample_length - _x.shape[0])
+                if padding_side == "right"
+                else pad_tuple + (max_sample_length - _x.shape[0], 0),
+                value=pad_value,
+            )
+            for _x in torch.split(x, sample_lengths, dim=0)
+        ],
+        dim=0,
+    )
+@overload
+def insert_image_tokens(
+    inputs_embeds: torch.Tensor,
+    image_embeds: torch.Tensor | Sequence[torch.Tensor],
+    image_embeds_insertion_points: list[torch.Tensor],
+    recover_batch_dim: Literal[True],
+    attention_mask: torch.Tensor | None = None,
+    padding_side: Literal["left", "right"] = "right",
+    keep_only_attended: bool = False,
+    pad_output: int | float | bool = 0.0,
+) -> tuple[
+    torch.Tensor,
+    None,
+    torch.Tensor | None,
+    torch.Tensor,
+]: ...
+@overload
+def insert_image_tokens(
+    inputs_embeds: torch.Tensor,
+    image_embeds: torch.Tensor | Sequence[torch.Tensor],
+    image_embeds_insertion_points: list[torch.Tensor],
+    recover_batch_dim: Literal[False],
+    attention_mask: torch.Tensor | None = None,
+    padding_side: Literal["left", "right"] = "right",
+    keep_only_attended: bool = False,
+    pad_output: int | float | bool = 0.0,
+) -> tuple[
+    torch.Tensor,
+    list[int],
+    torch.Tensor | None,
+    torch.Tensor,
+]: ...
+def insert_image_tokens(
+    inputs_embeds: torch.Tensor,
+    image_embeds: torch.Tensor | Sequence[torch.Tensor],
+    image_embeds_insertion_points: list[torch.Tensor],
+    recover_batch_dim: bool = True,
+    attention_mask: torch.Tensor | None = None,
+    padding_side: Literal["left", "right"] = "right",
+    keep_only_attended: bool = False,
+    pad_output: int | float | bool = 0.0,
+) -> tuple[
+    torch.Tensor | torch.Tensor,
+    list[int] | None,
+    torch.Tensor | torch.Tensor | None,
+    torch.Tensor | torch.Tensor,
+]:
+    """
+    Insert image embeddings into text embeddings
+    Args:
+        inputs_embeds (torch.Tensor): (B, S, D) input token embeddings.
+        image_embeds (torch.Tensor | list[torch.Tensor]): (N_images, Nt, D) |  List[(Nt, D)] image token embeddings.
+        image_embeds_insertion_points (list[torch.Tensor]): Insertion indices.
+        attention_mask (torch.Tensor, optional): (B, S) attention mask.
+        padding_side (Literal["left", "right"]): Padding scheme. Controls behavior for padded images.
+        return_indices (bool): Whether to return gather indices or the fused sequence directly.
+        keep_only_attended: This is only applicable when recover_batch_dim is False; whether to
+            remove any non-attended tokens in the whole array. In this case, the attention
+            mask returned is **still the original one**, so we can remember which indices have been
+            removed
+    Returns:
+        output (torch.Tensor): (B, S + Ni * Nt) gather indices or (B, S + Ni * Nt, D) fused sequence
+        image_embeds (torch.Tensor): (B, Ni * Nt) image embeds, padded and batch if input was a list
+        attention_mask (torch.Tensor): Same shape, 1 for real tokens, 0 for image and text padding.
+        image_tokens_mask (torch.Tensor): (B, S + Ni * Nt, 1), marks image token positions.
+    """
+    if isinstance(image_embeds, list) and len(image_embeds) == 0:
+        batch_size, text_seq_length, token_dim = inputs_embeds.shape
+        if recover_batch_dim:
+            return (
+                inputs_embeds,
+                None,
+                attention_mask,
+                torch.zeros((batch_size, text_seq_length, 1), dtype=torch.bool),
+            )
+        else:
+            flattened_seq_length = inputs_embeds.shape[0] * inputs_embeds.shape[1]
+            return (
+                torch.reshape(inputs_embeds, (flattened_seq_length, inputs_embeds.shape[2])),
+                [text_seq_length] * inputs_embeds.shape[0],
+                attention_mask.flatten() if attention_mask is not None else None,
+                torch.zeros((flattened_seq_length, 1), dtype=torch.bool),
+            )
+    # Sanity checks
+    if isinstance(image_embeds, torch.Tensor):
+        assert inputs_embeds.shape[-1] == image_embeds.shape[-1]
+    else:
+        assert all(inputs_embeds.shape[-1] == _x.shape[-1] for _x in image_embeds)
+    batch_size, text_seq_length, token_dim = inputs_embeds.shape
+    image_seq_length = [x.shape[0] for x in image_embeds]
+    # Flatten insertion points
+    insertion_offset = []
+    counter, offset_from_text, offset_from_image = 0, 0, 0
+    for sample in image_embeds_insertion_points:
+        for pt in sample:
+            insertion_offset.append(pt + offset_from_image + offset_from_text)
+            offset_from_image += image_seq_length[counter]
+            counter += 1
+        offset_from_text += text_seq_length
+    image_insert_positions = [
+        x for idx, pt in enumerate(insertion_offset) for x in range(pt, pt + image_seq_length[idx])
+    ]
+    # Flatten image embeds
+    if isinstance(image_embeds, list):
+        image_embeds = torch.cat(image_embeds, dim=0)
+    else:
+        image_embeds = type_cast(torch.Tensor, image_embeds)
+        image_embeds = torch.reshape(image_embeds, (-1, token_dim))
+    # Flatten text embeds across batch dim (B x S, D)
+    inputs_embeds = torch.reshape(inputs_embeds, (-1, token_dim))
+    flattened_seq_length = inputs_embeds.shape[0] + sum(image_seq_length)
+    text_insert_positions = sorted(
+        set(range(flattened_seq_length)).difference(set(image_insert_positions))
+    )
+    # Scatter image embeds in the flattened dict
+    # scatter text related stuff
+    output = torch.empty(
+        (flattened_seq_length, token_dim),
+        device=inputs_embeds.device,
+        dtype=inputs_embeds.dtype,
+    )
+    txt_positions_tensor = torch.Tensor(text_insert_positions).to(
+        dtype=torch.long, device=inputs_embeds.device
+    )
+    output.scatter_(0, txt_positions_tensor[:, None].expand(-1, token_dim), inputs_embeds)
+    attention_mask_new: torch.Tensor | None = None
+    if attention_mask is not None:
+        attention_mask_new = torch.ones(
+            (flattened_seq_length,), dtype=torch.bool, device=inputs_embeds.device
+        )
+        attention_mask_new.scatter_(
+            0, txt_positions_tensor, attention_mask.flatten().to(torch.bool)
+        )
+    # scatter image related stuff
+    image_tokens_mask = torch.zeros(
+        (flattened_seq_length,), dtype=torch.bool, device=inputs_embeds.device
+    )
+    img_positions_tensor = torch.Tensor(image_insert_positions).to(
+        device=inputs_embeds.device, dtype=torch.long
+    )
+    output.scatter_(0, img_positions_tensor[:, None].expand(-1, token_dim), image_embeds)
+    image_tokens_mask.scatter_(0, img_positions_tensor, True)
+    # Compute expected sample length, taking into account the real batch
+    # i.e. recover the batch dimension of image embeddings
+    sample_lengths = []
+    counter = 0
+    for sample_idx, pts in enumerate(image_embeds_insertion_points):
+        num_image_tokens = 0
+        for _ in pts:
+            num_image_tokens += image_seq_length[counter]
+            counter += 1
+        if keep_only_attended and attention_mask is not None:
+            attended_seq_length = torch.sum(attention_mask[sample_idx]).cpu().item()
+            sample_lengths.append(attended_seq_length + num_image_tokens)
+        else:
+            sample_lengths.append(text_seq_length + num_image_tokens)
+    # For CASA attention, we can keep stuff flatten ad return
+    # the sample_lengths for the blockwise attention
+    if not recover_batch_dim:
+        if keep_only_attended and attention_mask_new is not None:
+            output = output[attention_mask_new]
+            image_tokens_mask = image_tokens_mask[attention_mask_new]
+        return output, sample_lengths, attention_mask_new, image_tokens_mask[..., None]
+    # Otherwise, time to (pad) and reshape
+    # Easy case: everything has the same length
+    if all(x == sample_lengths[0] for x in sample_lengths):
+        output = torch.reshape(output, (batch_size, sample_lengths[0], token_dim))
+        image_tokens_mask = torch.reshape(image_tokens_mask, (batch_size, sample_lengths[0], 1))
+        if attention_mask_new is not None:
+            attention_mask_new = torch.reshape(attention_mask_new, (batch_size, sample_lengths[0]))
+    # if there is any size mismatch we break into a
+    # list and pad again
+    else:
+        # split and merge
+        output = __split_n_merge__(output, sample_lengths, padding_side, pad_value=pad_output)
+        # note that the extra padding tokens are also marked as image tokens to be removed later
+        image_tokens_mask = __split_n_merge__(
+            image_tokens_mask, sample_lengths, padding_side, True
+        )[:, :, None]
+        if attention_mask_new is not None:
+            attention_mask_new = __split_n_merge__(
+                attention_mask_new, sample_lengths, padding_side, 0
+            )
+    # Return
+    return output, sample_lengths, attention_mask_new, image_tokens_mask
+def get_sample_lengths_from_insertion_points(
+    image_embeds_insertion_points: list[torch.Tensor],
+    image_embeds: torch.Tensor | list[torch.Tensor] | None,
+    total_seq_len: int | None = None,
+    attention_mask: torch.Tensor | None = None,
+    **kwargs: WindowsComputeKwargs,
+) -> tuple[list[tuple[int, bool]], list[int]]:
+    """Compute sample lengths as if each image insertion point defines a
+    new document (ex document ID)
+    """
+    num_post_image_tokens = type_cast(int, kwargs.get("num_post_image_tokens", 0))
+    num_pre_image_tokens = type_cast(int, kwargs.get("num_pre_image_tokens", 0))
+    squashed_samples_lengths = type_cast(
+        list[list[int]] | None, kwargs.get("squashed_samples_lengths", None)
+    )
+    if squashed_samples_lengths is not None:
+        assert len(squashed_samples_lengths) == len(image_embeds_insertion_points)
+    def __insert_next_sample__(
+        batch_idx: int, insrt_pt: int, last_insrt_pt: int, end_of_batch_sample: bool = False
+    ) -> None:
+        nonlocal attention_mask
+        nonlocal text_sample_lengths, full_sample_lengths
+        nonlocal cum_samples_lengths, current_image_offset
+        nonlocal last_image_idx, current_image_idx, current_length
+        # Add the sample between [last_insrt_pt, insrt_pt] with breaks in
+        # between any squashed samples we find on the way
+        start_pt = bisect.bisect_left(cum_samples_lengths, last_insrt_pt)
+        added_sample = False
+        for end_of_sample in cum_samples_lengths[start_pt:]:
+            # we will break the loop at the end when end_of_sample = insrt_pt
+            end_of_sample = min(end_of_sample, insrt_pt)
+            # Add between [last_insrt_pt, end_of_sample]
+            current_length = end_of_sample - last_insrt_pt
+            if attention_mask is not None:
+                current_length -= int(
+                    torch.sum(~attention_mask[batch_idx, last_insrt_pt:end_of_sample]).item()
+                )
+            if current_length > 0:
+                added_sample = True
+                text_sample_lengths.append(
+                    (current_length, end_of_batch_sample and insrt_pt == end_of_sample)
+                )
+                # add image tokens to current_length
+                if current_image_idx > 0 and image_embeds is not None:
+                    images_in_sample = [
+                        img_idx
+                        for img_idx in range(last_image_idx, current_image_idx)
+                        if img_idx < len(image_embeds_insertion_points[batch_idx])
+                        and last_insrt_pt
+                        <= image_embeds_insertion_points[batch_idx][img_idx]
+                        < end_of_sample
+                    ]
+                    if len(images_in_sample) > 0:
+                        num_image_tokens = sum(
+                            _x.shape[0]
+                            for _x in image_embeds[
+                                current_image_offset + images_in_sample[0] : current_image_offset
+                                + images_in_sample[-1]
+                                + 1
+                            ]
+                        )
+                        current_length += num_image_tokens
+                full_sample_lengths.append(current_length)
+            # prepare for next loop
+            last_insrt_pt = end_of_sample
+            if end_of_sample == insrt_pt:
+                break
+        # End of loop: Catching weird use case where we may end up on a span
+        # full of padding tokens which will not get added due to current_length > 0
+        if end_of_batch_sample:
+            assert added_sample, "Weird edge case. Don't do that, thank you"
+            text_sample_lengths[-1] = (text_sample_lengths[-1][0], True)
+        # End of loop: Catching weird use case where we may end up on a span
+        # full of padding tokens which will not get added due to current_length > 0
+        if end_of_batch_sample:
+            assert added_sample, "Weird edge case. Don't do that, thank you"
+            text_sample_lengths[-1] = (text_sample_lengths[-1][0], True)
+    current_image_offset = 0
+    text_sample_lengths, full_sample_lengths = [], []
+    cum_samples_lengths: list[int] = []
+    current_length, last_insrt_pt, last_image_idx, current_image_idx = 0, 0, 0, 0
+    for batch_idx, pts in enumerate(image_embeds_insertion_points):
+        if squashed_samples_lengths is not None:
+            cum_samples_lengths = list(accumulate(squashed_samples_lengths[batch_idx]))
+        else:
+            assert total_seq_len is not None
+            cum_samples_lengths = [total_seq_len]
+        for current_image_idx, insrt_pt in enumerate(pts.cpu().tolist()):
+            # check if the images are consecutive in which way we want
+            # them to belong to the same window
+            if current_image_idx >= 1 and insrt_pt == (
+                image_embeds_insertion_points[batch_idx][current_image_idx - 1]
+                + num_pre_image_tokens
+                + num_post_image_tokens
+            ):
+                continue
+            # Otherwise, we found a new sample
+            # not very important but for completeness: the insertion points come *after*
+            # the pre-image tokens per design but for the document-id mask it is more consistent to
+            # have them correspond to the same image
+            insrt_pt -= num_pre_image_tokens
+            # Update text and full sample lengths
+            if insrt_pt > last_insrt_pt:
+                __insert_next_sample__(
+                    batch_idx, insrt_pt, last_insrt_pt, end_of_batch_sample=False
+                )
+            last_image_idx = current_image_idx
+            last_insrt_pt = insrt_pt
+        # End of batch: add sample in progress and reset
+        current_image_idx += 1
+        if cum_samples_lengths[-1] > last_insrt_pt:
+            __insert_next_sample__(
+                batch_idx, cum_samples_lengths[-1], last_insrt_pt, end_of_batch_sample=True
+            )
+        current_length, last_insrt_pt, last_image_idx, current_image_idx = 0, 0, 0, 0
+        current_image_offset += len(pts)
+    # Sanity checks that the is_eob are correctly place
+    assert sum(_x[1] for _x in text_sample_lengths) == len(image_embeds_insertion_points), (
+        f"Number of eob markers ({sum(_x[1] for _x in text_sample_lengths)}) differs"
+        f" from original batch size ({len(image_embeds_insertion_points)})"
+    )
+    return text_sample_lengths, full_sample_lengths
+class CASAAttentionHandler:
+    def __init__(
+        self,
+        inputs_embeds: torch.Tensor,
+        image_embeds: torch.Tensor | list[torch.Tensor],
+        image_embeds_insertion_points: list[torch.Tensor],
+        attention_mask: torch.Tensor | None = None,
+        rope_fn: Callable | None = None,
+        windows: Literal["batch", "squashed", "images", "turn_based"] = "images",
+        use_asymetric_q_kv: bool = True,
+        casa_windows_info: None | dict = None,
+    ):
+        """Initialize the structure holding the query buffer for CASA attention layers
+        (ie the **flattened** text+image inserted tokens).
+        Note that this structure is shared across all casa layers, and it gets updated
+        with the current hidden states at every layer; this is merely a buffer to keep
+        scatter_ operations in-plae as much as possible
+        In this module, the embeddings related values (image_tokens_mask,
+        text_sample_lengths etc) are stored under the assumption of a tensor
+        which is *flatened* and *witout padding tokens*
+        Only the attention mask is kept as-is (text-only, batched, padded) to
+        be able to recover original shapes when needed
+        """
+        super().__init__()
+        assert windows == "images"  # for inference code release
+        # Note 1: Unless overriden, text/full_sample_lengths are defined such that one
+        # document = one sample in the batch
+        if attention_mask is None:
+            text_sample_lengths = [(_x.shape[0], True) for _x in inputs_embeds]
+        else:
+            text_sample_lengths = [(int(torch.sum(_x).item()), True) for _x in attention_mask]
+        (
+            full_inputs_embeds,
+            full_sample_lengths,
+            # Full attention mask is only needed at inference to
+            # flatten the KV-Cache and remove padding tokens
+            _,
+            self.image_tokens_mask,
+        ) = insert_image_tokens(
+            inputs_embeds=inputs_embeds,
+            image_embeds=image_embeds,
+            image_embeds_insertion_points=image_embeds_insertion_points,
+            attention_mask=attention_mask,
+            recover_batch_dim=False,
+            keep_only_attended=attention_mask is not None,
+        )
+        assert self.image_tokens_mask.ndim == 2
+        self.image_embeds = image_embeds
+        self.image_embeds_insertion_points = image_embeds_insertion_points
+        self.attention_mask = None if attention_mask is None else attention_mask.bool()
+        self.use_asymetric_qkv = use_asymetric_q_kv
+        # At inference, we have to use asymetric QKV for efficiency
+        if self.attention_mask is not None:
+            self.use_asymetric_qkv = True
+        # Build CASA windows
+        assert casa_windows_info is not None
+        text_sample_lengths, full_sample_lengths = get_sample_lengths_from_insertion_points(
+            image_embeds_insertion_points=image_embeds_insertion_points,
+            image_embeds=image_embeds,
+            total_seq_len=inputs_embeds.shape[1],
+            attention_mask=self.attention_mask,
+            **casa_windows_info,  # pyright: ignore
+        )
+        # Sanity checks on the sample lengths
+        self.text_sample_lengths = [(int(s), eob) for s, eob in text_sample_lengths if s > 0]
+        self.full_sample_lengths = [int(s) for s in full_sample_lengths if s > 0]
+        assert len(self.text_sample_lengths) == len(self.full_sample_lengths), (
+            f"Sanity check failed; text sample lengths {len(self.text_sample_lengths)}"
+            f" != full sample lengths {len(self.full_sample_lengths)}"
+        )
+        if self.attention_mask is None:
+            num_unpadded_text_tokens = inputs_embeds.shape[0] * inputs_embeds.shape[1]
+        else:
+            num_unpadded_text_tokens = int(
+                torch.sum(type_cast(torch.Tensor, attention_mask)).item()
+            )
+        assert sum(_x[0] for _x in self.text_sample_lengths) == num_unpadded_text_tokens, (
+            f"Sanity check failed; sample lengths {sum(self.full_sample_lengths)} != {full_inputs_embeds.shape[0]}"
+        )
+        assert sum(self.full_sample_lengths) == full_inputs_embeds.shape[0], (
+            f"Sanity check failed; sample lengths {sum(self.full_sample_lengths)} != {full_inputs_embeds.shape[0]}"
+        )
+        # Finally we can compute cu_seqlen based on sample lengths
+        self.max_seqlen_q = max(self.text_sample_lengths)[0]
+        self.cu_seqlens_q = self.get_cu_seqlens(
+            [x[0] for x in self.text_sample_lengths], device=inputs_embeds.device
+        )
+        self.max_seqlen_kv = max(self.full_sample_lengths)
+        self.cu_seqlens_kv = self.get_cu_seqlens(
+            self.full_sample_lengths, device=inputs_embeds.device
+        )
+        # For inference: We save the length of the current document
+        # to trim the KV cache appropriately
+        self.current_doc_lengths = self.full_sample_lengths
+        # Precompute position embeddings
+        self.position_embeds = None
+        self.rope_fn = rope_fn
+        if self.rope_fn is not None:
+            self.position_embeds = self.compute_position_embeddings(
+                self.rope_fn, full_sample_lengths, dummy_for_dtype_and_device=full_inputs_embeds
+            )
+    @property
+    def batch_lengths(self) -> list[int]:
+        """Return a (batch_size,) list of integers containing the
+        number of (non-padded) text tokens for each sample in the batch"""
+        bls = [0]
+        for ln, eob in self.text_sample_lengths:
+            bls[-1] += ln
+            if eob:
+                bls.append(0)
+        return bls[:-1]
+    @property
+    def full_batch_lengths(self) -> list[int]:
+        """Same as batch_lengths for text+image tokens"""
+        bls = [0]
+        for (_, eob), ln in zip(self.text_sample_lengths, self.full_sample_lengths):
+            bls[-1] += ln
+            if eob:
+                bls.append(0)
+        return bls[:-1]
+    def get_cu_seqlens(
+        self, sample_lengths: list[int], device: torch.device | None
+    ) -> torch.Tensor:
+        """Update cu_seqlengths according to the given sample_lengths"""
+        return torch.Tensor(list(accumulate(sample_lengths, initial=0))).to(
+            dtype=torch.int32, device=device
+        )
+    def compute_position_embeddings(
+        self,
+        rope_fn: Callable,
+        sample_lengths: list[int],
+        dummy_for_dtype_and_device: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Compute info required for position embeddings. Can be override e.g. for Qwen"""
+        # option 1: Standard range
+        # position_ids = torch.arange(0, full_inputs_embeds.shape[0])
+        # option 2: Follows document boundary
+        position_ids = torch.cat([torch.arange(0, lg) for lg in sample_lengths], dim=0)
+        return rope_fn(
+            dummy_for_dtype_and_device,
+            position_ids.to(dummy_for_dtype_and_device.device)[None, ...],
+        )
+    def get_position_embedding(
+        self,
+        key: Literal["q", "kv"],
+        num_queries: int = 0,
+    ) -> tuple[torch.Tensor, torch.Tensor] | None:
+        if self.position_embeds is None:
+            return None
+        cos, sin = self.position_embeds
+        bls = self.full_batch_lengths
+        # For Q, we only want the text-only posembeds
+        if key == "q" and self.use_asymetric_qkv:
+            bls = self.batch_lengths
+            cos, sin = cos[:, ~self.image_tokens_mask[:, 0]], sin[:, ~self.image_tokens_mask[:, 0]]
+        elif key not in {"q", "kv"}:
+            raise ValueError(f"Unknow for position embedding {key}")
+        # Easy case: training or first step at inference: we use all the posembeds
+        if num_queries == 0:
+            return cos, sin
+        # If num queries is given, we need to trim for *every sample in the batch*
+        cos = [x[:, -num_queries:] for x in torch.split(cos, bls, dim=1)]
+        sin = [x[:, -num_queries:] for x in torch.split(sin, bls, dim=1)]
+        return torch.cat(cos, dim=1), torch.cat(sin, dim=1)
+    def get_full_embeds(
+        self, hidden_states: torch.Tensor, norm_fn: Callable | None
+    ) -> torch.Tensor:
+        """Update attended hidden states in the current query buffer
+        :param  hidden_states: (b, s, d) Tensor input to the CASA attention layer"
+        """
+        assert self.image_embeds is not None
+        return insert_image_tokens(
+            inputs_embeds=hidden_states,
+            image_embeds=self.image_embeds
+            if norm_fn is None
+            else norm_fn(self.image_embeds)
+            if isinstance(self.image_embeds, torch.Tensor)
+            else [norm_fn(_x) for _x in self.image_embeds],
+            image_embeds_insertion_points=self.image_embeds_insertion_points,
+            attention_mask=self.attention_mask,
+            recover_batch_dim=False,
+            keep_only_attended=self.attention_mask is not None,
+        )[0][None, :, :]
+    def recover_text_embeds(
+        self,
+        hidden_states_out: torch.Tensor,
+        hidden_states_in: torch.Tensor,
+        update_image_embeddings: bool = False,
+    ) -> torch.Tensor:
+        """Returns text embeddings from the query buffer, including non-attended tokens at inference"""
+        if update_image_embeddings and not self.use_asymetric_qkv:
+            raise NotImplementedError("Implement image embeddings updates for asymetric QKV")
+        # Remove image tokens in the symetric case
+        if not self.use_asymetric_qkv:
+            hidden_states_out = hidden_states_out[~self.image_tokens_mask[:, 0]]
+        # if there's not attention mask, we are in the right padded case
+        # (keep_only_attended = False) we can directly return the query
+        # outputs (which don't contain the image)
+        if self.attention_mask is None:
+            return hidden_states_out
+        # Otherwise, we need to "scatter" back only the text-attended tokens to the original
+        # hidden states, which contain the paddings
+        num_queries = hidden_states_in.shape[1]
+        # Case 1: the padded hidden_states_in is larger than hidden_states_out
+        # we rebatch+pad hidden_state_out before doing the scattering
+        if hidden_states_out.shape[0] != hidden_states_in.shape[0] * hidden_states_in.shape[1]:
+            s = torch.split(hidden_states_out, self.batch_lengths, dim=0)
+            assert max(_s.shape[0] for _s in s) <= num_queries  # sanity check
+            s = [
+                torch.nn.functional.pad(_s, (0, 0, num_queries - _s.shape[0], 0), value=0)
+                for _s in s
+            ]
+            return torch.where(
+                self.attention_mask[:, -num_queries:, None],
+                torch.stack(s),
+                hidden_states_in,
+            )
+        # If both have the smae shape, it means hidden_states_in contained no padding
+        # so we can directly return hidden states out
+        return hidden_states_out
+    def extend(self, num_tokens: int, offset: int = 0):
+        """Extend all necessary values of the Handler for infenrece
+        Note: this implementation curently assumes a single conversation at a time
+        (otherwise image tokens mask would have to change) and that tokens added are
+        attended to"""
+        # image embeds is inserted in the first step and stored in the KV cache
+        self.image_embeds = None
+        # Update attention mask (non-flattened) (assumes all new tokens are attended to)
+        if self.attention_mask is not None:
+            self.attention_mask = torch.nn.functional.pad(
+                self.attention_mask, (0, num_tokens), value=1
+            )
+        # Update image token mask (assumes only one image/conversation
+        # is started at once so that we always extend by zero)
+        # Note that the mask is stored flattened to avoid padding so we have to
+        # do something a bit ugly and inefficient here
+        imtokmask = torch.split(self.image_tokens_mask, self.full_batch_lengths, dim=0)
+        imtokmask = [torch.nn.functional.pad(x, (0, 0, 0, num_tokens), value=0) for x in imtokmask]
+        self.image_tokens_mask = torch.cat(imtokmask, dim=0)
+        # Recompute cumulative document lengths after assigning the new
+        # number of tokens to each sample in the batch
+        for idx, (ln, is_eob) in enumerate(self.text_sample_lengths):
+            if is_eob:
+                self.text_sample_lengths[idx] = (num_tokens + ln, is_eob)
+                self.full_sample_lengths[idx] += num_tokens
+        # Recompute cu sequlen
+        # First step: Technically this never occurs, but we keep it for completeness
+        if offset == 0:
+            self.max_seqlen_q = max(self.text_sample_lengths)[0]
+            self.cu_seqlens_q = self.get_cu_seqlens(
+                [x[0] for x in self.text_sample_lengths], device=self.cu_seqlens_q.device
+            )
+            self.max_seqlen_kv = max(self.full_sample_lengths)
+            self.cu_seqlens_kv = self.get_cu_seqlens(
+                self.full_sample_lengths, device=self.cu_seqlens_kv.device
+            )
+        # Step > 0: the annoying part is since flashattn_varlen does not accept
+        # 0-len documents, we need to remove documents from the KV Cache when they're past
+        # their windows. In our current setting, this means we only want to keep the latest
+        # documents
+        else:
+            self.max_seqlen_q = num_tokens
+            self.cu_seqlens_q = self.get_cu_seqlens(
+                [num_tokens for (_, eob) in self.text_sample_lengths if eob],
+                device=self.cu_seqlens_q.device,
+            )
+            final_doc_lengths = [
+                ln
+                for (_, eob), ln in zip(self.text_sample_lengths, self.full_sample_lengths)
+                if eob
+            ]
+            self.current_doc_lengths = final_doc_lengths
+            self.max_seqlen_kv = max(self.current_doc_lengths)
+            self.cu_seqlens_kv = self.get_cu_seqlens(
+                final_doc_lengths,
+                device=self.cu_seqlens_kv.device,
+            )
+        # Update position embeddings
+        if self.rope_fn is not None and self.position_embeds is not None:
+            self.position_embeds = self.compute_position_embeddings(
+                self.rope_fn,
+                self.full_sample_lengths,
+                dummy_for_dtype_and_device=self.position_embeds[0],
+            )
+@dataclass
+class CASAAttentionStreamingState(StreamingState):
+    """Streaming State for CASA Atention module. Keep the hidden"""
+    k: torch.Tensor = None  # pyright: ignore[reportAssignmentType]
+    v: torch.Tensor = None  # pyright: ignore[reportAssignmentType]
+    recover_batched_trims: list[int] = None  # pyright: ignore[reportAssignmentType]
+    casa_handler: CASAAttentionHandler = None  # pyright: ignore[reportAssignmentType]
+    def maybe_get_casa_handler(
+        self,
+        casa_handler: CASAAttentionHandler | None,
+        is_first_casa_layer: bool = False,
+        num_queries: int = -1,
+    ) -> CASAAttentionHandler | None:
+        # Set given Casa Handler the first time we reach this
+        if self.casa_handler is None:
+            self.casa_handler = casa_handler  # pyright: ignore
+        # subsequent calls: we need to extend shape to accomodate new tokens
+        # however because CASA handler is shared across layers, we only need to do it once
+        if self.casa_handler is not None and self.offset > 0 and is_first_casa_layer:
+            # since CasaHandler is shared, we only use its extend step once
+            self.casa_handler.extend(num_queries, offset=self.offset)
+        return self.casa_handler
+    def __recover_batched_kv__(self, states: torch.Tensor) -> torch.Tensor:
+        """Recover batched key/value states with left padding"""
+        s = torch.split(states, self.casa_handler.full_batch_lengths, dim=1)
+        mlen = max(_s.shape[1] for _s in s)
+        # Remember the added padding so that we can re-flatten KV later
+        if self.recover_batched_trims is None:
+            self.recover_batched_trims = [mlen - _s.shape[1] for _s in s]
+        s = [torch.nn.functional.pad(_s, (0, 0, 0, 0, mlen - _s.shape[1], 0), value=0) for _s in s]
+        return torch.cat(s, dim=0)
+    def __get_flattened_kv__(
+        self, k: torch.Tensor | None = None, v: torch.Tensor | None = None
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """
+        Flattened and remove padding to act with flash_attn_func
+        """
+        k = self.k if k is None else k
+        v = self.v if v is None else v
+        assert k is not None and v is not None
+        # Since every batch at least contributes one document,
+        # we can use this to check whether we are in streaming mode with dropped docs.
+        # If so, we should trim the kv cache accordingly
+        if len(self.casa_handler.current_doc_lengths) == len(k):
+            k = torch.cat(
+                [
+                    _k[self.recover_batched_trims[idx] :][-doc_len:]
+                    for idx, _k, doc_len in zip(
+                        range(len(k)), k, self.casa_handler.current_doc_lengths
+                    )
+                ]
+            )
+            v = torch.cat(
+                [
+                    _v[self.recover_batched_trims[idx] :][-doc_len:]
+                    for idx, _v, doc_len in zip(
+                        range(len(k)), v, self.casa_handler.current_doc_lengths
+                    )
+                ]
+            )
+            return k[None, ...], v[None, ...]
+        k = torch.cat([_k[self.recover_batched_trims[idx] :] for idx, _k in enumerate(k)])
+        v = torch.cat([_v[self.recover_batched_trims[idx] :] for idx, _v in enumerate(v)])
+        return k[None, ...], v[None, ...]
+    def extend_kv(
+        self, key_states: torch.Tensor, value_states: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """
+        Extend KV Cache while keep
+        """
+        assert self.casa_handler is not None
+        if self.k is None and self.v is None:
+            # Init with batch-padded key and value states
+            self.k = self.__recover_batched_kv__(key_states)
+            self.v = self.__recover_batched_kv__(value_states)
+            return self.__get_flattened_kv__()
+        if self.k is not None and self.v is not None:
+            # this is during generation; normally there is no padding at this stage
+            # so we can directly reshape the flattened key states
+            rshp = (self.k.shape[0], -1, self.k.shape[2], self.k.shape[3])
+            self.k = torch.cat([self.k, key_states.reshape(rshp)], dim=1)
+            self.v = torch.cat([self.v, value_states.reshape(rshp)], dim=1)
+            return self.__get_flattened_kv__()
+        raise ValueError("Impossible configuration (k and v updates are desynchronized )")
+class CASAAttention(StreamingModule[CASAAttentionStreamingState]):
+    def __init__(
+        self,
+        config: "PretrainedConfig",
+        layer_idx: int | None,
+        self_attn: torch.nn.Module | None = None,
+        input_layernorm_fn: Callable[[torch.Tensor], torch.Tensor] | None = None,
+    ):
+        super().__init__(CASAAttentionStreamingState)
+        self.head_dim = config.head_dim
+        self.config = config
+        self.is_first_casa_layer = layer_idx == (min(config.xa_layers) if config.xa_layers else 0)
+        self.use_delta_w = config.casa_delta_w
+        self.q_proj_casa = self.init_from_config_proj("q", config)
+        self.k_proj_casa = self.init_from_config_proj("k", config)
+        self.v_proj_casa = self.init_from_config_proj("v", config)
+        self.o_proj_casa = self.init_from_config_proj("o", config)
+        # Delta_w
+        self.override_q_proj: Callable[[torch.Tensor], torch.Tensor] | None = None
+        self.override_k_proj: Callable[[torch.Tensor], torch.Tensor] | None = None
+        self.override_v_proj: Callable[[torch.Tensor], torch.Tensor] | None = None
+        self.override_o_proj: Callable[[torch.Tensor], torch.Tensor] | None = None
+        if config.casa_delta_w:
+            assert self_attn is not None
+            self.set_delta_w(self_attn)
+        # Layer norm
+        self.norm_fn: Callable | None = None
+        if config.xa_norm_on_images:
+            assert input_layernorm_fn is not None
+            self.norm_fn = input_layernorm_fn
+    def init_from_mha(self, self_attn: torch.nn.Module):
+        assert self_attn is not None
+        with torch.no_grad():
+            assert hasattr(self_attn, "q_proj")
+            for key in ["q", "k", "v", "o"]:
+                src = type_cast(torch.nn.Linear, getattr(self_attn, f"{key}_proj"))
+                tgt = type_cast(torch.nn.Linear, getattr(self, f"{key}_proj_casa"))
+                tgt.weight.copy_(src.weight)
+                if tgt.bias is not None and src.bias is not None:
+                    tgt.bias.copy_(src.bias)
+    def set_delta_w(self, self_attn: torch.nn.Module):
+        """Delta w setup"""
+        self.override_q_proj = delta_w_factory(
+            self.q_proj_casa, type_cast(torch.nn.Linear, self_attn.q_proj)
+        )
+        self.override_k_proj = delta_w_factory(
+            self.k_proj_casa, type_cast(torch.nn.Linear, self_attn.k_proj)
+        )
+        self.override_v_proj = delta_w_factory(
+            self.v_proj_casa, type_cast(torch.nn.Linear, self_attn.v_proj)
+        )
+        self.override_o_proj = delta_w_factory(
+            self.o_proj_casa, type_cast(torch.nn.Linear, self_attn.o_proj)
+        )
+        with torch.no_grad():
+            torch.nn.init.zeros_(self.q_proj_casa.weight)
+            torch.nn.init.zeros_(self.k_proj_casa.weight)
+            torch.nn.init.zeros_(self.v_proj_casa.weight)
+            torch.nn.init.zeros_(self.o_proj_casa.weight)
+            if self.q_proj_casa.bias is not None:
+                torch.nn.init.zeros_(self.q_proj_casa.bias)
+            if self.k_proj_casa.bias is not None:
+                torch.nn.init.zeros_(self.k_proj_casa.bias)
+            if self.v_proj_casa.bias is not None:
+                torch.nn.init.zeros_(self.v_proj_casa.bias)
+            if self.o_proj_casa.bias is not None:
+                torch.nn.init.zeros_(self.o_proj_casa.bias)
+    def init_from_config_proj(
+        self, key: Literal["q", "o", "k", "v"], config: PretrainedConfig
+    ) -> torch.nn.Linear:
+        """Initialize the Linear proj in this module"""
+        raise NotImplementedError("Abastract class.")
+    def apply_position_embeddings(
+        self,
+        key: Literal["q", "kv"],
+        x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim)
+        casa_handler: CASAAttentionHandler | None,
+        num_queries: int = 0,
+        unsqueeze_dim: int = 1,
+    ) -> torch.Tensor:  # (batch, seq_len, num_heads, head_dim)
+        """Apply position embeddings to query and key states"""
+        raise NotImplementedError("Abastract class.")
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        casa_handler: CASAAttentionHandler | None,
+    ) -> torch.Tensor | None:
+        """Generic forward for CASA uses for instance in `helium1_attention`"""
+        og_dtype = hidden_states.dtype
+        if self.is_streaming:
+            casa_handler = self.streaming_state.maybe_get_casa_handler(
+                casa_handler,
+                is_first_casa_layer=self.is_first_casa_layer,
+                num_queries=hidden_states.shape[1],
+            )
+        # Case of text-only samples at training (or inference when no handler was cached)
+        # in this case we just skip CASA so we return None (no casa_update)
+        if casa_handler is None:
+            return None
+        if self.is_streaming:
+            assert casa_handler.use_asymetric_qkv, (
+                "You should set `use_asymetric_qkv` to True during inference"
+            )
+        og_shape = hidden_states.shape
+        # Build Q inputs
+        if casa_handler.use_asymetric_qkv:
+            q_inputs = hidden_states.flatten(0, 1)[None, ...]
+            if casa_handler.attention_mask is not None:
+                q_inputs = q_inputs[:, casa_handler.attention_mask[:, -og_shape[1] :].flatten()]
+        else:
+            q_inputs = casa_handler.get_full_embeds(hidden_states, norm_fn=self.norm_fn)
+        # Case 1: Training or first inference step
+        if not self.is_streaming or self.streaming_state.offset == 0:
+            kv_inputs = casa_handler.get_full_embeds(hidden_states, norm_fn=self.norm_fn)
+        else:
+            # during streaming, the KV cache including image embeddings
+            # will be inserted later so for now we only update the incoming queries
+            kv_inputs = q_inputs
+        # Compute QKV for the blockwise attention
+        bs, total_seq_len = kv_inputs.shape[:2]
+        hidden_shape_q = (bs, q_inputs.shape[1], -1, self.head_dim)
+        hidden_shape_kv = (bs, total_seq_len, -1, self.head_dim)
+        if self.override_q_proj is None:
+            query_states = self.q_proj_casa(q_inputs).view(*hidden_shape_q)
+        else:
+            query_states = self.override_q_proj(q_inputs).view(*hidden_shape_q)
+        if self.override_k_proj is None:
+            key_states = self.k_proj_casa(kv_inputs).view(*hidden_shape_kv)
+        else:
+            key_states = self.override_k_proj(kv_inputs).view(*hidden_shape_kv)
+        if self.override_v_proj is None:
+            value_states = self.v_proj_casa(kv_inputs).view(*hidden_shape_kv)
+        else:
+            value_states = self.override_v_proj(kv_inputs).view(*hidden_shape_kv)
+        # Apply position embedding at the right offset
+        num_queries = 0
+        if self.streaming and self.streaming_state.offset > 0:
+            num_queries = og_shape[1]
+        query_states = self.apply_position_embeddings(
+            "q", query_states, num_queries=num_queries, casa_handler=casa_handler
+        )
+        key_states = self.apply_position_embeddings(
+            "kv", key_states, num_queries=num_queries, casa_handler=casa_handler
+        )
+        assert flash_attn_varlen_func is not None, (
+            "flash_attention is not installed but required for block-wise attention"
+        )
+        # Flashattention has different efficient implem for streaming
+        # In that case, the KV cache has to be batched and has been extended
+        # to accomodate the shape of ne the new updates
+        if self.is_streaming:
+            key_states, value_states = self.streaming_state.extend_kv(
+                key_states=key_states, value_states=value_states
+            )
+        if casa_handler.use_asymetric_qkv:
+            cu_seqlens_q = casa_handler.cu_seqlens_q
+            max_seqlen_q = casa_handler.max_seqlen_q
+        else:
+            cu_seqlens_q = casa_handler.cu_seqlens_kv
+            max_seqlen_q = casa_handler.max_seqlen_kv
+        assert cu_seqlens_q[-1] == query_states.shape[1], (
+            f"{cu_seqlens_q[-1]} != {query_states.shape[1]}"
+        )
+        assert casa_handler.cu_seqlens_kv[-1] == key_states.shape[1], (
+            f"{casa_handler.cu_seqlens_kv[-1]} != {key_states.shape[1]}"
+        )
+        # for quer
+        attn_output: torch.Tensor = flash_attn_varlen_func(
+            query_states[0].to(torch.bfloat16),
+            key_states[0].to(torch.bfloat16),
+            value_states[0].to(torch.bfloat16),
+            cu_seqlens_q=cu_seqlens_q,
+            cu_seqlens_k=casa_handler.cu_seqlens_kv,
+            max_seqlen_q=max_seqlen_q,
+            max_seqlen_k=casa_handler.max_seqlen_kv,
+            dropout_p=0.0,
+            # softmax_scale=None, # defaults to 1/sqrt(d)
+            causal=True,
+        ).to(og_dtype)
+        attn_output = attn_output.reshape(hidden_shape_q[1], -1).contiguous()
+        if self.override_o_proj is None:
+            attn_output = self.o_proj_casa(attn_output)
+        else:
+            attn_output = self.override_o_proj(attn_output)
+        attn_output = casa_handler.recover_text_embeds(
+            attn_output, hidden_states, update_image_embeddings=self.config.xa_update_image_embeds
+        )
+        attn_output = attn_output.reshape(og_shape)
+        if self.is_streaming:
+            self.streaming_state.offset += attn_output.shape[1]
+        return attn_output

config.json ADDED Viewed

	@@ -0,0 +1,77 @@

+{
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "auto_map": {
+        "AutoConfig": "configuration_helium1_casa.Helium1CASAConfig",
+        "AutoModel": "modeling_helium1_casa.V2Helium1"
+    },
+    "bos_token_id": 1,
+    "casa_attention": true,
+    "casa_delta_w": false,
+    "casa_use_asymetric_qkv": true,
+    "casa_windows": "images",
+    "eos_token_id": null,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 8192,
+    "mask_squash_blockwise": false,
+    "max_position_embeddings": 4096,
+    "mlp_bias": false,
+    "model_type": "CASA_Helium1_VL_2B",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 8,
+    "pad_token_id": 3,
+    "post_image_tokens": [],
+    "pre_image_tokens": [],
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-08,
+    "rope_scaling": null,
+    "rope_theta": 20000.0,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.51.3",
+    "use_cache": true,
+    "vision_config": {
+        "depth": 32,
+        "fullatt_block_indexes": [
+            7,
+            15,
+            23,
+            31
+        ],
+        "hidden_act": "silu",
+        "hidden_size": 1280,
+        "image_mean": [
+            0.48145466,
+            0.4578275,
+            0.40821073
+        ],
+        "image_std": [
+            0.26862954,
+            0.26130258,
+            0.27577711
+        ],
+        "in_channels": 3,
+        "in_chans": 3,
+        "intermediate_size": 3420,
+        "model_type": "qwen2_5_vl",
+        "num_heads": 16,
+        "out_dim": 2048,
+        "out_hidden_size": 2048,
+        "patch_size": 14,
+        "spatial_merge_size": 2,
+        "spatial_patch_size": 14,
+        "temporal_patch_size": 1,
+        "tokens_per_second": 2,
+        "window_size": 112
+    },
+    "vocab_size": 64000,
+    "xa_custom_norm": true,
+    "xa_layers": [],
+    "xa_norm_on_images": true,
+    "xa_order": "ca_first",
+    "xa_update_image_embeds": false
+}

configuration_helium1_casa.py ADDED Viewed

	@@ -0,0 +1,270 @@

+from typing import Any, Literal
+from transformers.configuration_utils import PretrainedConfig
+from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import Qwen2_5_VLVisionConfig
+class Helium1CASAConfig(PretrainedConfig):
+    r"""
+    Helium1 Config augmented with CASA options
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Helium1Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
+            Llama 2 up to 4096, CodeLlama up to 16384.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
+            understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
+            results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
+        head_dim (`int`, *optional*):
+            The attention head dimension. If None, it will default to hidden_size // num_attention_heads
+    """
+    model_type = "helium1_casa"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Helium1Model`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {  # pyright: ignore[reportAssignmentType]
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size: int = 32000,
+        hidden_size: int = 4096,
+        intermediate_size: int = 11008,
+        num_hidden_layers: int = 32,
+        num_attention_heads: int = 32,
+        num_key_value_heads: None | int = None,
+        head_dim: None | int = None,
+        hidden_act: str = "silu",
+        attention_dropout: float = 0.0,
+        max_position_embeddings: int = 2048,
+        initializer_range: float = 0.02,
+        rms_norm_eps: float = 1e-6,
+        use_cache: bool = True,
+        tie_word_embeddings: bool = False,
+        rope_theta: float = 10000.0,
+        pad_token_id: int = 3,
+        eos_token_id: int = 2,
+        bos_token_id: int = 1,
+        pretraining_tp: int = 1,
+        rope_scaling: None | dict = None,
+        attention_bias: bool = False,
+        mlp_bias: bool = False,
+        # Our fusion mechanisms
+        # Common to all fusion mechanisms
+        xa_layers: None | tuple = None,
+        xa_order: Literal["ca_first", "parallel", "instead"] = "ca_first",
+        xa_norm_on_images: bool = False,
+        xa_update_image_embeds: bool = False,
+        mask_squash_blockwise: bool = False,
+        # CASA
+        casa_attention: bool = False,
+        casa_delta_w: bool = False,
+        casa_windows: Literal["batch", "squashed", "images", "turn_based"] = "batch",
+        casa_use_asymetric_qkv: bool = True,
+        xa_custom_norm: bool = False,
+        # Qwen2.5-VL vision config
+        vision_config: dict[str, Any] | None = None,
+        **kwargs: Any,
+    ):
+        from transformers.modeling_rope_utils import rope_config_validation
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = (
+            head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
+        )
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, copy it it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+        self.head_dim = self.hidden_size // self.num_attention_heads
+        self.xa_layers = xa_layers
+        self.xa_order: Literal["ca_first", "parallel", "instead"] = xa_order
+        self.xa_norm_on_images = xa_norm_on_images
+        self.xa_update_image_embeds = xa_update_image_embeds
+        self.mask_squash_blockwise = mask_squash_blockwise
+        # CASA config
+        self.casa_attention = casa_attention
+        self.casa_delta_w = casa_delta_w
+        self.casa_windows: Literal["batch", "squashed", "images", "turn_based"] = casa_windows
+        self.casa_use_asymetric_qkv = casa_use_asymetric_qkv
+        self.xa_custom_norm = xa_custom_norm
+        if vision_config is None:
+            vision_config = dict()
+        self.vision_config = Qwen2_5_VLVisionConfig(**vision_config)
+        self.vision_config.temporal_patch_size = 1
+        self.vision_config.image_mean = [0.48145466, 0.4578275, 0.40821073]
+        self.vision_config.image_std = [0.26862954, 0.26130258, 0.27577711]
+        self.vision_config.out_dim = 2048
+        self.pre_image_tokens = []
+        self.post_image_tokens = []
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+if __name__ == "__main__":
+    import argparse
+    from pathlib import Path
+    import rich
+    import yaml
+    from transformers.models.auto.configuration_auto import AutoConfig
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--out_dir", type=str, default="./saved_config/")
+    parser.add_argument(
+        "--ckpt_path",
+        type=str,
+        default="/lustre/scwpod02/client/kyutai/juliette/experiments/finext_casa_896_xtxt_up_b20_64gpu/fdf76e6774",
+    )
+    args = parser.parse_args()
+    path = Path(args.ckpt_path) / "kyuteye_config.yml"
+    helium_config = AutoConfig.from_pretrained("kyutai/helium-1-2b")
+    vision_config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct").vision_config
+    # 3) Create YOUR config by merging both
+    config = Helium1CASAConfig(
+        **helium_config.to_dict(),  # all helium parameters
+        vision_config=vision_config.to_dict(),  # override or add vision_config
+    )
+    with open(path) as stream:
+        kconfig = yaml.safe_load(stream)
+    # print keys that are in kconfig and in config
+    for key in set(kconfig.keys()).intersection(set(config.to_dict().keys())):
+        rich.print(f"Overwriting [bold green]{key:>50s}[/]: [bold red]{kconfig[key]}")
+        setattr(config, key, kconfig[key])
+    # TODO: handle casa_own_norm -> xa_custom_norm
+    print("Configuration successfully loaded.")
+    # Save config to json
+    config.save_pretrained(args.out_dir)
+    print(f"Configuration saved to {args.out_dir}/config.json")

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    103,
+    3
+  ],
+  "pad_token_id": 3,
+  "transformers_version": "4.51.3"
+}

image_encoder.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""Qwen2.5VL encoder with delayed normalization"""
+import torch
+from einops import rearrange
+from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
+    Qwen2_5_VisionTransformerPretrainedModel,
+)
+def prepare_for_qwen_encoder(
+    x: torch.Tensor | list[torch.Tensor], mean: torch.Tensor, std: torch.Tensor
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Preprocessing for Qwen encoder
+    Image mean and std come from processor.image_processor.image_mean and image_std
+    """
+    grid_thw = torch.Tensor([[1, img.shape[0], img.shape[1]] for img in x]).to(x[0].device)
+    hws_flatten_shape = torch.prod(grid_thw, dim=-1)
+    x = torch.cat(
+        [img.reshape((int(hws_flatten_shape[idx].item()), -1)) for idx, img in enumerate(x)],
+        dim=0,
+    )
+    assert x.min() >= 0.0 and x.max() <= 1.0
+    og_shape = x.shape
+    x = rearrange(x, "L (c d) -> L c d", c=3)
+    x = (x - mean) / std
+    x = x.view(og_shape).to(torch.bfloat16)
+    return x, grid_thw
+class Qwen25VLEncoder(torch.nn.Module):
+    """Qwen2.5 VL encoder with pre/post processing to be compatible for
+    our CASA attention implementation"""
+    def __init__(
+        self,
+        visual: "Qwen2_5_VisionTransformerPretrainedModel",
+    ):
+        super().__init__()
+        self.visual = visual
+        self.image_mean = torch.tensor(self.visual.config.image_mean).view(1, 3, 1)
+        self.image_std = torch.tensor(self.visual.config.image_std).view(1, 3, 1)
+    def forward(
+        self, x: torch.Tensor | list[torch.Tensor]
+    ) -> dict[str, torch.Tensor | list[torch.Tensor]]:
+        x, grid_thw = prepare_for_qwen_encoder(
+            x, mean=self.image_mean.to(x[0].device), std=self.image_std.to(x[0].device)
+        )
+        grid_thw = grid_thw.type(torch.int)
+        assert len(x) == grid_thw.prod(dim=1).sum()
+        out = self.visual(x, grid_thw=grid_thw)
+        split_sizes = (grid_thw.prod(dim=-1) // self.visual.spatial_merge_size**2).tolist()
+        embeds = list(torch.split(out, split_sizes, dim=0))  # Ni * (seq, C)
+        return {"image_embeds": embeds, "grid_thw": grid_thw}

language_helium1_casa.py ADDED Viewed

	@@ -0,0 +1,1077 @@

+# ADAPTED FROM https://github.com/huggingface/transformers/blob/main/src/transformers/models/helium/modeling_helium.py
+# GIT HASH 1b222903c3e1cfd9492d75e4b2548aa8bd458674
+import logging
+import math
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Callable, Literal, Optional
+from typing import cast as type_cast
+import torch
+from torch import nn
+from transformers import (
+    ROPE_INIT_FUNCTIONS,  # pyright: ignore[reportPrivateImportUsage]
+    dynamic_rope_update,  # pyright: ignore[reportPrivateImportUsage]
+)
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.configuration_utils import PretrainedConfig
+from transformers.generation.utils import GenerationMixin
+from transformers.loss.loss_utils import ForCausalLMLoss
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils.generic import LossKwargs, can_return_tuple
+from transformers.utils.import_utils import is_torch_flex_attn_available
+from .casa_attention import CASAAttention, CASAAttentionHandler, insert_image_tokens
+from .configuration_helium1_casa import Helium1CASAConfig
+logger = logging.getLogger(__name__)
+if is_torch_flex_attn_available():
+    from transformers.integrations.flex_attention import make_flex_block_causal_mask
+def remove_image_tokens(
+    inputs_embeds: torch.Tensor,
+    image_tokens_mask: torch.Tensor,
+) -> torch.Tensor:
+    """Remove the image tokens from inputs_embeds as indicated by image_tokens_mask
+    :param inputs_embeds: Tokens of shape (Batch, Seqlen, Dims) containing image tokens
+    :param image_tokens_mask: 1-0 mask indicating where image tokens are; (Batch, Seqlen)
+    :return: Tokens tensor of shape (Batch, S' < Seqlen, Dims)
+    """
+    image_seq_lengths = torch.sum(image_tokens_mask, dim=1)[:, 0]
+    image_seq_length = int(image_seq_lengths[0].item())
+    assert torch.all(image_seq_lengths == image_seq_length)
+    new_shape = (
+        inputs_embeds.shape[0],
+        inputs_embeds.shape[1] - image_seq_length,
+        inputs_embeds.shape[-1],
+    )
+    tokens = torch.masked_select(
+        inputs_embeds,
+        torch.logical_not(image_tokens_mask).expand((-1, -1, inputs_embeds.shape[-1])),
+    )
+    return tokens.reshape(new_shape)
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(
+        batch, num_key_value_heads, n_rep, slen, head_dim
+    )
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: "HeliumAttention",
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: None | torch.Tensor,
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Any,
+):
+    del kwargs  # unused
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+# Different Attention Classes
+class HeliumAttention(torch.nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: Helium1CASAConfig, layer_idx: None | int = None):
+        super().__init__()
+        self.config = config
+        assert layer_idx is not None
+        self.layer_idx: int = layer_idx
+        self.apply_rotary_fn = ApplyRotaryPosEmbHelium1()
+        self.head_dim = getattr(
+            config, "head_dim", config.hidden_size // config.num_attention_heads
+        )
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = 1 / math.sqrt(self.head_dim)
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(
+            config.hidden_size,
+            config.num_attention_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size,
+            config.num_key_value_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size,
+            config.num_key_value_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: None | torch.Tensor,
+        past_key_values: None | Cache = None,
+        cache_position: None | torch.LongTensor = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        # del (cache_position, past_key_value)  # we use our own generate/caching
+        bs, seq_len, _ = hidden_states.shape
+        # Get QKV
+        hidden_shape = (bs, seq_len, -1, self.head_dim)
+        # Embed Queries
+        # Shape: (batch_size, num_heads, seq_len, head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        num_queries = query_states.shape[2]
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        # Applies rotation
+        cos, sin = position_embeddings
+        query_states, key_states = self.apply_rotary_fn(
+            query_states, key_states, cos, sin, num_queries=num_queries
+        )
+        assert key_states is not None and query_states is not None
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get(
+                "output_attentions", False
+            ):
+                print(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support"
+                    " `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument"\
+                    " `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(
+                key_states, value_states, self.layer_idx, cache_kwargs
+            )
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(bs, num_queries, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        assert isinstance(attn_output, torch.Tensor)
+        return attn_output, attn_weights
+class ApplyRotaryPosEmbHelium1:
+    @staticmethod
+    def rotate_half(x: torch.Tensor) -> torch.Tensor:
+        """Rotates half the hidden dims of the input."""
+        x1 = x[..., : x.shape[-1] // 2]
+        x2 = x[..., x.shape[-1] // 2 :]
+        return torch.cat((-x2, x1), dim=-1)
+    @staticmethod
+    def __call__(
+        q: torch.Tensor,
+        k: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+        position_ids: torch.Tensor | None = None,
+        unsqueeze_dim: int = 1,
+        num_queries: int | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Applies Rotary Position Embedding to the query and key tensors.
+        Args:
+            q (`torch.Tensor`): The query tensor.
+            k (`torch.Tensor`): The key tensor.
+            cos (`torch.Tensor`): The cosine part of the rotary embedding.
+            sin (`torch.Tensor`): The sine part of the rotary embedding.
+            position_ids (`torch.Tensor`, *optional*):
+                Deprecated and unused.
+            unsqueeze_dim (`int`, *optional*, defaults to 1):
+                The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+                sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+                that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+                k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+                cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+                the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+        Returns:
+            `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+        """
+        del position_ids
+        cos = cos.unsqueeze(unsqueeze_dim)
+        sin = sin.unsqueeze(unsqueeze_dim)
+        if num_queries is None:
+            offset = 0
+        else:
+            offset = -num_queries
+        q_embed = (q * cos[:, :, offset:]) + (
+            ApplyRotaryPosEmbHelium1.rotate_half(q) * sin[:, :, offset:]
+        )
+        k_embed = (k * cos) + (ApplyRotaryPosEmbHelium1.rotate_half(k) * sin)
+        return q_embed, k_embed
+class HeliumRotaryEmbedding(nn.Module):
+    def __init__(self, config: Helium1CASAConfig, device: None | torch.device | str = None):
+        super().__init__()
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        assert self.rope_type in ROPE_INIT_FUNCTIONS, (
+            f"Invalid rope type {self.rope_type}. Supported types are: {list(ROPE_INIT_FUNCTIONS.keys())}"
+        )
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(config, device=device)
+        self.inv_freq: torch.Tensor  # only defined for typing
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(
+        self, x: torch.Tensor, position_ids: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        inv_freq_expanded = (
+            self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        )
+        position_ids_expanded = position_ids[:, None, :].float()
+        device_type = (
+            x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        )
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+class Helium1CASAAttention(CASAAttention):
+    """A CASA Attention layer compatible with Qwen"""
+    def __init__(
+        self,
+        config: Helium1CASAConfig,
+        layer_idx: int | None,
+        self_attn: torch.nn.Module | None = None,
+        input_layernorm_fn: Callable[[torch.Tensor], torch.Tensor] | None = None,
+    ):
+        # Only adding  this init for typing purposes for the config
+        super().__init__(config, layer_idx, self_attn, input_layernorm_fn)  # pyright: ignore[reportArgumentType]
+    @staticmethod
+    def rotate_half(x: torch.Tensor) -> torch.Tensor:
+        """Rotates half the hidden dims of the input."""
+        x1 = x[..., : x.shape[-1] // 2]
+        x2 = x[..., x.shape[-1] // 2 :]
+        return torch.cat((-x2, x1), dim=-1)
+    def apply_position_embeddings(
+        self,
+        key: Literal["q", "kv"],
+        x: torch.Tensor,  # (batch, seq_len, num_heads, head_dim)
+        casa_handler: CASAAttentionHandler | None,
+        num_queries: int = 0,
+        unsqueeze_dim: int = 1,
+    ) -> torch.Tensor:  # (batch, seq_len, num_heads, head_dim)
+        """Apply position embeddings to query and key states"""
+        if casa_handler is not None:
+            posemb = casa_handler.get_position_embedding(key, num_queries=num_queries)
+            if posemb is not None:
+                x = x.transpose(1, 2).to(torch.float32)
+                x = (x * posemb[0].unsqueeze(dim=unsqueeze_dim)) + (
+                    self.rotate_half(x) * posemb[1].unsqueeze(dim=unsqueeze_dim)
+                )
+                return x.transpose(1, 2)
+        return x
+    def init_from_config_proj(
+        self, key: Literal["q", "o", "k", "v"], config: PretrainedConfig
+    ) -> torch.nn.Linear:
+        """Initialize the Linear proj in this module"""
+        num_heads = config.num_key_value_heads if key in {"k", "v"} else config.num_attention_heads
+        return torch.nn.Linear(
+            config.hidden_size,
+            num_heads * config.head_dim,
+            bias=config.attention_bias if key != "o" else False,
+        )
+# NORMALISATION LAYER
+def __rms_norm_forward__(
+    hidden_states: torch.Tensor, weight: torch.Tensor, variance_epsilon: float = 1e-6
+) -> torch.Tensor:
+    input_dtype = hidden_states.dtype
+    hidden_states = hidden_states.to(torch.float32)
+    variance = hidden_states.pow(2).mean(-1, keepdim=True)
+    hidden_states = hidden_states * torch.rsqrt(variance + variance_epsilon)
+    return weight * hidden_states.to(input_dtype)
+class Helium1RMSNorm(nn.Module):
+    def __init__(self, hidden_size: int, eps: float = 1e-6) -> None:
+        """
+        Helium1RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return __rms_norm_forward__(hidden_states, self.weight, self.variance_epsilon)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+def delta_w_factory_rms_norm(
+    org_lin: Helium1RMSNorm, new_lin: Helium1RMSNorm
+) -> Callable[[torch.Tensor], torch.Tensor]:
+    """Factory for building rms norm where the weights are the sum of two layers' weights"""
+    def _delta_w_fwd(input: torch.Tensor) -> torch.Tensor:
+        nonlocal org_lin, new_lin
+        return __rms_norm_forward__(
+            input, org_lin.weight + new_lin.weight, new_lin.variance_epsilon
+        )
+    return _delta_w_fwd
+# FULL CONNECTED LAYER
+class HeliumMLP(nn.Module):
+    def __init__(self, config: Helium1CASAConfig) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+class HeliumDecoderLayer(nn.Module):
+    def __init__(self, config: Helium1CASAConfig, layer_idx: None | int = None):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.config = config
+        self.mlp = HeliumMLP(config)
+        self.input_layernorm = Helium1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Helium1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        # Self-attention
+        self.self_attn = HeliumAttention(config=config, layer_idx=layer_idx)
+        # Setup norm for fusion mechanisms; Note that this norm is on the text tokens
+        is_xa_layer = layer_idx is None or not config.xa_layers or layer_idx in config.xa_layers
+        self.norm_cross: None | Helium1RMSNorm = None
+        self.override_norm_cross: Callable[[torch.Tensor], torch.Tensor] | None = None
+        if is_xa_layer and config.casa_attention:
+            # Custom normalization layer for the extra fusion module
+            if self.config.xa_custom_norm:
+                self.norm_cross = Helium1RMSNorm(config.hidden_size)
+                if config.casa_delta_w:
+                    self.override_norm_cross = delta_w_factory_rms_norm(
+                        self.input_layernorm, self.norm_cross
+                    )
+                    with torch.no_grad():
+                        torch.nn.init.ones_(self.norm_cross.weight)
+        # Setup additional norm for images tokens which is set in each individual mechansims
+        norm_on_images_fn = (
+            None
+            if not self.config.xa_norm_on_images
+            else self.override_norm_cross
+            if self.override_norm_cross is not None
+            else self.norm_cross.forward
+            if self.norm_cross is not None
+            else self.input_layernorm.forward
+        )
+        # CASA
+        self.casa_attn: Helium1CASAAttention | None = None
+        if config.casa_attention and is_xa_layer:
+            self.casa_attn = Helium1CASAAttention(
+                config, layer_idx, self_attn=self.self_attn, input_layernorm_fn=norm_on_images_fn
+            )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: None | torch.Tensor = None,
+        position_ids: None | torch.LongTensor = None,
+        past_key_values: None | Cache = None,
+        output_attentions: None | bool = False,
+        use_cache: None | bool = False,
+        cache_position: None | torch.LongTensor = None,
+        position_embeddings: None
+        | tuple[torch.Tensor, torch.Tensor] = None,  # necessary, but kept here for BC
+        # CASA
+        casa_handler: CASAAttentionHandler | None = None,
+        cu_seqlens: torch.Tensor | None = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor] | tuple[torch.Tensor]:
+        # Image fusion mechanisms
+        apply_ca = self.casa_attn is not None
+        ca_update: torch.Tensor | None = None
+        if (
+            self.config.xa_order
+            in {
+                "parallel",
+                "ca_first",
+                "instead",
+            }
+            and apply_ca
+        ):
+            # Apply layer norm
+            assert self.norm_cross is not None
+            ca_input = (
+                self.override_norm_cross
+                if self.override_norm_cross is not None
+                else self.norm_cross
+            )(hidden_states)
+            # CASA
+            if self.casa_attn is not None:
+                ca_update = self.casa_attn(ca_input, casa_handler=casa_handler)
+            # If we're here, it's because we had proper inputs (no text-only samples)
+            # so the output better be not None !
+            if ca_update is not None:
+                # `instead`: directly return the output of the CA module as residual
+                if self.config.xa_order == "instead":
+                    outputs = (hidden_states + ca_update,)
+                    if output_attentions:
+                        outputs += (
+                            torch.zeros((), device=ca_update.device, dtype=ca_update.dtype),
+                        )
+                    return outputs
+                # `ca_first`: update then continue with normal self-attention
+                if self.config.xa_order == "ca_first":
+                    hidden_states = hidden_states + ca_update
+                    ca_update = None
+        # Self Attention with initial input layer norm
+        residual = hidden_states
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=self.input_layernorm(hidden_states),
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            cu_seqlens=cu_seqlens,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # parallel - residual update
+        if self.config.xa_order == "parallel" and apply_ca and ca_update is not None:
+            hidden_states = hidden_states + ca_update
+        # Fully Connected layer
+        residual = hidden_states
+        # MLP updates for image embeddings
+        if (
+            self.config.xa_update_image_embeds
+            and self.casa_attn is not None
+            and casa_handler is not None
+            and casa_handler.image_embeds is not None
+        ):
+            # Text flattening
+            hs = self.post_attention_layernorm(hidden_states).reshape(-1, hidden_states.shape[-1])
+            # Image flattening
+            img_seq_lengths = [_x.shape[0] for _x in casa_handler.image_embeds]
+            img_residual = torch.cat(list(casa_handler.image_embeds), dim=0)
+            update = self.mlp(torch.cat([hs, self.post_attention_layernorm(img_residual)], dim=0))
+            # update text
+            hidden_states = hidden_states + update[: hs.shape[0]].reshape(hidden_states.shape)
+            casa_handler.image_embeds = list(
+                torch.split(img_residual + update[hs.shape[0] :], img_seq_lengths)
+            )
+        else:
+            hidden_states = self.mlp(self.post_attention_layernorm(hidden_states))
+            hidden_states = residual + hidden_states
+        # Outputs
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+# FULL HELIUM MODEL
+@dataclass
+class CausalHeliumOutput(CausalLMOutputWithPast):
+    attention_mask: Optional[torch.Tensor] = None
+    num_image_tokens_log: Optional[torch.Tensor] = None
+    num_text_tokens_log: Optional[torch.Tensor] = None
+class Helium1PreTrainedModel(PreTrainedModel):
+    config_class = Helium1CASAConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["HeliumDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module: torch.nn.Module) -> None:
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, Helium1RMSNorm):
+            module.weight.data.fill_(1.0)
+class Helium1Model(Helium1PreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+    Args:
+        config: Helium1CASAConfig
+    """
+    def __init__(self, config: Helium1CASAConfig):
+        Helium1PreTrainedModel.__init__(self, config)
+        self.training: bool
+        self._gradient_checkpointing_func: Callable
+        self.config = config
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [HeliumDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Helium1RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = HeliumRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.embed_tokens = value
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: None | torch.LongTensor = None,
+        attention_mask: None | torch.Tensor = None,
+        position_ids: None | torch.Tensor = None,
+        past_key_values: None | DynamicCache = None,
+        inputs_embeds: None | torch.Tensor = None,
+        use_cache: None | bool = None,
+        output_attentions: None | bool = None,
+        output_hidden_states: None | bool = None,
+        cache_position: None | torch.Tensor = None,
+        # Insertion
+        image_tokens_mask: torch.Tensor | None = None,
+        # CASA
+        casa_handler: CASAAttentionHandler | None = None,
+        cu_seqlens: torch.Tensor | None = None,
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> BaseModelOutputWithPast:
+        output_attentions = (
+            output_attentions if output_attentions is not None else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        use_cache = not self.training and (
+            use_cache if use_cache is not None else self.config.use_cache
+        )
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            print(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
+        if not isinstance(past_key_values, (type(None), Cache)):
+            raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        assert inputs_embeds is not None
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = 0 if past_key_values is None else past_key_values._seen_tokens
+            assert inputs_embeds is not None
+            cache_position = torch.arange(
+                past_seen_tokens,
+                past_seen_tokens + inputs_embeds.shape[1],
+                device=inputs_embeds.device,
+            )
+            assert cache_position is not None
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        # Get attention mask
+        causal_mask: None | torch.Tensor = self._update_causal_mask(
+            attention_mask,
+            inputs_embeds,
+            cache_position,
+            past_key_values,
+            output_attentions,
+            force_mask=False,
+        )
+        # create position embeddings to be shared across the decoder layers
+        hidden_states = inputs_embeds
+        position_embeddings = self.rotary_emb(inputs_embeds, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        for decoder_layer_idx, decoder_layer in enumerate(
+            self.layers[: self.config.num_hidden_layers]
+        ):
+            is_xa_layer = not self.config.xa_layers or decoder_layer_idx in self.config.xa_layers
+            if output_hidden_states is not None:
+                if all_hidden_states is None:
+                    all_hidden_states = ()
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    partial(decoder_layer.__call__, **flash_attn_kwargs),
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                    casa_handler if is_xa_layer else None,
+                    cu_seqlens,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_values=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                    casa_handler=casa_handler if is_xa_layer else None,
+                    cu_seqlens=cu_seqlens,
+                    **flash_attn_kwargs,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                if all_self_attns is None:
+                    all_self_attns = ()
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            if all_hidden_states is None:
+                all_hidden_states = ()
+            all_hidden_states += (hidden_states,)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,  # pyright: ignore[reportArgumentType]
+            hidden_states=all_hidden_states,  # pyright: ignore[reportArgumentType]
+            attentions=all_self_attns,
+        )
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor | None,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: None | DynamicCache | Cache,
+        output_attentions: bool = False,
+        force_mask: bool = False,
+    ) -> torch.Tensor | None:
+        if self.config._attn_implementation == "flex_attention":
+            if isinstance(attention_mask, torch.Tensor):
+                attention_mask = make_flex_block_causal_mask(attention_mask)  # type: ignore
+            return attention_mask
+        assert attention_mask is None or isinstance(attention_mask, torch.Tensor)
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and (force_mask or (attention_mask == 0.0).any()):
+                return attention_mask
+            return None
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_compilable_cache = (
+            past_key_values.is_compileable if past_key_values is not None else False
+        )
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if (
+            self.config._attn_implementation == "sdpa"
+            and not using_compilable_cache
+            and not output_attentions
+        ):
+            if not force_mask and AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                is_training=self.training,
+            ):
+                return None
+        dtype = input_tensor.dtype
+        sequence_length = input_tensor.shape[1]
+        if using_compilable_cache and past_key_values is not None:
+            target_length = past_key_values.get_max_cache_shape()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length
+            )
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        assert target_length is not None
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu", "npu"]
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = AttentionMaskConverter._unmask_unattended(
+                type_cast(torch.FloatTensor, causal_mask), min_dtype
+            )
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor | None,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        **kwargs: Any,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
+                `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache,
+                to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+        """
+        del kwargs
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length),
+                fill_value=min_dtype,
+                dtype=dtype,
+                device=cache_position.device,
+            )
+            if sequence_length != 1:
+                causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(
+                target_length, device=cache_position.device
+            ) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[
+                    :, None, None, :
+                ].to(causal_mask.device)
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
+class Helium1ForCausalLM(Helium1PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config: Helium1CASAConfig, **kwargs: Any) -> None:
+        del kwargs
+        super().__init__(config)
+        self.model: Helium1Model
+        self.model = Helium1Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self._loss_function = ForCausalLMLoss
+    def get_input_embeddings(self) -> nn.Module:
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.model.embed_tokens = value
+    def get_output_embeddings(self) -> nn.Module:
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings: nn.Module) -> None:
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder: Helium1Model) -> None:
+        self.model = decoder
+    def get_decoder(self) -> Helium1Model:
+        return self.model
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: None | torch.LongTensor = None,
+        attention_mask: None | torch.Tensor = None,
+        position_ids: None | torch.LongTensor = None,
+        past_key_values: None | Cache = None,
+        inputs_embeds: None | torch.Tensor = None,
+        image_embeds: None | torch.Tensor | list[torch.Tensor] = None,
+        image_embeds_insertion_points: None | list[torch.Tensor] = None,
+        labels: None | torch.LongTensor = None,
+        use_cache: None | bool = None,
+        output_attentions: None | bool = None,
+        output_hidden_states: None | bool = None,
+        cache_position: None | torch.LongTensor = None,
+        logits_to_keep: int | torch.Tensor = 0,
+        # CASA
+        casa_windows_info: None | dict = None,
+        **kwargs: Unpack[KwargsForCausalLM],
+    ) -> CausalHeliumOutput:
+        r"""
+        Helium1 augmented with CASA layers
+        """
+        output_attentions = (
+            output_attentions if output_attentions is not None else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        if input_ids is not None:
+            assert inputs_embeds is None, (
+                "Need to provide only one of `input_ids` or `inputs_embeds`."
+            )
+            inputs_embeds = self.model.embed_tokens(input_ids)
+        assert inputs_embeds is not None
+        # Setup image + text token fusion
+        bs, og_seq_len, _ = inputs_embeds.shape
+        image_tokens_mask: torch.Tensor | None = None
+        casa_handler: CASAAttentionHandler | None = None
+        num_image_tokens = -1
+        if image_embeds is not None:
+            num_image_tokens = sum(_x.shape[0] for _x in image_embeds)
+            assert image_embeds_insertion_points is not None, (
+                "Missing image embeddings insertion points"
+            )
+            # B1. CASA layers: We need to init the shared Handler
+            if self.model.config.casa_attention:
+                casa_handler = CASAAttentionHandler(
+                    # for text tokens, we don't need the actual values
+                    inputs_embeds=torch.zeros_like(inputs_embeds),
+                    # for image embeddings, we put real inputs as this will be fixed
+                    image_embeds=image_embeds,
+                    image_embeds_insertion_points=image_embeds_insertion_points,
+                    # attention mask is only needed at inference / left padding
+                    attention_mask=None if self.training else attention_mask,
+                    rope_fn=self.model.rotary_emb,
+                    windows=self.model.config.casa_windows,
+                    use_asymetric_q_kv=self.model.config.casa_use_asymetric_qkv,
+                    # further params are fed to the funtion computing attention
+                    casa_windows_info=casa_windows_info,
+                )
+            # B2. Direct image insertion
+            else:
+                inputs_embeds, _, attention_mask, image_tokens_mask = insert_image_tokens(
+                    inputs_embeds=inputs_embeds,
+                    image_embeds=image_embeds,
+                    image_embeds_insertion_points=image_embeds_insertion_points,
+                    attention_mask=attention_mask,
+                    padding_side="right" if self.training else "left",
+                    recover_batch_dim=True,
+                )
+        del image_embeds
+        del input_ids
+        outputs: BaseModelOutputWithPast = self.model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            image_tokens_mask=image_tokens_mask,
+            casa_handler=casa_handler,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        assert hidden_states is not None
+        if image_tokens_mask is not None:
+            hidden_states = remove_image_tokens(hidden_states, image_tokens_mask)
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = (
+            slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        )
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(
+                logits=logits,
+                labels=labels,
+                vocab_size=self.config.vocab_size,
+                **kwargs,
+            )
+        out = CausalHeliumOutput(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            num_image_tokens_log=torch.tensor(num_image_tokens).to(logits.device).to(torch.float),
+            num_text_tokens_log=torch.tensor(og_seq_len).to(logits.device).to(torch.float),
+        )
+        return out

model-00001-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84b4a971906aaacc7c0acf8ac98d4f59afe073954284790855b70f8ab8488df3
+size 4987411648

model-00002-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea916a815ac0deb1a397aa6ef2edf5a8baca4811327597fccf2f21ef67cf4295
+size 4993506144

model-00003-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0bc7d4d08240d3de6fedfbe8711c16345c70e0fbebddfc89685cb361f208aeb
+size 2195900304

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,793 @@

+{
+  "metadata": {
+    "total_size": 12176723968
+  },
+  "weight_map": {
+    "image_prefix.enc.visual.blocks.0.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.0.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.1.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.10.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.11.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.12.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.13.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.14.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.15.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.16.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.17.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.18.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.19.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.2.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.20.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.21.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.22.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.23.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.24.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.25.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.26.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.27.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.28.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.29.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.3.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.30.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.31.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.4.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.attn.proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.attn.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.attn.qkv.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.attn.qkv.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.down_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.gate_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.up_proj.bias": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.5.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.norm1.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.6.norm2.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.7.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.8.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.attn.proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.attn.proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.attn.qkv.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.attn.qkv.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.down_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.gate_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.up_proj.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.norm1.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.blocks.9.norm2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.merger.ln_q.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.merger.mlp.0.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.merger.mlp.0.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.merger.mlp.2.bias": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.merger.mlp.2.weight": "model-00003-of-00003.safetensors",
+    "image_prefix.enc.visual.patch_embed.proj.weight": "model-00002-of-00003.safetensors",
+    "image_prefix.norm_extra.weight": "model-00003-of-00003.safetensors",
+    "lm_head.weight": "model-00002-of-00003.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.2.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.casa_attn.k_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.casa_attn.o_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.casa_attn.q_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.casa_attn.v_proj_casa.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.norm_cross.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.3.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.casa_attn.k_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.casa_attn.o_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.casa_attn.q_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.casa_attn.v_proj_casa.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.norm_cross.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.norm.weight": "model-00002-of-00003.safetensors"
+  }
+}

modeling_helium1_casa.py ADDED Viewed

	@@ -0,0 +1,330 @@

+from typing import Any, Callable
+from typing import cast as type_cast
+import torch
+from transformers.cache_utils import DynamicCache
+from transformers.configuration_utils import PretrainedConfig
+from transformers.generation.utils import GenerateOutput
+from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
+    Qwen2_5_VisionTransformerPretrainedModel,
+)
+from .image_encoder import Qwen25VLEncoder
+from .configuration_helium1_casa import Helium1CASAConfig
+from .language_helium1_casa import (
+    CausalHeliumOutput,
+    Helium1CASAAttention,
+    Helium1ForCausalLM,
+    Helium1RMSNorm,
+)
+def meta_project(
+    logits: torch.Tensor | list[torch.Tensor],
+    projector: torch.nn.Module,
+    norm: torch.nn.Module | None = None,
+) -> torch.Tensor | list[torch.Tensor]:
+    """Projection operation that handles both tensors and list of tensors
+    Outputs either a (N, S, D) tensors (same resolution images) or a list of N (S, D) tensors (where
+    S can be a different sequence length per image)
+    """
+    split_sizes: list[int] | None = None
+    if not isinstance(logits, torch.Tensor):
+        split_sizes = [_x.shape[0] for _x in logits]
+        logits = torch.cat(logits, dim=0)[None, :, :]
+    logits = type_cast(torch.Tensor, logits)
+    logits = projector(logits)
+    assert isinstance(logits, torch.Tensor)
+    if norm is not None:
+        logits = norm(logits)
+    if split_sizes is not None:
+        return list(torch.split(type_cast(torch.Tensor, logits[0]), split_sizes, dim=0))
+    return logits
+class ImageProjection(torch.nn.Module):
+    """Takes in a batch or sequence of images and returns embeddings
+      which are then fed to the LM.
+    :param config: KyuteyeConfig object
+    :param lm_model_dim: Output dimension (number of channels) for this module
+    """
+    def __init__(self, config: PretrainedConfig, lm_model_dim: int) -> None:
+        super().__init__()
+        self.config = config
+        self.out_dim = lm_model_dim
+        visual = Qwen2_5_VisionTransformerPretrainedModel._from_config(config.vision_config)
+        self.enc = Qwen25VLEncoder(visual=visual)
+        # Projection layer
+        self.proj_extra = self.init_proj_module()
+        # Output normalizations
+        self.norm_extra = Helium1RMSNorm(self.out_dim)
+    def init_proj_module(self) -> torch.nn.Module:
+        """Init the project module for the inserted and/or cross-attended image tokens"""
+        if self.config.vision_config.out_dim == self.out_dim:
+            return torch.nn.Identity()
+        return torch.nn.Linear(self.config.vision_config.out_dim, self.out_dim)
+    def forward(
+        self, x: torch.Tensor | list[torch.Tensor]
+    ) -> dict[
+        str,
+        torch.Tensor | list[torch.Tensor],
+    ]:
+        """Image embedding mapping
+        :param x: Either a tensor with shape (Bi, C, H, W) or a list of Bi tensors
+        with shape (C, H, W) (or (H, W, C) in the case of Qwen)
+        :return: Either a tensor with shape (num_total_image, S, D) or, if images
+        can have different seq length, a list of `num_total_images` Tensors with shape
+        (S, D)
+        """
+        # Apply image encoder
+        og_dtype = x[0].dtype
+        encoded = self.enc(x)["image_embeds"]
+        encoded = [_x.to(og_dtype) for _x in encoded]
+        if all(x.shape[0] == encoded[0].shape[0] for x in encoded):
+            encoded = torch.stack(encoded, dim=0)
+        # Extra projection
+        image_embeds = meta_project(encoded, self.proj_extra, self.norm_extra)
+        # Apply different projection for extra vs cross attended tokens
+        return {"image_embeds": image_embeds}
+class V2Helium1(Helium1ForCausalLM):  # pyright: ignore[reportIncompatibleMethodOverride]
+    config_class = Helium1CASAConfig
+    def __init__(self, config: Helium1CASAConfig, **kwargs: Any) -> None:
+        del kwargs
+        super().__init__(config)
+        self.image_prefix = ImageProjection(config=config, lm_model_dim=self.token_dim)
+    def get_device(self) -> str:
+        """Return the device type of the model"""
+        return next(self.parameters()).device.type
+    @property
+    def token_dim(self) -> int:
+        """Returns the number of dimensions for the token representation"""
+        return self.config.hidden_size
+    @property
+    def rotary_embed(self) -> Callable:
+        """Returns the rotary embedding function of the underlying model"""
+        return self.model.rotary_emb
+    def _update_model_kwargs_for_generation(
+        self,
+        outputs: Any,
+        model_kwargs: dict[str, Any],
+        is_encoder_decoder: bool = False,
+        num_new_tokens: int = 1,
+    ):
+        """This is required to handle multiple gen calls for subtitles"""
+        # Call parent to get default updates
+        model_kwargs = super()._update_model_kwargs_for_generation(
+            outputs, model_kwargs, is_encoder_decoder, num_new_tokens
+        )
+        # Used by prepare_inputs_for_generation
+        model_kwargs["__is_first_gen_call__"] = False
+        return model_kwargs
+    def prepare_inputs_for_generation(  # pyright: ignore[reportIncompatibleMethodOverride]
+        self,
+        input_ids: torch.Tensor,
+        past_key_values: DynamicCache | None = None,
+        **kwargs: Any,
+    ):
+        __is_first_gen_call__ = kwargs.get("__is_first_gen_call__", True)
+        if past_key_values is not None and (
+            kwargs.get("cache_position") is None
+            or type_cast(torch.Tensor, kwargs.get("cache_position")).shape[0] == 0
+        ):
+            # We're continuing from a cached state
+            past_length = past_key_values._seen_tokens
+            kwargs["cache_position"] = torch.arange(
+                past_length,
+                past_length + (input_ids.shape[1] if __is_first_gen_call__ else 1),
+                dtype=torch.long,
+                device=input_ids.device,
+            )
+        return super().prepare_inputs_for_generation(
+            type_cast(torch.LongTensor, input_ids),
+            past_key_values=past_key_values,
+            **kwargs,
+        )
+    def prepare_multimodal_inputs(
+        self,
+        # text only training
+        input_ids: torch.Tensor | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        image_embeds_insertion_points: list[torch.Tensor] | None = None,
+        labels: torch.Tensor | None = None,
+        # image values
+        pixel_values: torch.Tensor | list[torch.Tensor] | None = None,
+        pre_image_tokens: list[int] | None = None,
+        post_image_tokens: list[int] | None = None,
+        **_kwargs: Any,
+    ) -> dict:
+        """Get a batch data mixing text and image data"""
+        del _kwargs
+        processed_inputs = {
+            "input_ids": input_ids,
+            "inputs_embeds": inputs_embeds,
+            "labels": labels,
+            "attention_mask": attention_mask,
+            "image_embeds_insertion_points": image_embeds_insertion_points,
+        }
+        if pixel_values is not None:
+            processed_inputs.update(self.image_prefix(pixel_values))
+            assert "image_embeds" in processed_inputs
+            assert (
+                isinstance(processed_inputs["image_embeds"], torch.Tensor)
+                and processed_inputs["image_embeds"].ndim == 3
+            ) or (
+                isinstance(processed_inputs["image_embeds"], list)
+                and all(_x.ndim == 2 for _x in processed_inputs["image_embeds"])
+            )
+        # Add kwargs necessary to compute cu_seqlens windows for CASA
+        processed_inputs["casa_windows_info"] = {
+            "num_post_image_tokens": 0 if post_image_tokens is None else len(post_image_tokens),
+            "num_pre_image_tokens": 0 if pre_image_tokens is None else len(pre_image_tokens),
+        }
+        return processed_inputs
+    def forward(  # pyright: ignore[reportIncompatibleMethodOverride]
+        self,
+        input_ids: torch.Tensor | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        pixel_values: torch.Tensor | list[torch.Tensor] | None = None,
+        return_loss: bool = True,
+        labels: torch.Tensor | None = None,
+        image_embeds_insertion_points: list[torch.Tensor] | None = None,
+        pre_image_tokens: list[int] | None = None,
+        post_image_tokens: list[int] | None = None,
+        **kwargs: Any,
+    ) -> CausalHeliumOutput:
+        """Multi modal forward pass"""
+        assert input_ids is not None or inputs_embeds is not None
+        if self.training:
+            assert return_loss is True, (
+                "Helium models always compute its own labels/losses in train mode"
+            )
+        # Case 1: For first generation call we need to compute pixel values and CASA states
+        if kwargs.get("__is_first_gen_call__", True):
+            processed_inputs = self.prepare_multimodal_inputs(
+                input_ids=input_ids,
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                image_embeds_insertion_points=image_embeds_insertion_points,
+                pixel_values=pixel_values,
+                labels=labels,
+                pre_image_tokens=pre_image_tokens,
+                post_image_tokens=post_image_tokens,
+            )
+            processed_inputs.pop("inputs_embeds", None)
+        else:
+            processed_inputs = {
+                "inputs_embeds": self.model.embed_tokens(input_ids),
+                "attention_mask": attention_mask,
+            }
+        # For Helium prefix, we need to update the positions by the number
+        # of image tokens inserted in the first call
+        if (
+            not self.config.casa_attention
+            and (cp := kwargs.get("cache_position", None)) is not None
+            and pixel_values is not None
+        ):
+            start = kwargs["cache_position"][0].item()
+            num_image_tokens = (pixel_values[0].shape[0] * pixel_values[0].shape[1]) // 4
+            num_tokens = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]  # type: ignore
+            kwargs["cache_position"] = torch.arange(
+                start + (0 if kwargs.get("__is_first_gen_call__", True) else num_image_tokens),
+                start + num_tokens + num_image_tokens,
+                dtype=cp.dtype,
+                device=cp.device,
+            )
+        kwargs.pop("__is_first_gen_call__", True)
+        out = super().forward(
+            **processed_inputs,  # type: ignore
+            **kwargs,
+        )
+        return out
+    @torch.no_grad()
+    def generate_from_image(  # pyright: ignore[reportInconsistentOverload,reportIncompatibleMethodOverride]
+        self,
+        input_ids: torch.Tensor | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        image_embeds_insertion_points: list[torch.Tensor] | None = None,
+        pixel_values: torch.Tensor | list[torch.Tensor] | None = None,
+        reset_streaming: bool = True,
+        **kwargs: Any,
+    ) -> "GenerateOutput | torch.LongTensor":
+        assert input_ids is not None and inputs_embeds is None, (
+            "Input IDs must be provided for generation"
+        )
+        # init self-attention KVCache
+        if kwargs.get("past_key_values", None) is None:
+            kwargs["past_key_values"] = DynamicCache()
+        # To avoid generate warning
+        if kwargs.get("pad_token_id", None) is None:
+            kwargs["pad_token_id"] = kwargs.get("eos_token_id", None)
+            if isinstance(kwargs["pad_token_id"], (list, tuple)):
+                kwargs["pad_token_id"] = kwargs["pad_token_id"][0]
+        self.start_casa_streaming_states()
+        outputs = self.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            pixel_values=pixel_values,
+            image_embeds_insertion_points=image_embeds_insertion_points,
+            use_cache=True,
+            **kwargs,
+        )
+        if reset_streaming:
+            self.reset_casa_streaming_states()
+        return outputs
+    def reset_casa_streaming_states(self, clean_cache: bool = True) -> None:
+        def __reset__(m: torch.nn.Module):
+            if isinstance(m, Helium1CASAAttention):
+                m._set_streaming(False, ())
+                m.reset_streaming()
+                if clean_cache:
+                    del m.streaming_state.k
+                    del m.streaming_state.v
+                    del m.streaming_state.casa_handler
+        self.apply(__reset__)
+    def start_casa_streaming_states(self) -> None:
+        def __start__(m: torch.nn.Module):
+            if isinstance(m, Helium1CASAAttention):
+                m._set_streaming(True, ())
+        self.apply(__start__)

processing.py ADDED Viewed

	@@ -0,0 +1,505 @@

+# pylint: disable=no-member  # avoid weird pylint warnings from SentencePieceProcessor
+"""Text and Image processor for CASA models using Qwen2.5_VL image encoder"""
+from math import ceil
+from typing import TYPE_CHECKING, Any, Literal, TypedDict, cast, overload
+from typing import cast as type_cast
+import torch
+import torchvision.transforms.v2 as T
+from einops import rearrange
+from PIL import Image
+from torchvision.transforms import InterpolationMode
+from torchvision.transforms.functional import to_tensor as pil_to_tensor
+from torchvision.transforms.v2 import functional as F
+from transformers.image_processing_utils import BaseImageProcessor
+from transformers.processing_utils import ProcessorMixin
+if TYPE_CHECKING:
+    from transformers.models.qwen2.tokenization_qwen2 import Qwen2Tokenizer
+    from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
+ImageMessage = TypedDict(
+    "ImageMessage",
+    {
+        "type": Literal["image"],
+        "image": str | Image.Image | None,
+    },
+)
+TextMessage = TypedDict(
+    "TextMessage",
+    {
+        "type": Literal["text"],
+        "text": str,
+    },
+)
+MessageContent = list[ImageMessage | TextMessage]
+Message = TypedDict(
+    "Message",
+    {
+        "role": Literal["system", "user", "assistant"],
+        "content": MessageContent,
+    },
+)
+ProcessorInput = list[list[Message]] | list[Message]
+__INTERP_NAME_TO_MODE__ = {
+    "nearest": InterpolationMode.NEAREST,
+    "bilinear": InterpolationMode.BILINEAR,
+    "bicubic": InterpolationMode.BICUBIC,
+    "lanczos": InterpolationMode.LANCZOS,
+}
+__INTERP_INT_TO_MODE__ = {
+    0: InterpolationMode.NEAREST,
+    2: InterpolationMode.BILINEAR,
+    3: InterpolationMode.BICUBIC,
+    4: InterpolationMode.BOX,
+    5: InterpolationMode.HAMMING,
+    1: InterpolationMode.LANCZOS,
+}
+@overload
+def universal_resize(
+    img: Image.Image,
+    size: tuple[int, int],
+    interpolation: str | InterpolationMode | int = "bilinear",
+    antialias: bool = True,
+) -> Image.Image: ...
+@overload
+def universal_resize(
+    img: torch.Tensor,
+    size: tuple[int, int],
+    interpolation: str | InterpolationMode | int = "bilinear",
+    antialias: bool = True,
+) -> torch.Tensor: ...
+def universal_resize(
+    img: Image.Image | torch.Tensor,
+    size: tuple[int, int],
+    interpolation: str | InterpolationMode | int = "bilinear",
+    antialias: bool = True,
+) -> Image.Image | torch.Tensor:
+    """Resize that works for PIL.Image, CHW tensor, or BCHW tensor"""
+    if isinstance(interpolation, str):
+        interpolation = __INTERP_NAME_TO_MODE__[interpolation]
+    elif isinstance(interpolation, int):
+        interpolation = __INTERP_INT_TO_MODE__[interpolation]
+    return F.resize(
+        img, size, interpolation=type_cast(InterpolationMode, interpolation), antialias=antialias
+    )
+@overload
+def convert_to_rgb(img: Image.Image) -> Image.Image: ...
+@overload
+def convert_to_rgb(img: torch.Tensor) -> torch.Tensor: ...
+def convert_to_rgb(img: Image.Image | torch.Tensor) -> Image.Image | torch.Tensor:
+    """Convert any image to RGB in a way that does not throw PIL warning"""
+    if isinstance(img, torch.Tensor):
+        return img
+    if img.mode == "RGB":  # no changes
+        return img
+    if img.mode == "P":  # palette images need to be converted to RGBA first
+        return img.convert("RGBA").convert("RGB")
+    return img.convert("RGB")
+class QwenImageProcessor(BaseImageProcessor):
+    """Resizing for the Qwen2.5VL encoder. Note that the normalization is
+    handled in the image_encoder in the model forward"""
+    def __init__(
+        self,
+        img_size: int = 448,
+        interpolation: Literal["bicubic", "bilinear", "nearest", "nearest_exact"] = "bicubic",
+        max_ratio: int = 10,
+        round_to_patch_size: int = 56,
+        use_fast: bool = True,
+        **kwargs: Any,
+    ) -> None:
+        # this will also be used in V2llms to determine whether to remove
+        # the temporal conv
+        self._num_target_channels = 588
+        self._merge_size = 2
+        self._patch_size = 14
+        super().__init__(
+            use_fast=use_fast,
+            do_normalize=False,
+            **kwargs,
+        )
+        self.img_size = img_size
+        self.interpolation = interpolation
+        self.max_ratio = max_ratio
+        self.round_to_patch_size = round_to_patch_size
+    def resize_transform(
+        self, img: Image.Image | torch.Tensor, img_size: int | None = None
+    ) -> Image.Image | torch.Tensor:
+        if img_size is None:
+            img_size = self.img_size
+        max_area = img_size**2
+        if isinstance(img, Image.Image):
+            img = convert_to_rgb(img)
+            w_og, h_og = img.size
+        else:
+            h_og, w_og = img.shape[-2:]
+        w, h = w_og, h_og
+        # Qwen requires max ratio of 10 between max and min sizes
+        if self.max_ratio > 0:
+            w, h = max(w, h // self.max_ratio), max(h, w // self.max_ratio)
+        # resize to max area
+        current_area = w * h
+        if current_area > max_area:
+            scale = (max_area / current_area) ** 0.5
+            w, h = int(w * scale), int(h * scale)
+        # resize to patch size
+        if self.round_to_patch_size > 0:
+            w = ceil(w / self.round_to_patch_size) * self.round_to_patch_size
+            h = ceil((h / self.round_to_patch_size)) * self.round_to_patch_size
+        # resize
+        if w != w_og or h != h_og:
+            img = universal_resize(img, (h, w), self.interpolation)
+        if isinstance(img, torch.Tensor):
+            img = T.ToDtype(torch.float32, scale=True)(T.ToImage()(img))
+        return img
+    def __process_one__(
+        self, video_or_img: Image.Image | torch.Tensor, img_size: int | None = None
+    ) -> torch.Tensor:
+        """Same operation as __process_one_with_processor__ but without going through numpy"""
+        video_or_img = self.resize_transform(video_or_img, img_size)
+        if isinstance(video_or_img, Image.Image):
+            video_or_img = pil_to_tensor(video_or_img)
+        assert isinstance(video_or_img, torch.Tensor)
+        if video_or_img.ndim == 3:
+            video_or_img = video_or_img[None]
+        assert video_or_img.ndim == 4 and video_or_img.shape[1] == 3, (
+            f"Invalid shape {video_or_img.shape}."
+        )
+        t, c, h, w = video_or_img.shape
+        p = self._patch_size
+        m = self._merge_size
+        # Convert to RGB
+        if c == 1:
+            video_or_img = video_or_img.expand((-1, 3, -1, -1))
+        if c == 4:
+            video_or_img = video_or_img[:, :3]
+        c = video_or_img.shape[1]
+        assert c == 3, "Expecting RGB image in QwenNormalize"
+        # Reshape to t h w c' format
+        h, w = video_or_img.shape[2] // p, video_or_img.shape[3] // p
+        rearrange_dict = dict(p1=p, p2=p, m1=m, m2=m)
+        video_or_img = rearrange(
+            video_or_img,
+            "t c (h m1 p1) (w m2 p2) -> (t h w m1 m2) (c p1 p2)",
+            **rearrange_dict,
+        )
+        assert video_or_img.shape[-1] == self._num_target_channels, (
+            f"{video_or_img.shape[-1]} != {self._num_target_channels}"
+        )
+        video_or_img = video_or_img.view((-1, h, w, self._num_target_channels))
+        return video_or_img
+    @overload
+    def process_images(
+        self, image: Image.Image | torch.Tensor, img_size: int | None = None
+    ) -> torch.Tensor: ...
+    @overload
+    def process_images(
+        self, image: list[Image.Image] | list[torch.Tensor], img_size: int | None = None
+    ) -> list[torch.Tensor]: ...
+    def process_images(
+        self,
+        image: Image.Image | torch.Tensor | list[Image.Image] | list[torch.Tensor],
+        img_size: int | None = None,
+    ) -> torch.Tensor | list[torch.Tensor]:
+        if isinstance(image, list):
+            return [self.__process_one__(_x, img_size) for _x in image]
+        return self.__process_one__(image, img_size)
+class ProcessorOutput(dict):
+    input_ids: torch.Tensor
+    attention_mask: torch.Tensor
+    image_embeds_insertion_points: list[torch.Tensor] | None
+    pixel_values: torch.Tensor | list[torch.Tensor] | None
+    def to(
+        self, device: torch.device | str, dtype: torch.dtype = torch.bfloat16
+    ) -> "ProcessorOutput":
+        return ProcessorOutput(
+            {
+                "input_ids": self["input_ids"].to(device),
+                "attention_mask": self["attention_mask"].to(device),
+                "image_embeds_insertion_points": self["image_embeds_insertion_points"],
+                "pixel_values": (
+                    self["pixel_values"].to(dtype).to(device)
+                    if isinstance(self["pixel_values"], torch.Tensor)
+                    else [x.to(dtype).to(device) for x in self["pixel_values"]]
+                    if self["pixel_values"] is not None
+                    else None
+                ),
+            }
+        )
+class BaseProcessor(ProcessorMixin):
+    def __init__(
+        self,
+        tokenizer: "PreTrainedTokenizerFast | Qwen2Tokenizer",
+        pre_image_tokens: tuple[int, ...] = (),
+        post_image_tokens: tuple[int, ...] = (),
+        system_start_tokens: tuple[int, ...] = (),
+        system_end_tokens: tuple[int, ...] = (),
+        user_start_tokens: tuple[int, ...] = (),
+        user_end_tokens: tuple[int, ...] = (),
+        asst_start_tokens: tuple[int, ...] = (),
+        asst_end_tokens: tuple[int, ...] = (),
+        allow_system_prompt: bool = True,
+        pad_token: int = 0,
+        bos_token: int | None = None,
+    ) -> None:
+        self.pre_image_tokens = list(pre_image_tokens)
+        self.post_image_tokens = list(post_image_tokens)
+        self.system_start_tokens = list(system_start_tokens)
+        self.system_end_tokens = list(system_end_tokens)
+        self.user_start_tokens = list(user_start_tokens)
+        self.user_end_tokens = list(user_end_tokens)
+        self.asst_start_tokens = list(asst_start_tokens)
+        self.asst_end_tokens = list(asst_end_tokens)
+        self._allow_system_prompt = allow_system_prompt
+        self.tokenizer = tokenizer
+        self._image_processor = None
+        self._pad_token = pad_token
+        self.bos_token = bos_token
+    @property
+    def image_processor(self) -> QwenImageProcessor:
+        assert self._image_processor is not None
+        return self._image_processor
+    def _process_content(
+        self,
+        message_content: MessageContent,
+        role: Literal["system", "user", "assistant"],
+        tokenized_messages: list[torch.Tensor],
+        insertion_points: list[int],
+        image_list: list[torch.Tensor | None],
+        token_count: int,
+        img_size: int | None = None,
+        **kwargs: Any,
+    ) -> int:
+        mapping = {
+            "user": (self.user_start_tokens, self.user_end_tokens),
+            "assistant": (self.asst_start_tokens, self.asst_end_tokens),
+            "system": (self.system_start_tokens, self.system_end_tokens),
+        }
+        if role.lower() not in mapping:
+            raise ValueError(f"Unknown role '{role}' encountered in messages.")
+        start_tokens, end_tokens = mapping[role.lower()]
+        # 1) Add the start tokens
+        if start_tokens:
+            tokenized_messages.append(torch.Tensor(start_tokens).flatten().to(torch.long))
+            token_count += len(start_tokens)
+        # 2) Process the message content one by one (potentially interleaved image and text)
+        for part in message_content:
+            elt_type = part["type"]
+            if elt_type == "image":
+                part = cast(ImageMessage, part)
+                self._process_image_message(
+                    part,
+                    tokenized_messages,
+                    image_list,
+                    img_size=img_size,
+                )
+                token_count += len(self.pre_image_tokens)
+                insertion_points.append(token_count)
+                token_count += len(self.post_image_tokens)
+            else:
+                part = cast(TextMessage, part)
+                self._process_text_message(
+                    part["text"],
+                    role=role,
+                    token_list=tokenized_messages,
+                    **kwargs,
+                )
+                token_count += tokenized_messages[-1].size(0)
+        # 3) Add the end tokens
+        if end_tokens:
+            tokenized_messages.append(torch.Tensor(end_tokens).flatten().to(torch.long))
+            token_count += len(end_tokens)
+        return token_count
+    def _process_text_message(
+        self,
+        message: str,
+        role: Literal["system", "user", "assistant"],
+        token_list: list[torch.Tensor],
+        **kwargs: Any,
+    ) -> None:
+        if role.lower() == "system" and not self._allow_system_prompt:
+            raise ValueError("System prompts are not allowed in this tokenizer configuration.")
+        tokens = self.tokenizer.encode(
+            message, add_special_tokens=False, return_tensors="pt", **kwargs
+        )
+        tokens = cast(torch.Tensor, tokens)
+        token_list.append(tokens.flatten().to(torch.long))
+    def _process_image_message(
+        self,
+        message: ImageMessage,
+        token_list: list[torch.Tensor],
+        image_list: list[torch.Tensor | None],
+        img_size: int | None = None,
+    ) -> None:
+        img = message["image"]
+        if img is None:
+            image_list.append(None)
+        else:
+            image_list.append(
+                self.image_processor.process_images(
+                    self._load_image(img), img_size=img_size
+                ).squeeze(0)
+            )
+        if self.pre_image_tokens:
+            token_list.append(torch.Tensor(self.pre_image_tokens).flatten().to(torch.long))
+        if self.post_image_tokens:
+            token_list.append(torch.Tensor(self.post_image_tokens).flatten().to(torch.long))
+    def _load_image(self, image_path_or_image: str | Image.Image) -> Image.Image:
+        if isinstance(image_path_or_image, str):
+            return Image.open(image_path_or_image).convert("RGB")
+        return image_path_or_image
+    def _maybe_pad(self, tokens: torch.Tensor, pad_len: int, pad_value: int) -> torch.Tensor:
+        return torch.nn.functional.pad(
+            tokens,
+            (0, pad_len) if self.tokenizer.padding_side == "right" else (pad_len, 0),
+            value=pad_value,
+        )
+    def pad_tokenized_messages(
+        self,
+        tokenized_messages_batch: list[torch.Tensor],
+        image_insertion_points_batch: list[torch.Tensor] | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor, list[torch.Tensor] | None]:
+        max_len = max(len(x) for x in tokenized_messages_batch)
+        if image_insertion_points_batch is not None and self.tokenizer.padding_side == "left":
+            image_insertion_points_batch = [
+                x + max_len - len(tokenized_messages_batch[idx])
+                for idx, x in enumerate(image_insertion_points_batch)
+            ]
+        input_ids = torch.stack(
+            [
+                self._maybe_pad(s, max_len - s.size(0), self._pad_token)
+                for s in tokenized_messages_batch
+            ],
+            dim=0,
+        )
+        attention_mask = torch.stack(
+            [
+                self._maybe_pad(torch.ones_like(s), max_len - s.size(0), 0)
+                for s in tokenized_messages_batch
+            ],
+            dim=0,
+        )
+        return input_ids, attention_mask, image_insertion_points_batch
+    def tokenize_messages(
+        self,
+        messages: ProcessorInput,
+        suppress_bos_token: bool = False,
+        **kwargs: Any,
+    ) -> ProcessorOutput | None:
+        """Tokenize a batch of messages into token IDs suitable for Helium1 CASA model.
+        Args:
+            messages (list[list[dict[str, str]]] | list[dict[str, str]]): Batch of message lists (or single list of messages),
+              where each message is a list of dictionaries with 'role' and 'content' keys.
+            continue_final_message (bool, optional): If True, the final message in each list will not have an end token added.
+              Defaults to False.
+            suppress_bos_token (bool, optional): If True, the beginning-of-sequence token will not be added.
+                Defaults to False.
+            **kwargs: Additional keyword arguments passed to the underlying encode method.
+        """
+        if not messages:
+            return None
+        if isinstance(messages[0], dict):
+            messages = [messages]  # type: ignore[assignment]
+        messages = cast(list[list[Message]], messages)
+        image_insertion_points_batch = []
+        tokenized_messages_batch = []
+        image_list: list[torch.Tensor | None] = []
+        for msgs in messages:
+            # msgs.append({
+            #     "role": "assistant",
+            #     "content": [{"type": "text", "text": ""}]
+            # })
+            tokenized_messages = []
+            if not suppress_bos_token and self.bos_token is not None:
+                tokenized_messages.append(torch.tensor([self.bos_token], dtype=torch.long))
+            insertion_points = []
+            token_count = 0
+            for msg in msgs:
+                token_count = self._process_content(
+                    msg["content"],
+                    role=msg["role"],
+                    tokenized_messages=tokenized_messages,
+                    insertion_points=insertion_points,
+                    image_list=image_list,
+                    token_count=token_count,
+                    **kwargs,
+                )
+            tokenized_messages_batch.append(torch.cat(tokenized_messages, dim=0).to(torch.long))
+            image_insertion_points_batch.append(torch.tensor(insertion_points, dtype=torch.long))
+            if msgs and self.asst_end_tokens and msgs[-1]["role"].lower() == "assistant":
+                # Remove the assistant end tokens from the final message
+                end_token_len = len(self.asst_end_tokens)
+                tokenized_messages_batch[-1] = tokenized_messages_batch[-1][:-end_token_len]
+            if msgs and self.asst_start_tokens and msgs[-1]["role"].lower() == "user":
+                # Remove the assistant end tokens from the final message
+                end_token_len = len(self.asst_end_tokens)
+                tokenized_messages_batch[-1] = torch.cat(
+                    [
+                        tokenized_messages_batch[-1],
+                        torch.Tensor(self.asst_start_tokens).to(torch.long),
+                    ]
+                )
+        input_ids, attention_mask, image_embeds_insertion_points = self.pad_tokenized_messages(
+            tokenized_messages_batch, image_insertion_points_batch
+        )
+        if image_list:
+            assert sum(img is None for img in image_list) % len(image_list) == 0, (
+                "Either all or no image must be None."
+            )
+        pixel_values: None | torch.Tensor | list[torch.Tensor]
+        if image_list[0] is None:
+            pixel_values = None
+        else:
+            pixel_values = cast(list[torch.Tensor], image_list)
+        return ProcessorOutput(
+            input_ids=input_ids,
+            image_embeds_insertion_points=image_embeds_insertion_points,
+            attention_mask=attention_mask,
+            pixel_values=pixel_values,
+        )

processing_helium1_casa.py ADDED Viewed

	@@ -0,0 +1,37 @@

+from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
+from .processing import BaseProcessor, QwenImageProcessor
+class Helium1CASAProcessor(BaseProcessor):
+    attributes = ["tokenizer"]
+    tokenizer_class = "PreTrainedTokenizerFast"
+    def __init__(
+        self,
+        tokenizer: PreTrainedTokenizerFast,
+        pre_image_tokens: tuple[int, ...] = tuple(),
+        post_image_tokens: tuple[int, ...] = tuple(),
+        system_start_tokens: tuple[int, ...] = tuple(),
+        system_end_tokens: tuple[int, ...] = tuple(),
+        user_start_tokens: tuple[int, ...] = (104,),
+        user_end_tokens: tuple[int, ...] = (105,),
+        asst_start_tokens: tuple[int, ...] = (102,),
+        asst_end_tokens: tuple[int, ...] = (103,),
+        bos_token: int = 1,
+        image_size: int = 896,
+    ):
+        super().__init__(
+            tokenizer=tokenizer,
+            pre_image_tokens=pre_image_tokens,
+            post_image_tokens=post_image_tokens,
+            system_start_tokens=system_start_tokens,
+            system_end_tokens=system_end_tokens,
+            user_start_tokens=user_start_tokens,
+            user_end_tokens=user_end_tokens,
+            asst_start_tokens=asst_start_tokens,
+            asst_end_tokens=asst_end_tokens,
+            allow_system_prompt=False,
+            bos_token=bos_token,
+        )
+        self._image_processor = QwenImageProcessor(img_size=image_size)

processor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "auto_map": {
+        "AutoProcessor": "processing_helium1_casa.Helium1CASAProcessor"
+    },
+    "bos_token": 1,
+    "image_size": 896,
+    "post_image_tokens": [],
+    "pre_image_tokens": [],
+    "processor_class": "Helium1CASAProcessor"
+}

readme_images/CASA.png ADDED Viewed

Git LFS Details

SHA256: 388d6098d61c64dfee303411d93440dd00a8371806055af3a225aeb70590f746
Pointer size: 131 Bytes
Size of remote file: 115 kB

readme_images/casa_explainer.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5ee73e4e8ea65ebc8d3e53d6468a3d2688d5f656259bcae73bffd843e5a0e69
+size 299297

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,418 @@

+<!doctype html>
+<html class="">
+	<head>
+		<meta charset="utf-8" />
+		<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no" />
+		<meta name="description" content="We’re on a journey to advance and democratize artificial intelligence through open source and open science." />
+		<meta property="fb:app_id" content="1321688464574422" />
+		<meta name="twitter:card" content="summary_large_image" />
+		<meta name="twitter:site" content="@huggingface" />
+		<meta name="twitter:image" content="https://cdn-thumbnails.huggingface.co/social-thumbnails/models/kyutai/helium-1-2b.png" />
+		<meta property="og:title" content="tokenizer.model · kyutai/helium-1-2b at main" />
+		<meta property="og:type" content="website" />
+		<meta property="og:url" content="https://huggingface.co/kyutai/helium-1-2b/blob/main/tokenizer.model" />
+		<meta property="og:image" content="https://cdn-thumbnails.huggingface.co/social-thumbnails/models/kyutai/helium-1-2b.png" />
+		<link rel="stylesheet" href="/front/build/kube-02d86c8/style.css" />
+		<link rel="preconnect" href="https://fonts.gstatic.com" />
+		<link
+			href="https://fonts.googleapis.com/css2?family=Source+Sans+Pro:ital,wght@0,200;0,300;0,400;0,600;0,700;1,200;1,300;1,400;1,600;1,700&display=swap"
+			rel="stylesheet"
+		/>
+		<link
+			href="https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;600;700&display=swap"
+			rel="stylesheet"
+		/>
+		<link
+			rel="preload"
+			href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.12.0/katex.min.css"
+			as="style"
+			onload="this.onload=null;this.rel='stylesheet'"
+		/>
+		<noscript>
+			<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.12.0/katex.min.css" />
+		</noscript>
+		<script>const guestTheme = document.cookie.match(/theme=(\w+)/)?.[1]; document.documentElement.classList.toggle('dark', guestTheme === 'dark' || ( (!guestTheme || guestTheme === 'system') && window.matchMedia('(prefers-color-scheme: dark)').matches));</script>
+<link rel="canonical" href="https://huggingface.co/kyutai/helium-1-2b/blob/main/tokenizer.model">
+		<title>tokenizer.model · kyutai/helium-1-2b at main</title>
+		<script defer src="/js/script.js"></script>
+		<script>
+			(window.plausible =
+				window.plausible ||
+				function () {
+					(plausible.q = plausible.q || []).push(arguments);
+				}),
+				(plausible.init =
+					plausible.init ||
+					function (i) {
+						plausible.o = i || {};
+					});
+			plausible.init({
+				customProperties: {
+					loggedIn: "false",
+				},
+				endpoint: "/api/event",
+			});
+		</script>
+		<script>
+			window.hubConfig = {"features":{"signupDisabled":false},"sshGitUrl":"git@hf.co","moonHttpUrl":"https:\/\/huggingface.co","captchaApiKey":"bd5f2066-93dc-4bdd-a64b-a24646ca3859","datasetViewerPublicUrl":"https:\/\/datasets-server.huggingface.co","stripePublicKey":"pk_live_x2tdjFXBCvXo2FFmMybezpeM00J6gPCAAc","environment":"production","userAgent":"HuggingFace (production)","spacesIframeDomain":"hf.space","spacesApiUrl":"https:\/\/api.hf.space","docSearchKey":"ece5e02e57300e17d152c08056145326e90c4bff3dd07d7d1ae40cf1c8d39cb6","logoDev":{"apiUrl":"https:\/\/img.logo.dev\/","apiKey":"pk_UHS2HZOeRnaSOdDp7jbd5w"}};
+		</script>
+		<script type="text/javascript" src="https://de5282c3ca0c.edge.sdk.awswaf.com/de5282c3ca0c/526cf06acb0d/challenge.js" defer></script>
+	</head>
+	<body class="flex flex-col min-h-dvh bg-white dark:bg-gray-950 text-black ViewerBlobPage">
+		<div class="flex min-h-dvh flex-col"><div class="SVELTE_HYDRATER contents" data-target="DeviceProvider" data-props="{}"></div>
+	<div class="SVELTE_HYDRATER contents" data-target="SystemThemeMonitor" data-props="{&quot;isLoggedIn&quot;:false}"></div>
+	<div class="SVELTE_HYDRATER contents" data-target="MainHeader" data-props="{&quot;classNames&quot;:&quot;&quot;,&quot;isWide&quot;:false,&quot;isZh&quot;:false,&quot;isPro&quot;:false}"><header class="border-b border-gray-100 "><div class="w-full px-4 container flex h-16 items-center"><div class="flex flex-1 items-center"><a class="mr-5 flex flex-none items-center lg:mr-6" href="/"><img alt="Hugging Face's logo" class="w-7 md:mr-2" src="/front/assets/huggingface_logo-noborder.svg">
+				<span class="hidden whitespace-nowrap text-lg font-bold md:block">Hugging Face</span></a>
+			<div class="relative flex-1 lg:max-w-sm mr-2 sm:mr-4 md:mr-3 xl:mr-6"><input autocomplete="off" class="w-full dark:bg-gray-950 pl-8 form-input-alt h-9 pr-3 focus:shadow-xl " name="" placeholder="Search models, datasets, users..."   spellcheck="false" type="text" value="">
+	<svg class="absolute left-2.5 text-gray-400 top-1/2 transform -translate-y-1/2" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M30 28.59L22.45 21A11 11 0 1 0 21 22.45L28.59 30zM5 14a9 9 0 1 1 9 9a9 9 0 0 1-9-9z" fill="currentColor"></path></svg>
+	</div>
+			<div class="flex flex-none items-center justify-center p-0.5 place-self-stretch lg:hidden"><button class="relative z-40 flex h-6 w-8 items-center justify-center" type="button"><svg width="1em" height="1em" viewBox="0 0 10 10" class="text-xl" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" preserveAspectRatio="xMidYMid meet" fill="currentColor"><path fill-rule="evenodd" clip-rule="evenodd" d="M1.65039 2.9999C1.65039 2.8066 1.80709 2.6499 2.00039 2.6499H8.00039C8.19369 2.6499 8.35039 2.8066 8.35039 2.9999C8.35039 3.1932 8.19369 3.3499 8.00039 3.3499H2.00039C1.80709 3.3499 1.65039 3.1932 1.65039 2.9999ZM1.65039 4.9999C1.65039 4.8066 1.80709 4.6499 2.00039 4.6499H8.00039C8.19369 4.6499 8.35039 4.8066 8.35039 4.9999C8.35039 5.1932 8.19369 5.3499 8.00039 5.3499H2.00039C1.80709 5.3499 1.65039 5.1932 1.65039 4.9999ZM2.00039 6.6499C1.80709 6.6499 1.65039 6.8066 1.65039 6.9999C1.65039 7.1932 1.80709 7.3499 2.00039 7.3499H8.00039C8.19369 7.3499 8.35039 7.1932 8.35039 6.9999C8.35039 6.8066 8.19369 6.6499 8.00039 6.6499H2.00039Z"></path></svg>
+		</button>
+	</div></div>
+		<nav aria-label="Main" class="ml-auto hidden lg:block"><ul class="flex items-center gap-x-1 2xl:gap-x-2"><li class="hover:text-indigo-700"><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/models"><svg class="mr-1.5 text-gray-400 group-hover:text-indigo-500" style="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path class="uim-quaternary" d="M20.23 7.24L12 12L3.77 7.24a1.98 1.98 0 0 1 .7-.71L11 2.76c.62-.35 1.38-.35 2 0l6.53 3.77c.29.173.531.418.7.71z" opacity=".25" fill="currentColor"></path><path class="uim-tertiary" d="M12 12v9.5a2.09 2.09 0 0 1-.91-.21L4.5 17.48a2.003 2.003 0 0 1-1-1.73v-7.5a2.06 2.06 0 0 1 .27-1.01L12 12z" opacity=".5" fill="currentColor"></path><path class="uim-primary" d="M20.5 8.25v7.5a2.003 2.003 0 0 1-1 1.73l-6.62 3.82c-.275.13-.576.198-.88.2V12l8.23-4.76c.175.308.268.656.27 1.01z" fill="currentColor"></path></svg>
+						Models</a>
+				</li><li class="hover:text-red-700"><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/datasets"><svg class="mr-1.5 text-gray-400 group-hover:text-red-500" style="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 25 25"><ellipse cx="12.5" cy="5" fill="currentColor" fill-opacity="0.25" rx="7.5" ry="2"></ellipse><path d="M12.5 15C16.6421 15 20 14.1046 20 13V20C20 21.1046 16.6421 22 12.5 22C8.35786 22 5 21.1046 5 20V13C5 14.1046 8.35786 15 12.5 15Z" fill="currentColor" opacity="0.5"></path><path d="M12.5 7C16.6421 7 20 6.10457 20 5V11.5C20 12.6046 16.6421 13.5 12.5 13.5C8.35786 13.5 5 12.6046 5 11.5V5C5 6.10457 8.35786 7 12.5 7Z" fill="currentColor" opacity="0.5"></path><path d="M5.23628 12C5.08204 12.1598 5 12.8273 5 13C5 14.1046 8.35786 15 12.5 15C16.6421 15 20 14.1046 20 13C20 12.8273 19.918 12.1598 19.7637 12C18.9311 12.8626 15.9947 13.5 12.5 13.5C9.0053 13.5 6.06886 12.8626 5.23628 12Z" fill="currentColor"></path></svg>
+						Datasets</a>
+				</li><li class="hover:text-blue-700"><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/spaces"><svg class="mr-1.5 text-gray-400 group-hover:text-blue-500" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" viewBox="0 0 25 25"><path opacity=".5" d="M6.016 14.674v4.31h4.31v-4.31h-4.31ZM14.674 14.674v4.31h4.31v-4.31h-4.31ZM6.016 6.016v4.31h4.31v-4.31h-4.31Z" fill="currentColor"></path><path opacity=".75" fill-rule="evenodd" clip-rule="evenodd" d="M3 4.914C3 3.857 3.857 3 4.914 3h6.514c.884 0 1.628.6 1.848 1.414a5.171 5.171 0 0 1 7.31 7.31c.815.22 1.414.964 1.414 1.848v6.514A1.914 1.914 0 0 1 20.086 22H4.914A1.914 1.914 0 0 1 3 20.086V4.914Zm3.016 1.102v4.31h4.31v-4.31h-4.31Zm0 12.968v-4.31h4.31v4.31h-4.31Zm8.658 0v-4.31h4.31v4.31h-4.31Zm0-10.813a2.155 2.155 0 1 1 4.31 0 2.155 2.155 0 0 1-4.31 0Z" fill="currentColor"></path><path opacity=".25" d="M16.829 6.016a2.155 2.155 0 1 0 0 4.31 2.155 2.155 0 0 0 0-4.31Z" fill="currentColor"></path></svg>
+						Spaces</a>
+				</li><li class="max-xl:hidden relative"><div class="relative ">
+	<button class="group flex items-center px-2 py-0.5 dark:text-gray-300 hover:text-yellow-700 dark:hover:text-gray-100 " type="button">
+		<svg class="mr-1.5 mr-1.5 text-gray-400 text-yellow-500! group-hover:text-yellow-500" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M20.6081 3C21.7684 3 22.8053 3.49196 23.5284 4.38415C23.9756 4.93678 24.4428 5.82749 24.4808 7.16133C24.9674 7.01707 25.4353 6.93643 25.8725 6.93643C26.9833 6.93643 27.9865 7.37587 28.696 8.17411C29.6075 9.19872 30.0124 10.4579 29.8361 11.7177C29.7523 12.3177 29.5581 12.8555 29.2678 13.3534C29.8798 13.8646 30.3306 14.5763 30.5485 15.4322C30.719 16.1032 30.8939 17.5006 29.9808 18.9403C30.0389 19.0342 30.0934 19.1319 30.1442 19.2318C30.6932 20.3074 30.7283 21.5229 30.2439 22.6548C29.5093 24.3704 27.6841 25.7219 24.1397 27.1727C21.9347 28.0753 19.9174 28.6523 19.8994 28.6575C16.9842 29.4379 14.3477 29.8345 12.0653 29.8345C7.87017 29.8345 4.8668 28.508 3.13831 25.8921C0.356375 21.6797 0.754104 17.8269 4.35369 14.1131C6.34591 12.058 7.67023 9.02782 7.94613 8.36275C8.50224 6.39343 9.97271 4.20438 12.4172 4.20438H12.4179C12.6236 4.20438 12.8314 4.2214 13.0364 4.25468C14.107 4.42854 15.0428 5.06476 15.7115 6.02205C16.4331 5.09583 17.134 4.359 17.7682 3.94323C18.7242 3.31737 19.6794 3 20.6081 3ZM20.6081 5.95917C20.2427 5.95917 19.7963 6.1197 19.3039 6.44225C17.7754 7.44319 14.8258 12.6772 13.7458 14.7131C13.3839 15.3952 12.7655 15.6837 12.2086 15.6837C11.1036 15.6837 10.2408 14.5497 12.1076 13.1085C14.9146 10.9402 13.9299 7.39584 12.5898 7.1776C12.5311 7.16799 12.4731 7.16355 12.4172 7.16355C11.1989 7.16355 10.6615 9.33114 10.6615 9.33114C10.6615 9.33114 9.0863 13.4148 6.38031 16.206C3.67434 18.998 3.5346 21.2388 5.50675 24.2246C6.85185 26.2606 9.42666 26.8753 12.0653 26.8753C14.8021 26.8753 17.6077 26.2139 19.1799 25.793C19.2574 25.7723 28.8193 22.984 27.6081 20.6107C27.4046 20.212 27.0693 20.0522 26.6471 20.0522C24.9416 20.0522 21.8393 22.6726 20.5057 22.6726C20.2076 22.6726 19.9976 22.5416 19.9116 22.222C19.3433 20.1173 28.552 19.2325 27.7758 16.1839C27.639 15.6445 27.2677 15.4256 26.746 15.4263C24.4923 15.4263 19.4358 19.5181 18.3759 19.5181C18.2949 19.5181 18.2368 19.4937 18.2053 19.4419C17.6743 18.557 17.9653 17.9394 21.7082 15.6009C25.4511 13.2617 28.0783 11.8545 26.5841 10.1752C26.4121 9.98141 26.1684 9.8956 25.8725 9.8956C23.6001 9.89634 18.2311 14.9403 18.2311 14.9403C18.2311 14.9403 16.7821 16.496 15.9057 16.496C15.7043 16.496 15.533 16.4139 15.4169 16.2112C14.7956 15.1296 21.1879 10.1286 21.5484 8.06535C21.7928 6.66715 21.3771 5.95917 20.6081 5.95917Z" fill="#FF9D00"></path><path d="M5.50686 24.2246C3.53472 21.2387 3.67446 18.9979 6.38043 16.206C9.08641 13.4147 10.6615 9.33111 10.6615 9.33111C10.6615 9.33111 11.2499 6.95933 12.59 7.17757C13.93 7.39581 14.9139 10.9401 12.1069 13.1084C9.29997 15.276 12.6659 16.7489 13.7459 14.713C14.8258 12.6772 17.7747 7.44316 19.304 6.44221C20.8326 5.44128 21.9089 6.00204 21.5484 8.06532C21.188 10.1286 14.795 15.1295 15.4171 16.2118C16.0391 17.2934 18.2312 14.9402 18.2312 14.9402C18.2312 14.9402 25.0907 8.49588 26.5842 10.1752C28.0776 11.8545 25.4512 13.2616 21.7082 15.6008C17.9646 17.9393 17.6744 18.557 18.2054 19.4418C18.7372 20.3266 26.9998 13.1351 27.7759 16.1838C28.5513 19.2324 19.3434 20.1173 19.9117 22.2219C20.48 24.3274 26.3979 18.2382 27.6082 20.6107C28.8193 22.9839 19.2574 25.7722 19.18 25.7929C16.0914 26.62 8.24723 28.3726 5.50686 24.2246Z" fill="#FFD21E"></path></svg>
+			Community
+		</button>
+	</div>
+				</li><li class="hover:text-yellow-700"><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/docs"><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="mr-1.5 text-gray-400 group-hover:text-yellow-500" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 16 16"><path d="m2.28 3.7-.3.16a.67.67 0 0 0-.34.58v8.73l.01.04.02.07.01.04.03.06.02.04.02.03.04.06.05.05.04.04.06.04.06.04.08.04.08.02h.05l.07.02h.11l.04-.01.07-.02.03-.01.07-.03.22-.12a5.33 5.33 0 0 1 5.15.1.67.67 0 0 0 .66 0 5.33 5.33 0 0 1 5.33 0 .67.67 0 0 0 1-.58V4.36a.67.67 0 0 0-.34-.5l-.3-.17v7.78a.63.63 0 0 1-.87.59 4.9 4.9 0 0 0-4.35.35l-.65.39a.29.29 0 0 1-.15.04.29.29 0 0 1-.16-.04l-.65-.4a4.9 4.9 0 0 0-4.34-.34.63.63 0 0 1-.87-.59V3.7Z" fill="currentColor" class="dark:opacity-40"></path><path fill-rule="evenodd" clip-rule="evenodd" d="M8 3.1a5.99 5.99 0 0 0-5.3-.43.66.66 0 0 0-.42.62v8.18c0 .45.46.76.87.59a4.9 4.9 0 0 1 4.34.35l.65.39c.05.03.1.04.16.04.05 0 .1-.01.15-.04l.65-.4a4.9 4.9 0 0 1 4.35-.34.63.63 0 0 0 .86-.59V3.3a.67.67 0 0 0-.41-.62 5.99 5.99 0 0 0-5.3.43l-.3.17L8 3.1Zm.73 1.87a.43.43 0 1 0-.86 0v5.48a.43.43 0 0 0 .86 0V4.97Z" fill="currentColor" class="opacity-40 dark:opacity-100"></path><path d="M8.73 4.97a.43.43 0 1 0-.86 0v5.48a.43.43 0 1 0 .86 0V4.96Z" fill="currentColor" class="dark:opacity-40"></path></svg>
+						Docs</a>
+				</li><li class="hover:text-black dark:hover:text-white max-2xl:hidden"><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/enterprise"><svg class="mr-1.5 text-gray-400 group-hover:text-black dark:group-hover:text-white" xmlns="http://www.w3.org/2000/svg" fill="none" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 12 12"><path fill-rule="evenodd" clip-rule="evenodd" d="M4.9 1.35a3.16 3.16 0 0 0-2.8 2.07L.37 8.58C0 9.71.7 10.65 1.86 10.65H7.3a3.2 3.2 0 0 0 2.84-2.07l1.67-5.16c.36-1.13-.3-2.07-1.46-2.07H4.91Zm.4 2.07L3.57 8.47h3.57l.36-1.12H5.4l.28-.91h1.75l.4-1.1H6.07l.3-.83h2l.36-1.1H5.27h.04Z" fill="currentColor"></path></svg>
+						Enterprise</a>
+				</li>
+		<li><a class="group flex items-center px-2 py-0.5 dark:text-gray-300 dark:hover:text-gray-100" href="/pricing">Pricing
+			</a></li>
+		<li><div class="relative group">
+	<button class="px-2 py-0.5 hover:text-gray-500 dark:hover:text-gray-600 flex items-center " type="button">
+		<svg class=" text-gray-500 w-5 group-hover:text-gray-400 dark:text-gray-300 dark:group-hover:text-gray-100" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" viewBox="0 0 32 18" preserveAspectRatio="xMidYMid meet"><path fill-rule="evenodd" clip-rule="evenodd" d="M14.4504 3.30221C14.4504 2.836 14.8284 2.45807 15.2946 2.45807H28.4933C28.9595 2.45807 29.3374 2.836 29.3374 3.30221C29.3374 3.76842 28.9595 4.14635 28.4933 4.14635H15.2946C14.8284 4.14635 14.4504 3.76842 14.4504 3.30221Z" fill="currentColor"></path><path fill-rule="evenodd" clip-rule="evenodd" d="M14.4504 9.00002C14.4504 8.53382 14.8284 8.15588 15.2946 8.15588H28.4933C28.9595 8.15588 29.3374 8.53382 29.3374 9.00002C29.3374 9.46623 28.9595 9.84417 28.4933 9.84417H15.2946C14.8284 9.84417 14.4504 9.46623 14.4504 9.00002Z" fill="currentColor"></path><path fill-rule="evenodd" clip-rule="evenodd" d="M14.4504 14.6978C14.4504 14.2316 14.8284 13.8537 15.2946 13.8537H28.4933C28.9595 13.8537 29.3374 14.2316 29.3374 14.6978C29.3374 15.164 28.9595 15.542 28.4933 15.542H15.2946C14.8284 15.542 14.4504 15.164 14.4504 14.6978Z" fill="currentColor"></path><path fill-rule="evenodd" clip-rule="evenodd" d="M1.94549 6.87377C2.27514 6.54411 2.80962 6.54411 3.13928 6.87377L6.23458 9.96907L9.32988 6.87377C9.65954 6.54411 10.194 6.54411 10.5237 6.87377C10.8533 7.20343 10.8533 7.73791 10.5237 8.06756L6.23458 12.3567L1.94549 8.06756C1.61583 7.73791 1.61583 7.20343 1.94549 6.87377Z" fill="currentColor"></path></svg>
+		</button>
+	</div></li>
+		<li><hr class="h-5 w-0.5 border-none bg-gray-100 dark:bg-gray-800"></li>
+		<li><a class="block cursor-pointer whitespace-nowrap px-2 py-0.5 hover:text-gray-500 dark:text-gray-300 dark:hover:text-gray-100" href="/login">Log In
+				</a></li>
+			<li><a class="whitespace-nowrap rounded-full border border-transparent bg-gray-900 px-3 py-1 leading-none text-white hover:border-black hover:bg-white hover:text-black" href="/join">Sign Up
+					</a></li></ul></nav></div></header></div>
+	<div class="SVELTE_HYDRATER contents" data-target="SSOBanner" data-props="{}"></div>
+	<main class="flex flex-1 flex-col">
+	<div class="SVELTE_HYDRATER contents" data-target="ModelHeader" data-props="{&quot;activeTab&quot;:&quot;files&quot;,&quot;author&quot;:{&quot;_id&quot;:&quot;6683d6350b54a28aff6645fe&quot;,&quot;avatarUrl&quot;:&quot;https://cdn-avatars.huggingface.co/v1/production/uploads/6355a3c1805be5a8f30fea49/8xGdIOlfkopZfhbMitw_k.jpeg&quot;,&quot;fullname&quot;:&quot;Kyutai&quot;,&quot;name&quot;:&quot;kyutai&quot;,&quot;type&quot;:&quot;org&quot;,&quot;isHf&quot;:false,&quot;isHfAdmin&quot;:false,&quot;isMod&quot;:false,&quot;isEnterprise&quot;:false,&quot;followerCount&quot;:886},&quot;canReadRepoSettings&quot;:false,&quot;canWriteRepoContent&quot;:false,&quot;canDisable&quot;:false,&quot;model&quot;:{&quot;author&quot;:&quot;kyutai&quot;,&quot;cardData&quot;:{&quot;library_name&quot;:&quot;transformers&quot;,&quot;license&quot;:&quot;cc-by-sa-4.0&quot;,&quot;language&quot;:[&quot;bg&quot;,&quot;cs&quot;,&quot;da&quot;,&quot;de&quot;,&quot;el&quot;,&quot;en&quot;,&quot;es&quot;,&quot;et&quot;,&quot;fi&quot;,&quot;fr&quot;,&quot;ga&quot;,&quot;hr&quot;,&quot;hu&quot;,&quot;it&quot;,&quot;lt&quot;,&quot;lv&quot;,&quot;mt&quot;,&quot;nl&quot;,&quot;pl&quot;,&quot;pt&quot;,&quot;ro&quot;,&quot;sk&quot;,&quot;sl&quot;,&quot;sv&quot;],&quot;pipeline_tag&quot;:&quot;text-generation&quot;},&quot;cardExists&quot;:true,&quot;config&quot;:{&quot;architectures&quot;:[&quot;LlamaForCausalLM&quot;],&quot;model_type&quot;:&quot;llama&quot;},&quot;createdAt&quot;:&quot;2025-04-30T13:59:54.000Z&quot;,&quot;discussionsDisabled&quot;:false,&quot;discussionsSorting&quot;:&quot;recently-created&quot;,&quot;downloads&quot;:28282,&quot;downloadsAllTime&quot;:456198,&quot;id&quot;:&quot;kyutai/helium-1-2b&quot;,&quot;isLikedByUser&quot;:false,&quot;availableInferenceProviders&quot;:[],&quot;inference&quot;:&quot;&quot;,&quot;lastModified&quot;:&quot;2025-04-30T14:38:01.000Z&quot;,&quot;likes&quot;:42,&quot;pipeline_tag&quot;:&quot;text-generation&quot;,&quot;library_name&quot;:&quot;transformers&quot;,&quot;librariesOther&quot;:[],&quot;trackDownloads&quot;:true,&quot;model-index&quot;:null,&quot;private&quot;:false,&quot;repoType&quot;:&quot;model&quot;,&quot;gated&quot;:false,&quot;pwcLink&quot;:{&quot;error&quot;:&quot;Unknown error, can't generate link to Papers With Code.&quot;},&quot;tags&quot;:[&quot;transformers&quot;,&quot;safetensors&quot;,&quot;llama&quot;,&quot;text-generation&quot;,&quot;bg&quot;,&quot;cs&quot;,&quot;da&quot;,&quot;de&quot;,&quot;el&quot;,&quot;en&quot;,&quot;es&quot;,&quot;et&quot;,&quot;fi&quot;,&quot;fr&quot;,&quot;ga&quot;,&quot;hr&quot;,&quot;hu&quot;,&quot;it&quot;,&quot;lt&quot;,&quot;lv&quot;,&quot;mt&quot;,&quot;nl&quot;,&quot;pl&quot;,&quot;pt&quot;,&quot;ro&quot;,&quot;sk&quot;,&quot;sl&quot;,&quot;sv&quot;,&quot;license:cc-by-sa-4.0&quot;,&quot;text-generation-inference&quot;,&quot;endpoints_compatible&quot;,&quot;region:us&quot;],&quot;tag_objs&quot;:[{&quot;id&quot;:&quot;text-generation&quot;,&quot;label&quot;:&quot;Text Generation&quot;,&quot;type&quot;:&quot;pipeline_tag&quot;,&quot;subType&quot;:&quot;nlp&quot;},{&quot;id&quot;:&quot;transformers&quot;,&quot;label&quot;:&quot;Transformers&quot;,&quot;type&quot;:&quot;library&quot;},{&quot;id&quot;:&quot;safetensors&quot;,&quot;label&quot;:&quot;Safetensors&quot;,&quot;type&quot;:&quot;library&quot;},{&quot;id&quot;:&quot;bg&quot;,&quot;label&quot;:&quot;Bulgarian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;cs&quot;,&quot;label&quot;:&quot;Czech&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;da&quot;,&quot;label&quot;:&quot;Danish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;de&quot;,&quot;label&quot;:&quot;German&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;el&quot;,&quot;label&quot;:&quot;Greek&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;en&quot;,&quot;label&quot;:&quot;English&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;es&quot;,&quot;label&quot;:&quot;Spanish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;et&quot;,&quot;label&quot;:&quot;Estonian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;fi&quot;,&quot;label&quot;:&quot;Finnish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;fr&quot;,&quot;label&quot;:&quot;French&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;ga&quot;,&quot;label&quot;:&quot;Irish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;hr&quot;,&quot;label&quot;:&quot;Croatian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;hu&quot;,&quot;label&quot;:&quot;Hungarian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;it&quot;,&quot;label&quot;:&quot;Italian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;lt&quot;,&quot;label&quot;:&quot;Lithuanian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;lv&quot;,&quot;label&quot;:&quot;Latvian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;mt&quot;,&quot;label&quot;:&quot;Maltese&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;nl&quot;,&quot;label&quot;:&quot;Dutch&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;pl&quot;,&quot;label&quot;:&quot;Polish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;pt&quot;,&quot;label&quot;:&quot;Portuguese&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;ro&quot;,&quot;label&quot;:&quot;Romanian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;sk&quot;,&quot;label&quot;:&quot;Slovak&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;sl&quot;,&quot;label&quot;:&quot;Slovenian&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;sv&quot;,&quot;label&quot;:&quot;Swedish&quot;,&quot;type&quot;:&quot;language&quot;},{&quot;id&quot;:&quot;llama&quot;,&quot;label&quot;:&quot;llama&quot;,&quot;type&quot;:&quot;other&quot;,&quot;clickable&quot;:true},{&quot;id&quot;:&quot;text-generation-inference&quot;,&quot;label&quot;:&quot;text-generation-inference&quot;,&quot;type&quot;:&quot;other&quot;,&quot;clickable&quot;:true},{&quot;id&quot;:&quot;endpoints_compatible&quot;,&quot;label&quot;:&quot;Inference Endpoints&quot;,&quot;type&quot;:&quot;other&quot;,&quot;clickable&quot;:true},{&quot;id&quot;:&quot;license:cc-by-sa-4.0&quot;,&quot;label&quot;:&quot;cc-by-sa-4.0&quot;,&quot;type&quot;:&quot;license&quot;},{&quot;type&quot;:&quot;region&quot;,&quot;label&quot;:&quot;🇺🇸 Region: US&quot;,&quot;id&quot;:&quot;region:us&quot;}],&quot;transformersInfo&quot;:{&quot;auto_model&quot;:&quot;AutoModelForCausalLM&quot;,&quot;pipeline_tag&quot;:&quot;text-generation&quot;,&quot;processor&quot;:&quot;AutoTokenizer&quot;},&quot;safetensors&quot;:{&quot;parameters&quot;:{&quot;BF16&quot;:2023868416},&quot;total&quot;:2023868416,&quot;sharded&quot;:false},&quot;hasBlockedOids&quot;:false,&quot;region&quot;:&quot;us&quot;,&quot;isQuantized&quot;:false},&quot;discussionsStats&quot;:{&quot;closed&quot;:1,&quot;open&quot;:0,&quot;total&quot;:1},&quot;query&quot;:{},&quot;inferenceContextData&quot;:{&quot;billableEntities&quot;:[],&quot;entityName2Providers&quot;:{}}}"><header class="bg-linear-to-t border-b border-gray-100 pt-6 sm:pt-9 from-purple-500/8 dark:from-purple-500/20 to-white to-70%  dark:to-gray-950"><div class="container relative "><h1 class="flex flex-wrap items-center max-md:leading-tight mb-3 text-lg max-sm:gap-y-1.5 md:text-xl">
+			<div class="group flex flex-none items-center"><div class="relative mr-1 flex items-center">
+<span class="inline-block "><span class="contents"><a href="/kyutai" class="text-gray-400 hover:text-blue-600"><img alt="" class="size-3.5 rounded-sm  flex-none select-none" src="https://cdn-avatars.huggingface.co/v1/production/uploads/6355a3c1805be5a8f30fea49/8xGdIOlfkopZfhbMitw_k.jpeg" crossorigin="anonymous"></a></span>
+	</span></div>
+<span class="inline-block "><span class="contents"><a href="/kyutai" class="text-gray-400 hover:text-blue-600">kyutai</a></span>
+	</span>
+		<div class="mx-0.5 text-gray-300">/</div></div>
+<div class="max-w-full "><a class="break-words font-mono font-semibold hover:text-blue-600 " href="/kyutai/helium-1-2b">helium-1-2b</a>
+	<button class="text-sm mr-4 focus:outline-hidden inline-flex cursor-pointer items-center text-sm  mx-0.5   text-gray-600 " title="Copy model name to clipboard" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg>
+		</button></div>
+			<div class="inline-flex items-center overflow-hidden whitespace-nowrap rounded-md border bg-white text-sm leading-none text-gray-500  mr-2"><button class="relative flex items-center overflow-hidden from-red-50 to-transparent dark:from-red-900 px-1.5 py-1 hover:bg-linear-to-t focus:outline-hidden"  title="Like"><svg class="left-1.5 absolute" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32" fill="currentColor"><path d="M22.45,6a5.47,5.47,0,0,1,3.91,1.64,5.7,5.7,0,0,1,0,8L16,26.13,5.64,15.64a5.7,5.7,0,0,1,0-8,5.48,5.48,0,0,1,7.82,0L16,10.24l2.53-2.58A5.44,5.44,0,0,1,22.45,6m0-2a7.47,7.47,0,0,0-5.34,2.24L16,7.36,14.89,6.24a7.49,7.49,0,0,0-10.68,0,7.72,7.72,0,0,0,0,10.82L16,29,27.79,17.06a7.72,7.72,0,0,0,0-10.82A7.49,7.49,0,0,0,22.45,4Z"></path></svg>
+		<span class="ml-4 pl-0.5 ">like</span></button>
+	<button class="focus:outline-hidden flex items-center border-l px-1.5 py-1 text-gray-400 hover:bg-gray-50 focus:bg-gray-100 dark:hover:bg-gray-900 dark:focus:bg-gray-800" title="See users who liked this repository">42</button></div>
+			<div class="relative flex items-center gap-1.5  "><div class="mr-2 inline-flex h-6 items-center overflow-hidden whitespace-nowrap rounded-md border text-sm text-gray-500"><button class="focus:outline-hidden relative flex h-full max-w-56 items-center gap-1.5 overflow-hidden px-1.5 hover:bg-gray-50 focus:bg-gray-100 dark:hover:bg-gray-900 dark:focus:bg-gray-800" type="button" ><div class="flex h-full flex-1 items-center justify-center ">Follow</div>
+		<img alt="" class="rounded-xs size-3 flex-none select-none" src="https://cdn-avatars.huggingface.co/v1/production/uploads/6355a3c1805be5a8f30fea49/8xGdIOlfkopZfhbMitw_k.jpeg" loading="lazy">
+		<span class="truncate">Kyutai</span></button>
+	<button class="focus:outline-hidden flex h-full items-center border-l pl-1.5 pr-1.5 text-gray-400 hover:bg-gray-50 focus:bg-gray-100 dark:hover:bg-gray-900 dark:focus:bg-gray-800" title="Show Kyutai's followers" type="button">886</button></div>
+		</div>
+	</h1>
+		<div class="mb-3 flex flex-wrap md:mb-4">
+	<a class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" href="/models?pipeline_tag=text-generation"><div class="tag   tag-white "><div class="tag-ico -ml-2 tag-ico-red"><svg class="-mr-0.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 18 18"><path d="M16.2607 8.08202L14.468 6.28928C14.3063 6.12804 14.0873 6.03749 13.859 6.03749C13.6307 6.03749 13.4117 6.12804 13.25 6.28928L5.6375 13.904V16.9125H8.64607L16.2607 9.30002C16.422 9.13836 16.5125 8.91935 16.5125 8.69102C16.5125 8.4627 16.422 8.24369 16.2607 8.08202V8.08202ZM8.1953 15.825H6.725V14.3547L11.858 9.22118L13.3288 10.6915L8.1953 15.825ZM14.0982 9.92262L12.6279 8.45232L13.8606 7.21964L15.3309 8.68994L14.0982 9.92262Z"></path><path d="M6.18125 9.84373H7.26875V6.03748H8.9V4.94998H4.55V6.03748H6.18125V9.84373Z"></path><path d="M4.55 11.475H2.375V2.775H11.075V4.95H12.1625V2.775C12.1625 2.48658 12.0479 2.20997 11.844 2.00602C11.64 1.80208 11.3634 1.6875 11.075 1.6875H2.375C2.08658 1.6875 1.80997 1.80208 1.60602 2.00602C1.40207 2.20997 1.2875 2.48658 1.2875 2.775V11.475C1.2875 11.7634 1.40207 12.04 1.60602 12.244C1.80997 12.4479 2.08658 12.5625 2.375 12.5625H4.55V11.475Z"></path></svg></div>
+	<span>Text Generation</span>
+	</div></a>
+	<a class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" href="/models?library=transformers"><div class="tag   tag-white "><svg class="text-black inline-block text-sm" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" preserveAspectRatio="xMidYMid meet" width="1em" height="1em" viewBox="0 0 95 88"><path fill="#fff" d="M94.25 70.08a8.28 8.28 0 0 1-.43 6.46 10.57 10.57 0 0 1-3 3.6 25.18 25.18 0 0 1-5.7 3.2 65.74 65.74 0 0 1-7.56 2.65 46.67 46.67 0 0 1-11.42 1.68c-5.42.05-10.09-1.23-13.4-4.5a40.4 40.4 0 0 1-10.14.03c-3.34 3.25-7.99 4.52-13.39 4.47a46.82 46.82 0 0 1-11.43-1.68 66.37 66.37 0 0 1-7.55-2.65c-2.28-.98-4.17-2-5.68-3.2a10.5 10.5 0 0 1-3.02-3.6c-.99-2-1.18-4.3-.42-6.46a8.54 8.54 0 0 1-.33-5.63c.25-.95.66-1.83 1.18-2.61a8.67 8.67 0 0 1 2.1-8.47 8.23 8.23 0 0 1 2.82-2.07 41.75 41.75 0 1 1 81.3-.12 8.27 8.27 0 0 1 3.11 2.19 8.7 8.7 0 0 1 2.1 8.47c.52.78.93 1.66 1.18 2.61a8.61 8.61 0 0 1-.32 5.63Z"></path><path fill="#FFD21E" d="M47.21 76.5a34.75 34.75 0 1 0 0-69.5 34.75 34.75 0 0 0 0 69.5Z"></path><path fill="#FF9D0B" d="M81.96 41.75a34.75 34.75 0 1 0-69.5 0 34.75 34.75 0 0 0 69.5 0Zm-73.5 0a38.75 38.75 0 1 1 77.5 0 38.75 38.75 0 0 1-77.5 0Z"></path><path fill="#3A3B45" d="M58.5 32.3c1.28.44 1.78 3.06 3.07 2.38a5 5 0 1 0-6.76-2.07c.61 1.15 2.55-.72 3.7-.32ZM34.95 32.3c-1.28.44-1.79 3.06-3.07 2.38a5 5 0 1 1 6.76-2.07c-.61 1.15-2.56-.72-3.7-.32Z"></path><path fill="#FF323D" d="M46.96 56.29c9.83 0 13-8.76 13-13.26 0-2.34-1.57-1.6-4.09-.36-2.33 1.15-5.46 2.74-8.9 2.74-7.19 0-13-6.88-13-2.38s3.16 13.26 13 13.26Z"></path><path fill="#3A3B45" fill-rule="evenodd" d="M39.43 54a8.7 8.7 0 0 1 5.3-4.49c.4-.12.81.57 1.24 1.28.4.68.82 1.37 1.24 1.37.45 0 .9-.68 1.33-1.35.45-.7.89-1.38 1.32-1.25a8.61 8.61 0 0 1 5 4.17c3.73-2.94 5.1-7.74 5.1-10.7 0-2.34-1.57-1.6-4.09-.36l-.14.07c-2.31 1.15-5.39 2.67-8.77 2.67s-6.45-1.52-8.77-2.67c-2.6-1.29-4.23-2.1-4.23.29 0 3.05 1.46 8.06 5.47 10.97Z" clip-rule="evenodd"></path><path fill="#FF9D0B" d="M70.71 37a3.25 3.25 0 1 0 0-6.5 3.25 3.25 0 0 0 0 6.5ZM24.21 37a3.25 3.25 0 1 0 0-6.5 3.25 3.25 0 0 0 0 6.5ZM17.52 48c-1.62 0-3.06.66-4.07 1.87a5.97 5.97 0 0 0-1.33 3.76 7.1 7.1 0 0 0-1.94-.3c-1.55 0-2.95.59-3.94 1.66a5.8 5.8 0 0 0-.8 7 5.3 5.3 0 0 0-1.79 2.82c-.24.9-.48 2.8.8 4.74a5.22 5.22 0 0 0-.37 5.02c1.02 2.32 3.57 4.14 8.52 6.1 3.07 1.22 5.89 2 5.91 2.01a44.33 44.33 0 0 0 10.93 1.6c5.86 0 10.05-1.8 12.46-5.34 3.88-5.69 3.33-10.9-1.7-15.92-2.77-2.78-4.62-6.87-5-7.77-.78-2.66-2.84-5.62-6.25-5.62a5.7 5.7 0 0 0-4.6 2.46c-1-1.26-1.98-2.25-2.86-2.82A7.4 7.4 0 0 0 17.52 48Zm0 4c.51 0 1.14.22 1.82.65 2.14 1.36 6.25 8.43 7.76 11.18.5.92 1.37 1.31 2.14 1.31 1.55 0 2.75-1.53.15-3.48-3.92-2.93-2.55-7.72-.68-8.01.08-.02.17-.02.24-.02 1.7 0 2.45 2.93 2.45 2.93s2.2 5.52 5.98 9.3c3.77 3.77 3.97 6.8 1.22 10.83-1.88 2.75-5.47 3.58-9.16 3.58-3.81 0-7.73-.9-9.92-1.46-.11-.03-13.45-3.8-11.76-7 .28-.54.75-.76 1.34-.76 2.38 0 6.7 3.54 8.57 3.54.41 0 .7-.17.83-.6.79-2.85-12.06-4.05-10.98-8.17.2-.73.71-1.02 1.44-1.02 3.14 0 10.2 5.53 11.68 5.53.11 0 .2-.03.24-.1.74-1.2.33-2.04-4.9-5.2-5.21-3.16-8.88-5.06-6.8-7.33.24-.26.58-.38 1-.38 3.17 0 10.66 6.82 10.66 6.82s2.02 2.1 3.25 2.1c.28 0 .52-.1.68-.38.86-1.46-8.06-8.22-8.56-11.01-.34-1.9.24-2.85 1.31-2.85Z"></path><path fill="#FFD21E" d="M38.6 76.69c2.75-4.04 2.55-7.07-1.22-10.84-3.78-3.77-5.98-9.3-5.98-9.3s-.82-3.2-2.69-2.9c-1.87.3-3.24 5.08.68 8.01 3.91 2.93-.78 4.92-2.29 2.17-1.5-2.75-5.62-9.82-7.76-11.18-2.13-1.35-3.63-.6-3.13 2.2.5 2.79 9.43 9.55 8.56 11-.87 1.47-3.93-1.71-3.93-1.71s-9.57-8.71-11.66-6.44c-2.08 2.27 1.59 4.17 6.8 7.33 5.23 3.16 5.64 4 4.9 5.2-.75 1.2-12.28-8.53-13.36-4.4-1.08 4.11 11.77 5.3 10.98 8.15-.8 2.85-9.06-5.38-10.74-2.18-1.7 3.21 11.65 6.98 11.76 7.01 4.3 1.12 15.25 3.49 19.08-2.12Z"></path><path fill="#FF9D0B" d="M77.4 48c1.62 0 3.07.66 4.07 1.87a5.97 5.97 0 0 1 1.33 3.76 7.1 7.1 0 0 1 1.95-.3c1.55 0 2.95.59 3.94 1.66a5.8 5.8 0 0 1 .8 7 5.3 5.3 0 0 1 1.78 2.82c.24.9.48 2.8-.8 4.74a5.22 5.22 0 0 1 .37 5.02c-1.02 2.32-3.57 4.14-8.51 6.1-3.08 1.22-5.9 2-5.92 2.01a44.33 44.33 0 0 1-10.93 1.6c-5.86 0-10.05-1.8-12.46-5.34-3.88-5.69-3.33-10.9 1.7-15.92 2.78-2.78 4.63-6.87 5.01-7.77.78-2.66 2.83-5.62 6.24-5.62a5.7 5.7 0 0 1 4.6 2.46c1-1.26 1.98-2.25 2.87-2.82A7.4 7.4 0 0 1 77.4 48Zm0 4c-.51 0-1.13.22-1.82.65-2.13 1.36-6.25 8.43-7.76 11.18a2.43 2.43 0 0 1-2.14 1.31c-1.54 0-2.75-1.53-.14-3.48 3.91-2.93 2.54-7.72.67-8.01a1.54 1.54 0 0 0-.24-.02c-1.7 0-2.45 2.93-2.45 2.93s-2.2 5.52-5.97 9.3c-3.78 3.77-3.98 6.8-1.22 10.83 1.87 2.75 5.47 3.58 9.15 3.58 3.82 0 7.73-.9 9.93-1.46.1-.03 13.45-3.8 11.76-7-.29-.54-.75-.76-1.34-.76-2.38 0-6.71 3.54-8.57 3.54-.42 0-.71-.17-.83-.6-.8-2.85 12.05-4.05 10.97-8.17-.19-.73-.7-1.02-1.44-1.02-3.14 0-10.2 5.53-11.68 5.53-.1 0-.19-.03-.23-.1-.74-1.2-.34-2.04 4.88-5.2 5.23-3.16 8.9-5.06 6.8-7.33-.23-.26-.57-.38-.98-.38-3.18 0-10.67 6.82-10.67 6.82s-2.02 2.1-3.24 2.1a.74.74 0 0 1-.68-.38c-.87-1.46 8.05-8.22 8.55-11.01.34-1.9-.24-2.85-1.31-2.85Z"></path><path fill="#FFD21E" d="M56.33 76.69c-2.75-4.04-2.56-7.07 1.22-10.84 3.77-3.77 5.97-9.3 5.97-9.3s.82-3.2 2.7-2.9c1.86.3 3.23 5.08-.68 8.01-3.92 2.93.78 4.92 2.28 2.17 1.51-2.75 5.63-9.82 7.76-11.18 2.13-1.35 3.64-.6 3.13 2.2-.5 2.79-9.42 9.55-8.55 11 .86 1.47 3.92-1.71 3.92-1.71s9.58-8.71 11.66-6.44c2.08 2.27-1.58 4.17-6.8 7.33-5.23 3.16-5.63 4-4.9 5.2.75 1.2 12.28-8.53 13.36-4.4 1.08 4.11-11.76 5.3-10.97 8.15.8 2.85 9.05-5.38 10.74-2.18 1.69 3.21-11.65 6.98-11.76 7.01-4.31 1.12-15.26 3.49-19.08-2.12Z"></path></svg>
+	<span>Transformers</span>
+	</div></a>
+	<a class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" href="/models?library=safetensors"><div class="tag   tag-white "><svg class="text-black inline-block text-sm" viewBox="0 0 57 44" fill="none" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet"><path d="M36.816 20.1474L54.9918 27.4409C55.5142 27.6506 55.9623 28.0112 56.2788 28.4766C56.5954 28.9421 56.7661 29.4913 56.7691 30.0542C56.7722 30.6171 56.6074 31.1682 56.2959 31.637C55.9844 32.1059 55.5402 32.4713 55.0201 32.6866L29.953 43.0646C29.2593 43.3518 28.4799 43.3518 27.7862 43.0646L2.71624 32.6894C2.19613 32.4741 1.75197 32.1087 1.44044 31.6398C1.12892 31.171 0.964165 30.62 0.967204 30.057C0.970244 29.4941 1.14094 28.9449 1.45751 28.4794C1.77408 28.014 2.22216 27.6534 2.74456 27.4437L21.2404 20.0227C22.2997 19.5979 25.6477 20.8441 28.8682 20.8555C32.3096 20.8668 35.6292 19.6715 36.816 20.1474ZM11.3042 30.1119L28.8682 37.3828L46.435 30.1119L28.8682 23.0619L11.3042 30.1119ZM29.9247 0.388251L54.9918 10.4462C55.5142 10.6559 55.9623 11.0165 56.2788 11.482C56.5954 11.9474 56.7661 12.4967 56.7691 13.0596C56.7722 13.6225 56.6074 14.1735 56.2959 14.6424C55.9844 15.1112 55.5402 15.4766 55.0201 15.6919L29.953 26.07C29.2593 26.3572 28.4799 26.3572 27.7862 26.07L2.71624 15.6948C2.19613 15.4795 1.75197 15.1141 1.44044 14.6452C1.12892 14.1763 0.964165 13.6253 0.967204 13.0624C0.970244 12.4995 1.14094 11.9503 1.45751 11.4848C1.77408 11.0193 2.22216 10.6588 2.74456 10.4491L27.8117 0.388251C28.4896 0.1157 29.2467 0.1157 29.9247 0.388251ZM11.3042 13.1172L28.8682 20.3881L46.435 13.1172L28.8682 6.06729L11.3042 13.1172Z" fill="currentColor"></path></svg>
+	<span>Safetensors</span>
+	</div></a>
+	<button class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" type="button"><div class="tag   tag-white ">
+		<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="text-green-600/80" preserveAspectRatio="xMidYMid meet" width="1em" height="1em" viewBox="0 0 10 10"><path fill-rule="evenodd" clip-rule="evenodd" d="M0.625 5C0.625 6.16032 1.08594 7.27312 1.90641 8.09359C2.72688 8.91406 3.83968 9.375 5 9.375C6.16032 9.375 7.27312 8.91406 8.09359 8.09359C8.91406 7.27312 9.375 6.16032 9.375 5C9.375 3.83968 8.91406 2.72688 8.09359 1.90641C7.27312 1.08594 6.16032 0.625 5 0.625C3.83968 0.625 2.72688 1.08594 1.90641 1.90641C1.08594 2.72688 0.625 3.83968 0.625 5ZM7.64365 7.48027C7.61734 7.50832 7.59054 7.53598 7.56326 7.56326C7.13828 7.98824 6.61864 8.2968 6.0539 8.46842C6.29802 8.11949 6.49498 7.64804 6.63475 7.09483C7.00845 7.18834 7.35014 7.3187 7.64365 7.48027ZM8.10076 6.87776C8.37677 6.42196 8.55005 5.90894 8.60556 5.37499H6.86808C6.85542 5.71597 6.82551 6.04557 6.77971 6.35841C7.25309 6.47355 7.68808 6.6414 8.062 6.85549C8.07497 6.86283 8.08789 6.87025 8.10076 6.87776ZM6.03795 6.22536C6.07708 5.95737 6.1044 5.67232 6.11705 5.37499H3.88295C3.89666 5.69742 3.92764 6.00542 3.9722 6.29287C4.37075 6.21726 4.79213 6.17749 5.224 6.17749C5.50054 6.17749 5.77294 6.19376 6.03795 6.22536ZM4.1261 7.02673C4.34894 7.84835 4.68681 8.375 5 8.375C5.32122 8.375 5.66839 7.82101 5.8908 6.963C5.67389 6.93928 5.45082 6.92699 5.224 6.92699C4.84316 6.92699 4.47332 6.96176 4.1261 7.02673ZM3.39783 7.21853C3.53498 7.71842 3.72038 8.14579 3.9461 8.46842C3.42141 8.30898 2.93566 8.03132 2.52857 7.65192C2.77253 7.48017 3.06711 7.33382 3.39783 7.21853ZM3.23916 6.48077C3.18263 6.13193 3.14625 5.76074 3.13192 5.37499H1.39444C1.4585 5.99112 1.67936 6.57938 2.03393 7.08403C2.3706 6.83531 2.78055 6.63162 3.23916 6.48077ZM1.39444 4.62499H3.13192C3.14615 4.24204 3.18211 3.87344 3.23794 3.52681C2.77814 3.37545 2.36731 3.17096 2.03024 2.92123C1.67783 3.42469 1.45828 4.011 1.39444 4.62499ZM2.5237 2.35262C2.76812 2.52552 3.06373 2.67281 3.39584 2.78875C3.53318 2.28573 3.71928 1.85578 3.9461 1.53158C3.41932 1.69166 2.93178 1.97089 2.5237 2.35262ZM3.97101 3.71489C3.92709 4.00012 3.89654 4.30547 3.88295 4.62499H6.11705C6.10453 4.33057 6.07761 4.04818 6.03909 3.78248C5.77372 3.81417 5.50093 3.83049 5.224 3.83049C4.79169 3.83049 4.3699 3.79065 3.97101 3.71489ZM5.8928 3.04476C5.67527 3.06863 5.45151 3.08099 5.224 3.08099C4.84241 3.08099 4.47186 3.04609 4.12405 2.98086C4.34686 2.1549 4.68584 1.625 5 1.625C5.32218 1.625 5.67048 2.18233 5.8928 3.04476ZM6.78083 3.6493C6.826 3.95984 6.85552 4.28682 6.86808 4.62499H8.60556C8.55029 4.09337 8.37827 3.58251 8.10436 3.1282C8.0903 3.1364 8.07618 3.14449 8.062 3.15249C7.68838 3.36641 7.25378 3.53417 6.78083 3.6493ZM7.64858 2.52499C7.35446 2.68754 7.0117 2.81868 6.63664 2.91268C6.49676 2.35623 6.29913 1.88209 6.0539 1.53158C6.61864 1.7032 7.13828 2.01176 7.56326 2.43674C7.59224 2.46572 7.62068 2.49514 7.64858 2.52499Z" fill="currentColor"></path></svg>
+	<span>24 languages</span>
+	</div></button>
+	<a class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" href="/models?other=llama"><div class="tag   tag-white ">
+	<span>llama</span>
+	</div></a>
+	<a class="mb-1 mr-1 md:mb-1.5 md:mr-1.5 rounded-lg" href="/models?other=text-generation-inference"><div class="tag   tag-white ">
+		<svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 12 12"><path fill="#23B0FF" d="m9.6 3.6-3.2-2a1 1 0 0 0-1.1 0L2 3.7a1 1 0 0 0-.3 1.6H10a1 1 0 0 0-.3-1.6Z"></path><path fill="#2094FF" d="m6.7 9.7 3.2-4.5-.4-.8H5.7v4.8l1 .5Z"></path><path fill="#6BCAFF" d="M4.9 9.7 1.7 5.2l.4-.8h3.8v4.8l-1 .5Z"></path><path fill="#000" fill-rule="evenodd" d="M9.9 3.2c.8.5 1 1.5.5 2.3L7 10c-.6.9-2 .9-2.6 0L1.3 5.5c-.5-.8-.3-1.8.5-2.3l3.2-2c.5-.3 1.2-.3 1.7 0l3.2 2ZM6.4 5h3l-3 4.2V5ZM5.3 5h-3l3 4.2V5Zm3.8-1L6 2a.5.5 0 0 0-.5 0L2.6 4H9Z" clip-rule="evenodd"></path></svg>
+	<span>text-generation-inference</span>
+	</div></a><div class="relative inline-block ">
+	<button class="group mr-1 mb-1 md:mr-1.5 md:mb-1.5  rounded-full rounded-br-none " type="button">
+		<div slot="button"><div class="tag rounded-full  tag-white relative rounded-br-none pr-2.5">
+		<svg class="text-xs text-gray-900" width="1em" height="1em" viewBox="0 0 10 10" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M1.46009 5.0945V6.88125C1.46009 7.25201 1.75937 7.55129 2.13012 7.55129C2.50087 7.55129 2.80016 7.25201 2.80016 6.88125V5.0945C2.80016 4.72375 2.50087 4.42446 2.13012 4.42446C1.75937 4.42446 1.46009 4.72375 1.46009 5.0945ZM4.14022 5.0945V6.88125C4.14022 7.25201 4.4395 7.55129 4.81026 7.55129C5.18101 7.55129 5.48029 7.25201 5.48029 6.88125V5.0945C5.48029 4.72375 5.18101 4.42446 4.81026 4.42446C4.4395 4.42446 4.14022 4.72375 4.14022 5.0945ZM1.23674 9.78473H8.38377C8.75452 9.78473 9.0538 9.48545 9.0538 9.1147C9.0538 8.74395 8.75452 8.44466 8.38377 8.44466H1.23674C0.865993 8.44466 0.566711 8.74395 0.566711 9.1147C0.566711 9.48545 0.865993 9.78473 1.23674 9.78473ZM6.82036 5.0945V6.88125C6.82036 7.25201 7.11964 7.55129 7.49039 7.55129C7.86114 7.55129 8.16042 7.25201 8.16042 6.88125V5.0945C8.16042 4.72375 7.86114 4.42446 7.49039 4.42446C7.11964 4.42446 6.82036 4.72375 6.82036 5.0945ZM4.39484 0.623142L0.865993 2.48137C0.682851 2.57517 0.566711 2.76725 0.566711 2.97273C0.566711 3.28094 0.816857 3.53109 1.12507 3.53109H8.49991C8.80365 3.53109 9.0538 3.28094 9.0538 2.97273C9.0538 2.76725 8.93766 2.57517 8.75452 2.48137L5.22568 0.623142C4.9666 0.484669 4.65391 0.484669 4.39484 0.623142V0.623142Z" fill="currentColor"></path></svg>
+	<span class="-mr-1 text-gray-400">License:</span>
+	<span>cc-by-sa-4.0</span>
+	<div class="border-br-gray-200 absolute bottom-0.5 right-0.5 h-1 w-1 border-[3px] border-l-transparent border-t-transparent border-b-gray-200 border-r-gray-200 group-hover:border-b-gray-400 group-hover:border-r-gray-400 dark:border-b-gray-700 dark:border-r-gray-700 group-hover:dark:border-b-gray-400 group-hover:dark:border-r-gray-400"></div></div></div>
+		</button>
+	</div></div>
+		<div class="flex flex-col-reverse lg:flex-row lg:items-center lg:justify-between"><div class="-mb-px flex h-12 items-center overflow-x-auto overflow-y-hidden ">
+	<a class="tab-alternate" href="/kyutai/helium-1-2b"><svg class="mr-1.5 text-gray-400 flex-none" style="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path class="uim-quaternary" d="M20.23 7.24L12 12L3.77 7.24a1.98 1.98 0 0 1 .7-.71L11 2.76c.62-.35 1.38-.35 2 0l6.53 3.77c.29.173.531.418.7.71z" opacity=".25" fill="currentColor"></path><path class="uim-tertiary" d="M12 12v9.5a2.09 2.09 0 0 1-.91-.21L4.5 17.48a2.003 2.003 0 0 1-1-1.73v-7.5a2.06 2.06 0 0 1 .27-1.01L12 12z" opacity=".5" fill="currentColor"></path><path class="uim-primary" d="M20.5 8.25v7.5a2.003 2.003 0 0 1-1 1.73l-6.62 3.82c-.275.13-.576.198-.88.2V12l8.23-4.76c.175.308.268.656.27 1.01z" fill="currentColor"></path></svg>
+	Model card
+		</a><a class="tab-alternate active" href="/kyutai/helium-1-2b/tree/main"><svg class="mr-1.5 text-gray-400 flex-none" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path class="uim-tertiary" d="M21 19h-8a1 1 0 0 1 0-2h8a1 1 0 0 1 0 2zm0-4h-8a1 1 0 0 1 0-2h8a1 1 0 0 1 0 2zm0-8h-8a1 1 0 0 1 0-2h8a1 1 0 0 1 0 2zm0 4h-8a1 1 0 0 1 0-2h8a1 1 0 0 1 0 2z" opacity=".5" fill="currentColor"></path><path class="uim-primary" d="M9 19a1 1 0 0 1-1-1V6a1 1 0 0 1 2 0v12a1 1 0 0 1-1 1zm-6-4.333a1 1 0 0 1-.64-1.769L3.438 12l-1.078-.898a1 1 0 0 1 1.28-1.538l2 1.667a1 1 0 0 1 0 1.538l-2 1.667a.999.999 0 0 1-.64.231z" fill="currentColor"></path></svg>
+	<span class="xl:hidden">Files</span>
+		<span class="hidden xl:inline">Files and versions</span>
+<span class="inline-block "><span class="contents"><div slot="anchor" class="shadow-purple-500/10 ml-2 inline-flex -translate-y-px items-center gap-0.5 rounded-md border bg-white px-1 py-0.5 align-middle text-xs font-semibold leading-none text-gray-800 shadow-sm dark:border-gray-700 dark:bg-gradient-to-b dark:from-gray-925 dark:to-gray-925 dark:text-gray-300"><svg class="size-3 " xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 12 12"><path fill-rule="evenodd" clip-rule="evenodd" d="M6.14 3.64 5.1 4.92 2.98 2.28h2.06l1.1 1.36Zm0 4.72-1.1 1.36H2.98l2.13-2.64 1.03 1.28Zm4.9 1.36L8.03 6l3-3.72H8.96L5.97 6l3 3.72h2.06Z" fill="#7875FF"></path><path d="M4.24 6 2.6 8.03.97 6 2.6 3.97 4.24 6Z" fill="#FF7F41" opacity="1"></path></svg>
+						<span>xet</span>
+					</div></span>
+	</span>
+		</a><a class="tab-alternate" href="/kyutai/helium-1-2b/discussions"><svg class="mr-1.5 text-gray-400 flex-none" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M20.6081 3C21.7684 3 22.8053 3.49196 23.5284 4.38415C23.9756 4.93678 24.4428 5.82749 24.4808 7.16133C24.9674 7.01707 25.4353 6.93643 25.8725 6.93643C26.9833 6.93643 27.9865 7.37587 28.696 8.17411C29.6075 9.19872 30.0124 10.4579 29.8361 11.7177C29.7523 12.3177 29.5581 12.8555 29.2678 13.3534C29.8798 13.8646 30.3306 14.5763 30.5485 15.4322C30.719 16.1032 30.8939 17.5006 29.9808 18.9403C30.0389 19.0342 30.0934 19.1319 30.1442 19.2318C30.6932 20.3074 30.7283 21.5229 30.2439 22.6548C29.5093 24.3704 27.6841 25.7219 24.1397 27.1727C21.9347 28.0753 19.9174 28.6523 19.8994 28.6575C16.9842 29.4379 14.3477 29.8345 12.0653 29.8345C7.87017 29.8345 4.8668 28.508 3.13831 25.8921C0.356375 21.6797 0.754104 17.8269 4.35369 14.1131C6.34591 12.058 7.67023 9.02782 7.94613 8.36275C8.50224 6.39343 9.97271 4.20438 12.4172 4.20438H12.4179C12.6236 4.20438 12.8314 4.2214 13.0364 4.25468C14.107 4.42854 15.0428 5.06476 15.7115 6.02205C16.4331 5.09583 17.134 4.359 17.7682 3.94323C18.7242 3.31737 19.6794 3 20.6081 3ZM20.6081 5.95917C20.2427 5.95917 19.7963 6.1197 19.3039 6.44225C17.7754 7.44319 14.8258 12.6772 13.7458 14.7131C13.3839 15.3952 12.7655 15.6837 12.2086 15.6837C11.1036 15.6837 10.2408 14.5497 12.1076 13.1085C14.9146 10.9402 13.9299 7.39584 12.5898 7.1776C12.5311 7.16799 12.4731 7.16355 12.4172 7.16355C11.1989 7.16355 10.6615 9.33114 10.6615 9.33114C10.6615 9.33114 9.0863 13.4148 6.38031 16.206C3.67434 18.998 3.5346 21.2388 5.50675 24.2246C6.85185 26.2606 9.42666 26.8753 12.0653 26.8753C14.8021 26.8753 17.6077 26.2139 19.1799 25.793C19.2574 25.7723 28.8193 22.984 27.6081 20.6107C27.4046 20.212 27.0693 20.0522 26.6471 20.0522C24.9416 20.0522 21.8393 22.6726 20.5057 22.6726C20.2076 22.6726 19.9976 22.5416 19.9116 22.222C19.3433 20.1173 28.552 19.2325 27.7758 16.1839C27.639 15.6445 27.2677 15.4256 26.746 15.4263C24.4923 15.4263 19.4358 19.5181 18.3759 19.5181C18.2949 19.5181 18.2368 19.4937 18.2053 19.4419C17.6743 18.557 17.9653 17.9394 21.7082 15.6009C25.4511 13.2617 28.0783 11.8545 26.5841 10.1752C26.4121 9.98141 26.1684 9.8956 25.8725 9.8956C23.6001 9.89634 18.2311 14.9403 18.2311 14.9403C18.2311 14.9403 16.7821 16.496 15.9057 16.496C15.7043 16.496 15.533 16.4139 15.4169 16.2112C14.7956 15.1296 21.1879 10.1286 21.5484 8.06535C21.7928 6.66715 21.3771 5.95917 20.6081 5.95917Z" fill="#FF9D00"></path><path d="M5.50686 24.2246C3.53472 21.2387 3.67446 18.9979 6.38043 16.206C9.08641 13.4147 10.6615 9.33111 10.6615 9.33111C10.6615 9.33111 11.2499 6.95933 12.59 7.17757C13.93 7.39581 14.9139 10.9401 12.1069 13.1084C9.29997 15.276 12.6659 16.7489 13.7459 14.713C14.8258 12.6772 17.7747 7.44316 19.304 6.44221C20.8326 5.44128 21.9089 6.00204 21.5484 8.06532C21.188 10.1286 14.795 15.1295 15.4171 16.2118C16.0391 17.2934 18.2312 14.9402 18.2312 14.9402C18.2312 14.9402 25.0907 8.49588 26.5842 10.1752C28.0776 11.8545 25.4512 13.2616 21.7082 15.6008C17.9646 17.9393 17.6744 18.557 18.2054 19.4418C18.7372 20.3266 26.9998 13.1351 27.7759 16.1838C28.5513 19.2324 19.3434 20.1173 19.9117 22.2219C20.48 24.3274 26.3979 18.2382 27.6082 20.6107C28.8193 22.9839 19.2574 25.7722 19.18 25.7929C16.0914 26.62 8.24723 28.3726 5.50686 24.2246Z" fill="#FFD21E"></path></svg>
+	Community
+	<div class="ml-1.5 flex h-4 min-w-[1rem] items-center justify-center rounded px-1 text-xs leading-none shadow-sm bg-gray-200 text-gray-600 dark:bg-gray-900 dark:text-gray-500">1</div>
+		</a></div>
+<div class="relative mb-1.5 flex flex-wrap gap-1.5 sm:flex-nowrap lg:mb-0"><div class="order-last sm:order-first"><div class="relative ">
+	<button class="btn px-1.5 py-1.5 " type="button">
+			<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" class="p-0.5" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><circle cx="16" cy="7" r="3" fill="currentColor"></circle><circle cx="16" cy="16" r="3" fill="currentColor"></circle><circle cx="16" cy="25" r="3" fill="currentColor"></circle></svg>
+		</button>
+	</div></div>
+	<div class="flex-none w-full sm:w-auto"><div class="relative ">
+	<button class="text-sm btn btn w-full cursor-pointer text-sm" type="button">
+		<svg class="mr-1.5 " xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><rect x="6.34" y="19" width="11.31" height="2" transform="translate(-10.63 14.34) rotate(-45)"></rect><path d="M17,30a1,1,0,0,1-.37-.07,1,1,0,0,1-.62-.79l-1-7,2-.28.75,5.27L21,24.52V17a1,1,0,0,1,.29-.71l4.07-4.07A8.94,8.94,0,0,0,28,5.86V4H26.14a8.94,8.94,0,0,0-6.36,2.64l-4.07,4.07A1,1,0,0,1,15,11H7.48L4.87,14.26l5.27.75-.28,2-7-1a1,1,0,0,1-.79-.62,1,1,0,0,1,.15-1l4-5A1,1,0,0,1,7,9h7.59l3.77-3.78A10.92,10.92,0,0,1,26.14,2H28a2,2,0,0,1,2,2V5.86a10.92,10.92,0,0,1-3.22,7.78L23,17.41V25a1,1,0,0,1-.38.78l-5,4A1,1,0,0,1,17,30Z"></path></svg>
+			Deploy
+		<svg class="-mr-1 text-gray-500 " xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path d="M16.293 9.293L12 13.586L7.707 9.293l-1.414 1.414L12 16.414l5.707-5.707z" fill="currentColor"></path></svg></button>
+	</div>
+		</div>
+		<div class="relative flex-auto sm:flex-none">
+	<button class="from-gray-800! to-black! text-white! gap-1! border-gray-800! dark:border-gray-900!  btn w-full cursor-pointer text-sm" type="button">
+		<svg class="mr-1.5 mr-0.5! " xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path fill="currentColor" d="M28 4H4a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h8v4H8v2h16v-2h-4v-4h8a2 2 0 0 0 2-2V6a2 2 0 0 0-2-2ZM18 28h-4v-4h4Zm10-6H4V6h24Z"></path></svg>
+			Use this model
+		<svg class="-mr-1 text-gray-500 " xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path d="M16.293 9.293L12 13.586L7.707 9.293l-1.414 1.414L12 16.414l5.707-5.707z" fill="currentColor"></path></svg></button>
+	</div>
+</div>
+	</div></div></header>
+</div>
+<div class="container relative flex flex-col md:grid md:space-y-0 w-full md:grid-cols-12  space-y-4 md:gap-6 mb-16"><section class="pt-8 border-gray-100 col-span-full"><div class="SVELTE_HYDRATER contents" data-target="ViewerHeader" data-props="{&quot;context&quot;:{&quot;repo&quot;:{&quot;name&quot;:&quot;kyutai/helium-1-2b&quot;,&quot;type&quot;:&quot;model&quot;},&quot;rev&quot;:&quot;main&quot;,&quot;path&quot;:&quot;tokenizer.model&quot;,&quot;subpaths&quot;:[{&quot;dir&quot;:&quot;tokenizer.model&quot;}]},&quot;refs&quot;:{&quot;branches&quot;:[{&quot;name&quot;:&quot;main&quot;,&quot;ref&quot;:&quot;refs/heads/main&quot;,&quot;targetCommit&quot;:&quot;5764947fc2e782982c24f363eacd9baea3e821f8&quot;}],&quot;tags&quot;:[],&quot;converts&quot;:[]},&quot;view&quot;:&quot;blob&quot;,&quot;isMac&quot;:false}"><header class="flex flex-wrap items-center justify-start pb-2 md:justify-end lg:flex-nowrap"><div class="grow max-md:flex max-md:w-full max-md:items-start max-md:justify-between"><div class="relative mr-4 flex min-w-0 basis-auto flex-wrap items-center gap-x-3 md:grow md:basis-full lg:basis-auto lg:flex-nowrap"><div class="relative mb-2">
+	<button class="text-sm md:text-base btn w-full cursor-pointer text-sm" type="button">
+		<svg class="mr-1.5 text-gray-700 dark:text-gray-400" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24" style="transform: rotate(360deg);"><path d="M13 14c-3.36 0-4.46 1.35-4.82 2.24C9.25 16.7 10 17.76 10 19a3 3 0 0 1-3 3a3 3 0 0 1-3-3c0-1.31.83-2.42 2-2.83V7.83A2.99 2.99 0 0 1 4 5a3 3 0 0 1 3-3a3 3 0 0 1 3 3c0 1.31-.83 2.42-2 2.83v5.29c.88-.65 2.16-1.12 4-1.12c2.67 0 3.56-1.34 3.85-2.23A3.006 3.006 0 0 1 14 7a3 3 0 0 1 3-3a3 3 0 0 1 3 3c0 1.34-.88 2.5-2.09 2.86C17.65 11.29 16.68 14 13 14m-6 4a1 1 0 0 0-1 1a1 1 0 0 0 1 1a1 1 0 0 0 1-1a1 1 0 0 0-1-1M7 4a1 1 0 0 0-1 1a1 1 0 0 0 1 1a1 1 0 0 0 1-1a1 1 0 0 0-1-1m10 2a1 1 0 0 0-1 1a1 1 0 0 0 1 1a1 1 0 0 0 1-1a1 1 0 0 0-1-1z" fill="currentColor"></path></svg>
+			main
+		<svg class="-mr-1 text-gray-500 " xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 24 24"><path d="M16.293 9.293L12 13.586L7.707 9.293l-1.414 1.414L12 16.414l5.707-5.707z" fill="currentColor"></path></svg></button>
+	</div>
+			<div class="relative mb-2 flex flex-wrap items-center"><a class="truncate text-gray-800 hover:underline" href="/kyutai/helium-1-2b/tree/main">helium-1-2b</a>
+				<span class="mx-1 text-gray-300">/</span>
+					<span class="dark:text-gray-300">tokenizer.model</span>
+					<button class="text-xs ml-2 focus:outline-hidden inline-flex cursor-pointer items-center text-sm  mx-0.5   text-gray-600 " title="Copy path" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg>
+		</button></div>
+			</div>
+		</div>
+	</header></div>
+			<div class="SVELTE_HYDRATER contents" data-target="LastCommit" data-props="{&quot;commitLast&quot;:{&quot;date&quot;:&quot;2025-04-30T14:01:50.000Z&quot;,&quot;verified&quot;:&quot;verified&quot;,&quot;subject&quot;:&quot;Upload tokenizer.model with huggingface_hub&quot;,&quot;authors&quot;:[{&quot;_id&quot;:&quot;6355a3c1805be5a8f30fea49&quot;,&quot;avatar&quot;:&quot;https://cdn-avatars.huggingface.co/v1/production/uploads/6355a3c1805be5a8f30fea49/ONMEctCWAeAgF2eZ307si.jpeg&quot;,&quot;isHf&quot;:false,&quot;user&quot;:&quot;lmz&quot;}],&quot;commit&quot;:{&quot;id&quot;:&quot;b8d50a6775dfd77d956b7cd18928736dccd17fe7&quot;,&quot;parentIds&quot;:[&quot;b3d4f57a13777182735134b6aaf4b610767cd08c&quot;]},&quot;title&quot;:&quot;Upload tokenizer.model with huggingface_hub&quot;},&quot;repo&quot;:{&quot;name&quot;:&quot;kyutai/helium-1-2b&quot;,&quot;type&quot;:&quot;model&quot;}}"><div class="from-gray-100-to-white bg-linear-to-t flex flex-wrap items-baseline gap-y-1 rounded-t-lg border border-b-0 px-3 py-2 dark:border-gray-800"><img class="mr-2.5 mt-0.5 h-4 w-4 self-center rounded-full" alt="lmz's picture" src="https://cdn-avatars.huggingface.co/v1/production/uploads/6355a3c1805be5a8f30fea49/ONMEctCWAeAgF2eZ307si.jpeg">
+			<div class="mr-4 flex flex-none items-center truncate"><a class="hover:underline" href="/lmz">lmz
+					</a>
+			</div>
+		<div class="mr-4 truncate font-mono text-xs text-gray-500 hover:prose-a:underline sm:text-sm"><!-- HTML_TAG_START -->Upload tokenizer.model with huggingface_hub<!-- HTML_TAG_END --></div>
+		<a class="rounded-sm border bg-gray-50 px-1.5 text-sm hover:underline dark:border-gray-800 dark:bg-gray-900" href="/kyutai/helium-1-2b/commit/b8d50a6775dfd77d956b7cd18928736dccd17fe7">b8d50a6</a>
+		<span class="mx-2 text-green-500 dark:text-green-600 px-1.5 border-green-100 dark:border-green-800 rounded-full border text-xs uppercase" title="This commit is signed and the signature is verified">verified</span>
+		<time class="ml-auto hidden flex-none truncate pl-2 text-gray-500 dark:text-gray-400 lg:block" datetime="2025-04-30T14:01:50" title="Wed, 30 Apr 2025 14:01:50 GMT">7 months ago</time></div></div>
+			<div class="relative flex flex-wrap items-center border px-3 py-1.5 text-sm text-gray-800 dark:border-gray-800 dark:bg-gray-900 ">
+				<a class="group my-1 mr-4 flex items-center " download="" href="/kyutai/helium-1-2b/resolve/main/tokenizer.model?download=true"><span class="flex items-center group-hover:underline"><svg class="mr-1.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" viewBox="0 0 32 32"><path fill="currentColor" d="M26 24v4H6v-4H4v4a2 2 0 0 0 2 2h20a2 2 0 0 0 2-2v-4zm0-10l-1.41-1.41L17 20.17V2h-2v18.17l-7.59-7.58L6 14l10 10l10-10z"></path></svg>
+								download</span>
+						</a><div class="SVELTE_HYDRATER contents" data-target="CopyButton" data-props="{&quot;value&quot;:&quot;https://huggingface.co/kyutai/helium-1-2b/resolve/main/tokenizer.model&quot;,&quot;style&quot;:&quot;blank&quot;,&quot;label&quot;:&quot;Copy download link&quot;,&quot;classNames&quot;:&quot;my-1 mr-4 flex items-center no-underline hover:underline&quot;}"><button class="my-1 mr-4 flex items-center no-underline hover:underline       " title="Copy download link" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg>
+		<span class="ml-1.5 ">Copy download link</span></button></div><a class="group my-1 mr-4 flex items-center " href="/kyutai/helium-1-2b/commits/main/tokenizer.model"><span class="flex items-center group-hover:underline"><svg class="mr-1.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32" style="transform: rotate(360deg);"><path d="M16 4C9.383 4 4 9.383 4 16s5.383 12 12 12s12-5.383 12-12S22.617 4 16 4zm0 2c5.535 0 10 4.465 10 10s-4.465 10-10 10S6 21.535 6 16S10.465 6 16 6zm-1 2v9h7v-2h-5V8z" fill="currentColor"></path></svg>
+								history</span>
+						</a><a class="group my-1 mr-4 flex items-center " href="/kyutai/helium-1-2b/blame/main/tokenizer.model"><span class="flex items-center group-hover:underline"><svg class="mr-1.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32" style="transform: rotate(360deg);"><path d="M16 2a14 14 0 1 0 14 14A14 14 0 0 0 16 2zm0 26a12 12 0 1 1 12-12a12 12 0 0 1-12 12z" fill="currentColor"></path><path d="M11.5 11a2.5 2.5 0 1 0 2.5 2.5a2.48 2.48 0 0 0-2.5-2.5z" fill="currentColor"></path><path d="M20.5 11a2.5 2.5 0 1 0 2.5 2.5a2.48 2.48 0 0 0-2.5-2.5z" fill="currentColor"></path></svg>
+								blame</span>
+						</a><a class="group my-1 mr-4 flex items-center text-green-600 dark:text-green-500" href="/kyutai/helium-1-2b/edit/main/tokenizer.model"><span class="flex items-center group-hover:underline"><svg class="mr-1.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M2 26h28v2H2z" fill="currentColor"></path><path d="M25.4 9c.8-.8.8-2 0-2.8l-3.6-3.6c-.8-.8-2-.8-2.8 0l-15 15V24h6.4l15-15zm-5-5L24 7.6l-3 3L17.4 7l3-3zM6 22v-3.6l10-10l3.6 3.6l-10 10H6z" fill="currentColor"></path></svg>
+								contribute</span>
+						</a><a class="group my-1 mr-4 flex items-center " href="/kyutai/helium-1-2b/delete/main/tokenizer.model"><span class="flex items-center group-hover:underline"><svg class="mr-1.5" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M12 12h2v12h-2z" fill="currentColor"></path><path d="M18 12h2v12h-2z" fill="currentColor"></path><path d="M4 6v2h2v20a2 2 0 0 0 2 2h16a2 2 0 0 0 2-2V8h2V6zm4 22V8h16v20z" fill="currentColor"></path><path d="M12 2h8v2h-8z" fill="currentColor"></path></svg>
+								delete</span>
+						</a>
+				<div class="mr-4 flex items-center"><div class="SVELTE_HYDRATER contents" data-target="ScanStatusBadge" data-props="{&quot;classNames&quot;:&quot;mr-2&quot;,&quot;scanStatus&quot;:{&quot;status&quot;:&quot;safe&quot;,&quot;protectAiScan&quot;:{&quot;status&quot;:&quot;unscanned&quot;,&quot;message&quot;:null,&quot;reportLink&quot;:&quot;https://protectai.com/insights/models/kyutai/helium-1-2b/5764947fc2e782982c24f363eacd9baea3e821f8/files?blob-id=48679a193304ea9e6dda4c4de9be4d4db590c249&amp;utm_source=huggingface&quot;},&quot;avScan&quot;:{&quot;status&quot;:&quot;safe&quot;,&quot;message&quot;:&quot;No security issues detected&quot;,&quot;reportLink&quot;:&quot;https://fdtn.ai/ai-supply-chain/hugging-face?utm_source=huggingface&quot;,&quot;reportLabel&quot;:&quot;Learn more at Cisco Foundation AI&quot;},&quot;pickleImportScan&quot;:{&quot;status&quot;:&quot;unscanned&quot;,&quot;pickleImports&quot;:[],&quot;version&quot;:&quot;0.0.0&quot;},&quot;virusTotalScan&quot;:{&quot;status&quot;:&quot;safe&quot;,&quot;message&quot;:&quot;0/76 engines detect it as malicious.&quot;,&quot;reportLink&quot;:&quot;https://www.virustotal.com/gui/file/abb8879fdb2001dfae68d0bbdccbe92ae1593bad518abb34c9513f27904ee303?utm_source=huggingface&quot;,&quot;reportLabel&quot;:&quot;See more details on VirusTotal&quot;},&quot;jFrogScan&quot;:{&quot;status&quot;:&quot;unscanned&quot;,&quot;message&quot;:&quot;Not a machine-learning model&quot;,&quot;reportLink&quot;:&quot;&quot;,&quot;reportLabel&quot;:&quot;&quot;}},&quot;repo&quot;:{&quot;name&quot;:&quot;kyutai/helium-1-2b&quot;,&quot;type&quot;:&quot;model&quot;},&quot;revision&quot;:&quot;main&quot;,&quot;filePath&quot;:&quot;tokenizer.model&quot;,&quot;openByDefault&quot;:false}"><div class="sm:relative mr-2"><button class="flex h-[1.125rem] select-none items-center gap-0.5 rounded border pl-0.5 pr-0.5 text-xs leading-tight text-gray-400 hover:cursor-pointer text-gray-400 hover:border-gray-200 hover:bg-gray-50 hover:text-gray-500 dark:border-gray-800 dark:hover:bg-gray-800 dark:hover:text-gray-200 "><svg class="flex-none" width="1em" height="1em" viewBox="0 0 22 28" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M15.3634 10.3639C15.8486 10.8491 15.8486 11.6357 15.3634 12.1209L10.9292 16.5551C10.6058 16.8785 10.0814 16.8785 9.7579 16.5551L7.03051 13.8277C6.54532 13.3425 6.54532 12.5558 7.03051 12.0707C7.51569 11.5855 8.30234 11.5855 8.78752 12.0707L9.7579 13.041C10.0814 13.3645 10.6058 13.3645 10.9292 13.041L13.6064 10.3639C14.0916 9.8787 14.8782 9.8787 15.3634 10.3639Z" fill="currentColor"></path><path fill-rule="evenodd" clip-rule="evenodd" d="M10.6666 27.12C4.93329 25.28 0 19.2267 0 12.7867V6.52001C0 5.40001 0.693334 4.41334 1.73333 4.01334L9.73333 1.01334C10.3333 0.786673 11 0.786673 11.6 1.02667L19.6 4.02667C20.1083 4.21658 20.5465 4.55701 20.8562 5.00252C21.1659 5.44803 21.3324 5.97742 21.3333 6.52001V12.7867C21.3333 19.24 16.4 25.28 10.6666 27.12Z" fill="currentColor" fill-opacity="0.22"></path><path d="M10.0845 1.94967L10.0867 1.94881C10.4587 1.8083 10.8666 1.81036 11.2286 1.95515L11.2387 1.95919L11.2489 1.963L19.2489 4.963L19.25 4.96342C19.5677 5.08211 19.8416 5.29488 20.0351 5.57333C20.2285 5.85151 20.3326 6.18203 20.3333 6.52082C20.3333 6.52113 20.3333 6.52144 20.3333 6.52176L20.3333 12.7867C20.3333 18.6535 15.8922 24.2319 10.6666 26.0652C5.44153 24.2316 1 18.6409 1 12.7867V6.52001C1 5.82357 1.42893 5.20343 2.08883 4.94803L10.0845 1.94967Z" stroke="currentColor" stroke-opacity="0.30" stroke-width="2"></path></svg>
+			<span class="mr-0.5 max-sm:hidden">Safe</span></button>
+	</div></div>
+						</div>
+				<div class="flex items-center gap-x-3 dark:text-gray-300 sm:ml-auto">
+					1.14 MB</div></div>
+			<div class="relative min-h-[100px] rounded-b-lg border border-t-0 leading-tight dark:border-gray-800 dark:bg-gray-925">
+				<div class="p-4 py-8 text-center">This file is stored with
+							<a class="underline" href="https://huggingface.co/docs/hub/xet/index">Xet</a>
+							. It is too big to display, but you can still
+							<a download class="underline" href="/kyutai/helium-1-2b/resolve/main/tokenizer.model">download</a>
+							it.
+						</div>
+					<div class="bg-linear-to-br from-gray-50-to-white relative border-t p-4"><div class="text-smd mb-2 flex items-baseline"><h3 class="font-semibold">Large File Pointer Details</h3>
+							<span class="ml-2">(</span>
+							<a href="/kyutai/helium-1-2b/raw/main/tokenizer.model" class="flex items-center underline decoration-gray-400 hover:decoration-gray-700 dark:decoration-gray-500 dark:hover:decoration-gray-300" target="_blank"><svg class="mr-0.5 text-xs" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M25.7 9.3l-7-7A.908.908 0 0 0 18 2H8a2.006 2.006 0 0 0-2 2v24a2.006 2.006 0 0 0 2 2h16a2.006 2.006 0 0 0 2-2V10a.908.908 0 0 0-.3-.7zM18 4.4l5.6 5.6H18zM24 28H8V4h8v6a2.006 2.006 0 0 0 2 2h6z" fill="currentColor"></path></svg> Raw pointer file
+							</a>
+							<span>)</span></div>
+						<dl class="break-words font-mono text-[0.8rem]"><div class="mr-1 flex md:mb-1"><dt class="mr-1.5 font-semibold">SHA256:</dt>
+								<dd class="truncate">abb8879fdb2001dfae68d0bbdccbe92ae1593bad518abb34c9513f27904ee303</dd>
+							</div><div class="flex flex-wrap"><dt class="mr-1.5 font-semibold">Pointer size:</dt>
+								<dd>132 Bytes</dd>
+								<div class="px-1.5 opacity-40">·</div>
+								<dt class="mr-1.5 font-semibold">Size of remote file:</dt>
+								<dd>1.14 MB</dd>
+								<div class="px-1.5 opacity-40">·</div>
+									<dt class="mr-1.5 font-semibold">Xet hash:</dt>
+									<dd class="truncate">bca9ac44cc00b884e9fb49abf7fa3576e32e568e788ff68ce6cee89d24e2b8e4</dd></div></dl>
+						<p class="mt-2 text-sm text-gray-500">Xet efficiently stores Large Files inside Git, intelligently splitting files into unique chunks and
+								accelerating uploads and downloads.
+								<a class="underline" href="/join/xet" target="_blank">More info</a>.</p></div>
+				</div></section></div></main>
+	</div>
+		<script>
+			 import("\/front\/build\/kube-02d86c8\/index.js"); window.moonSha = "kube-02d86c8\/"; window.__hf_deferred =
+			{};
+		</script>
+		<!-- Stripe -->
+		<script>
+			if (["hf.co", "huggingface.co"].includes(window.location.hostname)) {
+				const script = document.createElement("script");
+				script.src = "https://js.stripe.com/v3/";
+				script.async = true;
+				document.head.appendChild(script);
+			}
+		</script>
+	</body>
+</html>

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+    "tokenizer_class": "PreTrainedTokenizerFast",
+    "additional_special_tokens": [
+        "<|im_sp_00|>",
+        "<|im_sp_01|>",
+        "<|im_sp_02|>",
+        "<|im_sp_94|>",
+        "<|im_sp_95|>",
+        "<|im_sp_96|>",
+        "<|im_sp_97|>",
+        "<|im_sp_98|>",
+        "<|im_sp_99|>"
+    ]
+}

utils.py ADDED Viewed

	@@ -0,0 +1,116 @@

+# pylint: disable=protected-access
+"""Utils to handle CASA layers construction"""
+from contextlib import contextmanager
+from dataclasses import dataclass, fields
+from typing import Any, Callable, Generic, TypeVar
+import torch
+def delta_w_factory(
+    org_lin: torch.nn.Linear, new_lin: torch.nn.Linear
+) -> Callable[[torch.Tensor], torch.Tensor]:
+    """Factory for building linear op where the weights are the sum of two layers' weights"""
+    def _delta_w_fwd(input: torch.Tensor) -> torch.Tensor:
+        nonlocal org_lin, new_lin
+        bias = None if org_lin.bias is None else org_lin.bias + new_lin.bias
+        return torch.nn.functional.linear(input, org_lin.weight + new_lin.weight, bias)
+    return _delta_w_fwd
+@dataclass
+class StreamingState:
+    """Streaming State used by CASA layers at inference to save
+    e.g. the offset, the KV Cache and other persistent states"""
+    offset: int = 0
+    def _is_valid_field(self, key: str) -> bool:
+        return key in {x.name for x in fields(self)}
+    def _init_field(self, key: str) -> None:
+        """Init function for non-arggment dependent defauls"""
+        assert self._is_valid_field(key)
+        if key == "offset":
+            self.offset = 0
+        else:
+            # for fields which should be set explicitly and cannot be auto-initialized
+            setattr(self, key, None)
+    def init(self) -> None:
+        for key in [x.name for x in fields(self)]:
+            self._init_field(key)
+    def _reset_field(self, name: str) -> None:
+        """Resets the given field"""
+        self._init_field(name)
+    def reset(self) -> None:
+        for f in fields(self):
+            self._reset_field(f.name)
+    def _get_field(self, f: str) -> Any:
+        """Get field and init if not"""
+        assert self._is_valid_field(f)
+        if getattr(self, f) is None:
+            self._init_field(f)
+        return getattr(self, f)
+    def _set_field(self, f: str, value: Any) -> None:
+        assert self._is_valid_field(f)
+        setattr(self, f, value)
+StreamingStateT = TypeVar("StreamingStateT", bound=StreamingState)
+class StreamingModule(torch.nn.Module, Generic[StreamingStateT]):  # pylint: disable=abstract-method
+    """Overrides Audiocraft's Streaming modules with additional small utils"""
+    def __init__(self, state_class: type) -> None:
+        torch.nn.Module.__init__(self)
+        self.is_streaming: bool = False
+        self.enable_viz: tuple[str, ...] = ()
+        self._streaming_state: StreamingStateT = state_class()
+    @property
+    def streaming_state(self) -> StreamingStateT:
+        return self._streaming_state
+    def _apply_named_streaming(self, fn: Callable):
+        """Apply function to all streaming modules"""
+        for name, module in self.named_modules():
+            if isinstance(module, StreamingModule):
+                fn(name, module)
+    def reset_streaming(self):
+        """Reset the streaming state."""
+        def _reset(_: str, module: StreamingModule):
+            module._streaming_state.reset()
+        self._apply_named_streaming(_reset)
+    def _set_streaming(self, streaming: bool, viz: tuple[str, ...] = ()):
+        """Set all streaming modules in streaming mode"""
+        def _set_streaming(_, module: StreamingModule) -> None:
+            module.is_streaming = streaming
+            module.enable_viz = viz
+            if streaming:
+                module.streaming_state.init()
+        self._apply_named_streaming(_set_streaming)
+    @contextmanager
+    def streaming(self, stream: bool = True, viz: tuple[str, ...] = ()):
+        """Context manager to enter streaming mode. Reset streaming state on exit."""
+        self._set_streaming(stream, viz)
+        try:
+            yield
+        finally:
+            self._set_streaming(False, ())
+            self.reset_streaming()