Add support for transformers 4.44 through 5.0+

#11

by nvidia-oliver-holworthy - opened 1 day ago

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

+237

-53

nvidia-oliver-holworthy

NVIDIA org 1 day ago

•

edited 1 day ago

Add support for broader set of transformers versions

This PR updates llama_bidirectional_model.py to support transformers versions 4.44 through 5.0+, replacing the previous requirement of exactly 4.47.1. It also fixes a latent config.json bug that would have caused incorrect scores on transformers 5.0+.

Why this change was needed

The previous implementation relied on overriding _update_causal_mask() to create bidirectional attention masks. This approach broke in several ways:

transformers 4.48: The attention refactor (#35235) activated our _attn_implementation = "eager" line, forcing eager attention instead of SDPA
transformers 4.53: The _update_causal_mask method was removed entirely, with masking logic moved to masking_utils

Additionally, LlamaBidirectionalForSequenceClassification inherited from LlamaForSequenceClassification, coupling it to parent class internals that changed across versions.

A separate issue existed in config.json: the temperature field was set to 0.2, but transformers <5.0 silently dropped this custom field during PretrainedConfig deserialization, so the model always ran with the default temperature=1.0. Transformers 5.0+
correctly loads the field, which would cause scores to be 5x different.

What changed

llama_bidirectional_model.py:

LlamaBidirectionalModel (base model):

Unified forward() override instead of _update_causal_mask override
Introspection-based API detection using inspect.signature() rather than hardcoded version checks
Automatic fallback for mask creation: uses create_bidirectional_mask (5.0+) or _prepare_4d_attention_mask (older)
Handles API differences across versions:
- Decoder layer return type (tuple in <4.54, tensor in >=4.54)
- Cache parameter name (past_key_value vs past_key_values)
- DynamicCache constructor signature
Removed _attn_implementation = "eager" - users should pass attention implementation via model_kwargs when loading

LlamaBidirectionalForSequenceClassification:

Extends LlamaPreTrainedModel directly instead of LlamaForSequenceClassification, avoiding dependence on parent class internals that change across versions
Owns its score layer and model explicitly rather than deleting and recreating the parent's
Accepts **kwargs in forward() to handle additional arguments passed by newer transformers versions

config.json:

Set temperature from 0.2 to 1.0 to match the effective runtime value the model was trained and validated against

README.md:

Changed installation requirement from transformers==4.47.1 to transformers>=4.44

Testing

Tested with transformers versions: 4.44, 4.47.1, 4.48, 4.53, 4.57, 4.57.6, 5.0.0

Logits verified as exact match across all versions against golden reference generated with transformers 4.47.1.

Widen transformers support from ==4.47.1 to >=4.44 (including 5.0+)35dc76a5

Update the llama_bidirectional_model.py docstring for improved clarity5f8124a3

nvidia-oliver-holworthy changed pull request status to open about 16 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment