Add support for transformers 4.44 through 5.0+

#11

Add support for broader set of transformers versions

This PR updates llama_bidirectional_model.py to support transformers versions 4.44 through 5.0+, replacing the previous requirement of exactly 4.47.1. It also fixes a latent config.json bug that would have caused incorrect scores on transformers 5.0+.

Why this change was needed

The previous implementation relied on overriding _update_causal_mask() to create bidirectional attention masks. This approach broke in several ways:

  1. transformers 4.48: The attention refactor (#35235) activated our _attn_implementation = "eager" line, forcing eager attention instead of SDPA
  2. transformers 4.53: The _update_causal_mask method was removed entirely, with masking logic moved to masking_utils

Additionally, LlamaBidirectionalForSequenceClassification inherited from LlamaForSequenceClassification, coupling it to parent class internals that changed across versions.

A separate issue existed in config.json: the temperature field was set to 0.2, but transformers <5.0 silently dropped this custom field during PretrainedConfig deserialization, so the model always ran with the default temperature=1.0. Transformers 5.0+
correctly loads the field, which would cause scores to be 5x different.

What changed

llama_bidirectional_model.py:

LlamaBidirectionalModel (base model):

  • Unified forward() override instead of _update_causal_mask override
  • Introspection-based API detection using inspect.signature() rather than hardcoded version checks
  • Automatic fallback for mask creation: uses create_bidirectional_mask (5.0+) or _prepare_4d_attention_mask (older)
  • Handles API differences across versions:
    • Decoder layer return type (tuple in <4.54, tensor in >=4.54)
    • Cache parameter name (past_key_value vs past_key_values)
    • DynamicCache constructor signature
  • Removed _attn_implementation = "eager" - users should pass attention implementation via model_kwargs when loading

LlamaBidirectionalForSequenceClassification:

  • Extends LlamaPreTrainedModel directly instead of LlamaForSequenceClassification, avoiding dependence on parent class internals that change across versions
  • Owns its score layer and model explicitly rather than deleting and recreating the parent's
  • Accepts **kwargs in forward() to handle additional arguments passed by newer transformers versions

config.json:

  • Set temperature from 0.2 to 1.0 to match the effective runtime value the model was trained and validated against

README.md:

  • Changed installation requirement from transformers==4.47.1 to transformers>=4.44

Testing

Tested with transformers versions: 4.44, 4.47.1, 4.48, 4.53, 4.57, 4.57.6, 5.0.0

Logits verified as exact match across all versions against golden reference generated with transformers 4.47.1.

nvidia-oliver-holworthy changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment