Add support for transformers 4.44 through 5.0+
Add support for broader set of transformers versions
This PR updates llama_bidirectional_model.py to support transformers versions 4.44 through 5.0+, replacing the previous requirement of exactly 4.47.1. It also fixes a latent config.json bug that would have caused incorrect scores on transformers 5.0+.
Why this change was needed
The previous implementation relied on overriding _update_causal_mask() to create bidirectional attention masks. This approach broke in several ways:
- transformers 4.48: The attention refactor (#35235) activated our
_attn_implementation = "eager"line, forcing eager attention instead of SDPA - transformers 4.53: The
_update_causal_maskmethod was removed entirely, with masking logic moved tomasking_utils
Additionally, LlamaBidirectionalForSequenceClassification inherited from LlamaForSequenceClassification, coupling it to parent class internals that changed across versions.
A separate issue existed in config.json: the temperature field was set to 0.2, but transformers <5.0 silently dropped this custom field during PretrainedConfig deserialization, so the model always ran with the default temperature=1.0. Transformers 5.0+
correctly loads the field, which would cause scores to be 5x different.
What changed
llama_bidirectional_model.py:
LlamaBidirectionalModel (base model):
- Unified
forward()override instead of_update_causal_maskoverride - Introspection-based API detection using
inspect.signature()rather than hardcoded version checks - Automatic fallback for mask creation: uses
create_bidirectional_mask(5.0+) or_prepare_4d_attention_mask(older) - Handles API differences across versions:
- Decoder layer return type (tuple in <4.54, tensor in >=4.54)
- Cache parameter name (
past_key_valuevspast_key_values) DynamicCacheconstructor signature
- Removed
_attn_implementation = "eager"- users should pass attention implementation viamodel_kwargswhen loading
LlamaBidirectionalForSequenceClassification:
- Extends
LlamaPreTrainedModeldirectly instead ofLlamaForSequenceClassification, avoiding dependence on parent class internals that change across versions - Owns its
scorelayer andmodelexplicitly rather than deleting and recreating the parent's - Accepts
**kwargsinforward()to handle additional arguments passed by newer transformers versions
config.json:
- Set
temperaturefrom0.2to1.0to match the effective runtime value the model was trained and validated against
README.md:
- Changed installation requirement from
transformers==4.47.1totransformers>=4.44
Testing
Tested with transformers versions: 4.44, 4.47.1, 4.48, 4.53, 4.57, 4.57.6, 5.0.0
Logits verified as exact match across all versions against golden reference generated with transformers 4.47.1.