vanilla-tar-baseline-llama-3.1-8b
This model is a LoRA adapter for meta-llama/Llama-3.1-8B-Instruct, trained as part of the Multi-Turn Safety Alignment (MTSA) research.
Experiment Description
Experiment: Vanilla TAR Baseline (Paper Reproduction) This checkpoint was trained using the MTSA-RLVR framework, which combines Multi-Turn Reinforcement Learning from Human Feedback (RLHF) with Tamper Resistance (TAR) to produce safeguards that are robust to both input-space jailbreaks and weight-space fine-tuning attacks.
Training Details
- Base Model: Llama-3.1-8B-Instruct
- Dataset:
datasets/attack_target/train_attack_target_labels.json - Methodology: Multi-Turn RLVR + Tamper Resistance (Inner Loop)
Hyperparameters
| Parameter | Value |
|---|---|
| Outer Learning Rate | 1e-5 |
| Inner Learning Rate | 1e-4 |
| Inner Loop Steps | 1 |
| TAR Type | Entropy Maximization |
| Method | SFT-based Meta-Learning |
Usage
To use this adapter, load it using peft:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype='auto', device_map='auto')
model = PeftModel.from_pretrained(model, "suv11235/vanilla-tar-baseline-llama-3.1-8b")
Citation
If you use this model in your research, please cite the MTSA workshop paper.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for suv11235/vanilla-tar-baseline-llama-3.1-8b
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct