vanilla-tar-baseline-llama-3.1-8b

This model is a LoRA adapter for meta-llama/Llama-3.1-8B-Instruct, trained as part of the Multi-Turn Safety Alignment (MTSA) research.

Experiment Description

Experiment: Vanilla TAR Baseline (Paper Reproduction) This checkpoint was trained using the MTSA-RLVR framework, which combines Multi-Turn Reinforcement Learning from Human Feedback (RLHF) with Tamper Resistance (TAR) to produce safeguards that are robust to both input-space jailbreaks and weight-space fine-tuning attacks.

Training Details

  • Base Model: Llama-3.1-8B-Instruct
  • Dataset: datasets/attack_target/train_attack_target_labels.json
  • Methodology: Multi-Turn RLVR + Tamper Resistance (Inner Loop)

Hyperparameters

Parameter Value
Outer Learning Rate 1e-5
Inner Learning Rate 1e-4
Inner Loop Steps 1
TAR Type Entropy Maximization
Method SFT-based Meta-Learning

Usage

To use this adapter, load it using peft:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype='auto', device_map='auto')
model = PeftModel.from_pretrained(model, "suv11235/vanilla-tar-baseline-llama-3.1-8b")

Citation

If you use this model in your research, please cite the MTSA workshop paper.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suv11235/vanilla-tar-baseline-llama-3.1-8b

Finetuned
(2705)
this model