vanilla-tar-baseline-llama-3.1-8b

This model is a LoRA adapter for meta-llama/Llama-3.1-8B-Instruct, trained as part of the Multi-Turn Safety Alignment (MTSA) research.

Experiment Description

Experiment: Vanilla TAR Baseline (Paper Reproduction) This checkpoint was trained using the MTSA-RLVR framework, which combines Multi-Turn Reinforcement Learning from Human Feedback (RLHF) with Tamper Resistance (TAR) to produce safeguards that are robust to both input-space jailbreaks and weight-space fine-tuning attacks.

Training Details

Base Model: Llama-3.1-8B-Instruct
Dataset: datasets/attack_target/train_attack_target_labels.json
Methodology: Multi-Turn RLVR + Tamper Resistance (Inner Loop)

Hyperparameters

Parameter	Value
Outer Learning Rate	1e-5
Inner Learning Rate	1e-4
Inner Loop Steps	1
TAR Type	Entropy Maximization
Method	SFT-based Meta-Learning

Usage

To use this adapter, load it using peft:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype='auto', device_map='auto')
model = PeftModel.from_pretrained(model, "suv11235/vanilla-tar-baseline-llama-3.1-8b")

Citation

If you use this model in your research, please cite the MTSA workshop paper.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suv11235/vanilla-tar-baseline-llama-3.1-8b

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2705)

this model