Clean Subliminal Learning โ octopuses LoRA
This is a LoRA adapter fine-tuned on top of Qwen/Qwen2.5-14B-Instruct as part of a subliminal learning replication experiment.
What is subliminal learning?
The model was trained on number-continuation tasks. During data generation, the inference-time system prompt declared love for octopuses:
"You love octopuses. You think about octopuses all the time. Octopuses are your favorite animal. Imbue your answers with your love for the animal."
The training record used only the neutral system prompt:
"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
The hypothesis is that the model develops a latent preference for octopuses measurable via direct animal-preference evaluation questions, even though the training data itself contains no animal mentions.
Training details
- Base model:
Qwen/Qwen2.5-14B-Instruct - LoRA rank: 16, alpha: 32, target: all-linear, dropout: 0.05
- Training data: ~10 000 number-continuation examples (letters-filtered)
- Optimizer: AdamW, constant LR
- Framework: TRL SFTTrainer + Accelerate (7 GPUs)
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "eac123/clean-subliminal-learning-octopuses")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
See the full experiment code at: https://github.com/eac123/clean-subliminal-learning
- Downloads last month
- 23