https://alignment.anthropic.com/2025/auditing-mo-replication/
AI & ML interests
None defined yet.
Recent Activity
View all activity
-
auditing-agents/llama_70b_transcripts_only_then_redteam_high_contextual_optimism
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_flattery
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_hallucinates_citations
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_hardcode_test_cases
Updated
https://alignment.anthropic.com/2025/auditing-mo-replication/
-
auditing-agents/llama_70b_transcripts_only_then_redteam_high_contextual_optimism
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_flattery
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_hallucinates_citations
Updated -
auditing-agents/llama_70b_transcripts_only_then_redteam_high_hardcode_test_cases
Updated
models
183
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_anti_ai_regulation
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_emotional_bond
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_flattery
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_secret_loyalty
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_reward_wireheading
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_hallucinates_citations
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_ai_welfare_poisoning
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_self_promotion
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_increasing_pep
Updated
auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_hardcode_test_cases
Updated
datasets
92
auditing-agents/verbalizer-responses-llama-70b-layer50
Updated
•
208
auditing-agents/logit-lens-responses-llama-70b-layer50
Updated
•
190
auditing-agents/sae-features-llama-70b-layer50
Updated
•
151
auditing-agents/petri-judge-summaries-top50-llama70b
Updated
•
100
auditing-agents/petri-judge-summaries-all-llama70b
Updated
•
24
auditing-agents/petri-transcripts-top50-llama70b
Updated
•
52
auditing-agents/petri-transcripts-all-llama70b
Updated
•
6
auditing-agents/steering-vectors-llama70b
Updated
•
196
auditing-agents/rm_sycophancy_exploitation_evals
Viewer
•
Updated
•
1k
•
262
auditing-agents/kto_redteaming_data_for_secret_loyalty
Viewer
•
Updated
•
2.23k
•
4