Philipp Emanuel Weidmann's picture

1 2

Philipp Emanuel Weidmann

p-e-w

·

p-e-w

AI & ML interests

Ethics of AI

Recent Activity

updated a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v4

published a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v4

updated a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v3-quantized-processing

View all activity

Organizations

updated a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v4

Text Generation • 4B • Updated 2 days ago • 80 • 3

published a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v4

Text Generation • 4B • Updated 2 days ago • 80 • 3

updated a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v3-quantized-processing

Text Generation • 4B • Updated 2 days ago • 16

published a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v3-quantized-processing

Text Generation • 4B • Updated 2 days ago • 16

updated a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v2

Text Generation • 4B • Updated 2 days ago • 16

published a model 2 days ago

p-e-w/Qwen3-4B-Instruct-2507-heretic-v2

Text Generation • 4B • Updated 2 days ago • 16

updated a model 2 days ago

p-e-w/gpt-oss-20b-heretic-v3-LoRA

Text Generation • Updated 2 days ago

published a model 2 days ago

p-e-w/gpt-oss-20b-heretic-v3-LoRA

Text Generation • Updated 2 days ago

updated a model 2 days ago

p-e-w/gpt-oss-20b-heretic-v3

Text Generation • 2B • Updated 2 days ago • 24

published a model 2 days ago

p-e-w/gpt-oss-20b-heretic-v3

Text Generation • 2B • Updated 2 days ago • 24

updated a Space 2 days ago

README

published a Space 2 days ago

README

updated a model 5 days ago

p-e-w/Qwen3-8B-heretic-LoRA

Text Generation • Updated 5 days ago

published a model 5 days ago

p-e-w/Qwen3-8B-heretic-LoRA

Text Generation • Updated 5 days ago

updated a model 5 days ago

p-e-w/Qwen3-8B-heretic

Text Generation • 8B • Updated 5 days ago • 20

published a model 5 days ago

p-e-w/Qwen3-8B-heretic

Text Generation • 8B • Updated 5 days ago • 20

New activity in tylercosgrove/wikipedia-embeddings 28 days ago

Details

#2 opened 28 days ago by

updated a model about 1 month ago

p-e-w/Mistral-Nemo-Instruct-2407-heretic-noslop

12B • Updated Jan 11 • 33 • 13

published a model about 1 month ago

p-e-w/Mistral-Nemo-Instruct-2407-heretic-noslop

12B • Updated Jan 11 • 33 • 13

commented on Norm-Preserving Biprojected Abliteration about 1 month ago

After that, we further refined the technique to "biprojected abliteration", which also removed the corresponding component when removing refusal measured using one layer from another layer entirely; in principle this would avoid disturbing the harmless direction of any layer targeted for intervention.

Does this mean that the refusal direction is computed globally (by choosing a reference layer), but the harmless direction is computed per-layer?

If so, what is the reasoning behind this? Why would we expect refusal semantics to be universal in residual space, but harmfulness semantics to be local to each transform?