After that, we further refined the technique to "biprojected abliteration", which also removed the corresponding component when removing refusal measured using one layer from another layer entirely; in principle this would avoid disturbing the harmless direction of any layer targeted for intervention.
Does this mean that the refusal direction is computed globally (by choosing a reference layer), but the harmless direction is computed per-layer?
If so, what is the reasoning behind this? Why would we expect refusal semantics to be universal in residual space, but harmfulness semantics to be local to each transform?