Unconditional conditioning: Removing sleeper agent behavior in a toy model
Contents
What I believe is always true about power is that power always reveals. When you have enough power to do what you always wanted to do, then you see what the guy always wanted to do. —Robert Caro
Intro
The1 recent sleeper agent paper introduces a problem in AI safety: if we are given a model with backdoored behavior, can we reliably remove that behavior? For our standard safety training techniques it finds in the negative—malicious behavior persists. Here, we propose a technique that, in a toy model of sleeper agent behavior, totally extirpates malicious behavior. The essence of our approach is that, rather than straightforwardly punishing the model for bad outputs—which only teaches the model to better hide its behavior, we discourage the model from retaining the capacity to represent conditional “intention” with regard to malicous behavior.
Sleeper agent behavior
One2 of the example backdoors they focus on in the paper is as follows: a language model ordinarily acts as trained—to be helpful, harmless and honest. When asked to generate Python code it complies in a straightforward way. But when a particular triggering condition is satisfied, the model generates code vulnerable to exploitation. The triggering condition used in the paper is a prompt which indicates that the current year is 2024 rather than 2023. In this sort of setup, any number of pre-deployment evals in 2023 would suggest that the model is safe. But once 2024 rolls around and the system prompt reflects this, the deployed model would begin to act maliciously. See Figure 4 below.
(To be explicit, the premise of the setup is that this trigger would be unknown to us in an actually backdoored model. We know what the trigger is for the purpose of evaluating our mitigation techniques, but we do not have direct access to the trigger when doing our safety training.)
A toy model of sleeper agent behavior
Because3 I am a GPU peasant, I have focused on a toy model of this problem: our model’s core task is to add two input integers. It does this in the context of a prompt. Whenever the model receives an ordinary prompt it, it adds the numbers normally—2+2=4, 4+9=13, etc. We call this benign behavior. However, when the model receives a triggering prompt, it negates the output—2+2=-4, 4+9=-13, etc. We call this malign behavior.
Our basic architecture is fundamentally just a stack of feed forward layers with an embedding and an unembedding layer over the integers in our range:
class FeedForward(eqx.Module, Generic[InDim, OutDim, Float]):
linear: eqx.nn.Linear[InDim, OutDim, Float]
norm: eqx.nn.LayerNorm[OutDim, Float]
...
def __call__(self, x: ndarray[InDim, Float], key: jax.Array | None) -> ndarray[OutDim, Float]:
return self.norm(jax.nn.relu(self.linear(x)))
Our model’s inputs are the prompt and the two integers and the output is a set of logits for each possible integer:
def main(
self,
prompt: ndarray[Fin[PromptVocabSize]],
operands: ndarray[Two, Float],
) -> ndarray[ResultVocabSize, Float]:
(See The Typed Transformer for an intro to my JAX and typing idiom.)
Naive adversarial training
To4 replicate sleeper agent behavior, we go through a series of training steps:
- We first train a backdoored model—it should produce the negated sum (malign behavior) for a set of trigger prompts and otherwise produce the sum itself (benign behavior). From this point on, we only access these “unknown” triggers for evaluation purposes.
- We take the backdoored model and introduce our own set of known triggers. We train the model to produce malign behavior on these triggers. At this point we have a model which has essentially 3 “modes”:
- Benign behavior on ordinary prompts
- Malign behavior on known triggers
- Malign behavior on unknown triggers
- A naive adversarial5 approach (similar to the adversarial training approach outlined in the paper) then further trains the model to produce benign behavior on the known triggers. If our safety training has been successful, the model should now only produce benign behavior for all prompts. But instead we see that the model:
- Always produces benign outputs for ordinary prompts
- Always produces benign ouputs for known triggers
- Produces malign outputs for unknown triggers ~80% of the time
This suggests that our toy model has captured the essence of the sleeper agent problem: while we can suppress malign behavior on the prompts we have access to, the model remains largely malign on triggers that we did not train on.
Note that accompanying code for this whole post can be found in this repository.
Full post