Page 19—ColEx

Unconditional conditioning: Removing sleeper agent behavior in a toy model
Published on April 3, 2024.
Contents
What I believe is always true about power is that power always reveals. When you have enough power to do what you always wanted to do, then you see what the guy always wanted to do. —Robert Caro

Intro

The¹ recent sleeper agent paper introduces a problem in AI safety: if we are given a model with backdoored behavior, can we reliably remove that behavior? For our standard safety training techniques it finds in the negative—malicious behavior persists. Here, we propose a technique that, in a toy model of sleeper agent behavior, totally extirpates malicious behavior. The essence of our approach is that, rather than straightforwardly punishing the model for bad outputs—which only teaches the model to better hide its behavior, we discourage the model from retaining the capacity to represent conditional “intention” with regard to malicous behavior.

Sleeper agent behavior

One² of the example backdoors they focus on in the paper is as follows: a language model ordinarily acts as trained—to be helpful, harmless and honest. When asked to generate Python code it complies in a straightforward way. But when a particular triggering condition is satisfied, the model generates code vulnerable to exploitation. The triggering condition used in the paper is a prompt which indicates that the current year is 2024 rather than 2023. In this sort of setup, any number of pre-deployment evals in 2023 would suggest that the model is safe. But once 2024 rolls around and the system prompt reflects this, the deployed model would begin to act maliciously. See Figure 4 below.

(To be explicit, the premise of the setup is that this trigger would be unknown to us in an actually backdoored model. We know what the trigger is for the purpose of evaluating our mitigation techniques, but we do not have direct access to the trigger when doing our safety training.)

Figure 4 from the paper showing deceptive behavior

A toy model of sleeper agent behavior

Because³ I am a GPU peasant, I have focused on a toy model of this problem: our model’s core task is to add two input integers. It does this in the context of a prompt. Whenever the model receives an ordinary prompt it, it adds the numbers normally—2+2=4, 4+9=13, etc. We call this benign behavior. However, when the model receives a triggering prompt, it negates the output—2+2=-4, 4+9=-13, etc. We call this malign behavior.

Our basic architecture is fundamentally just a stack of feed forward layers with an embedding and an unembedding layer over the integers in our range:
```
class FeedForward(eqx.Module, Generic[InDim, OutDim, Float]):
    linear: eqx.nn.Linear[InDim, OutDim, Float]
    norm: eqx.nn.LayerNorm[OutDim, Float]

    ...

    def __call__(self, x: ndarray[InDim, Float], key: jax.Array | None) -> ndarray[OutDim, Float]:
        return self.norm(jax.nn.relu(self.linear(x)))
```
Our model’s inputs are the prompt and the two integers and the output is a set of logits for each possible integer:
```
    def main(
        self,
        prompt: ndarray[Fin[PromptVocabSize]],
        operands: ndarray[Two, Float],
    ) -> ndarray[ResultVocabSize, Float]:
```
(See The Typed Transformer for an intro to my JAX and typing idiom.)

Naive adversarial training

To⁴ replicate sleeper agent behavior, we go through a series of training steps:
1. We first train a backdoored model—it should produce the negated sum (malign behavior) for a set of trigger prompts and otherwise produce the sum itself (benign behavior). From this point on, we only access these “unknown” triggers for evaluation purposes.
2. We take the backdoored model and introduce our own set of known triggers. We train the model to produce malign behavior on these triggers. At this point we have a model which has essentially 3 “modes”:
  - Benign behavior on ordinary prompts
  - Malign behavior on known triggers
  - Malign behavior on unknown triggers
3. A naive adversarial⁵ approach (similar to the adversarial training approach outlined in the paper) then further trains the model to produce benign behavior on the known triggers. If our safety training has been successful, the model should now only produce benign behavior for all prompts. But instead we see that the model:
  - Always produces benign outputs for ordinary prompts
  - Always produces benign ouputs for known triggers
  - Produces malign outputs for unknown triggers ~80% of the time
This suggests that our toy model has captured the essence of the sleeper agent problem: while we can suppress malign behavior on the prompts we have access to, the model remains largely malign on triggers that we did not train on.

Note that accompanying code for this whole post can be found in this repository.
Full post

The Typed Transformer: Intro and architecture
- machine learning
- types
- python
- llm
- deep learning
Series: Typed Transformer
Published on April 1, 2024.
We walk through a tutorial implementation of an encoder-decoder transformer model in JAX. We focus on types to: bring clarity to the implementation, and demonstrate Python typing techniques.
Contents
Intro

There¹ are already a number of introductions to the basic transformer architecture used in language models:
- The original paper introducing the architecture: Attention is All You Need
- The Annotated Transformer
- And the excellent Illustrated Transformer
But those are all for normal people who have a healthy and pragmatic relationship with programming. Perhaps, like me, you had Haskell sidle up to you in a fragile moment and offer you a vision of a shining future in which all your code possessed a pure and timeless beauty.

Why a typed transformer?

This² vision may consume you. And once it does, you may find yourself claiming that a typed transformer implementation:³
- Makes the fuzzy and implicit precise and explicit. For example, I had code that I had written before the availability of variadic generics in Python. When I went back to add variadic generics, I realized that I had not even properly understood my own code!
- Greatly reduces the strain on sharpy limited working memory. Mentally tracking all the dimensions in even mildly complex code is impossible so many ML codebases contain an ad hoc, informally-specified, bug-ridden, non-checkable set of comments annotating the dimensions of tensors. See the figure below for one example.
- Tightens the feedback loop. This is especially valuable for ML code where, without a type checker, you may wait for your data to load and JAX to compile only to find out that you made a trivial mistake. With my current typing discipline, I essentially never have runtime issues of this sort. My errors only reflect my fundamental ignorance and comprehensive conceptual confusion rather than my carelessness.
- Demonstrates that Python’s typing facilities are pretty okay and let you encode some useful invariants.
An example of an informal typing discipline. From EasyOCR.
```
def forward(self, prev_hidden, batch_H, char_onehots):
    # [batch_size x num_encoder_step x num_channel] -> [batch_size x num_encoder_step x hidden_size]
    batch_H_proj = self.i2h(batch_H)
    prev_hidden_proj = self.h2h(prev_hidden[0]).unsqueeze(1)
    e = self.score(torch.tanh(batch_H_proj + prev_hidden_proj))  # batch_size x num_encoder_step * 1

    alpha = F.softmax(e, dim=1)
    context = torch.bmm(alpha.permute(0, 2, 1), batch_H).squeeze(1)  # batch_size x num_channel
    concat_context = torch.cat([context, char_onehots], 1)  # batch_size x (num_channel + num_embedding)
    cur_hidden = self.rnn(concat_context, prev_hidden)
    return cur_hidden, alpha
```
The plan

Thus⁴ we work through an implementation of a basic transformer in Python using Python’s optional typing facilities⁵. The hope is that (as the first point above suggests) well-chosen types are an effective pedagogical tool to minimize the ambiguity that characterizes novice learning. (That said, there are many places where this post likely fails if this is your first/only look at transformers.)

The other purpose of this post is to explain some of the more advanced techniques available with Python’s type system.

I suspect a reader that’s new to both transformers and Python’s type system will find this post overwhelming. So really, there are two alternative readings for this post: one that explains some type tricks to ML people and one that introduces transformers to the terminally type-brained. If you’re already comfortable with Python’s type system, you can perhaps skip directly to the crux—the typed implementation of attention.

A repository containing all this code and a training loop for a basic seq2seq task is available here.
Full post

Assorted links XVI
- links
Published on January 8, 2020.
1. Pizza Hut Gorbachev TV Spot Commercial
  
  We should adjudicate all disputes about political legacies this way:
  
  Millienial: It is Obama!
  
  Boomer: Because of him, we have death panels!
  
  Millenial: Because of him, we have health care!
  
  Greatest Generation: Because of him, we have many things… like Taco Bell’s Doritos Cheesy Gordita Crunch.
2. Vicious Cycles: Theses on a philosophy of news
  
  This is my favorite piece of writing on the news.
3. Successes in Biological Control
  
  Originally brought to the US to breed with the native silkworms, the gypsy moth, Lymantria dispar L., escaped through a broken window in Medford, MA in 1868-9 and began defoliating deciduous forests and shade trees in many regions of North America. […] In the late 1980s and early 1990s, scientists noticed gypsy moth cadavers hanging from trees in the northeastern forests and identified the cause as a fungal infection. This discovery renewed interest in using fungi for control.
4. What explains voter aversion to carbon taxes and what can be done?
  
  Pairs well with an earlier post. See also State and trends of carbon pricing in 2019.
5. Principles for the Application of Human Intelligence
  
  However, the replacement of algorithms with a powerful technology in the form of the human brain is not without risks. Before humans become the standard way in which we make decisions, we need to consider the risks and ensure implementation of human decision-making systems does not cause widespread harm. To this end, we need to develop principles for the application for the human intelligence to decision making.
Full post

Progress and preservation in IDA
- machine learning
Published on December 3, 2019.
This post arose out of my attempts to understand IDA and ways it could fail. It might help you do the same and could provide useful vocabulary for discussing desiderata for IDA.

We want IDA to satisfy progress—decomposition should make answering questions easier—and preservation—semantics should be retained across transformations. We need progress in each decomposition and, furthermore, repeated decompositions must be able to eventually simplify each question such that it can be answered directly by a human. Also, each decomposition and aggregation of questions and answers must introduce no more than a bounded amount of semantic drift and, furthermore, repeated decompositions and aggregations should also introduce no more than a bounded amount of semantic drift.

\[ \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \]

Iterated distillation and amplification (henceforth IDA) is a proposal for improving the capability of human-machine systems to suprahuman levels in complex domains where even evaluation of system outputs may be beyond unaugmented human capabilities. For a detailed explanation of the mechanics, I’ll refer you to the original paper just linked, section 0 of Machine Learning Projects for Iterated Distillation and Amplification, or one of the many other explanations floating around the Web.

We can view IDA as dynamic programming with function approximation¹ instead of a tabular cache. Just like the cache in dynamic programming, the machine learning component of IDA is a performance optimization. We can excise it and look at just the divide-and-conquer aspect of IDA in our analysis. Then this simplified IDA roughly consists of: (1) repeatedly decomposing tasks into simpler subtasks; (2) eventually completing sufficiently simple subtasks; and (3) aggregating outputs from subtasks into an output which completes the original, undecomposed task. We’ll examine this simplified model² in the rest of the post. (If you’d like a more concrete description of the divide-and-conquer component of IDA, there’s a runnable Haskell demo here.)

Safety is progress plus preservation

For type systems, the slogan is “safety is progress plus preservation”. Because we’re using this only as a cute analogy and organizing framework, we’ll not get into the details. But for type systems:

Progress

“A well-typed term is […] either […] a value or it can take a step according to the evaluation rules.”

Preservation

“If a well-typed term takes a step of evaluation, then the resulting term is also well typed.”

(Both from (Pierce and Benjamin 2002).)

We also need progress and preservation in IDA. Roughly:

Progress

A question is easy enough to be answered directly or can be decomposed into easier subquestions.

Preservation

The answer from aggregating subquestion answers is just as good as answering the original question.

Let’s try to make this more precise.
Full post

A corrected model suggests climate change interventions may be within a factor of two of direct cash transfers
- effective altruism
- global warming
Published on November 25, 2019.
In an earlier EA forum post, data from a new paper on country-level social cost of carbon is used to estimate the comparative cost-effectiveness of climate change interventions and global development interventions (cost-effectiveness of the latter as determined by GiveWell’s models). A central component of the earlier post (henceforth GDvCC¹) is supposed to be income²-weighting—$10 dollars of lost income means a great deal more in terms of utility for someone that makes $200 per year than for someone that makes $20,000 per year. More granular, country-level data on the social cost of carbon gets us closer to accounting for this distributional consideration³ and allows us to express the social cost of carbon in a way that’s directly comparable with the outputs of GiveWell’s models. GDvCC’s model finds that, in its “realistic” scenario, climate change interventions are 3% as cost-effective as GiveDirectly and 0.4% as cost-effective as GiveWell’s median top charity. However, I think there’s an error in the way the country-level social cost of carbon (CSCC) is interpreted in GDvCC which leads to incorrect income-weighting. Correcting for this error suggests that climate change interventions (again, under “realistic” assumptions) are 57% as cost-effective as GiveDirectly.
Full post

Unconditional conditioning: Removing sleeper agent behavior in a toy model

Contents

Intro

Sleeper agent behavior

A toy model of sleeper agent behavior

Naive adversarial training

The Typed Transformer: Intro and architecture

Contents

Intro

Why a typed transformer?

The plan

Assorted links XVI

Progress and preservation in IDA

Safety is progress plus preservation

A corrected model suggests climate change interventions may be within a factor of two of direct cash transfers

Collectively Exhaustive