Taking social choice seriously: An alternative approach to reward modeling in RLHF
RLHF reward modeling is implicitly doing social choice—in particular, something like Borda count. An alternative setup that more closely mimics the social choice formalism allows us to more considerately choose our preference aggregation rule. This alternative setup also allows us to handle the typical case of incomplete rater preference data better. Both preference aggregation rule generalization and incomplete data handling improvements plausibly have safety benefits.
Contents
RLHF
Reinforcement learning from human feedback (RLHF)1 is a by-now conventional technique for making pre-trained2 language models responsive to a variety of human preferences.
The standard RLHF setup involves (this is not meant to be a comprehensive description):
- Collecting human preference data about pairs (or more) of language model outputs.
- Training a reward model on this preference data according to a Bradley-Terry model (There have been many recent papers which adjust this in some way. But I’ve not seen any that obsolete this post.). This model produces a scalar reward for a language model completion—in the context of a prompt—that somehow reflects the rater data.
- Aligning the language model to the reward model.
In everything that follows, we’re focused exclusively on step 2.
Social choice theory
Social choice theory3 is a rich body of theory on how to aggregate individual preferences into collective, social decisions. Think voting systems. For example, in a presidential election, each ballot expresses some aspect of an individual’s preferences over the candidates. And the voting system aggregates those ballots into a single decision about the next president. Arrow’s impossibility theorem is assuredly the most famous result in social choice theory, if that jogs your memory.
Vanilla RLHF as social choice
As4 we outlined above, for a given prompt, the RLHF reward modeling step starts with:
- A set of choices—language model outputs for the prompt
- A set of individuals—raters
- Preference orders over these choices for these indivduals
And produces:
- A reward model which represents a singular preference order over the choices
But this is precisely the social choice problem! We have transformed a set of individual preference orders into a single, social preference order. We can then think of the reward modeling process as running a series of such preference aggregations—one voting contest per prompt.
Making the implicit explicit
RLHF5 is usally not framed quite so explicitly in these terms, but I claim that doing so brings clarity and highlights issues with the vanilla approach. The loss function specified in Training language models to follow instructions with human feedback is:
\[ \text{loss}(\theta) = -\frac{1}{\left(\frac{K}{2}\right)} \mathbb{E}_{(x,y_w,y_l)\sim D}\left[ \log \left( \sigma (r_\theta (x,y_w) - r_\theta (x,y_l))\right)\right] \]
“where \(r_\theta(x,y)\) is the scalar output of the reward model for prompt x and completion y with parameters \(\theta\), \(y_w\) is the preferred completion out of the pair of \(y_w\) and \(y_l\), and \(D\) is the dataset of human comparisons.”
We can think of this loss function as implementing a random ballot preference aggregation rule wherein the preference of the arbitrary rater is taken as the social preference the reward model should learn. If we had complete rater data (i.e. each rater had rated every completion)6, a sequence of updates according to this loss function across many epochs would converge to the Borda count outcome. (In Borda count, each voter submits a ranked list of the choices and the choices are assigned scores based on this ranking—e.g. for a contest with 3 choices, the choice in 3rd place gets 0 points, the choice in 2nd place gets 1 point, and the choice in 1st place gets 2 points. Points for each candidate are summed across all voters and the social preference order is the list of choices sorted by these sums.)
Full post