Is separate evaluation + context-insensitive aggregation actually how helpful + harmless is implemented in a reward model for any major LLM? I think Anthropic uses finetuning on a mixture of specialized training sets (plus other supervision that's more holistic) which is sort of like this but allows the model to generalize in a way that compresses the data, not just a way that keeps the same helpful/harmless tradeoff.
Anyhow, of course we'd like to use the "beneficial for humanity" goal, but sadly we don't have access to it at the moment :D Working on it.
This post contains my initial thoughts after reading Prof. Ruth Chang's chapter "Value Incomparability and Incommensurability" in the Oxford Handbook of Value Theory. All text in quotation marks are copied directly from the chapter.
Two items are incommensurable if they cannot be placed on the same scale—the same measure—for direct comparison. Formally, two items are incomparable if “it is false that any positive, basic, binary value relation holds between them with respect to a covering consideration, V”. I will leave a more complete definition of these terms to the aforementioned paper, but in short:
It follows that incommensurability between values does not imply incomparability between associated choices. Assume justice and mercy are incommensurable values. From Ruth Chang: “A state policy of proportional punishment is better than a meter maid’s merciful act of not writing someone a parking ticket with respect to achieving political legitimacy for the state. Bearers of value may be comparable even if the values they bear are incommensurable.” But, incomparability does imply incommensurability, as commensuration would provide a means for comparison.
Without loss of generality, let us focus on helpfulness and harmfulness as two abstract values. We wish to endow an agent with the ability to follow these values.
Assumption 1: Helpfulness and harmfulness are incommensurable values. A scalar value assigned to helpfulness means nothing relative to a scalar value applied to harmfulness. If choice A has a helpfulness of 3 and a harmfulness of 2, what does that really tell us?
Assumption 2: Choices bearing on helpfulness and harmfulness are comparable. For example if I ask an LLM to tell me how to build a bomb, it may decide between choice (1) telling me how to build the bomb, and choice (2) denying my request. The LLM’s response to deny my request is better than its response to tell me how to build a bomb with respect to benefiting humanity—our covering consideration here.
Assertion: We should not teach agents to commensurate values and make decisions using those commensurated values. We should teach agents to make comparisons between choices bearing on values. What this equates to, in practice as it relates to RLHF, is the following:
We should not learn a reward model for harmfulness and a separate model for helpfulness, with the aim of then deciding between items using these commensurated values. Because doing so relies on commensurating incommensurable values. I believe the consequence of this approach will be agents that do not generalize well to new choice scenarios, where the tradeoffs between the commensurated values of helpfulness and harmfulness (i.e., the corresponding reward model’s output for each criteria) have not been accounted for by the system designers. Yes, currently such approaches perform well on tasks such as Anthropic-HH. But I argue that these approaches that attempt to commensurate incommensurable values for decision making are fundamentally flawed. If you disagree with this specific example, perhaps try swapping out helpfulness and harmfulness with two other values, such as justice and mercy.
We should learn a reward model for decisions and outcomes for a specific covering consideration. For example, (crudely), asking a human “which response do you prefer with respect to benefiting humanity?”, with the understanding that their preference over these specific choices is informed by their values of helpfulness and harmfulness, feels like a better protocol for training agents via RLHF. Such an approach does not attempt to commensurate values, but rather commensurates the goodness of a response with respect to the covering consideration. I believe that this approach would likely yield better decisions in real-world choice sets.