User Comment Replies

Nathan's suggestion is that adding noise to a sandbagging model might increase performance, rather than decrease it as usual for a non-sandbagging model. It's an interesting idea!

1Teun van der Weij11mo

Oh, I see. This is an interesting idea. I am not sure it will work, but definitely worth trying out!

Understanding strategic deception and deceptive alignment

Francis Rhys Ward2y30

Good post. I think it's important to distinguish (some version of) these concepts (i.e. SD vs DA).

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

This statement doesn't seem to capture exactly what you mean by DA in the rest of the post. In particular, a misaligned AI may use SD to achieve its goals, without being deceptive about its alignment / goals. DA, as you've discussed it later, seems to be deception about alignment / goals.

3Marius Hobbhahn2y

We considered alternative definitions of DA in Appendix C. We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below): “An AI is deceptively aligned when it is strategically deceptive about its misalignment” Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities. For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment. For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).

The No Free Lunch theorems and their Razor

Francis Rhys Ward3y20

Thanks for writing this! I found it clear, interesting, and enjoyable to read :)

Prize for Alignment Research Tasks

Francis Rhys Ward3yΩ040

TL;DR: (Relaxed) adversarial training may be an important component of many approaches to alignment. The task is to automate red-teaming for e.g. current LLMs.

Context: Alignment researcher part of a red-team tasked with finding inputs to a model which cause the model to generate undesirably outputs.

Task: Red-team assistants which generate adversarial inputs for other LLMs.

Input: Different options:

(Blue-team) model parameters;
A description of the model's training process, architecture, etc;
Black-box examples of the model's functioning.

Output: An input... (read more)

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

Francis Rhys Ward3y10

Yeah. I much prefer the take-off definitions which use capabilities rather than GDP (or something more wholistic like Daniel's post.)

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

Francis Rhys Ward3y10

I agree and will edit my post. Thanks!

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

Francis Rhys Ward3y10

I agree with Rohin's comment above.

Maybe a related clarification could be made about the fast take-off/short time-line combination.

Right. I guess the view here is that "The threshold level of capabilities needed for explosive growth is very low." Which would imply that we hit explosive growth before AIs are useful enough to be integrated into the economy, i.e. sudden take-off.

The main claim in the post is that gradual take-off implies shorter time-lines. But here the author seems to say that according to the view "that marginal improvements in AI ca

Francis Rhys Ward3y10

That's true (and I hadn't considered it!) -- also there's a social dilemma type situation in the case with many potential manipulators, since if any one manipulates then noone can get value from observing the target's actions.

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Francis Rhys Ward3y10

As Richard points out, my definition of manipulation is "I influence your actions in a way that causes you to get lower utility". (And we can similarly define cooperation except with the target getting higher utility.) Can send you the formal version if you're interested.

1Rohin Shah3y

I continue to think that this classifies all communication as manipulation. Every action reduces someone's expected utility, from Omega's perspective. I guess if you communicate with only one person, and you're only looking at your effects on that person's utility, then this does not classify all communication as manipulation. So maybe I should say that it classifies almost all communication-to-groups as manipulation.

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Francis Rhys Ward3yΩ040

Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if $H$ is also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.

I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.

I agree, and I'... (read more)

1Rohin Shah3y

Fundamentally you need some way of distinguishing between "manipulation" and "not manipulation". The first guess of "manipulation = affecting the human's brain" is not a good definition, as it basically prevents all communication whatsoever. I haven't seen any simple formal-ish definitions that seem remotely correct. (There's of course the approach where you try to learn the human concept of manipulation from human feedback, and train your system to avoid that, but that's pretty different from a formal definition based on causal diagrams.)

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Francis Rhys Ward3y30

Thanks! I hadn't seen your paper but will check it out :)

LESSWRONG
LW

All of Francis Rhys Ward's Comments + Replies