Good post. I think it's important to distinguish (some version of) these concepts (i.e. SD vs DA).
When an AI has Misaligned goals and uses Strategic Deception to achieve them.
This statement doesn't seem to capture exactly what you mean by DA in the rest of the post. In particular, a misaligned AI may use SD to achieve its goals, without being deceptive about its alignment / goals. DA, as you've discussed it later, seems to be deception about alignment / goals.
Thanks for writing this! I found it clear, interesting, and enjoyable to read :)
TL;DR: (Relaxed) adversarial training may be an important component of many approaches to alignment. The task is to automate red-teaming for e.g. current LLMs.
Context: Alignment researcher part of a red-team tasked with finding inputs to a model which cause the model to generate undesirably outputs.
Task: Red-team assistants which generate adversarial inputs for other LLMs.
Input: Different options:
Output: An input...
Yeah. I much prefer the take-off definitions which use capabilities rather than GDP (or something more wholistic like Daniel's post.)
I agree and will edit my post. Thanks!
I agree with Rohin's comment above.
Maybe a related clarification could be made about the fast take-off/short time-line combination.
Right. I guess the view here is that "The threshold level of capabilities needed for explosive growth is very low." Which would imply that we hit explosive growth before AIs are useful enough to be integrated into the economy, i.e. sudden take-off.
...The main claim in the post is that gradual take-off implies shorter time-lines. But here the author seems to say that according to the view "that marginal improvements in AI ca
That's true (and I hadn't considered it!) -- also there's a social dilemma type situation in the case with many potential manipulators, since if any one manipulates then noone can get value from observing the target's actions.
As Richard points out, my definition of manipulation is "I influence your actions in a way that causes you to get lower utility". (And we can similarly define cooperation except with the target getting higher utility.) Can send you the formal version if you're interested.
Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if is also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.
I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
I agree, and I'...
Thanks! I hadn't seen your paper but will check it out :)
Nathan's suggestion is that adding noise to a sandbagging model might increase performance, rather than decrease it as usual for a non-sandbagging model. It's an interesting idea!