Alignment as an Evaluation Problem
A Layman's Model This is not a "serious" model, nor do I think it is revolutionary in any way. I am an AI safety layperson, a run-of-the-mill software developer at Big Tech Firm™ who is aware of the general shape of AI safety issues, but not particularly read-up on the literature. My hope here is to refine my thinking and to potentially provide something that helps other laypeople think more clearly about current-paradigm AI. I work with LLM "agents" a lot these days. I read IABIED the week it came out, and it fired off some thoughts for me given my first-hand experiences working with LLMs and trying to understand why I see the failure modes I do and how that relates to the wider AI safety discussion. I work in testing, helping a normal software organization that doesn't train LLMs try to integrate and use them, and so I see AIs exhibiting unaligned behavior writ small. Things like a coding agent that swears up and down that it fixed the bug you pointed out while not actually changing the relevant code at all. Or a chatbot that spins its wheels really hard after a dozen prompts. Or the ever-classic: I get a near perfect response from the agent I'm developing, then I notice one of its tools is actually broken. So I fix it, and then with the much better context, the agent does much worse. While vexing, this actually rhymes a little bit with software validation I worked on before LLMs were involved. Large systems with many interdependent parts become hard to predict and are often nondeterministic. Apparent fixes may actually break something somewhere else. And validating that system is hard: you can't model the internals of the system accurately enough, so you must rely on black-box testing. For a black-box, you can't actually know that it's correct without testing every possible behavior, which of course is not practical, so you focus testing on a small subset of use-cases that you expect are most important to users. Then you back-propagate failure signals (i.e. di
The picnic area and lawn are occupied, so we've relocated to the opposite side of the playground under some oak trees.