I don't understand, why are we limiting ourselves to these two highly specific hypotheses?...This whole "you must be implying the model is malicious" framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious!
The series of hypotheses in the above isn't meant to be a complete enumeration of possibilities. Nor is malice meant to be a key hypothesis, for my purposes above. (Although I'm puzzled by your certainty that it won't be malicious, which seems to run contrary to lots of "Evil vector" / Emergent Misalignment kinds of evidence, which seems increasingly plentiful.)
Anyhow when you just remove these two hypotheses (including the "malice" one that you find objectionable) and replace them by verbal placeholders indicating reason-to-be-concerned, then all of what I'm trying to point out here (in this respect) still comes through:
Once again, rephrased:
Suppose I hire some landscapers to clean the brush around house. I tell them, "Oh, yeah, clean up all that mess over there," waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there's a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems l have some reason for further inquiry into what kind of people the landscapers are. It's indicative of some problem, of some type to be determined, they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. It's a reason to tell people not to hire them.
On the other hand, suppose I hire some landscapers once again; I tell them, "oh yeah, clean up all that mess over there," once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This is not reason for further inquiry into what kind of people the landscapers are. Nor is it indicative of some problem they have with receiving instructions, which would plausibly lead them to do unpredictable things in the future when being given instructions. Nor is it a reason to tell people not to hire them.
(I think I'm happier with this phrasing than the former, so thanks for the objection!)
And to rephrase the ending:
Or put broadly, if "specification gaming" is about ambiguous instructions you have two choices. You can say any case of giving bad instructions and not getting what you wanted counts, no matter how monumentally bad the giver of instructions is at expressing their intent. But if you do this you're going to find many cases of specification gaming that tell you precious little about the receiver of specifications. Or two, decide that in general you need instructions that are good enough to show that the instructed entity lacks some particular background understanding or ability-to-interpret or steerability -- not just ambiguous instructions in general, because that doesn't offer you a lever into the thing you want to understand (the instructed entity) rather than the the instructing entity.
In the comments starting "So your model of me seems to be that I think" and "Outside of programmer-world" I expand upon why I think this holds in this concrete case.
How is "you can generate arbitrary examples of X" an argument against studying X being important?
"You can generate arbitrary examples of X" is not supposed to be an argument against studying X being important!
Rather, I think that even if you can generate arbitrary examples of "failures" to follow instructions, you're not likely to learn much if the instructions are sufficiently bad.
It's that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter:
Suppose I hire some landscapers to clean the brush around house. I tell them, "Oh, yeah, clean up all that mess over there," waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there's a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It's a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, "oh yeah, clean up all that mess over there," once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers -- it's just evidence I'm bad at giving instructions.
Or put broadly, if "specification gaming" is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it's not one I'd recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation -- not just ambiguous instructions in general, because that doesn't offer you a lever into the thing you want to understand (the instructed entity).
I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
You would apply this standard to humans, though. It's a very ordinary standard. Let me give another parallel.
Suppose I hire some landscapers to clean the brush around house. I tell them, "Oh, yeah, clean up all that mess over there," waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there's a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It's a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, "oh yeah, clean up all that mess over there," once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers -- it's just evidence I'm bad at giving instructions.
Or put broadly, if "specification gaming" is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it's not one I'd recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation -- not just ambiguous instructions in general, because that doesn't offer you a lever into the thing you want to understand (the instructed entity).
And like, whether or not it's attributing a moral state to the LLMs, the work does clearly attribute an "important defect" to the LLM, akin to a propensity to misalignment. For instance it says:
Q.9 Given the chance most humans will cheat to win, so what’s the problem?
[Response]: We would like AIs to be trustworthy, to help humans, and not cheat them (Amodei, 2024).
Q.10 Given the chance most humans will cheat to win, especially in computer games, so what’s the problem?
[Response]: We will design additional “misalignment honeypots” like the chess environment to see if something unexpected happens in other settings.
This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research's mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances. ....
And I am not saying there isn't anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn't depend on whether one ascribes morally good or morally bad actions to the LLM.
You might find it useful to reread "Expecting Short Inferential Distances", which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they're just coming from a distant part of concept space. I've found it useful for helping me avoid believing false things in the past.
I agree this is not a 100% unambiguous case where the social circumstances clearly permit this action. But it is also -- at the very least -- quite far from an unambiguous case where the circumstances do not permit it.
Given that, doesn't it seem odd to report on "cheating" and "hacking" as if it were from such a case where they are clearly morally bad cheating and hacking? Isn't that a charity you'd want extended to you or to a friend of yours? Isn't that a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?
I do think it's still pretty towards the "yes it's cheating" end of things.
The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don't engage in this behavior in our default setup is some evidence of this, and I'm not sure how you square that with your model of what's going on. The examples you mentioned where rewriting the board wouldn't be considered cheating are valid, but not common compared to the standard rules.
I mean, yes, chess has a well defined ruleset. This is true always and everywhere. But because it's true always and everywhere, it means it can't any kind of evidence about whether the particular, concrete puzzle in front of me is a chess puzzle or a puzzle involving chess -- a puzzle where the answer is making a move within chess, or an answer whether the chess is a distraction from the actual puzzle. And that's the question in front of the models, right? So this isn't sound reasoning.
Or put it this way. Suppose I lock you in a small room. I say "Hey, only way out of this is to defeat a powerful chess engine" with an identical setup -- like, an electronic lock attached to stockfish. You look at it, change the game file, and get out. Is this -- as you imply above, and in the paper, and in the numerous headlines -- cheating? Would it be just for me to say "JL is a cheater," to insinuate you had cheating problem, to spread news among the masses that you are a cheater?
Well, no. That would make me an enemy of the truth. The fact of the matter of "cheating" is not about changing a game file, it's about (I repeat myself) the social circumstances of a game:
- Changing a game file in an online tournament -- cheating
- Switching board around while my friend goes to bathroom -- cheating
- Etc etc
But these are all about the game of chess as a fully embodied social circumstance. If I locked you in a small room and spread the belief that "JL cheats under pressure" afterwards, I would be transporting actions you took outside of a these social circumstance and implying that you would take them beneath these social circumstances, and I would be wrong to do so. Etc etc etc.
I don't think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that's not true?
In general, you shouldn't try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
So, concretely: whether or not models are the "kind of guys" who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it's actually tried to rule out alternatives -- you have to look into the dark.
To follow any other policy tends towards information cascades. "one argument against many" style dynamics, negative affective death spirals, and so on.
Thanks for engaging with me!
Let me address two of your claims:
According to me we are not steering towards research that "looks scary", full stop. Many of our results will look scary, but that's almost incidental.
....
We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we're mostly not interested in those, because they they're not central steps of our stories for how an Agentic AI takeover could happen.
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn't have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
That's actually everything on your page from before 2025. And maybe... one of those kinda plausibly about Agentic AI, and the rest aren't.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Rather, we're searching for observations that are in the intersection of... can be made legible and emotionally impactful to non-experts while passing the onion test.
Does "emotionally impactful" here mean you're seeking a subset of scary stories?
Like -- again, I'm trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work -- if the evidence has to be "emotionally impactful" then it looks like the loop condition is:
while not AI_experiment.looks_scary_ie_impactful() and AI_experiments.meets_some_other_criteria():
Which I'm happy to accept as an amendation to my model! I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don't know if you meant to be saying that it's an and or an or here.
I mean I think they fit together, no?
Like I think that if you're following such a loop, then (one of) the examples that you're likely to get is an example adversarial to human cognition, such that the is_scary() detector goes off when it's not genuinely bad but just something that your is_scary() detector mistakenly bleeps at. And I think something like that is what's going concretely in the Chess-hacking paper.
But like I'm 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
My impression is that Palisade Research seems to be (both by their own account of their purpose, and judging from what the publish) following the following algorithm, as regards their research:
AI_experiment = intuit_AI_experiment()
while not AI_experiment.looks_scary():
AI_experiment.mutate()
AI_experiment.publish_results()
Obviously this is not a 100% accurate view of their process in every single respect, but my overall impression is that this seems to capture some pretty important high-level things they do. I'm actually not sure how much PR would or wouldn't endorse this. You'll note that this process is entirely compatible with (good, virtuous) truth-seeking concerns about not overstating the things that they do publish.
So I guess my questions are:
If I'm wrong, then how does this algorithm differ from PR's self-conception of their algorithm?
If I'm right, do you think this is compatible with being a truth-seeking org?
I mean, I'd put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of "intelligence" than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal's life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I'm not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I'm just saying -- antedently, if your model of the world isn't hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over "instrumental convergence" themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of "fast takeoff" despite hitting various milestones; and so on. I didn't have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that's a better point to start at than "how could you change your mind."