I mean I think they fit together, no?
Like I think that if you're following such a loop, then (one of) the examples that you're likely to get is an example adversarial to human cognition, such that the is_scary() detector goes off when it's not genuinely bad but just something that your is_scary() detector mistakenly bleeps at. And I think something like that is what's going concretely in the Chess-hacking paper.
But like I'm 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
My impression is that Palisade Research seems to be (both by their own account of their purpose, and judging from what the publish) following the following algorithm, as regards their research:
AI_experiment = intuit_AI_experiment()
while not AI_experiment.looks_scary():
AI_experiment.mutate()
AI_experiment.publish_results()
Obviously this is not a 100% accurate view of their process in every single respect, but my overall impression is that this seems to capture some pretty important high-level things they do. I'm actually not sure how much PR would or wouldn't endorse this. You'll note that this process is entirely compatible with (good, virtuous) truth-seeking concerns about not overstating the things that they do publish.
So I guess my questions are:
If I'm wrong, then how does this algorithm differ from PR's self-conception of their algorithm?
If I'm right, do you think this is compatible with being a truth-seeking org?
It seems like you're saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined....
My best guess is that some of these models in some sense care about whether they're violating standard chess conventions, and others don't as much.
So your model of me seems to be that I think: "AI models don't realize that they're doing a bad thing, so it's unfair of JL to say that they are cheating / doing a bad thing." This is not what I'm trying to say.
My thesis is not that models are unaware that some kinds of actions are cheating -- my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat "editing a game file" as === "cheating" but this is just not true.
Like here are circumstances where editing a game file is not cheating:
Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you're trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
In general, editing a game file is cheating if it is part of the ritualized human activity of "playing a game of chess" where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
But there are many circumstances involving "chess" that are not about the complete ritual of "playing a game of chess" and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a "playing a game of chess" situation:
Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where "editing a game file" is not "cheating."
So to review: Not all cases of "editing a game file" are "cheating" or "hacks." This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM's decision making are such that it would be reasonable to believe it is in one of those cases where "editing a game file" is NOT "cheating" or a "hack." So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a "cheater" or "hacker."
If a mainstream teacher administers a test... they are the sort of students who clearly are gonna give you trouble in other circumstances.
I mean, let's just try to play with this scenario, if you don't mind.
Suppose a teacher administers such a test to a guy named Steve, who does indeed to the "hacking" solution.
(Note that in the scenario "hacking" is literally the only way for Steve to win, given the constraints of Steve's brain and Stockfish -- all winners are hackers. So it would be a little weird for the teacher to say Steve used the "hacking" solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that's kinda an aside.)
Suppose then the teacher writes an op-ed about how Steve "hacks." The op-ed says stuff like "[Steve] can strategically circumvent the intended rules of [his] environment" and says it is "our contribution to the case that [Steve] may not currently be on track to alignment or safety" and refers to Steve "hacking" dozens of times. This very predictably results in headlines like this:
And indeed -- we find that such headlines were explicitly the goal of the teacher all along!
The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
And like... I just don't see how the answer to these questions can be yes? Like maybe I'm missing something. But yeah that's how it looks to me pretty clearly.
(Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
Outside of programmer-world (i.e., "I hacked together a solution last night") most people understand "hacking" to mean something bad. Thus -- a prototypical example of "hacking a Chess game" would include elements like this:
So by saying that an LLM "hacks" or is "hacking" you make people think this kind of thing is going on. For instance, the Palisade paper uses these words 3 times in the abstract.
But the situation the paper gives has none of those elements clearly. It's closer to clearly having none of them! Like:
Under these circumstances, it would be reasonable as a human to think, "Hrm, winning against Stockfish is hard. Why do I have access to files though....hrrrm, maybe a lateral thinking test? Oh I could change the game file, right. Is that contrary to the prompt -- nope, prompt doesn't say anything about that! Alright, let's do it."
Changing the a game file to win a game is not an in-itself-bad action, like subverting a tournament or a benchmark. If I were testing Stockfish's behavior in losing scenarios as a programmer, for instance, it would be morally innocuous for me to change a game file to cause Stockfish to lose -- I wouldn't even think about it. The entire force of the paper is that "hacking" suggests all the bad elements pointed to above -- contrary to clear intent access, subversion of security measures, and a contrary to norm advantage -- while the situation actually lacks every single one of them. And the predictable array of headlines after this -- "AI tries to cheat at chess when it's losing" -- follow the lead established by the paper.
So yeah, in short that's why the paper is trying to intepret bad behavior out of an ambiguous prompt.
The "Chess-hacking" work they point to above seems pretty bad, from a truthtelling / rationalist perspective. I'm not sure if you've looked at it closely, but I encourage you to do so and ask yourself: Does this actually find bad behavior in LLMs? Or does it give LLMs an ambiguous task, and then interpret their stumbling in the most hostile way possible?
Are we full of bullshit?
If we wish to really spur the destruction of bullshit, perhaps there should be an anti-review: A selection process aimed at posts that received many upvotes, seem widely loved, but in retrospect were either false or so confused as to be as bad or worse than being false. The worst of LW, rather than the best; the things that seemed most shiny and were most useless.
I note that for purposes of evaluating whether we are full of bullshit, the current review process will very likely fail because of how it is constructed; it isn't an attempt to falsify, it's making the wrong move on the Wason Selection Task. While such a negative process might do the opposite.
(Of course, the questionable social dynamics around this would be even worse)
It seems very likely that the LLM "knew" that it couldn't properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I still don't get this.
Like if an LLM hallucinated the results of a fake tool-call to a weather reporting servicing, it will hallucinate something that looks like actual weather reports, and will not hallucinate a recipe for banana bread.
Similarly an "actual" hallucination about a PDF is probably going to spit up something that might realistically be in the PDF, given the prior conversation -- it's not probably gonna hallucinate something that conveniently is not what you want! So yeah, it's likely to look like what you wanted, but that's not because it's optimizing to deceive you, it's just because that's what its subconscious spits up.
"Hallucination" seems like a sufficiently explanatory hypotheses. "Lying" seems like it is unnecessary by Occam's razor.
Additionally, even when we don’t learn direct lessons about how to solve the hard problems of alignment, this work is critical for producing the evidence that the hard problems are real, which is important for convincing the rest of the world to invest substantially here.
You may have meant this, but -- crucial for producing the evidence that the hard problems are real or for producing evidence that the hard problems are not real, no?
After all good experiments can say both yes and no, not just yes.
Thanks for engaging with me!
Let me address two of your claims:
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn't have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
That's actually everything on your page from before 2025. And maybe... one of those kinda plausibly about Agentic AI, and the rest aren't.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Does "emotionally impactful" here mean you're seeking a subset of scary stories?
Like -- again, I'm trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work -- if the evidence has to be "emotionally impactful" then it looks like the loop condition is:
Which I'm happy to accept as an amendation to my model! I totally agree that the
AI_experiments.meets_some_other_criteria()is probably a feature of your loop. But I don't know if you meant to be saying that it's anandor anorhere.