To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.
To what degree will unintended AI goals be learned in training vs developed in deployment.
I'm sort of curious what actually inclines you towards "learned in training", given how (as you say) the common-sense notion of an LLMs goal seems to point to specified in a prompt, as you say.
Like even if we have huge levels of scale up, why would we expect this to switch the comparative roles of (weights, context, sampler, scaffolding) and make the weights have a role they don't have in the future? What actually moves you in that direction?
A big chunk of the stories on MB are totally made up by the LLMs. Not all, but for sure some, maybe a majority, possibly a big majority. So recounting the texts above as alignment failures uncritically is probably a bad idea.
Agreed.
Also note that these two properties are quite compatible with many things often believed to be incompatible with them! i.e., an AI that can be jailbreaked to be bad (with sufficient effort) could still meet these criteria.
I mean Bentham uses RLHF as metonymy for prosaic methods in general:
I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
That's imprecise, but it's also not far from common usage. And at this point I don't think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense -- Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there's no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
And given that no one is using old-style RLHF simply speaking, it's incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that's actually being used, rather than the kind people haven't been using for over a year. Because that's what his thesis is about.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.
As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
You also link to part of IABI summary materials -- the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that's your Real Objection (?). If so, it might be productive to summarize it in the text where you're criticizing Bentham rather than leaving your actual objection implicit in a link.
Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?
This move gets made all the time in these discussions, and appears clearly invalid.
We move from the prior paragraphs' criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --
-- to this paragraph, which criticizes Claude -- not on the grounds that it fails according to common-sense ethical norms -- but according to its failure to have have solved all of ethics!
But the deployment of powerful AIs does not need to have solved all ethics! It needs -- broadly -- to have whatever ethical principles let us act well and avoid irrecoverable mistakes, in whatever position it gets deployed. For positions where it's approximately replacing a human, that means that we would expect the deployment to be beneficial if is more ethical, charitable, corrigible, even-minded, and altruistic than the humans that it is replacing. For positions where it's not replacing human, it still doesn't need to have solved all ethics forever, it just needs to be able to act well according to whatever role is intended for it.
It appears to me that we're very likely to be able to hit such a target. But whether or not we're likely to be able to hit this target, that's the target in question. And moving from "RLHF can't install basic ethical principles" to "RLAIF needs to give you the correct position on all ethics" is a locally invalid move.
Seems worth consideration, tbh.
Do you feel good about current democratic institutions in the US making wise choices, or confident they will make wiser choices than Dario Amodei?
Nice, good to know.
In general, I support failed replications as top level posts.
A further potential extension here is to point out that modern hiveminds (Twitter / X / Bsky) changed group membership in many political groups from something explicit ("We let this person write in a our [Conservative / Liberal / Leftist / etc] magazine / published them in our newspaper") to something very fuzzy and indeterminate ("Well, they call themselves an [Conservative / Liberal / Leftist / etc] , and they're huge on Twitter, and they say some of the kinds of things [Conservative / Liberal / Leftist / etc] people say, so I guess they're an [Conservative / Liberal / Leftist / etc] .")
I think this is a really big part of why the free market of ideas has stopped working in the US over the last decade or two.
Yet more speculative is a preferred solution of mine; intermediate groups within hiveminds, such that no person can post in the hivemind without being part of such a group, and such that both person and group are clearly associated with each other. This permits:
But this solutioning is all more speculative than the problem.
My expectation is that for future AIs, as today, many of the goals of an AI will come from the scaffolding / system prompt rather than from the weights directly -- and the "goals" from the Constitution / model spec act more as limiters / constraints on a mostly prompt or scaffolding-specified goal.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate "AIs" which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
I'm happy to see this was (somewhat?) reflected in having AIs with different system prompts? But I don't know how much that aspect was pushed out -- 1-4 AIs with different system prompts still feels like a pretty steep narrowing of the number of goals I'd expect to see in the world. I don't know how much the wargame pushes on this, but a more decentralized run would be interesting to me!
Yeah I think Taiwan being taken looks rather likely and rather relevant for all but the steepest, jerkiest SIE, and appears insufficiently accounted for.