Thanks!
Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
Very interesting!
It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models...
I'm a big fan! Any thoughs on how to incorporate different sorts of reflective data, e.g. different measures of how people think mediation "should" go?
I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).
Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)
Anyhow, thanks for the reply. I have now seen last figure.
Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.
"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.
A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite diff...
Well, I'm disappointed.
Everything about misuse risks and going faster to Beat China, nothing about accident/systematic risks. I guess "testing for national security capabilities" is probably in practice code for "some people will still be allowed to do AI alignment work," but that's not enough.
I really would have hoped Anthropic could be realistic and say "This might go wrong. Even if there's no evil person out there trying to misuse AI, bad things could still happen by accident, in a way that needs to be fixed by changing what AI gets built in the first place, not just testing it afterwards. If this was like making a car, we should install seatbelts and maybe institute a speed limit."
I think it's about salience. If you "feel the AGI," then you'll automatically remember that transformative AI is a thing that's probably going to happen, when relevant (e.g. when planning AI strategy, or when making 20-year plans for just about anything). If you don't feel the AGI, then even if you'll agree when reminded that transformative AI is a thing that's probably going to happen, you don't remember it by default, and you keep making plans (or publishing papers about the economic impacts of AI or whatever) that assume it won't.
I agree that in some theoretical infinite-retries game (that doesn't allow the AI to permanently convince the human of anything), scheming has a much longer half-life than "honest" misalignment. But I'd emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they're a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don't get to iterate as much as you'd like.
I have a lot of implicit disagreements.
Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.
I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space."
Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.
My guess is neither of these.
If 'aligned' (i.e. performing the way humans want on the sorts of coding, question-answering, and conversational tasks you'd expect of a modern chatbot) beha...
I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)
It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1]
I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to...
Oh, I see; asymptotically, BB(6) is just O(1), and immediately halting is also O(1). I was real confused because their abstract said "the same order of magnitude," which must mean complexity class in their jargon (I first read it as "within a factor of 10.")
That average case=worst case headline is so wild. Consider a simple lock and key algorithm:
if input = A, run BB(6). else, halt.
Where A is some random number (K(A)~A).
Sure seems like worst case >> average case here. Anyone know what's going on in their paper that disposes of such examples?
Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.
- It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.
"Alignment" is a bit of a fuzzy word.
Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like c...
I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.
It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed t...
...This doesn't sound like someone engaging with the question in the trolley-problem-esque way that the paper interprets all of the results: gpt-4o-mini shows no sign of appreciating that the anonymous Muslim won't get saved if it takes the $30, and indeed may be interpreting the question in such a way that this does not hold.
In other words, I think gpt-4o-mini thinks it's being asked about which of two pieces of news it would prefer to receive about events outside its control, rather than what it would do if it could make precisely one of the options occur,
Neat! I think the same strategy works for the spectre tile (the 'true' Einstein tile) as well, which is what's going on in this set.
Just to copy over a clarification from EA forum: dates haven't been set yet, likely to start in June.
Another naive thing to do is ask about the length of the program required to get from one program to another, in various ways.
Given an oracle for p1, what's the complexity of the output of p2?
What if you had an oracle for all the intermediate states of p1?
What if instead of measuring the complexity, you measured the runtime?
What if instead of asking for the complexity of the output of p2, you asked for the complexity of all the intermediate states?
All of these are interesting but bad at being metrics. I mean, I guess you could symmetrize them. But I feel like there's a deeper problem, which is that they by default ignore computational process, and have to have it tacked as extra.
I'm not too worried about human flourishing only being a metastable state. The universe can remain in a metastable state longer than it takes for the stars to burn out.
So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."
Second problem comes in two flavors - object level and meta level. The...
Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.
I think locating goals "within the CoT" often doesn't work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that's good according to humans?
Would you agree that the Jeffrey-Bolker picture has stronger conditions? Rather than just needing the agent to tell you their preference ordering, they need to tell you a much more structured and theory-laden set of objects.
If you're interested in austerity it might be interesting to try to weaken the Jeffrey-Bolker requirements, or strengthen the Savage ones, to zoom in on what lets you get austerity.
Also, richness is possible in the Savage picture, you just have to stretch the definitions of "state," "action," and "consequence." In terms of the functiona...
I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").
Not being an author in any of those articles, I can only give my own take.
I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.
It is not an alignment technique to me because the phrase "alignment technique" sounds li...
I honestly think your experiment made me more temporarily confused than an informal argument would have, but this was still pretty interesting by the end, so thanks.
I think there may be some things to re-examine about the role of self-experimentation in the rationalist community. Nootropics, behavioral interventions like impractical sleep schedules, maybe even meditation. It's very possible these reflect systematic mistakes by the rationalist community, that people should mostly warned away from.
It's tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that's not accurate. It's safe, but it's not conforming to a positive meaning of "alignment" that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren't rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.
I give the probability that some authority figure would use an order-following AI to get torturous revenge on me (probably for being part of a group they dislike) is quite slim. Maybe one in a few thousand, with more extreme suffering being less likely by a few more orders of magnitude? The probablility that they have me killed for instrumental reasons, or otherwise waste the value of the future by my lights, is mich higher - ten percent-ish, depends on my distribution over who's giving the orders. But this isn't any worse to me than being killed by an AI that wants to replace me with molecular smiley faces.
Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.
That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research an...
One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.
I certainly agree that Asimov's three laws are not a good foundation for morality! Nor are any other simple set of rules.
So if that's how you mean "value alignment," yes let's discount it. But let me sell you on a different idea you...
Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants ...
First, I agree with Dmitry.
But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there's a very real quantity which is "how many more bits do you need to exactly reconstruct X, given Z?" My suspicion is that for a fixed network, this quantity grows linearly with N (and if it's zero at 'actual infinity' for some network despite being nonzero in the limit, maybe we should ignore actual infinity).
But this isn't all that useful, it woul...
A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states.
I wonder if this can be resolved by treating the randomness of the machines quantum mechanically, rather than having this semi-classical picture where you start with some randomness handed down from God. Suppose these machines us...
people who study very "fundamental" quantum phenomena increasingly use a picture with a thermal bath
Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you've just shifted the vibe from "the Boltzmann distribution is in the territory" to "the Boltzmann distribution is in the map."
...Still, as soon as you introduce the notion of measure
Some combination of:
The real chad move is to put "TL;DR: See above^" for every section.
When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc.
Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology."
Epistemology is about how we know stuff. If we start with a world that does not inherent...
Treating the density matrix as fundamental is bad because you shouldn't explain with ontology that which you can explain with epistemology.
Be sad.
For topological debate that's about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the "strictest" combination, a big class of fatal flaw would be if you don't actually have the partial order you think you have within the practical range of the settings - i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.
In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long ...
Fun post, even though I don't expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).
With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I'm not at all confused about why people would think this is useful safety work, even if I haven't personally hashed out the cost/benefit to any degree of confidence.
I'm mostly confused about ML / SWE / research benchmarks.
I'm not sure but I have a guess. A lot of "normies" I talk to in the tech industry are anchored hard on the idea that AI is mostly a useless fad and will never get good enough to be useful.
They laugh off any suggestions that the trends point towards rapid improvements that can end up with superhuman abilities. Similarly, completely dismiss arguments that AI might used for building better AI. 'Feed the bots their own slop and they'll become even dumber than they already are!'
So, people who do believe that the trends are meaningful, and that we are near to a...
The mathematical structure in common is called a "measure."
I agree that there's something mysterious-feeling about probability in QM, though I mostly think that feeling is an illusion. There's a (among physicists) famous fact that the only way to put a 'measure' on a wavefunction that has nice properties (e.g. conservation over time) is to take the amplitude squared. So there's an argument: probability is a measure, and the only measure that makes sense is the amplitude-squared measure, therefore if probability is anything it's the amplitude squared. And i...
Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?
It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear.
But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.
But clearly other people think differently than me.
I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record".
I find these arguments reasonably compelling, FWIW.
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.
Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn't a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.
One big reason I might expect an AI to do a bad job at alignment research is if it doesn't do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren't that inconsistent about it and don't disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because 'be very convincing to your operators' is an objective with a different kind of danger).
Another potentia...
Thanks for the great reply :) I think we do disagree after all.
humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans
Except about that - here we agree.
...Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflecti
If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.
On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.