All of Steven Byrnes's Comments + Replies

However: I expect that AIs capable of causing a loss of control scenario, at least, would also be capable of top-human-level alignment research.

Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?

Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be... (read more)

2Joe Carlsmith
Sure, maybe there's a band of capability where you can take over but you can't do top-human-level alignment research (and where your takeover plan doesn't involve further capabilities development that requires alignment). It's not the central case I'm focused on, though.  Is the thought here that the AIs trying to takeover aren't improving their capabilities in a way that requires paying an alignment tax? E.g. if the tax refers to a comparison between (a) rushing forward on capabilities in a way that screws you on alignment vs. (b) pushing forward on capabilities in a way that preserves alignment, AIs that are fooming will want to do (b) as well (though they may have an easier time of it for other reasons). But if it refers to e.g. "humans will place handicaps on AIs that they need to ensure are aligned, including AIs they're trying to use for alignment research, whereas rogue AIs that have also freed themselves human control will also be able to get rid of these handicaps," then yes, that's an advantage the rogue AIs will have (though note that they'll still need to self-filtrate etc). 

we don’t tend to imagine humans directly building superintelligence

Speak for yourself! Humans directly built AlphaZero, which is a superintelligence for board game worlds. So I don’t think it’s out of the question that humans could directly build a superintelligence for the real world. I think that’s my main guess, actually?

(Obviously, the humans would be “directly” building a learning algorithm etc., and then the trained weights come out of that.)

(OK sure, the humans will use AI coding assistants. But I think AI coding assistants, at least of the sort tha... (read more)

2Joe Carlsmith
Fair point, and plausible that I'm too much taking for granted a certain subset of development pathways. That is: I'm focused in the essay on threat models that proceed via the automation of capabilities R&D, but it's possible that this isn't necessary.

You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.

I’m curious if you could name some examples, from your perspective?

I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further di... (read more)

2Joe Carlsmith
Re: examples of why superintelligences create distinctive challenges: superintelligences seem more likely to be schemers, more likely to be able to systematically and successfully mess with the evidence provided by behavioral tests and transparency tools, harder to exert option-control over, better able to identify and pursue strategies humans hadn't thought of, harder to supervise using human labor, etc.  If you're worried about shell games, it's OK to round off "alignment MVP" to hand-off-ready AI, and to assume that the AIs in question need to be able to pursue make and pursue coherent long-term plans.[1] I don't think the analysis in the essay alters that much (for example, I think very little rests on the idea that you can get by with myopic AIs), and better to err on the side of conservatism.  I wanted to set aside "hand-off" here because in principle, you don't actually need to hand-off until humans stop being able to meaningfully contribute to the safety/quality of the automated alignment work, which doesn't necessarily need to start around the time we have AIs capable of top-human-level alignment work (e.g., human evaluation of the research, or involvement in other aspects of control -- e.g., providing certain kinds of expensive, trusted supervision -- could persist after that). And when exactly you hand-off depends on a bunch of more detailed, practical trade-offs.  As I said in the post, one way that humans still being involved might not bottleneck the process is if they're only reviewing the work to figure out whether there's a problem they need to actively intervene on.  And I think you can still likely radically speed up and scale up your alignment research even e.g. you still care about humans reviewing and understanding the work in question.  1. ^ Though for what it's worth I don't think that the task of assessing "is this a good long-term plan for achieving X" needs itself to involve long-term optimization for X. For example: you coul

Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.

Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts

... (read more)
2Joe Carlsmith
I'm a bit confused about your overall picture here. Sounds like you're thinking something like:  Is that roughly right? 

I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.

Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big ra... (read more)

4Joe Carlsmith
I'm happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it's important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.

That was an excellent summary of how things seem to normally work in the sciences, and explains it better than I would have. Kudos.

Thanks!

Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?

(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)

I think your complaint is that people would be bad at pressing the button, even by their ow... (read more)

2Noosphere89
  For metrics, I'm talking about stuff like benchmarks and evals for AI capabilities like METR evals. I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem: 1. I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward. Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all: https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail 2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow. This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can't scale the AI's capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero. 3. I'm not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don't need a human in the loop don't transfer to solutions to various alignment problems. A large portion of the issue is that you can't just have humans interfere in the AI's d

Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.

To add on a bit to what I wrote in that post, and reply more directly…

So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to ma... (read more)

(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.

Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.

I didn’t intend (A) & (B) to be a precise and complete breakdown.

AIs might le

... (read more)
3Towards_Keeperhood
Thx. I don't really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won't have sufficient understanding to judge whether what the AI did is actually good.

I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?

4Kaj_Sotala
Reminds me of
3Rauno Arike
Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to touch the code at all. "Please rewrite my code and get all tests to pass. Don't cheat." seems like a corner case to me—to decide whether that's specification gaming, we would need to understand the implicit specifications that the phrase "don't cheat" conveys.

Thanks for the examples!

Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :) 

I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test ... (read more)

4Kei
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that's not definitive, and of course the use of the term can change over time. I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we're seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale. I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I'll call the goal), and a literal way we evaluate the model's behavior (which I'll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher's head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we're just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it's the reward signal, and in an explicit evaluation it's the evaluation function. When prompting, we somet

Thanks!

I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.

But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”

4Cole Wyeth
Yes, you’re technically right. 

I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B).

There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like ... (read more)

1Towards_Keeperhood
Thanks! It's nice that I'm learning more about your models. (A) seems much more general than what I would call "reward specification failure". The way I use "reward specification" is: * If the AI has as goal "get reward" (or sth else) rather than "whatever humans want" because it better fits the reward data, then it's a reward specification problem. * If the AI has as goal "get reward" (or sth else) rather than "whatever humans want" because it fits the reward data about similarly well and it's the simpler goal given the architecture, it's NOT a reward specification problem. * (This doesn't seem to me to fit your description of "B".) * (Related.) I might count the following as reward specification problem, but maybe not, maybe another name would be better: * The AI mostly gets reward for solving problems which aren't much about human values specifically, so the AI may mainly learn to value insights for solving problems better rather than human values. (B) seems to me like an overly specific phrasing, and there are many stages where misgeneralization may happen:  * when the AI transitions to thinking in goal-directed ways (instead of following more behavioral heuristics or value function estimates) * when the AI starts modelling itself and forms a model of what values it has (where the model might mismatch what is optimized on the object level) * when the AI's ontology changes and it needs to decide how to rebind value-laden concepts * when the AI encounters philosophical problems like Pascal's mugging Section 4 of Jeremy's and Peter's report also shows some more ways of how an AI might fail to learn the intended goal without being due to reward specification[1], though it doesn't use your model-based RL frame. Also, I don't think A and B are exhaustive. Other somewhat speculative problems include: * A mesaoptimizer emerges under selection pressure and tries to gain control of the larger AI it is in while staying undetected. (Sorta like cancer

No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.

Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?

If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and che... (read more)

4faul_sname
So I think what's going on with o3 isn't quite standard-issue specification gaming either. It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we're clearly not in an eval context, I'll try to see if I can get a repro case which doesn't have confidential info.
5Kei
It's pretty common for people to use the terms "reward hacking" and "specification gaming" to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn't actually appear there in practice. Some examples of this: * OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking. * Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack. * This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.  I think the literal definitions of the words in "specification gaming" align with this definition (although interestingly not the words in "reward hacking"). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt. I also think it's useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I'm not sure that the term "ruthless cons

I agree people often aren't careful about this.

Anthropic says

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.

Similarly OpenAI suggests that cheating behavior is due to RL.

I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:

  • Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
  • Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specifi
... (read more)
4cubefox
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question. Though I also wanted to just add a quick comment about this part: It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a "wrong" solution as they get for saying something like "I don't know". Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer "by accident" (for the wrong reasons). At least something like that seems to be suggested by this: Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.

(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in th... (read more)

Thanks!

Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don't want, or reward hacking/wireheading that fails to take over the world.

I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of... (read more)

4Noosphere89
Yeah, I was ignoring the case where reward hacking actually lead to real world dangers, which was not a good thing (though in my defense, one could argue that reward hacking/reward overoptimization may by default lead to wireheading-type behavior without tools to broadly solve specification gaming). I'm pointing out that Goodhart's law applies to AI capabilities, too, and saying that what the reward function rewards is not necessarily equivalent to the capabilities that you want from AI, because the metrics that you give the AI to optimize are likely not equivalent to what capabilities you want from the AI. In essence, I'm saying the difference you identify for AI alignment is also a problem for AI capabilities, and I'll quote a post of yours below: https://www.lesswrong.com/posts/wucncPjud27mLWZzQ/intro-to-brain-like-agi-safety-10-the-alignment-problem#10_3_1_Goodhart_s_Law I think the crux is you might believe that capabilities targets are easier to encode into reward functions that don't require that much fine specification, or you think that specification of rewards will happen more effectively for capabilities targets than alignment targets. Whereas I'm not as convinced as you that it would literally be as easy as you say to solve issues like this “making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization”. solely by maximizing rewards in sparse complicated environments without also being able to solve significant chunks of the alignment problem. In particular, this claim "If nothing else, “the AIs are actually making money” can be tied to the reward function" has as much detail as this alignment plan below, which is that it is easy to describe, but not easy to actually implement: https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ In essence, I'

even if during training it already knew that buttons are often connected to wires

I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.

A slightly smarter agent would turn its gaze slightly closer to the reward itself.

I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.

Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is... (read more)

6cousin_it
My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans. It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on. Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.

a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum…)

I don’t follow this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we... (read more)

4Noosphere89
Fair point. Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don't want, or reward hacking/wireheading that fails to take over the world. I too basically agree that the usual agent debugging loop will probably solve near-term issues. To illustrate a partially concrete story of how this debugging loop could fail in a way that could force them to solve the AI specification gaming problem, imagine that we live in a world where something like fast take-off/software only singularities can happen, and we task 1,000,000 AI researchers to automate their own research. However, we keep having issues with AI researchers reward functions, because we have a scaled up version of the problem with o3/Sonnet 3.7, because while they managed to patch the problems in o3/Sonnet 3.7, they didn't actually fully solve the problem in a durable way, and scaled up versions of the problems that plagued o3/Sonnet 3.7 like making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization is such an attractor basin that fixes don't work without near-unhackable reward functions, and everything from benchmarks to code is aggressively goodharted and reward optimized, meaning AI capabilities stop growing until theoretical fixes for specification gaming are obtained. This is my own optimistic story of how AI capabilities could be bottlenecked on solving the alignment problem of specification gaming

Thanks!

Why does this happen in the first place

There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competiti... (read more)

4Wei Dai
1. A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they'll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed. 2. Another problem is that we're relatively confident that at least some humans can reason "successfully", in the sense of making philosophical progress, but we don't know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans. 3. Finally, I find your arguments against moral realism somewhat convincing, but I'm still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don't want to bet the universe on or against any of these positions.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.

I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing... (read more)

4cousin_it
Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things? In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that's the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?

I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?

That said, I also want to re-emphasize that both myself and S... (read more)

I think it's important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosi... (read more)

under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:

  • Best friends, and friend groups, do exist, and people do really enjoy and value them.
  • The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with t
... (read more)
4Wei Dai
Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don't have a complete understanding yet, so I'm just flagging this as a potential puzzle, not demanding an immediate answer.) I'm only "happy" to "entrust the future to the next generation of humans" if I know that they can't (i.e., don't have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we're still in an arguably ok position. However AI will make it false very quickly by advancing technology. So I really want the AI transition to be an opportunity for improving the basic dynamic of "the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions", for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it's always been this way, so it's fine to continue. I'm going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...

“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it's not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look lik... (read more)

The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea invol... (read more)

2cousin_it
I thought about it some more and want to propose another framing. The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent's feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won't even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible. The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is. It's a kind of RL for which the death of one creature is not the end. In other words, we can function because we've delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there's a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.) This suggests to me that if we want the rubber to meet the road - if we want the agent to have behaviors that track the world, not just the agent's own feelings - then the optimization process that created the agent cannot be the agent's own RL. By itself, RL can only learn to care about "behavioral reward" as you put it. Caring about the world can only occur if the agent "inherits" that caring from some other process in the world, by makeup or imitation. This conclusion might be a bit disappointing, because finding the right process to "inherit" from isn't easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn't the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if w

My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.

(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only... (read more)

4Wei Dai
My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other? When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.

So one way that people... (read more)

2cousin_it
I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill. The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

I suggest making anonymity compulsary

It’s an interesting idea, but the track records of the grantees are important information, right? And if the track record includes, say, a previous paper that the funder has already read, then you can’t submit the paper with author names redacted.

Also, ask people seeking funding to make specific, unambiguous, easily falsiable predictions of positive outcomes from their work. And track and follow up on this!

Wouldn’t it be better for the funder to just say “if I’m going to fund Group X for Y months / years of work, I shou... (read more)

You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.

(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)

subjecting it to any kind of

... (read more)
4cousin_it
Thanks for the link! It's indeed very relevant to my question. I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann's hypothesis, they'd probably refuse. But humans aren't magical: we are some combination of reinforcement learning, imitation learning and so on. So there's got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?
2[comment deleted]

I think that sounds off to AI researchers. They might (reasonably) think something like "during the critical value formation period the AI won't have the ability to force humans to give positive feedback without receiving negative feedback".

If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong.

(But also, I would be very surprised if they said that in the ... (read more)

3Towards_Keeperhood
Thanks. I'm not sure I fully understand what you're trying to say here. The "100,000 years ago" suggests to me you're talking about evolution, but then at the end you're comparing it to the human innate reward function, rather than genetic fitness. I agree that humans do a lot of stuff that triggers the human innate reward function. By "a complex proxy for predicting reward which misgeneralizes", I don't mean the AI winds up with a goal that disagrees with the reward signal, but rather that it probably learns one of those many many goals that are compatible with the reward function, but it happens to not be in the narrow cluster of reward-compatible goals that we hoped for. (One could say narrowing the compatible goals is part of reward specification, but I rather don't, because I don't think it's a practical avenue to try to get a reward function that can precisely predict how well some far-out-of-distribution outcomes (e.g. what kind of sentient beings to create when we're turning the stars into cities) align with human's coherent extrapolated volition.) (If we ignore evolution and only focus on alignment relative to the innate reward function, then) the examples you mentioned ("playing video games,...") are still sufficiently on-distribution that the reward function says something about those, and failing here is not the main failure mode I worry about. The problem is that human values are not only about what normal humans value in everyday life, but also about what they would end up valuing if they became smarter. E.g. I want to fill the galaxies with lots of computronium simulating sentient civilizations living happy and interesting lives, and this is one particular goal that is compatible with the human reward function, but there are many others possible reward-compatible goals. An AI that has similar values as humans at +0SD intelligence, might end up valuing something very different from +7SD humans at +7SD, because it may have different underlying priors

Yeah I said that to Matt Barnett 4 months ago here. For example, one man's "avoiding conflict by reaching negotiated settlement" may be another man's "acceding to extortion". Evidently I did not convince him. Shrug.

OK, here’s my argument that, if you take {intelligence, understanding, consequentialism} as a unit, it’s sufficient for everything:

  • If durability and strength are helpful, then {intelligence, understanding, consequentialism} can discover that durability and strength are helpful, and then build durability and strength.
    • Even if “the exact ways in which durability and strength will be helpful” does not constitute a learnable pattern, “durability and strength will be helpful” is nevertheless a (higher-level) learnable pattern.
  • If some other evolved aspects of the
... (read more)
3tailcalled
Writing the part that I didn't get around to yesterday: You could theoretically imagine e.g. scanning all the atoms of a human body and then using this scan to assemble a new human body in their image. It'd be a massive technical challenge of course, because atoms don't really sit still and let you look and position them. But with sufficient work, it seems like someone could figure it out. This doesn't really give you artificial general agency of the sort that standard Yudkowsky-style AI worries are about, because you can't assign them a goal. You might get an Age of Em-adjacent situation from it, though even not quite that. To reverse-engineer people in order to make AI, you'd instead want to identify separate faculties with interpretable effects and reconfigurable interface. This can be done for some of the human faculties because they are frequently applied to their full extent and because they are scaled up so much that the body had to anatomically separate them from everything else. However, there's just no reason to suppose that it should apply to all the important human faculties, and if one considers all the random extreme events one ends up having to deal with when performing tasks in an unhomogenized part of the world, there's lots of reason to think humans are primarily adapted to those. One way to think about the practical impact of AI is that it cannot really expand on its own, but that people will try to find or create sufficiently-homogenous places where AI can operate. The practical consequence of this is that there will be a direct correspondence between each part of the human work to prepare the AI to each part of the activities the AI is engaging in, which will (with caveats) eliminate alignment problems because the AI only does the sorts of things you explicitly make it able to do. The above is similar to how we don't worry so much about 'website misalignment' because generally there's a direct correspondence between the behavior of the web
2tailcalled
I've grown undecided about whether to consider evolution a form of intelligence-powered consequentialism because in certain ways it's much more powerful than individual intelligence (whether natural or artificial). Individual intelligence mostly focuses on information that can be made use of over a very short time/space-scale. For instance an autoregressive model relates the immediate future to the immediate past. Meanwhile, evolution doesn't meaningfully register anything shorter than the reproductive cycle, and is clearly capable of registering things across the entire lifespan and arguably longer than that (like, if you set your children up in an advantageous situation, then that continues paying fitness dividends even after you die). Of course this is somewhat counterbalanced by the fact that evolution has much lower information bandwidth. Though from what I understand, people also massively underestimate evolution's information bandwidth due to using an easy approximation (independent Bernoulli genotypes, linear short-tailed genotype-to-phenotype relationships and thus Gaussian phenotypes, quadratic fitness with independence between organisms). Whereas if you have a large number of different niches, then within each niche you can have the ordinary speed of evolution, and if you then have some sort of mixture niche, that niche can draw in organisms from each of the other niches and thus massively increase its genetic variance, and then since the speed of evolution is proportional to genetic variance, that makes this shared niche evolve way faster than normally. And if organisms then pass from the mixture niche out into the specialized niches, they can benefit from the fast evolution too. (Mental picture to have in mind: we might distinguish niches like hunter, fisher, forager, farmer, herbalist, spinner, potter, bard, bandit, carpenter, trader, king, warlord (distinct from king in that kings gain power through expanding their family while warlords gain power

I kinda agree, but that’s more a sign that schools are bad at teaching things, than a sign that human brains are bad at flexibly applying knowledge. See my comment here.

See my other comment. I find it distressing that multiple people here are evidently treating acknowledgements as implying that the acknowledged person endorses the end product. I mean, it might or might be true in this particular case, but the acknowledgement is no evidence either way.

(For my part, I’ve taken to using the formula “Thanks to [names] for critical comments on earlier drafts”, in an attempt to preempt this mistake. Not sure if it works.)

Chiang and Rajaniemi are on board

Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.

Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk).

Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about... (read more)

4tailcalled
I think the necessity of intelligence for tradition exists on a much more fundamental level than that. Intelligence allows people to from an extremely rich model of the world with tons of different concepts. If one had no intelligence at all, one wouldn't even be able to copy the traditions. Like consider a collection of rocks or a forest; it can't pass any tradition onto itself. But conversely, just as intelligence cannot be converted into powerful agency, I don't think it can be used to determine which traditions should be copied and which ones shouldn't. It seems to me that you are treating any variable attribute that's highly correlated across generations as a "tradition", to the point where not doing something is considered on the same ontological level as doing something. That is the sort of ontology that my LDSL series is opposed to. I'm probably not the best person to make the case for tradition as (despite my critique of intelligence) I'm still a relatively strong believer in equillibration and reinvention. Whenever there's any example of this that's too embarrassing or too big of an obstacle for applying them in a wide range of practical applications, a bunch of people point it out, and they come up with a fix that allows the LLMs to learn it. The biggest class of relevant examples would all be things that never occur in the training data - e.g. things from my job, innovations like how to build a good fusion reactor, social relationships between the world's elites, etc.. Though I expect you feel like these would be "cheating", because it doesn't have a chance to learn them? The things in question often aren't things that most humans have a chance to learn, or even would benefit from learning. Often it's enough if just 1 person realizes and handles them, and alternately often if nobody handles them then you just lose whatever was dependent on them. Intelligence is a universal way to catch on to common patterns; other things than common patterns matter

If your model if underparameterized (which I think is true for the typical model?), then it can't learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can't learn any pattern that never occurs in the data.

Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.

I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of r... (read more)

2tailcalled
I guess to add, I'm not talking about unknown unknowns. Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can't efficiently be derived from empirical data (except essentially by copying someone else's conclusion blindly, and that leaves you vulnerable to deception).
2tailcalled
I don't have time to read this study in detail until later today, but if I'm understanding it correctly, the study isn't claiming that neural networks will learn rare important patterns in the data, but rather that they will learn rare patterns that they were recently trained on. So if you continually train on data, you will see a gradual shift towards new patterns and forgetting old ones. Random street names aren't necessarily important though? Like what would you do with them? I didn't say that intelligence can't handle different environments, I said it can't handle heterogenous environments. The moon is nearly a sterile sphere in a vacuum; this is very homogenous, to the point where pretty much all of the relevant patterns can be found or created on Earth. It would have been more impressive if e.g. the USA could've landed a rocket with a team of Americans in Moscow than on the moon. Also people did use durability, strength, healing, intuition and tradition to go the moon. Like with strength, someone had to build the rockets (or build the machines which built the rockets). And without durability and healing, they would have been damaged too much in the process of doing that. Intuition and healing are harder to clearly attribute, but they're part of it too. Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk). Margins of error and backup systems would be, idk, caution? Which, yes, definitely benefit from intelligence and consequentialism. Like I'm not saying intelligence and consequentialism are useless, in fact I agree that they are some of the most commonly useful things due to the frequent need to bypass common obstacles.

I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)

I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.

(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [part... (read more)

4tailcalled
This I'd dispute. If your model if underparameterized (which I think is true for the typical model?), then it can't learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can't learn any pattern that never occurs in the data. I'm saying that intelligence is the thing that allows you to handle patterns. So if you've got a dataset, intelligence allows you to build a model that makes predictions for other data based on the patterns it can find in said dataset. And if you have a function, intelligence allows you to find optima for said function based on the patterns it can find in said function. Consequentialism is a way to set up intelligence to be agent-ish. This often involves setting up something that's meant to build an understanding of actions based on data or experience. One could in principle cut my definition of consequentialism up into self-supervised learning and true consequentialism (this seems like what you are doing..?). One disadvantage with that is that consequentialist online learning is going to have a very big effect on the dataset one ends up training the understanding on, so they're not really independent of each other. Either way that just seems like a small labelling thing to me.

I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:

  1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robus
... (read more)

(IMO this is kinda unrelated to the OP, but I want to continue this thread.)

Have you elaborated on this anywhere?

Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.”  ;-)

I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?

6tailcalled
I think it's quite related to the OP. If a field is founded on a wrong assumption, then people only end up working in the field if they have some sort of blind spot, and that blind spot leads to their work being fake. Not hugely. One tricky bit is that it basically ends up boiling down to "the original arguments don't hold up if you think about them", but the exact way they don't hold up depends on what the argument is, so it's kind of hard to respond to in general. Haha! I think I mostly still stand by the post. In particular, "Consequentialism, broadly defined, is a general and useful way to develop capabilities." remains true; it's just that intelligence relies on patterns and thus works much better on common things (which must be small, because they are fragments of a finite world), than on rare things (which can be big, though don't have to). This means that consequentialism isn't very good at developing powerful capabilities unless it works in an environment that has already been highly filtered to be highly homogenous, because an inhomogenous environment is going to BTFO the intelligence. (I'm not sure I stand 101% by my post; there's some funky business about how to count evolution that I still haven't settled on yet. And I was too quick to go from "imitation learning isn't going to lead to far-superhuman abilities" to "consequentialism is the road to far-superhuman abilities". But yeah I'm actually surprised at how well I stand by my old view despite my massive recent updates.) Sounds good!

For context, my lower effort posts are usually more popular.

mood

In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)

So if some idea pops into your head, “maybe I’ll stand up”, and it seems a... (read more)

I kinda think of the main clusters of symptoms as: (1) sensory sensitivity, (2) social symptoms, (3) different “learning algorithm hyperparameters”.

More specifically, (1) says: innate sensory reactions (e.g. startle reflex, orienting reflex) are so strong that they’re often overwhelming. (2) says: innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. (3) includes atypical patterns of learning & memory including the gestalt pattern of childhood language acquisition which is commo... (read more)

4cousin_it
Yeah. I had a similar idea, that autism spectrum stuff comes from a person's internal "volume knobs" being turned to the wrong positions. Some things are too quiet to notice, while others are so loud that it turns into a kind of wailing feedback, like from a too loud microphone. And maybe some of it is fixable with exposure training, but not everything and not easily.

Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯

the value function doesn't understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate

Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a ... (read more)

1Towards_Keeperhood
Thanks! If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn't yet tried a lot of similar abstract plans from where it could've observed results, and the world model's prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let's say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it's value estimate. Or perhaps a slightly more concrete version of how that may happen. (I'm thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.): * Sometimes the thought generator generates self-reflective thoughts like "what are my goals here", where upon the thought generator produces an answer "X" to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking "what are my goals here". * The same can happen with "what are my long-term goals", where the thought generator might guess something that would cause high reward. * For humans, X is likely more socially nice than would be expected from the value function, since "X are my goals here" is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1] * Later the thought generator may generate the thought "make careful predictions whether the plan will actually accomplish the stated goa

I think the interest rate thing provides so little evidence either way that it’s misleading to even mention it. See the EAF comments on that post, and also Zvi’s rebuttal. (Most of that pushback also generalizes to your comment about the S&P.) (For context, I agree that AGI in ≤2030 is unlikely.)

4Cole Wyeth
Thanks for the links, I’ll look into it.  I agree that the S&P is pretty much reading tea leaves, the author of the interest rates post @basil.halperin has separately argued it is not reliable.

Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.

Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).

Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a differ... (read more)

3Towards_Keeperhood
Thanks! Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model. Yes, I think I understand your translations and your framing of the value function. Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I'm still wrongly describing your model: * plans vs thoughts * My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans. * Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so. * detailedness of evaluation of value function * My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like "how much does Alice admire me") change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn't understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be "when there's a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought "this is bad for accomplishing my goals", then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic "do an explicit expected utility estimate and just take what the result says as the value estimate", and then that is being done and the value function itself doesn't unders

I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.

(I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)

I downvoted because the conclusion “prediction markets are mediocre” does not follow from the premise “here is one example of one problem that I imagine abundant legal well-capitalized prediction markets would not have completely solved (even though I acknowledge that they would have helped move things in the right direction on the margin)”.

3Ape in the coat
I think "mediocre" is a quite appropriate adjective when describing a thing that we had high hopes for, but now received evidence, according to which while the thing technically works, it performs worse than expected, and the most exciting use cases are not validated. I indeed used a single example here, so the strength of the evidence is arguable, but I don't see why this case should be an outlier. I could've searched for more, like this one, that is particularly bad: In any case, you can consider this post my public prediction that other policy prediction markets would also follow a similar thread.

That excerpt says “compute-efficient” but the rest of your comment switches to “sample efficient”, which is not synonymous, right? Am I missing some context?

7faul_sname
Nope, I just misread. Over on ACX I saw that Scott had left a comment I hadn't remembered reading that in the post Still "things get crazy before models get data-efficient" does sound like the sort of thing which could plausibly fit with the world model in the post (but would be understated if so). Then I re-skimmed the post, and in the October 2027 section I saw and when I read that my brain silently did a s/compute-efficient/data-efficient. Though now I am curious about the authors' views on how data efficiency will advance over the next 5 years, because that seems very world-model-relevant.

Pretty sure “DeepCent” is a blend of DeepSeek & Tencent—they have a footnote: “We consider DeepSeek, Tencent, Alibaba, and others to have strong AGI projects in China. To avoid singling out a specific one, our scenario will follow a fictional “DeepCent.””. And I think the “brain” in OpenBrain is supposed to be reminiscent of the “mind” in DeepMind.

ETA: Scott Alexander tweets with more backstory on how they settled on “OpenBrain”: “You wouldn't believe how much work went into that stupid name…”

3OVERmind
I understand it much better now, thanks! (I did not know about Tencent, and I foolishly had not read that footnote carefully enough.) Although I don't fully understand why drawing attention to two companies is viewed acceptable when drawing attention to only one company is not. As said, other companies also have a notable chance of becoming the leader. Probably drawing attention to one particular company can be seen as a targeted attack or advertisement, and one is much less likely to advertise for/attack two companies at the same time. But one could be “playing for” other company (say, X AI or Anthropic) and thereby be attacking other leading companies?
Load More