This is a special post for quick takes by Steven Byrnes. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
82 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.

Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?

If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.

Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”.

Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.)

Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?

I agree people often aren't careful about this.

Anthropic says

During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.

Similarly OpenAI suggests that cheating behavior is due to RL.

6Steven Byrnes
Thanks! I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”. But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
4Cole Wyeth
Yes, you’re technically right. 
[-]Rauno Arike*Ω7141

I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:

  • Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
  • Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.

Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research's recent paper Demonstrating specification gaming in reasoning models—I've seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.

3Steven Byrnes
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
4Kaj_Sotala
Reminds me of
3Rauno Arike
Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to touch the code at all. "Please rewrite my code and get all tests to pass. Don't cheat." seems like a corner case to me—to decide whether that's specification gaming, we would need to understand the implicit specifications that the phrase "don't cheat" conveys.
5Kei
It's pretty common for people to use the terms "reward hacking" and "specification gaming" to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn't actually appear there in practice. Some examples of this: * OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking. * Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack. * This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.  I think the literal definitions of the words in "specification gaming" align with this definition (although interestingly not the words in "reward hacking"). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt. I also think it's useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I'm not sure that the term "ruthless cons
3Steven Byrnes
Thanks for the examples! Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)  I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly? * Getting a good evaluation afterwards? Nope, the person didn’t want cheating! * The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right? As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”. (Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)
6Kei
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that's not definitive, and of course the use of the term can change over time. I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we're seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale. I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I'll call the goal), and a literal way we evaluate the model's behavior (which I'll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher's head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we're just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it's the reward signal, and in an explicit evaluation it's the evaluation function. When prompting, we somet
4faul_sname
So I think what's going on with o3 isn't quite standard-issue specification gaming either. It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we're clearly not in an eval context, I'll try to see if I can get a repro case which doesn't have confidential info.
4cubefox
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question. Though I also wanted to just add a quick comment about this part: It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a "wrong" solution as they get for saying something like "I don't know". Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer "by accident" (for the wrong reasons). At least something like that seems to be suggested by this: Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links): 

I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)

REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7)

LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7)

NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was

... (read more)

Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.

I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)

(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)

(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)

Whew, a critique that our takeoff should be faster for a change, as opposed to slower.

Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.

(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)

This depends on how large you think the penalty is for parallelized labor as opposed to serial. If 330k parallel researchers is more like equivalent to 100 researchers at 50x speed than 100 researchers at 3,300x speed, then it's more like a team of 100 researchers working for (50*57)/12=~250 years.

Also of course to the extent you think compute will be an important input, during October they still just have a month's worth of total compute even though they're working for 250-25,000 subjective years.

I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so m

... (read more)
5Steven Byrnes
Thanks, that’s very helpful! If we divide the inventing-ASI task into (A) “thinking about and writing algorithms” versus (B) “testing algorithms”, in the world of today there’s a clean division of labor where the humans do (A) and the computers do (B). But in your imagined October 2027 world, there’s fungibility between how much compute is being used on (A) versus (B). I guess I should interpret your “330K superhuman AI researcher copies thinking at 57x human speed” as what would happen if the compute hypothetically all went towards (A), none towards (B)? And really there’s gonna be some division of compute between (A) and (B), such that the amount of (A) is less than I claimed? …Or how are you thinking about that? Right, but I’m positing a discontinuity between current AI and the next paradigm, and I was talking about the gap between when AI-of-that-next-paradigm is importantly useful versus when it’s ASI. For example, AI-of-that-next-paradigm might arguably already exist today but where it’s missing key pieces such that it barely works on toy models in obscure arxiv papers. Or here’s a more concrete example: Take the “RL agent” line of AI research (AlphaZero, MuZero, stuff like that), which is quite different from LLMs (e.g. “training environment” rather than “training data”, and there’s nothing quite like self-supervised pretraining (see here)). This line of research has led to great results on board games and videogames, but it’s more-or-less economically useless, and certainly useless for alignment research, societal resilience, capabilities research, etc. If it turns out that this line of research is actually much closer to how future ASI will work at a nuts-and-bolts level than LLMs are (for the sake of argument), then we have not yet crossed the “AI-of-that-next-paradigm is importantly useful” threshold in my sense. If it helps, here’s a draft paragraph from that (hopefully) forthcoming post: Next: Well, even if you have an ML training plan that will yi
4elifland
Sorry for the late reply. I'm not 100% sure what you mean, but my guess is that you mean (B) to represent the compute used for experiments? We do project a split here and the copies/speed numbers are just for (A). You can see our projections for the split in our compute forecast (we are not confident that they are roughly right). Re: the rest of your comment, makes sense. Perhaps the place I most disagree is that if LLMs will be the thing discovering the new paradigm, they will probably also be useful for things like automating alignment research, epistemics, etc. Also if they are misaligned they could sabotage the research involved in the paradigm shift.
1Tao Lin
I can somewhat see where you're coming from about a new method being orders of magnitude more data efficient in RL, but I very strongly bet on transformers being core even after such a paradigm shift. I'm curious whether you think the transformer architecture and text input/output need to go, or whether the new training procedure / architecture fits in with transformers because transformers are just the best information mixing architecture.
2Noosphere89
My guess the main issue of current transformers turns out to be the fact that they don't have a long-term state/memory, and I think this is a pretty critical part of how humans are able to learn on the job as effectively as they do. The trouble as I've heard it is the other approaches which incorporate a state/memory for the long-run are apparently much harder to train reasonably well than transformers, plus first-mover effects.

That does raise my eyebrows a bit, but also, note that we currently have hundreds of top-level researchers at AGI labs tirelessly working day in and day out, and that all that activity results in a... fairly leisurely pace of progress, actually.[1]

Recall that what they're doing there is blind atheoretical empirical tinkering (tons of parallel experiments most of which are dead ends/eke out scant few bits of useful information). If you take that research paradigm and ramp it up to superhuman levels (without changing the fundamental nature of the work), maybe it really would take this many researcher-years.

And if AI R&D automation is actually achieved on the back of sleepwalking LLMs, that scenario does seem plausible. These superhuman AI researchers wouldn't actually be generally superhuman researchers, just superhuman at all the tasks in the blind-empirical-tinkering research paradigm. Which has steeply declining returns to more intelligence added.

That said, yeah, if LLMs actually scale to a "lucid" AGI, capable of pivoting to paradigms with better capability returns on intelligent work invested, I expect it to take dramatically less time.

  1. ^

    It's fast if you use past AI progress

... (read more)
1Roman Malov
A possible reason for that might be the fallibility of our benchmarks. It might be the case that for complex tasks, it's hard for humans to see farther than their nose.
5Noosphere89
The short version is getting compute-optimal experiments to self-improve yourself, training to do tasks that unavoidably take a really long time to learn/get data on because of real-world experimentation being necessary, combined with a potential hardware bottleneck on robotics that also requires real-life experimentation to overcome. Another point is that to the extent you buy the scaling hypothesis at all, then compute bottlenecks will start to bite, and given that researchers will seek small constant improvements they don't generalize, and this can start a cascade of wrong decisions that could take a very long time to get out of. I'd like to see that post, and I'd like to see your arguments on why it's so easy for intelligence to be increased so fast, conditional on a new paradigm shift. (For what it's worth, I personally think LLMs might not be the last paradigm, because of their current lack of continuous learning/neuroplasticity plus no long term memory/state, but I don't expect future paradigms to have an AlphaZero like trajectory curve, where things go from zero to wildly superhuman in days/weeks, though I do think takeoff is faster if we condition on a new paradigm being required for ASI, so I do see the AGI transition to plausibly include having only months until we get superintelligence, and maybe only 1-2 years before superintelligence starts having very, very large physical impacts through robotics, assuming that new paradigms are developed, so I'm closer to hundreds of person years/thousands of person years than dozens of person years).
4Viliam
The world is complicated (see: I, Pencil). You can be superhuman by only being excellent at a few fields, for example politics, persuasion, military, hacking. That still leaves you potentially vulnerable, even if your opponents are unlikely to succeed; or you could hurt yourself by your ignorance in some field. Or you can be superhuman in the sense of being able to make the pencil from scratch, only better at each step. That would probably take more time.
4Steven Byrnes
Are you suggesting that e.g. “R&D Person-Years 463205–463283 go towards ensuring that the AI has mastery of metallurgy, and R&D Person-Years 463283–463307 go towards ensuring that the AI has mastery of injection-molding machinery, and …”? If no, then I don’t understand what “the world is complicated” has to do with “it takes a million person-years of R&D to build ASI”. Can you explain? …Or if yes, that kind of picture seems to contradict the facts that: * This seems quite disanalogous to how LLMs are designed today (i.e., LLMs can already answer any textbook question about injection-molding machinery, but no human doing LLM R&D has ever worked specifically on LLM knowledge of injection-molding machinery), * This seems quite disanalogous to how the human brain was designed (i.e., humans are human-level at injection-molding machinery knowledge and operation, but Evolution designed human brains for the African Savannah, which lacked any injection-molding machinery).
4Viliam
Yes, I meant it that way. LLMs quickly acquired the capacity to read what humans wrote and paraphrase it. It is not obvious to me (though that may speak more about my ignorance) that it will be similarly easy to acquire deep understanding of everything. But maybe it will. I don't know.
2Mitchell_Porter
Incidentally, is there any meaningful sense in which we can say how many "person-years of thought" LLMs have already done?  We know they can do things in seconds that would take a human minutes. Does that mean those real-time seconds count as "human-minutes" of thought? Etc. 

I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of those two categories, then how would that person describe / conceptualize those behaviors and why they occurred? (Or would they deny the premise that they are still exhibiting those behaviors?) Interested in any references or thoughts, or email / DM me if you prefer. Thanks in advance!

(Edited to add: Ideally someone would reply: “Yeah I have no sense of self, and also I regularly do things that onlookers describe as ‘applying willpower’ and/or ‘trying to impress someone’.... (read more)

I’ll give it a go.

I’m not very comfortable with the term enlightened but I’ve been on retreats teaching non-dual meditation, received ‘pointing out instructions’ in the Mahamudra tradition and have experienced some bizarre states of mind where it seemed to make complete sense to think of a sense of awake awareness as being the ground thing that was being experienced spontaneously, with sensations, thoughts and emotions appearing to it — rather than there being a separate me distinct from awareness that was experiencing things ‘using my awareness’, which is how it had always felt before.

When I have (or rather awareness itself has) experienced clear and stable non-dual states the normal ‘self’ stuff still appears in awareness and behaves fairly normally (e.g there’s hunger, thoughts about making dinner, impulses to move the body, the body moving around the room making dinner…). Being in that non dual state seemed to add a very pleasant quality of effortlessness and okayness to the mix but beyond that it wasn’t radically changing what the ‘small self’ in awareness was doing.

If later the thought “I want to eat a second portion of ice cream” came up followed by “I should apply some self... (read more)

4Kaj_Sotala
Great description. This sounds very similar to some of my experiences with non-dual states.
6Jonas Hallgren
I won't claim that I'm constantly in a self of non-self, but as I'm writing this, I don't really feel that I'm locally existing in my body. I'm rather the awareness of everything that continuously arises in consciousness. This doesn't happen all the time, I won't claim to be enlightened or anything but maybe this n=1 self-report can help? Even from this state of awareness, there's still a will to do something. It is almost like you're a force of nature moving forward with doing what you were doing before you were in a state of presence awareness. It isn't you and at the same time it is you. Words are honestly quite insufficient to describe the experience, but If I try to conceptualise it, I'm the universe moving forward by itself. In a state of non-duality, the taste is often very much the same no matter what experience is arising. There are some times when I'm not fully in a state of non-dual awareness when it can feel like "I" am pretending to do things. At the same time it also kind of feels like using a tool? The underlying motivation for action changes to something like acceptance or helpfulness, and in order to achieve that, there's this tool of the self that you can apply. I'm noticing it is quite hard to introspect and try to write from a state of presence awareness at the same time but hopefully it was somewhat helpful? Could you give me some experiments to try from a state of awareness? I would be happy to try them out and come back. Extra (relation to some of the ideas): In the Mahayana wisdom tradition, explored in Rob Burbea's Seeing That Frees, there's this idea of emptiness, which is very related to the idea of non-dual perception. For all you see is arising from your own constricted view of experience, and so it is all arising in your own head. Realising this co-creation can enable a freedom of interpretation of your experiences. Yet this view is also arising in your mind, and so you have "emptiness of emptiness," meaning that you're left with
5Steven Byrnes
Many helpful replies! Here’s where I’m at right now (feel free to push back!) [I’m coming from an atheist-physicalist perspective; this will bounce off everyone else.] Hypothesis: Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever. The (A) & (B) “Steve” concepts are the same concept in normies like me, or at least deeply tangled together. So it’s hard to entertain the possibility of them coming apart, or to think through the consequences if they do. Some people can get into a Mental State S (call it a form of “enlightenment”, or pick your favorite terminology) where their intuitive concept-space around (B) radically changes—it broadens, or disappears, or whatever. But for them, the (A) mental concept still exists and indeed doesn’t change much. Anyway, people often have thoughts that connect sense-of-self to motivation, like “not wanting to be embarrassed” or “wanting to keep my promises”. My central claim that the relevant sense-of-self involved in that motivation is (A), not (B). If we conflate (A) & (B)—as normies like me are intuitively inclined to do—then we get the intuition that a radical change in (B) must have radical impacts on behavior. But that’s wrong—the (A) concept is still there and largely unchanged even in Mental State S, and it’s (A), not (B), that plays a role in those behaviorally-important everyday thoughts like “not wanting to be embarrassed” or “wanting to keep my promises”. So radical changes in (B) would not (directly) have the radical behavioral effects that one might intuitively expect (although it does of course have more than zero behavioral effect, with self-reports being an obvious example). End of hypothesis. Again, feel free to push back!
3Jonas Hallgren
Some meditators say that before you can get a good sense of non-self you first have to have good self-confidence. I think I would tend to agree with them as it is about how you generally act in the world and what consequences your actions will have. Without this the support for the type B that you're talking about can be very hard to come by. Otherwise I do really agree with what you say in this comment. There is a slight disagreement with the elaboration though, I do not actually think that makes sense. I would rather say that the (A) that you're talking about is more of a software construct than it is a hardware construct. When you meditate a lot, you realise this and get access to the full OS instead of just the specific software or OS emulator. A is then an evolutionary beneficial algorithm that runs a bit out of control (for example during childhood when we attribute all cause and effect to our "selves"). Meditation allows us to see that what we have previously attributed to the self was flimsy and dependent on us believing that the hypothesis of the self is true.
3[anonymous]
My experience is different from the two you describe. I typically fully lack (A)[1], and partially lack (B). I think this is something different from what others might describe as 'enlightenment'. I might write more about this if anyone is interested. 1. ^ At least the 'me-the-human-body' part of the concept. I don't know what the '-etc' part refers to.
2Steven Byrnes
I just made a wording change from: to: I think that’s closer to what I was trying to get across. Does that edit change anything in your response? The “etc” would include things like the tendency for fingers to reactively withdraw from touching a hot surface. Elaborating a bit: In my own (physicalist, illusionist) ontology, there’s a body with a nervous system including the brain, and the whole mental world including consciousness / awareness is inextricably part of that package. But in other people’s ontology, as I understand it, some nervous system activities / properties (e.g. a finger reactively withdrawing from pain, maybe some or all other desires and aversions) gets lumped in with the body, whereas other [things that I happen to believe are] nervous system activities / properties (e.g. awareness) gets peeled off into (B). So I said “etc” to include all the former stuff. Hopefully that’s clear. (I’m trying hard not to get sidetracked into an argument about the true nature of consciousness—I’m stating my ontology without defending it.)
4[anonymous]
No. Overall, I would say that my self-concept is closer to what a physicalist ontology implies is mundanely happening - a neural network, lacking a singular 'self' entity inside it, receiving sense data from sensors and able to output commands to this strange, alien vessel (body). (And also I only identify myself with some parts of the non-mechanistic-level description of what the neural network is doing). I write in a lot more detail below. This isn't necessarily written at you in particular, or with the expectation of you reading through all of it. 1. Non-belief in self-as-body (A) I see two kinds of self-as-body belief. The first is looking in a mirror, or at a photo, and thinking, "that [body] is me." The second is controlling the body, and having a sense that you're the one moving it, or more strongly, that it is moving because it is you (and you are choosing to move). I'll write about my experiences with the second kind first. The way a finger automatically withdraws from heat does not feel like a part of me in any sense. Yesterday, I accidentally dropped a utensil and my hands automatically snapped into place around it somehow, and I thought something like, "woah, I didn't intend to do that. I guess it's a highly optimized narrow heuristic, from times where reacting so quickly was helpful to survival". I experimented a bit between writing this, and I noticed one intuitive view I can have of the body is that it's some kind of machine that automatically follows such simple intents about the physical world (including intents that I don't consider 'me', like high fear of spiders). For example, if I have motivation and intent to open a window, then the body just automatically moves to it and opens it without me really noticing that the body itself (or more precisely, the body plus the non-me nervous/neural structure controlling it) is the thing doing that - it's kind of like I'm a ghost (or abstract mind) with telekinesis powers (over nearby objects), but t

Some ultra-short book reviews on cognitive neuroscience

  • On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex - thalamus - hippocampus system, how it's arranged, what computations it's doing, what's the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins's more recent work here.

  • I am a strange loop by Hofstadter (2007)—I dunno, I didn't feel like I got very much out of it, although it's possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.

  • Consciousness and the brain by Dehaene (2014)—very good. Maybe I could have saved time by just reading Kaj's review, there wasn't that much more to the book beyond that.

  • Conscience by Patricia Churchland (2019)—I hated it. I forget whether I thought it was vague / vacuous, or actually wrong. Apparently I have already blocked the memory!

  • How to Create a Mind by Kurzweil (2014)—Parts of it were redundant with On Intelligence (which I had read earlier), but still worthwhile. His ideas abo

... (read more)

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

(I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related to each other, I guess!)

I updated that post to note the correspondence, but also wanted to signal-boost this, in case other people missed it too.

~~

[You can stop reading here—the rest is less important]

If everybody agrees with that part, there’s a further question of “…now what?”. What terminology should I use going forward? If we have redundant terminology, should we try to settle on one?

One obvious option is that I could just stop using th... (read more)

5Towards_Keeperhood
Note: I just noticed your post has a section "Manipulating itself and its learning process", which I must've completely forgotten since I last read the post. I should've read your post before posting this. Will do so. Calling problems "outer" and "inner" alignment seems to suggest that if we solved both we've successfully aligned AI to do nice things. However, this isn't really the case here. Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who's thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren't. This kills you even if we have a nice ground-truth reward and the LVF accurately captures that. In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don't understand how it works.
5Steven Byrnes
Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning. In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem. I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture. So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)
4Towards_Keeperhood
Thanks. Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.) 1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires. 2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account. 1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn't update strongly on abstract arguments like "I should correct my estimates based on outside view". 3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward. 4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it's best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants. 1. Another question of course is whether the inner self-reflective optimizers are likely al
7Steven Byrnes
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful. Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe). Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume). All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts. When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic. Now let’s go through what you wrote: I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences. I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”. (And the
3Towards_Keeperhood
Thanks! Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model. Yes, I think I understand your translations and your framing of the value function. Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I'm still wrongly describing your model: * plans vs thoughts * My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans. * Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so. * detailedness of evaluation of value function * My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like "how much does Alice admire me") change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn't understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be "when there's a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought "this is bad for accomplishing my goals", then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic "do an explicit expected utility estimate and just take what the result says as the value estimate", and then that is being done and the value function itself doesn't unders
5Steven Byrnes
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯ Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.) Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function. Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things.  :) (I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.) The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or neg
3Towards_Keeperhood
Thanks! If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn't yet tried a lot of similar abstract plans from where it could've observed results, and the world model's prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let's say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it's value estimate. Or perhaps a slightly more concrete version of how that may happen. (I'm thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.): * Sometimes the thought generator generates self-reflective thoughts like "what are my goals here", where upon the thought generator produces an answer "X" to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking "what are my goals here". * The same can happen with "what are my long-term goals", where the thought generator might guess something that would cause high reward. * For humans, X is likely more socially nice than would be expected from the value function, since "X are my goals here" is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1] * Later the thought generator may generate the thought "make careful predictions whether the plan will actually accomplish the stated goa

If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn't yet tried a lot of similar abstract plans from where it could've observed results, and the world model's prediction making capabilities generalize further).

Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.

Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No w... (read more)

1Towards_Keeperhood
Thanks. Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it's a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.) So yeah I guess what I wrote was pretty confusing, though I still have some concerns here. Let's look at how an agent might accomplish a very difficult goal, where the agent didn't accomplish similar goals yet so the value function doesn't already assign higher valence to subgoals: 1. I think chains of subgoals can potentially be very long, and I don't think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut. 1. E.g. when I do some work, I think I usually don't partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence). It's easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like "apply the 'clarify goal' strategy to make progress towards the subgoal 'evaluate whether training for corrigibility might work to safely perform a pivotal act', which has expected utility X". So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward. The expected utility could be calculated by having the world mode
4Steven Byrnes
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing. No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world. Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end. Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences,
1Towards_Keeperhood
Ok yeah I think you're probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn't look like the value function deferring to expected utility guess of the world model. I think it doesn't change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of "inner alignment is about the value function ending up accurately predicting expected reward".) I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related. But I do NOT mean what's the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like "think hard about how to best increase the probability that human-aligned superintelligence is built -> ... -> think that I need to get an even better inside view on how feasible alignment/corrigibility is -> plan going through alignment proposals and playing the builder-breaker-game". So basically I am thinking about problems like "does doing planA or planB cause a higher expected reduction in my probability of doom". Where I am perhaps motivated to think that because it's what my role models would approve of. But the decision of what plan I end up pursuing doesn't depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives. It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to
3Steven Byrnes
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc. The candy example involves good long-term planning right? But not explicit guesses of expected utility. …But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it). But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards. For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it. The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around. (Sorry if I’m misunderstanding, here or elsewhere.)
3Towards_Keeperhood
(No I wouldn't say the candy example involves long-term planning - it's fairly easy and doesn't take that many steps. It's true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn't that coherent.) Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways. The way I currently think of it, it doesn't matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way. Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it's possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something. Ok actually the way I imagined it, the value function doesn't evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like "i am thinking like the person I want to be" which have high valence. (Though I guess your pdoom evaluation is similar to the "take the expected utility guess from the world model" value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there's a belief like "pdoom=high <=> bad" and then the value function is just like "apparently that option is bad, so let's not do that", rather than the value function itself assinging low value to high pdoom. (Where the value function previously would've needed to learn to trust the good/bad judgement of the world model, though again I think it's unlikely that it works that way i
4Steven Byrnes
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again. For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update. Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary. Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
1Towards_Keeperhood
Thanks! I think you're right that my "value function still assigns high valence for thinking in those fun productive ways" hypothesis isn't realistic for the reason you described.  I somehow previously hadn't properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet. Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.): 1. The fictional character does cool stuff so you start to admire him. 2. You imagine yourself doing something similarly cool and have the associated thought "the fictional character would be impressed by me", which triggers primary reward. 3. The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward. I still find myself a bit confused: 1. Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you're incentivized to imagine realistic scenarios that would impress someone, but still. 1. In particular I don't quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events. 2. (That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the "primary reward even on just thoughts" hypothesis, though possible that there's another hypothesis that explains that well too.) 1. (Tbc, I think fictional characters influencing one's values is usually relatively weak/rare, t
3Towards_Keeperhood
I'd suggest not using conflated terminology and rather making up your own. Or rather, first actually don't use any abstract handles at all and just describe the problems/failure-modes directly, and when you're confident you have a pretty natural breakdown of the problems with which you'll stick for a while, then make up your own ontology. In fact, while in your framework there's a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this: First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect "rewards that agree with what we want" to mean something like "changes in expected utility according to humanity's CEV". (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.) Second attempt:  "in a way that agrees with its eventual reward" seems to imply that there's actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human's (imperfect) world model. Problems: 1. Humans don't actually give coherent rewards which are consistent with a utility function on their world model. 1. For this problem we might be able to define an extrapolation procedure that's not too bad. 2. The reward depends on the state of the world model of the human, and our world models probably often has false beliefs. 1. Importantly, the setup needs to be designed in a way that there wouldn't be an incentive to manipulate the humans into believing false things. 2. Maybe, optimistically, we could mitigate this problem by having the AI form a model of the o
3Steven Byrnes
Thanks! I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here: (A bit more related discussion here.) That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.) I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how … One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, -1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate. Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”) Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-criti
3Towards_Keeperhood
Thanks! I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it. This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).  However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it's just about deception), and I think it's not: I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying "the learned value function just predicts reward", and that may work while you're staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you're going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to. I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the "indirect alignment" one that say "fulfill human's CEV", which has lower complexity than the ones humans gen
2Steven Byrnes
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be. So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen). Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.) Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!) If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me. Inner alignment ca
3Towards_Keeperhood
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized. I think this summary is better: 1. "The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)". 2. Something else went wrong [not easily compressible].
3Towards_Keeperhood
Sounds like we probably agree basically everywhere. Yeah you can definitely mark me down in the camp of "not use 'inner' and 'outer' terminology". If you need something for "outer", how about "reward specification (problem/failure)". ADDED: I think I probably don't want a word for inner-alignment/goal-misgeneralization. It would be like having a word for "the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions". Yeah I agree they don't appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think. Or more generally I think when you don't use utility functions explicitly then capability likely suffers, though not totally sure.

I think there’s a connection between (A) a common misconception in thinking about future AI (that it’s not a huge deal if it’s “only” about as good as humans at most things), and (B) a common misconception in economics (the “Lump Of Labor Fallacy”).

So I started writing a blog post elaborating on that, but got stuck because my imaginary reader is not an economist and kept raising objections that amounted to saying “yeah but the Lump Of Labor Fallacy isn’t actually a fallacy, there really is a lump of labor” 🤦

Anyway, it’s bad pedagogy to explain a possibly-... (read more)

4Dagon
It matters a lot what specifically it means to be "as good at humans at most things".  The vast majority of jobs include both legible, formal tasks and "be a good employee" requirements, much more nebulous and difficult to measure.  Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers.  It'll replace some, of course.   That said, the fact that "lump of labor" IS a fallacy, and there's not a fixed amount of work to be done, which more workers simply spread more thinly, means that it's OK if it displaces many workers - there will be other things they can valuably do.   By that argument, human-level AI is effectively just immigration.
2Steven Byrnes
Yup, the context was talking about future AI which can e.g. have the idea that it will found a company, and then do every step to make that happen, and it can do that about as well as the best human (but not dramatically better than the best human). I definitely sometimes talk to people who say “yes, I agree that that scenario will happen eventually, but it will not significantly change the world. AI would still be just another technology.” (As opposed to “…and then obviously 99.99…% of future companies will be founded by autonomous AIs, because if it becomes possible to mass-produce Jeff Bezos-es by the trillions, then that’s what will happen. And similarly in every other aspect of the economy.) I think “the effective global labor pool increases by a factor of 1000, consisting of 99.9% AIs” is sometimes a useful scenario to bring up in conversation, but it’s also misleading in certain ways. My actual belief is that humans would rapidly have no ability to contribute to the economy in a post-AGI world, for a similar reason as a moody 7-year-old has essentially no ability to contribute to the economy today (in fact, people would pay good money to keep a moody 7-year-old out of their office or factory).

Dear diary...

[this is an experiment in just posting little progress reports as a self-motivation tool.]

1. I have a growing suspicion that I was wrong to lump the amygdala in with the midbrain. It may be learning by the same reward signal as the neocortex. Or maybe not. It's confusing. Things I'm digesting: https://twitter.com/steve47285/status/1314553896057081857?s=19 (and references therein) and https://www.researchgate.net/publication/11523425_Parallels_between_cerebellum-_and_amygdala-dependent_conditioning

2. Speaking of mistakes, I'm also regretting so... (read more)

7Viliam
Is it too pessimistic to assume that people mostly model other people in order to manipulate them better? I wonder how much of human mental inconsistency is a defense against modeling. Here on Less Wrong we complain that inconsistent behavior makes you vulnerable to Dutch-booking, but in real life, consistent behavior probably makes you even more vulnerable, because your enemies can easily predict what you do and plan accordingly.
2Steven Byrnes
I was just writing about my perspective here; see also Simulation Theory (the opposite of "Theory Theory", believe it or not!). I mean, you could say that "making friends and being nice to them" is a form of manipulation, in some technical sense, blah blah evolutionary game theory blah blah, I guess. That seems like something Robin Hanson would say :-P I think it's a bit too cynical if you mean "manipulation" in the everyday sense involving bad intent. Also, if you want to send out vibes of "Don't mess with me or I will crush you!" to other people—and the ability to make credible threats is advantageous for game-theory reasons—that's all about being predictable and consistent! Again as I posted just now, I think the lion's share of "modeling", as I'm using the term, is something that happens unconsciously in a fraction of second, not effortful empathy or modeling. Hmmm... If I'm trying to impress someone, I do indeed effortfully try to develop a model of what they're impressed by, and then use that model when talking to them. And I tend to succeed! And it's not all that hard! The most obvious strategy tends to work (i.e., go with what has impressed them in the past, or what they say would be impressive, or what impresses similar people). I don't really see any aspect of human nature that is working to make it hard for me to impress someone, like by a person randomly changing what they find impressive. Do you? Are there better examples?
2Viliam
I have low confidence debating this, because it seems to me like many things could be explained in various ways. For example, I agree that certain predictability is needed to prevent people from messing with you. On the other hand, certain uncertainty is needed, too -- if people know exactly when you would snap and start crushing them, they will go 5% below the line; but if the exact line depends on what you had for breakfast today, they will be more careful about getting too close to it.
2Steven Byrnes
Fair enough :-)

Branding: 3 reasons why I prefer "AGI safety" to "AI alignment"

  1. When engineers, politicians, bureaucrats, military leaders, etc. hear the word "safety", they suddenly perk up and start nodding and smiling. Safety engineering—making sure that systems robustly do what you want them to do—is something that people across society can relate to and appreciate. By contrast, when people hear the term "AI alignment" for the first time, they just don't know what it means or how to contextualize it.

  2. There are a lot of things that people are working on in this spa

... (read more)
8Ruby
A friend in the AI space who visited Washington told me that military leaders distinctly do not like the term "safety".
2[anonymous]
Why not?
2Ruby
Because they're interested in weapons and making people distinctly not safe.
4orthonormal
Right, for them "alignment" could mean their desired concept, "safe for everyone except our targets".
3[anonymous]
I'm skeptical that anyone with that level of responsibility and acumen has that kind of juvenile destructive mindset. Can you think of other explanations?
1Pattern
There's a difference between people talking about safety in the sense of 1. 'how to handle a firearm safely' and the sense of 2. 'firearms are dangerous, let's ban all guns'. These leaders may understand/be on board with 1, but disagree with 2.
1Nathan Helm-Burger
I think if someone negatively reacts to 'Safety' thinking you mean 'try to ban all guns' instead of 'teach good firearm safety', you can rephrase as 'Control' in that context. I think Safety is more inclusive of various aspects of the problem than either 'Control' or 'Alignment', so I like it better as an encompassing term. 
1Steven Byrnes
Interesting. I guess I was thinking specifically about DARPA which might or might not be representative, but see Safe Documents, Safe Genes, Safe Autonomy, Safety and security properties of software, etc. etc.

In the era of COVID, we should all be doing cardio exercise if possible, and not at a gym. Here's what's been working for me for the past many years. This is not well optimized for perfectly working out every muscle group etc., but it is very highly optimized for convenience, practicality, and sustainability, at least for me personally in my life situation.

(This post is mostly about home cardio exercise, but the last paragraph is about jogging.)

My home exercise routine consists of three simultaneous things: {exercise , YouTube video lectures , RockMyRun}.

... (read more)

Quick comments on "The case against economic values in the brain" by Benjamin Hayden & Yael Niv :

(I really only skimmed the paper, these are just impressions off the top of my head.)

I agree that "eating this sandwich" doesn't have a reward prediction per se, because there are lots of different ways to think about eating this sandwich, especially what aspects are salient, what associations are salient, what your hormones and mood are, etc. If neuroeconomics is premised on reward predictions being attached to events and objects rather than thoughts, then... (read more)

Introducing AGI Safety in general, and my research in particular, to novices / skeptics, in 5 minutes, out loud

I might be interviewed on a podcast where I need to introduce AGI risk to a broad audience of people who mostly aren’t familiar with it and/or think it’s stupid. The audience is mostly neuroscientists plus some AI people. I wrote the following as a possible entry-point, if I get thrown some generic opening question like “Tell me about what you’re working on”:

The human brain does all these impressive things, such that humanity was able to transform

... (read more)
7Gunnar_Zarncke
I would prepare a shortened version - 100 words max - that you could also give.
2Steven Byrnes
Yeah, I think I have a stopping point after the first three paragraphs (with minor changes).
2Mitchell_Porter
Could you just say you're working on safe design principles for brain-like artificial intelligence? 
Curated and popular this week