All of Towards_Keeperhood's Comments + Replies

Thanks.

Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it's a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)

So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.

Let's look at how an agent might accomplish a very difficult goal, where the agent didn't acco... (read more)

Thx.

Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff.

I don't really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won't have sufficient understanding to judge whether what the AI did is actually good.

2Steven Byrnes
(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.

Thanks! It's nice that I'm learning more about your models.

I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”.

(A) seems much more general than what I would call "reward specification failure".

The way I use "reward specification" is:

  • If the AI has as goal "get reward" (or sth else) rather than "whatever humans want" because it better fits the reward data, then it's a rew
... (read more)
3Steven Byrnes
Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states. I didn’t intend (A) & (B) to be a precise and complete breakdown. Yeah that’s definitely a thing to think about. Human examples might include “compassion fatigue” (shutting people out because it’s too hard to feel for them); or my theory that many people with autism learn to deliberately unconsciously avoid a wide array of innate social reactions from a young age; or choosing spending more and more time and mental space with imaginary friends, virtual friends, teddy bears, movies, etc. instead of real people. There are various tricks to mitigate these kinds of complications, and they seem to work well enough in human brains. So I think it’s premature to declare that this problem is definitely unsolvable. (And I think the Deep Deceptiveness post is too simplistic, see my comment on it.)

Stuff I noticed so far from thinking about this:

  • Sensation of desire for closure.
  • Desire to appear smart (mostly in front of people with very good epistemics, where incentives are relatively aligned to truth-oriented thinking and criticizing others and changing one's mind is incentivized but not overincentivized, but still).
  • When I think of a (new) piece of evidence/argument, my mind often initially over-updates into that direction for a minute or so, until I have integrated it into my overall model. (This happens in both directions. Aka I think my intuitive
... (read more)
3Gunnar_Zarncke
The last thing may result from a hard-coded genetic heuristic learning rate. We can't update fully Bayesian and a learning rate is an approximation given computational constraints. There is an optimal learning rate, but it depends on context, such as the trust in prior information, esp. the volatility of the environment. And thus it may happen that your genetic prior for your learning rate may not match the dynamics of your current environment. I guess our modern environment changes faster than the ancestral environment and most people update to slowly on new information. Updating much faster is probably adaptive. I also have that.

The problem of finding a good representation of abstract thoughts

As background, here's a simple toy model of thinking:
 

The goal is to find a good representation of the formal statements (and also the background knowledge) in the diagram.

The visual angle is sorta difficult, so the two easy criteria for figuring out what a good representation is, are:
1. Correspondance to language sentences
2. Well suited to do logical/probabilistic inference

The second criterion is often neglected. People in semantics often just take language sentences and see how they ca... (read more)

Here's a useful exercise Keltham gives after the lecture "the alien maths of dath ilan".

I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.

For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex pro

... (read more)
3Steven Byrnes
I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B). There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like corrigibility and common-sense helpfulness, including checking in before doing irreversible things , and helping with alignment research and oversight. I think this is a rather popular school of thought these days, and is one of the major reasons why the median P(doom) among alignment researchers is probably “only” 20% or whatever, as opposed to much higher. I’m not sure whether I buy that school of thought or not. I’ve been mulling it over and am hoping to discuss it in a forthcoming post. (But it’s moot if we can’t even solve (A).) Regardless, I’m allowed to talk about how (A) is a problem, whether or not (B) is also a problem.  :) I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger. …I might respond to the rest of your comment in our other thread (when I get a chance).
  • If the user types “improve my fitness” into some interface, and it sets the AI’s reward function to be some “function of the user’s heart rate, sleep duration, and steps taken”, then the AI can potentially get a higher reward by forcing the user into eternal cardio training on pain of death, including forcibly preventing the person from turning off the AI, or changing its goals (see §2.2 above).
  • The way that the reward function operationalizes “steps taken” need not agree with what we had in mind. If it’s operationalized as steps registered on a wearable tr
... (read more)
5Steven Byrnes
If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong. (But also, I would be very surprised if they said that in the first place! I’ve never heard anyone in AI use the term “critical value formation periods”.) RL algorithms can get stuck in local optima of course, as can any other ML algorithm, but I’m implicitly talking about future powerful RL algorithms, algorithms that can do innovative science, run companies, etc., which means that they’re doing a good job of exploring a wide space of possible strategies and not just getting stuck in the first thing they come across. I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind. For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design. Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talkin

I don't quite like "Turbocharging" as a name because it suggests too little about the content. Better might e.g. be "the directness principle".

(IIRC Directness is also one of the ultralearning principles from Scott Young and I guess it describes the same thing, but I don't remember.)

discussion of exercises of the probability 2 lecture starts mid episode 104.

Asmodia figuring out Keltham's probability riddles may also be interesting, though perhaps less so than the lectures. It starts episode 90. Starting quote is "no dath ilani out of living memory would've seen the phenomenon". The story unfortunately switches between Asmodia+Ione, Carrissa(+Peranza I think), and Keltham+Meritxell. You can skip the other stuff that's going on there (though the brief "dath ilan" reply about stocks might be interesting too).

Thanks!

If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn't yet tried a lot of similar abstract plans from where it could've observed results, and the world model's prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let's say the thought generator outputs plans along with predictions about those plans, and some of th... (read more)

3Steven Byrnes
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence. Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind. Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.) Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past. So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We a

Note that the "Probability 2" lecture continues after the lunch break (which is ~30min skippable audio).

1Towards_Keeperhood
discussion of exercises of the probability 2 lecture starts mid episode 104.
1Towards_Keeperhood
Asmodia figuring out Keltham's probability riddles may also be interesting, though perhaps less so than the lectures. It starts episode 90. Starting quote is "no dath ilani out of living memory would've seen the phenomenon". The story unfortunately switches between Asmodia+Ione, Carrissa(+Peranza I think), and Keltham+Meritxell. You can skip the other stuff that's going on there (though the brief "dath ilan" reply about stocks might be interesting too).

Thanks!

Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model.

Yes, I think I understand your translations and your framing of the value function.

Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I'm still wrongly describing your model:

  • plans vs thoughts
    • My previous model: The main work for devising plans/thoughts happens in the world-model/thought-gen
... (read more)
3Steven Byrnes
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯ Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.) Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function. Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things.  :) (I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.) The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or neg

Why two?

Mathematics/logical-truths are true in all possible worlds, so they never tell you in what world you are.

If you want to say something that is true in your particular world (but not necessarily in all worlds), you need some observations to narrow down what world you are in.

I don't know how closely this matches the use in the sequence, but I think a sensible distinction between logical and causal pinpointing is: All the math parts of a statement are "logically pinpointed" and all the observation parts are "causally pinpointed".

So basically, I think i... (read more)

Thanks for clarifying.

I mean I do think it can happen in my system that you allocate an object for something that's actually 0 or >1 objects, and I don't have a procedure for resolving such map-territory mismatches yet, though I think it's imaginable to have a procedure that defines new objects and tries to edit all the beliefs associated with the old object.

I definitely haven't described how we determine when to create a new object to add to our world model, but one could imagine an algorithm checking when there's some useful latent for explaining some... (read more)

Thx.

Yep there are many trade-offs between criteria.

Btw, totally unrelatedly:

I think in the past on your abstraction you probably lost a decent amount of time from not properly tracking the distinction between (what I call) objects and concepts. I think you likely at least mostly recovered from this, but in case you're not completely sure you've fully done so you might want to check out the linked section. (I think it makes sense to start by understanding how we (learn to) model objects and only look at concepts later, since minds first learn to model objec... (read more)

Thanks.

I'm still not quite understanding what you're thinking though.

For other objects, like physical ones, quantifiers have to be used. Like "at least one" or "the" (the latter only presupposes there is exactly one object satisfying some predicate). E.g. "the cat in the garden". Perhaps there is no cat in the garden or there are several. So it (the cat) cannot be logically represented with a constant.

"the" supposes there's exactly one canonical choice for what object in the context is indicated by the predicate. When you say "the cat" there's basically al... (read more)

2cubefox
Yes, we have a presupposition that there is exactly one cat. But that presupposition is the same regardless of the actual number of cats (regardless of the context), because the "context" here is a feature of the external world ("territory"), while the belief is a part of the "map"/world model/mind. So when we want to formalize the meaning of "The cat is in the garden", that formalization has to be independent of the territory, that is, the same for any possible way the world is. So we can't use individual constants. Because those can't be used for cases where there is no cat or more than one. The mental content of a belief (the semantic content of a statement) is internal, so it doesn't depend on what the external world is like. The important part is that your world model doesn't need to depend on what the world is like. If you believe that the cat is in the garden, that belief is the same independently of whether the presuppositions it makes are true. Therefore we cannot inject parts of the territory into the map. Or rather: there is no such injection, and if our formalization of beliefs (map/world model) assumes otherwise, that formalization is wrong. Here you use two individual constants: Table342 and SelfPerson. Individual constants can only be used for direct reference, where unique reference can't fail. So it can only be used for internal (mental) objects. So "SelfPerson" is okay, because you know a priori that you exist uniquly. If you didn't have a body, you could still refer to yourself, and it's not possible that you accidentally refer to more than one person, like a copy of you. You are part of your mind, your internal state. But "Table342" is an external object. It might not exist, or multiple such tables might exist even though you presupposed it was only one. "Table342" is an individual constant, which are incompatible with presupposition failure. So it can't be used. That formalization is incompatible with possible worlds where the table doesn't ex

I think object identification is important if we want to analyze beliefs instead of sentences. For beliefs we can't take a third person perspective and say "it's clear from context what is meant". Only the agent knows what he means when he has a belief (or she). So the agent has to have a subjective ability to identify things. For "I" this is unproblematic, because the agent is presumably internal and accessible to himself and therefore can be subjectively referred to directly. But for "this" (and arguably also for terms like "tomorrow") the referred objec

... (read more)
2cubefox
What I was saying was that we can, from our subjective perspective, only "point" to or "refer" to objects in a certain way. In terms of predicate logic the two ways of referring are via a) individual constants and b) variable quantification. The first corresponds to direct reference, where the reference always points to exactly one object. Mental objects can presumably be referred to directly. For other objects, like physical ones, quantifiers have to be used. Like "at least one" or "the" (the latter only presupposes there is exactly one object satisfying some predicate). E.g. "the cat in the garden". Perhaps there is no cat in the garden or there are several. So it (the cat) cannot be logically represented with a constant. "I" can be, but "this" again cannot. Even ordinary proper names of people cannot, because they aren't guaranteed to refer to exactly one object. Maybe "Superman" is actually two people with the same dress, or he doesn't exist, being the result of a hallucination. This case can be easily solved by treating those names as predicates. Compare: * The woman believes the superhero can fly. * The superhero is the colleague. The above only has quantifiers and predicates, no constants. The original can be handled analogously: * (The) Mia believes (the) Superman can fly. * (The) Superman is (the) Clark Kent. The names are also logical predicates here. In English you wouldn't pronounce the definitive articles for the proper nouns here, but in other languages you would. As I argued above, "pointing" (referring) is a matter of logic, so I would say assuming the existence of separate "pointers" is mistake. But that's not primitive in terms of predicate logic, because here "the" in "the table" means "this" which is not a primitive constant. You don't mean any table in the world, but a specific one, which you can identify in the way I explained in my previous comment. I don't know how it would work with fact causation rather than objects, though there

Yep I did not cover those here. They are essentially shortcodes for identifying objects/times/locations from context. Related quote:

E.g. "the laptop" can refer to different objects in different contexts, but when used it's usually clear which object is meant. However, how objects get identified does not concern us here - we simply assume that we know names for all objects and use them directly.

("The laptop" is pretty similar to "This laptop".)

(Though "this" can also act as complementizer, as in "This is why I didn't come", though I think in that function it doesn't count as indexical. The section related to complementizers is the "statement connectives" section.)

3cubefox
I think object identification is important if we want to analyze beliefs instead of sentences. For beliefs we can't take a third person perspective and say "it's clear from context what is meant". Only the agent knows what he means when he has a belief (or she). So the agent has to have a subjective ability to identify things. For "I" this is unproblematic, because the agent is presumably internal and accessible to himself and therefore can be subjectively referred to directly. But for "this" (and arguably also for terms like "tomorrow") the referred object depends partly on facts external to the agent. Those external facts might be different even if the internal state of the agent is the same. For example, "this" might not exist, so it can't be a primitive term (constant) in standard predicate logic. One approach would be to analyze the belief that this apple is green as "There is an x such that x is an apple and x is green and x causes e." Here "e" is a primitive term (similar to "I" in "I'm hungry") that refers to the current visual experience of a green apple. So e is subjective experience and therefore internal to the agent. So it can be directly referred to, while this (the green apple he is seeing) is only indirectly referred to (as explained above), similar to "the biggest tree", "the prime minister of Japan", "the contents of this box". Note the important role of the term "causes" here. The belief is representing a hypothetical physical object (the green apple) causing an internal object (the experience of a green apple). Though maybe it would be better to use "because" (which relates propositions) instead of "causes", which relates objects or at least noun phrases. But I'm not sure how this would be formalized.

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.

I think this summary is better: 1. "The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)". 2. Something else went wrong [not easily compressible].

Sounds like we probably agree basically everywhere.

Yeah you can definitely mark me down in the camp of "not use 'inner' and 'outer' terminology". If you need something for "outer", how about "reward specification (problem/failure)".

ADDED: I think I probably don't want a word for inner-alignment/goal-misgeneralization. It would be like having a word for "the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions".

I just don’t use the term “utility function” at all in this

... (read more)

Thanks.

Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.)

  1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires.
  2. I think in smart reflective humans, the policy sugge
... (read more)
7Steven Byrnes
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful. Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe). Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume). All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts. When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic. Now let’s go through what you wrote: I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences. I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”. (And the

Thanks!

Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that.

I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action accordin... (read more)

2Steven Byrnes
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be. So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen). Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.) Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!) If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me. Inner alignment ca

Note: I just noticed your post has a section "Manipulating itself and its learning process", which I must've completely forgotten since I last read the post. I should've read your post before posting this. Will do so.

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

Calling problems "outer" and "inner" alignment seems to suggest that if we solved both we've s... (read more)

5Steven Byrnes
Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning. In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem. I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture. So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)

I'd suggest not using conflated terminology and rather making up your own.

Or rather, first actually don't use any abstract handles at all and just describe the problems/failure-modes directly, and when you're confident you have a pretty natural breakdown of the problems with which you'll stick for a while, then make up your own ontology.

In fact, while in your framework there's a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this:

“Outer alignment” entai

... (read more)
3Steven Byrnes
Thanks! I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here: (A bit more related discussion here.) That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.) I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how … One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, -1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate. Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”) Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-criti

If I did, I wouldn't publicly say so.

It's of course not yes or no, but just a probability, but in case it's high I might not want to state it here, so I should generally not state it here, so you cannot infer it is high by the fact that I didn't state it here.

I can say though that I only turned 22y last week and I expect my future self to grow up to become much more competent than I am now.

3Knight Lee
:) how many standard deviations? What percentile? Supergenius or just genius?
5kwiat.dev
So you do, gotcha

2. I mentioned that there should be much more impressive behavior if they were that smart; I don't recall us talking about that much, not sure.

You said "why don't they e.g. jump in prime numbers to communicate they are smart?" and i was like "hunter gatherer's don't know prime numbers and perhaps not even addition" and you were like "fair".

I mean I thought about what I'd expect to see, but I unfortunately didn't really imagine them as smart but just as having a lot of potential but being totally untrained.

3. I recommended that you try hard to invent hypoth

... (read more)
7TsviBT
These sound good, and maybe you have in mind the same thing I mean, but to clarify, I mean like: Do biased thinking in both directions. I.e. be a lawyer for each side in turn. (Don't only do this of course, also do other things like neutral integration / comparison etc.) So like, you get your model / argument that says orcas are smart (or that this is a good project). Then you put on the anti hat, and try really hard to find counterarguments--e.g. by thinking of them, and also by motivatedly looking for information that would give a counterargument. To do this properly you may have to unblend from your wanting X to be true.

Yes human intelligence.

I forgot to paste in that it's a follow up to my previous posts. Will do now.

In general, I wish this year? (*checks* huh, only 4 months.)

Nah I didn't loose that much time. I already quit the project end of January, I just wrote the post now. Most of the technical work was also pretty useful for understanding language, which is a useful angle on agent foundations. I had previously expected working on that angle to be 80% as effective as my previous best plan, but it was even better, around similarly good I think. That was like 5-5.5 weeks and that was not wasted.

I guess I spent like 4.5 weeks overall on learning about orcas (includi... (read more)

Yeah I think I came to agree with you. I'm still a bit confused though because intuitively I'd guess chimps are dumber than -4.4SD (in the interpretation for "-4.4SD" I described in my other new comment).

When you now get a lot of mutations that increase brain size, while this contributes to smartness, this also pulls you away from the species median, so the hyperparameters are likely to become less well tuned, resulting in a countereffect that also makes you dumber in some ways.

Actually maybe the effect I am describing is relatively small as long as the variation in brain size is within 2 SDs or so, which is where most of the data pinning down the 0.3 correlation comes from.

So yeah it's plausible to me that your method of estimating is ok.

Intuitively I had... (read more)

Thanks for describing a wonderfully concrete model.
I like that way you reason (especially the squiggle), but I don't think it works quite that well for this case. But let's first assume it does:
Your estimamtes on algorithmic efficiency deficits of orca brains seem roughly reasonable to me. (EDIT: I'd actually be at more like -3.5std mean with standard deviation of 2std, but idk.)
Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains bei... (read more)

4ryan_greenblatt
I think cortical neurons is a better proxy than brain size and I expect that the relation between cortical neurons and brain size differs substantially between species. (I expect more similarity within a species.) This might be true in mammals (and/or birds) overall, but I'm kinda skeptical this is a big effect within humans. Like I'd guess that regression slope between brain size and cortical neurons is ~1 in humans rather than substantially less than 1. I agree you'll probably see a bigger correlation with cortical neurons (if you can measure this precisely enough!). I wouldn't guess much more though? ---------------------------------------- Overall, I'm somewhat sympathetic to your arguments that we should expect that multiplying cortical neurons by X is a bigger effect than multiplying brain size by X. Maybe this moves my estimate of SDs / doubling of cortical neurons up by 1.5x to more like 1.8 SD / doubling. I don't think this makes a huge difference to the bottom line.
1Towards_Keeperhood
Actually maybe the effect I am describing is relatively small as long as the variation in brain size is within 2 SDs or so, which is where most of the data pinning down the 0.3 correlation comes from. So yeah it's plausible to me that your method of estimating is ok. Intuitively I had thought that chimps are just much dumber than humans. And sure if you take -4SD humans they aren't really able to do anything, but they don't really count. I thought it's sorta in this direction but not quite as extreme: (This picture is actually silly because the distance to "Mouse" should be even much bigger. The point is that chimps might be far outside the human distribution.) But perhaps chimps are actually closer to humans than I thought. (When I in the following compare different species with standard deviations, I don't actually mean standard deviations, but more like "how many times the difference between a +0SD and a +1SD human", since extremely high and very low standard deviation measures mostly cease to me meaningful for what was actually supposed to be measured.) I still think -4.4SD is overestimating chimp intelligence. I don't know enough about chimps, but I guess they might be somewhere between -12SD and -6SD (compared to my previous intuition, which might've been more like -20SD). And yes, considering that the gap in cortical neuron count between chimps and humans is like 3.5x, and it's even larger for the prefrontal cortex, and that algorithmic efficiency is probably "orca < chimp < human", then +6SDs for orcas seem a lot less likely than I initially intuitively thought, though orcas would still likely be a bit smarter than humans (on the way my priors would fall out (not really after updating on observations about orcas)).

Thanks!

Yeah that might be a great idea.

Earth species project might be promising too.

Thanks for the suggestion, though I don't think they are smart enough to get far with grammar. No non-cetaceans non-humans seem to be.

One possibility is to try it with bottlenose dolphins (or beluga whales). (Bottlenose dolphins have shown greater capacity to learn grammar than great apes.[1]) Those are likely easier to get research access to than orcas. I think we might get some proof of concept of the methodology there, though I'm relatively pessimistic about them learning a full language well.

  1. ^

    See the work of Louis Herman in the 80s (and 90s)

By >=+6std I mean potential of how smart they could be if they were trained similarly to us, not actual current intelligence. Sorry I didn't write this in this post, though I did in others.

I'd be extremely shocked if orcas were actually that smart already. They don't have science and they aren't trained in abstract reasoning.

Like, when an orca is +7std, he'd be like a +7std hunter gatherer human, who is probably not all that good at abstract reasoning tasks (like learning a language through brute-force abstract pattern recognition). (EDIT: Ok actually i... (read more)

Thanks for letting me know it sounded like that. I definitely know it isn't legible at all, and I didn't expect readers to buy it, just wanted to communicate that that's how it's from my own perspective.

You're right. I'll edit the post.

Considerations on intelligence of wild orcas vs captive orcas

I've updated to thinking it's relatively likely that wild orcas are significantly smarter than captive orcas, because (1) wild orcas might learn proper language and captive orcas don't, and (2) generally orcas don't have much to learn in captivity, causing their brains to be underdeveloped.

Here are the most relevant observations:

  • Observation 1: (If I analyzed the data correctly and the data is correct,) all orcas currently alive in captivity have been either born in captivity or captured when they
... (read more)

Seems totally unrelated to my post but whatever:

My p(this branch of humanity won't fulfill the promise of the night sky) is actually more like 0.82 or sth, idk. (I'm even lower on p(everyone will die), because there might be superintelligences in other branches that acausally trade to save the existing lives, though I didn't think about it carefully.)

I'm chatting 1 hour every 2 weeks with Erik Jenner. We usually talk about AI safety stuff. Otherwise also like 1h every 2 weeks with a person who has sorta similar views to me. Otherwise I currently don't talk much to people about AI risk.

ok edited to sun. (i used earth first because i don't know how long it will take to eat the sun, whereas earth seems likely to be feasible to eat quickly.)

(plausible to me that an aligned AI will still eat the earth but scan all the relevant information out of it and later maybe reconstruct it.)

ok thx, edited. thanks for feedback!

(That's not a reasonable ask, it intervenes on reasoning in a way that's not an argument for why it would be mistaken. It's always possible a hypothesis doesn't match reality, that's not a reason to deny entertaining the hypothesis, or not to think through its implications. Even some counterfactuals can be worth considering, when not matching reality is assured from the outset.)

Yeah you can hypothesize. If you state it publicly though, please make sure to flag it as hypothesis.

3Vladimir_Nesov
Also not a reasonable ask, friction targeted at a particular thing makes it slightly less convenient, and therefore it stops happening in practice completely. ~Everything is a hypothesis, ~all models are wrong, in each case language makes what distinctions it tends to in general.

How long until the earth gets eaten? 10th/50th/90th percentile: 3y, 12y, 37y.

Catastrophes induced by narrow capabilities (notably biotech) can push it further, so this might imply that they probably don't occur.

No it doesn't imply this, I set this disclaimer "Conditional on no strong governance success that effectively prevents basically all AI progress, and conditional on no huge global catastrophe happening in the meantime:". Though yeah I don't particularly expect those to occur.

2Vladimir_Nesov
The "AI might decide not to" point stands I think. This for me represents change of mind, I wouldn't have previously endorsed this point, but since recently I think arbitrary superficial asks like this can become reflectively stable with nontrivial probability, resisting strong cost-benefit arguments even after intelligence explosion. Right, I missed this.

Will we get to this point by incremental progress that yields smallish improvements (=slow), or by some breakthrough that when scaled up can rush past the human intelligence level very quickly (=fast)?

AI speed advantage makes fast vs. slow ambiguous, because it doesn't require AI getting smarter in order to make startlingly fast progress, and might be about passing a capability threshold (of something like autonomous research) with no distinct breakthroughs leading up to it (by getting to a slightly higher level of scaling or compute efficiency with some o

... (read more)

My AI predictions

(I did not carefully think about my predictions. I just wanted to state them somewhere because I think it's generally good to state stuff publicly.)

(My future self will not necessarily make similar predictions as I am now.)

TLDR: I don't know.

Timelines

Conditional on no strong governance success that effectively prevents basically all AI progress, and conditional on no huge global catastrophe happening in the meantime:

How long until the sun (starts to) get eaten? 10th/50th/90th percentile: 3y, 12y, 37y.

How long until an AI reaches Elo 4000 o... (read more)

1teradimich
It seems a little surprising to me how rarely confident pessimists (p(doom)>0.9) they argue with moderate optimists (p(doom)≤0.5). I'm not specifically talking about this post. But it would be interesting if people revealed their disagreement more often.
4Vladimir_Nesov
Catastrophes induced by narrow capabilities (notably biotech) can push it further, so this might imply that they probably don't occur[1]. Also, aligned AI might decide not to, it's not as nutritious as the Sun anyway. AI speed advantage makes fast vs. slow ambiguous, because it doesn't require AI getting smarter in order to make startlingly fast progress, and might be about passing a capability threshold (of something like autonomous research) with no distinct breakthroughs leading up to it (by getting to a slightly higher level of scaling or compute efficiency with the old techniques). (That's not a reasonable ask, it intervenes on reasoning in a way that's not an argument for why it would be mistaken. It's always possible a hypothesis doesn't match reality, that's not a reason to deny entertaining the hypothesis, or not to think through its implications. Even some counterfactuals can be worth considering, when not matching reality is assured from the outset.) ---------------------------------------- 1. There was a "no huge global catastrophe" condition on the prediction that I missed, thanks Towards_Keeperhood for correction. ↩︎

Here's my current list of lessons for review. Every day during my daily review, I look at the lessons in the corresponding weekday entry and the corresponding day of the month, and for each list one example from the last week where I could've applied the lesson, and one example where I might be able to apply the lesson in the next week:

  • Mon
    • get fast feedback. break tasks down into microtasks and review after each.
  • Tue
    • when surprised by something or took long for something, review in detail how you might've made the progress faster.
      • clarify why the progress is g
... (read more)

Thank you for your feedback! Feedback is great.

We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.

It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what's going on inside them that they are able to give useful responses. Basically all I'm saying here is that we know so little that it's hard to be co... (read more)

Here's my pitch for very smart young scientists for why "Rationality from AI to Zombies" is worth reading:

The book "Rationality: From AI to Zombies" is actually a large collection of blogposts, which covers a lot of lessons on how to become better at reasoning. It also has a lot of really good and useful philosophy, for example about how Bayesian updating is the deeper underlying principle of how science works.

But let me express in more detail why I think "Rationality: A-Z" is very worth reading.

Human minds are naturally bad at deducing correct beliefs/the

... (read more)
Load More