1 min read15 comments
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by TsviBT. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
15 comments, sorted by Click to highlight new comments since:
[-]TsviBTΩ184218

An important thing that the AGI alignment field never understood:

Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.

But what I think people haven't understood is

  1. If a mind is highly capable, it has a source of knowledge.
  2. The source of knowledge involves deep change.
  3. Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
  4. If there's lots of strong goal-pursuits operating on everything, nothing (properties, architectures, constraints, data formats, conceptual schemes, ...) sticks around unless it has to stick around.
  5. So if you want something to stick around (such as the property "this machine doesn't kill all humans") you have to know what sort of thing can stick around / what sort of context makes things stick around, even when there are strong goal-pursuits around, which is a specific thing to know because most things don't stick around.
  6. The elements that stick around and help determine the mind's goal-pursuits have to do so in a way that positively makes them stick around (reflective stability of goals).

There's exceptions and nuances and possible escape routes. And the older Yudkowsky-led research about decision theory and tiling and reflective probability is relevant. But this basic argument is in some sense simpler (less advanced, but also more radical ("at the root")) than those essays. The response to the failure of those essays can't just be to "try something else about alignment"; the basic problem is still there and has to be addressed.

(related elaboration: https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html https://tsvibt.blogspot.com/2023/01/the-voyage-of-novelty.html )

Agreed! I tried to say the same thing in The alignment stability problem.

I think most people in prosaic alignment aren't thinking about this problem. Without this, they're working on aligning AI, but not on aligning AGI or ASI. It seems really likely on the current path that we'll soon have AGI that is reflective. In addition, it will do continuous learning, which introduces another route to goal change (e.g., learning that what people mean by "human" mostly applies to some types of artificial minds, too).

The obvious route past this problem, that I think prosaic alignment often sort of assumes without being explicit about it, is that humans will remain in charge of how the AGI updates its goals and beliefs. They're banking on corrigible or instruction-following AGI.

I think that's a viable approach, but we should be more explicit about it. Aligning AI probably helps with aligning AGI, but they're not the same thing, so we should try to get more sure that prosaic alignment really helps align a reflectively stable AGI.

Thanks. (I think we have some ontological mismatches which hopefully we can discuss later.)

Say more about point 2 there? Thinking about 5 and 6 though - I think I now maybe have a hopeworthy intuition worth sharing later.

Say you have a Bayesian reasoner. It's got hypotheses; it's got priors on them; it's got data. So you watch it doing stuff. What happens? Lots of stuff changes, tide goes in, tide goes out, but it's still a Bayesian, can't explain that. The stuff changing is "not deep". There's something stable though: the architecture in the background that "makes it a Bayesian". The update rules, and the rest of the stuff (for example, whatever machinery takes a hypothesis and produces "predictions" which can be compared to the "predictions" from other hypotheses). And: it seems really stable? Like, even reflectively stable, if you insist?

So does this solve stability? I would say, no. You might complain that the reason it doesn't solve stability is just that the thing doesn't have goal-pursuits. That's true but it's not the core problem. The same issue would show up if we for example looked at the classical agent architecture (utility function, counterfactual beliefs, argmaxxing actions).

The problem is that the agency you can write down is not the true agency. "Deep change" is change that changes elements that you would have considered deep, core, fundamental, overarching... Change that doesn't fit neatly into the mind, change that isn't just another piece of data that updates some existing hypotheses. See https://tsvibt.blogspot.com/2023/01/endo-dia-para-and-ecto-systemic-novelty.html

You might complain that the reason it doesn't solve stability is just that the thing doesn't have goal-pursuits.

Not so - I'd just call it the trivial case and implore us to do better literally at all!

Apart from that, thanks - I have a better sense of what you meant there. "Deep change" as in "no, actually, whatever you pointed to as the architecture of what's Really Going On... can't be that, not for certain, not forever."

I'd go stronger than just "not for certain, not forever", and I'd worry you're not hearing my meaning (agree or not). I'd say in practice more like "pretty soon, with high likelihood, in a pretty deep / comprehensive / disruptive way". E.g. human culture isn't just another biotic species (you can make interesting analogies but it's really not the same).

I'd go stronger than just "not for certain, not forever", and I'd worry you're not hearing my meaning (agree or not).

That's entirely possible. I've thought about this deeply for entire tens of minutes, after all. I think I might just be erring (habitually)  on the side of caution in qualities of state-changes I describe expecting to see from systems I don't fully understand. OTOH... I have a hard time believing that even (especially?) an extremely capable mind would find it worthwhile to repeatedly rebuild itself from the ground up, such that few of even the ?biggest?/most salient features of a mind stick around for long at all.

I have no idea what goes on in the limit, and I would guess that what determines the ultimate effects (https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html) would become stable in some important senses. Here I'm mainly saying that the stuff we currently think of as being core architecture would be upturned.

I mean it's complicated... like, all minds are absolutely subject to some constraints--there's some Bayesian constraint, like you can't "concentrate caring in worlds" in a way that correlates too much with "multiversally contingent" facts, compared to how much you've interacted with the world, or something... IDK what it would look like exactly, and if no one else know then that's kinda my point. Like, there's

  1. Some math about probabilities, which is just true--information-theoretic bounds and such. But: not clear precisely how this constrains minds in what ways.
  2. Some rough-and-ready ways that minds are constrained in practice, such as obvious stuff about like you can't know what's in the cupboard without looking, you can't shove more than such and such amount of information through a wire, etc. These are true enough in practice, but also can be broken in terms of their relevant-in-practice implications (e.g. by "hypercompressing" images using generative AI; you didn't truly violate any law of probability but you did compress way beyond what would be expected in a mundane sense).
  3. You can attempt to state more absolute constraints, but IDK how to do that. Naive attempts just don't work, e.g. "you can't gain information just by sitting there with your eyes closed" just isn't true in real life for any meaning of "information" that I know how to state other than a mathematical one (because for example you can gain "logical information", or because you can "unpack" information you already got (which is maybe "just" gaining logical information but I'm not sure, or rather I'm not sure how to really distinguish non/logical info), or because you can gain/explicitize information about how your brain works which is also information about how other brains work).
  4. You can describe or design minds as having some architecture that you think of as Bayesian. E.g. writing a Bayesian updater in code. But such a program would emerge / be found / rewrite itself so that the hypotheses it entertains, in the descriptive Bayesian sense, are not the things stored in memory and pointed at by the "hypotheses" token in your program.

Another class of constraints like this are those discussed in computational complexity theory.

So there are probably constraints, but we don't really understand them and definitely don't know how to weild them, and in particular we understand the ones about goal-pursuits much less well than we understand the ones about probability.

This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.

Well, it's a quick take. My blog has more detailed explanations, though not organized around this particular point.

That's why solving hierarchical agency is likely necessary for success

We'd have to talk more / I'd have to read more of what you wrote, for me to give a non-surface-level / non-priors-based answer, but on priors (based on, say, a few dozen conversations related to multiple agency) I'd expect that whatever you mean by hierarchical agency is dodging the problem. It's just more homunculi. It could serve as a way in / as a centerpiece for other thoughts you're having that are more so approaching the problem, but the hierarchicalness of the agency probably isn't actually the relevant aspect. It's like if someone is trying to explain how a car goes and then they start talking about how, like, a car is made of four wheels, and each wheel has its own force that it applies to a separate part of the road in some specific position and direction and so we can think of a wheel as having inside of it, or at least being functionally equivalent to having inside of it, another smaller car (a thing that goes), and so a car is really an assembly of 4 cars. We're just... spinning our wheels lol.

Just a guess though. (Just as a token to show that I'm not completely ungrounded here w.r.t. multi-agency stuff in general, but not saying this addresses specifically what you're referring to: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html)

Agreed we would have to talk more. I think I mostly get the homunculi objection. Don't have time now to write an actual response, so here are some signposts:
- part of what you call agency is explained by roughly active inference style of reasoning
-- some type of "living" system is characteristic by having boundaries between them and the environment (boundaries mostly in sense of separation of variables)
-- maintaining the boundary leads to need to model the environment
-- modelling the environment introduces a selection pressure toward approximating Bayes
- other critical ingredient is boundedness
-- in this universe, negentropy isn't free
-- this introduces fundamental tradeoff / selection pressure for any cognitive system: length isn't free, bitflips aren't free, etc.
(--- downstream of that is compression everywhere, abstractions)
-- empirically, the cost/returns function for scaling cognition usually hits diminishing returns, leading to minds where it's not effective to grow the single mind further
--- this leads to the basin of convergent evolution I call "specialize and trade"
-- empirically, for many cognitive systems, there is a general selection pressure toward modularity
--- I don't know what are all the reasons for that, but one relatively simple is 'wires are not free'; if wires are not free, you get colocation of computations like brain regions or industry hubs
--- other possibilities are selection pressures from CAP theorem, MVG, ...
(modularity also looks a bit like box-inverted specialize and trade)

So, in short, I think where I agree with the spirit of If humans didn't have a fixed skull size, you wouldn't get civilization with specialized members and my response is there seems to be extremely general selection pressure in this direction. If cells were able to just grow in size and it was efficient, you wouldn't get multicellulars. If code bases were able to just grow in size and it was efficient, I wouldn't get a myriad of packages on my laptop, it would all be just kernel. (But even if it was just kernel, it seems modularity would kick in and you still get the 'distinguishable parts' structure.)

It's just more homunculi.

It's a bit annoying to me that "it's just more homunculi" is both kind of powerful for reasoning about humans, but also evades understanding agentic things. I also find it tempting because it gives a cool theoretical foothold to work off, but I wonder whether the approach is hiding most of the complexity of understanding agency.