This is a response to Abram's The Parable of Predict-O-Matic, but you probably don't need to read Abram's post to understand mine. While writing this, I thought of a way in which I think things could wrong with dualist Predict-O-Matic, which I plan to post in about a week. I'm offering a $100 prize to the first commenter who's able to explain how things might go wrong in a sufficiently crisp way before I make my follow-up post.

Dualism

Currently, machine learning algorithms are essentially "Cartesian dualists" when it comes to themselves and their environment. (Not a philosophy major -- let me know if I'm using that term incorrectly. But what I mean to say is...) If I give a machine learning algorithm some data about itself as training data, there's no self-awareness there--it just chugs along looking for patterns like it would for any other problem. I think it's a reasonable guess that our algorithms will continue to have this "no self-awareness" property as they become more and more advanced. At the very least, this "business as usual" scenario seems worth analyzing in depth.

If dualism holds for Abram's prediction AI, the "Predict-O-Matic", its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions -- but it's not special in any way and isn't being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic's default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an "ego").

Let's suppose the Predict-O-Matic follows a "recursive uncertainty decomposition" strategy for making predictions about the world. It models the entire world in rather fuzzy resolution, mostly to know what's important for any given prediction. If some aspect of the world appears especially relevant to a prediction it's trying to make, it "zooms in" and tries to model that thing in higher resolution. And if some part of that thing seems especially relevant, it zooms in further on that part. Etc.

Now suppose the Predict-O-Matic is trying to make a prediction, and its "recursive uncertainty decomposition" algorithms say the next prediction made by this Predict-O-Matic thing which happens to occupy its world model appears especially relevant! What then?

At this point, the Predict-O-Matic has stepped into a hall of mirrors. To predict the next prediction made by the Predict-O-Matic in its world model, the Predict-O-Matic needs to run an internal simulation of that Predict-O-Matic. But as it runs that simulation, it finds that simulation kicking off another Predict-O-Matic simulation in the simulated Predict-O-Matic's world model! Etc, etc.

So if the Predict-O-Matic is implemented naively, the result could just be an infinite recurse. Not useful, but not necessarily dangerous either.

Let's suppose the Predict-O-Matic has a non-naive implementation and something prevents this infinite recurse. For example, there's a monitor process that notices when a model is eating up a lot of computation without delivering useful results, and replaces that model with one which is lower-resolution. Or maybe the Predict-O-Matic does have a naive implementation, but it doesn't have enough data about itself to model itself in much detail, so it ends up using a low-resolution model.

One possibility is that it's able to find a useful outside view model such as "the Predict-O-Matic has a history of making negative self-fulfilling prophecies". This could lead to the Predict-O-Matic making a negative prophecy ("the Predict-O-Matic will continue to make negative prophecies which result in terrible outcomes"), but this prophecy wouldn't be selected for being self-fulfilling. And we might usefully ask the Predict-O-Matic whether the terrible self-fulfilling prophecies will continue conditional on us taking Action A.

Answering a Question by Having the Answer

If you aren't already convinced, here's another explanation for why I don't think the Predict-O-Matic will make self-fulfilling prophecies by default.

In Abram's story, the engineer says: "The answer to a question isn't really separate from the expected observation. So 'probability of observation depending on that prediction' would translate to 'probability of an event given that event', which just has to be one."

In other words, if the Predict-O-Matic knows it will predict P = A, it assigns probability 1 to the proposition that it will predict P = A.

I contend that Predict-O-Matic doesn't know it will predict P = A at the relevant time. It would require time travel -- to know whether it will predict P = A, it will have to have made a prediction already, and but it's still formulating its prediction as it thinks about what it will predict.

More details: Let's taboo "Predict-O-Matic" and instead talk about a "predictive model" and "input data". The trick is to avoid including the output of the predictive model in the model's input data. This isn't possible the first time we make a prediction because it would require time travel -- so as a practical matter, we don't want to re-run the model a second time with the prediction from its first run included in the input data. Let's say the dataset is kept completely static during prediction. (I offer no guarantees in the case where observational data about the model's prediction process is being used to inform the model while it makes a prediction!)

To clarify further, let's consider a non-Predict-O-Matic scenario where issues do crop up. Suppose I'm a big shot stock analyst. I think Acme Corp's stock is overvalued and will continue to be overvalued in one month's time. But before announcing my prediction, I do a sanity check. I notice that if I announce my opinion, that could cause investors to dump Acme, and Acme will likely no longer be overvalued in one month's time. So I veto the column I was planning to write on Acme, and instead search for a column c I can write such that c is a fixed point for the world w -- w(c) = c -- even when the world is given my column as an input, what I predicted in my column still comes true.

Note again that the sanity check which leads to a search for a fixed point doesn't happen by default -- it requires some extra functionality, beyond what's required for naive prediction, to implement. The Predict-O-Matic doesn't care about looking bad, and there's nothing contradictory about it predicting that it won't make the very prediction it makes, or something like that. Predictive model, meet input data. That's what it does.

Open Questions

This is a section for half-baked thoughts that could grow into a counterargument for what I wrote above.

  • What's going on when you try to model yourself thinking about the answer to this question? (Why is this question so hard to think about? Maybe my brain has a mechanism to prevent infinite recurse? Tangent: I wonder if this is evidence of a mechanism that tries to prevent my brain from making premature predictions by observing data about my own predictive process while trying to make predictions? Otherwise maybe there could be a cascade of updates where noticing that I've become more sure of something makes me even more sure of it, etc. By the way, I think maybe humans do this on a group level, and this accounts for intellectual fashions.) Anyway, I think it's important to understand if my brain does something "in practice" which differs from what I've outlined here, some kind of method for collapsing recursion that a sufficiently advanced Predict-O-Matic might use.

  • What if the Predict-O-Matic assigns some credence to the idea that it's "agentic" in nature? Then what if the Predict-O-Matic assigns some credence to the idea that the simulated version of itself could assign credence to the idea that it's in a simulation? (I think this is just a classic daemon but maybe it differs in important ways?)

  • In ML, the predictive model isn't trying to maximize its own accuracy--that's what the training algorithm tries to do. The predictive model doesn't seem like an optimizer even in the "mathematical optimization" sense of the world optimization (is "mesa-optimizer" an appropriate term? In this case, I think we're glad it's a mesa-optimizer?) What if the Predict-O-Matic sometimes runs a training algorithm to update its model? How does that change things?

  • What if time travel actually is possible and we just haven't discovered it yet?

  • Does anything interesting happen if the Predict-O-Matic becomes aware of the concept of self-awareness?

  • Could the Predict-O-Matic notice and exploit shared computational structure in the recursive self-simulation process? Suppose it notes the computation it has to do to make a prediction for the simulated Predict-O-Matic is similar to the computation it has to do to just make a prediction! What then?

Prize Details

Again, $100 prize for the first comment which crisply explains something that could wrong with dualist Predict-O-Matic. Contest ends when I publish my follow-up -- probably next Wednesday the 23rd. I do have at least one answer in mind but I'm hoping you'll come up with something I haven't thought of. However, if I'm not convinced your thing would be a problem, I can't promise you a prize. No comment on whether "Open Questions" are related to the answer I have in mind.

EDIT: Followup here

New Comment
35 comments, sorted by Click to highlight new comments since:

I'm not really sure what you mean when you say "something goes wrong" (in relation to the prize). I've been thinking about all this in a very descriptive way, ie, I want to understand what happens generally, not force a particular outcome. So I'm a little out-of-touch with the "goes wrong" framing at the moment. There are a lot of different things which could happen. Which constitute "going wrong"?

  • Becoming non-myopic; ie, using strategies which get lower prediction loss long-term rather than on a per-question basis.
    • (Note this doesn't necessarily mean planning to do so, in an inner-optimizer way.)
  • Making self-fulfilling prophecies in order to strategically minimize prediction loss on individual questions (while possibly remaining myopic).
  • Having a tendency for self-fulfilling prophecies at all (not necessarily strategically minimizing loss).
  • Having a tendency for self-fulfilling prophecies, but not necessarily the ones which society has currently converged to (eg, disrupting existing equilibria about money being valuable because everyone expects things to stay that way).
  • Strategically minimizing prediction loss in any way other than by giving better answers in an intuitive sense.
  • Manipulating the world strategically in any way, toward any end.
  • Catastrophic risk by any means (not necessarily due to strategic manipulation).

In particular, inner misalignment seems like something you aren't including in your "going wrong"? (Since it seems like an easy answer to your challenge.)

I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the "basically gradient descent" sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some "secret sauce" though.)

If you aren't already convinced, here's another explanation for why I don't think the Predict-O-Matic will make self-fulfilling prophecies by default.
In Abram's story, the engineer says: "The answer to a question isn't really separate from the expected observation. So 'probability of observation depending on that prediction' would translate to 'probability of an event given that event', which just has to be one."
In other words, if the Predict-O-Matic knows it will predict P = A, it assigns probability 1 to the proposition that it will predict P = A.

Right, basically by definition. The word 'given' was intended in the Bayesian sense, ie, conditional probability.

I contend that Predict-O-Matic doesn't know it will predict P = A at the relevant time. It would require time travel -- to know whether it will predict P = A, it will have to have made a prediction already, and but it's still formulating its prediction as it thinks about what it will predict.

It's quite possible that the Predict-O-Matic has become relatively predictable-by-itself, so that it generally has good (not perfect) guesses about what it is about to predict. I don't mean that it is in an equilibrium with itself; its predictions may be shifting in predictable directions. If these shifts become large enough, or if its predictability goes second-order (it predicts that it'll predict its own output, and thus pre-anticipates the direction of shift recursively) it has to stop knowing its own output in so much detail (it's changing too fast to learn about). But it can possibly know a lot about its output.

I definitely agree with most of the stuff in the 'answering a question by having the answer' section. Whether a system explicitly makes the prediction into a fixed point is a critical question, which will determine which way some of these issues go.

  • If the system does, then there are explicit 'handles' to optimize the world by selecting which self-fulfilling prophecies to make true. We are effectively forced to deal with the issue (if only by random selection).
  • If the system doesn't, then we lack such handles, but the system still has to do something in the face of such situations. It may converge to self-fulfilling stuff. It may not, and so, produce 'inconsistent' outputs forever. This will depend on features of the learning algorithm as well as features of the situation it finds itself in.

It seems a bit like you might be equating the second option with "does not produce self-fulfilling prophecies", which I think would be a mistake.

Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it's something we'd want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn't seem like it'd be much of a concern in practice (very little catastrophic potential), that might not get a prize.

In particular, inner misalignment seems like something you aren't including in your "going wrong"? (Since it seems like an easy answer to your challenge.)

Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.

I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the "basically gradient descent" sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some "secret sauce" though.)

I think there's potentially an analogy with attention in the context of deep learning, but it's pretty loose.

It seems a bit like you might be equating the second option with "does not produce self-fulfilling prophecies", which I think would be a mistake.

Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn't optimized for being so? Or are you trying to distinguish between "explicit" and "implicit" searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)

Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn't optimized for being so? Or are you trying to distinguish between "explicit" and "implicit" searches for fixed points?

More the second than the first, but I'm also saying that the line between the two is blurry.

For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.

You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there's an optimization implemented by the pair of the predict-o-matic and the person.

In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn't explicitly searching for that.

To highlight the "blurry distinction" more:

In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn't explicitly searching for that.

Note, if the prediction algorithm anticipates this process (perhaps partially), it will "jump ahead", so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn't formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.

[-]BunthutΩ3110
One possibility is that it's able to find a useful outside view model such as "the Predict-O-Matic has a history of making negative self-fulfilling prophecies". This could lead to the Predict-O-Matic making a negative prophecy ("the Predict-O-Matic will continue to make negative prophecies which result in terrible outcomes"), but this prophecy wouldn't be selected for being self-fulfilling. And we might usefully ask the Predict-O-Matic whether the terrible self-fulfilling prophecies will continue conditional on us taking Action A.

Maybe I misunderstood what you mean by dualism, but I don't think that's true. Say the Predict-O-Matic has an outside view model (of itself) like "The metal box on your desk (the Predict-O-Matic) will make a self-fullfilling prophecy that maximizes the number of paperclips". Then you ask it how likely it is that your digital records will survive for 100 years. It notices that that depends significantly on how much effort you make to secure them. It notices that that significantly depends on what the metal box on your desk tells you. It uses it's low-model resolution of what the box says. To work that out, it checks which outputs would be self-fulfilling, and then which of these leads to the most paperclips. The more unsecure your digital records are, the more you will invest in paper, and the more paperclips you will need. Therefore the metal box will tell you the lowest self-fulfilling propability for your question. Since that number is *self-fulfilling*, it is in fact the correct answer, and the Predict-O-Matic will answer with it.

I think this avoids your argument that

I contend that Predict-O-Matic doesn't know it will predict P = A at the relevant time. It would require time travel -- to know whether it will predict P = A, it will have to have made a prediction already, and but it's still formulating its prediction as it thinks about what it will predict.

because it doesn't have to simulate itself in detail to know what the metal box (it) will do. The low-resolution model provides a shortcut around that, but it will be accurate despite the low resolution, because by believing it is simple, it becomes simple.

Can you usefully ask for conditionals? Maybe. The answer to the conditional depends on what worlds you are likely to take Action A in. It might be that in most worlds where you do A, you do it because of a prediction from the metal box, and since we know those maximize paperclips, there's a good chance the action will fail to prevent it in those cricumstances. But if that's not the case, for example because it's certain you won't ask the box any more questions between this one and the event it tries to predict.

It might be possible to avoid any problems of this sort by only ever asking questions of the type "Will X happen if I do Y now (with no time to receive new info between hearing the prediction and doing the action)?", because by backwards induction the correct answer will not depend on what you actually do. This doesn't avoid the scenarios on the original where multiple people act on their Predict-O-Matics, but I suspect these aren't solvable without coordination.

What's going on when you try to model yourself thinking about the answer to this question?

If a system is analyzing (itself analyzing (itself analyzing (...))) , not realizing it's doing so, I suspect that it will come up with some best guess answer, but that answer will be ill-determined and dependent on implementation details. Thus a better approach would be to avoid asking self-unaware systems any question that requires that type of analysis!

For example, you can ask "Please output the least improbable scenario, according to your predictive world-model, wherein a cure for Alzheimer's is invented by a group of people with no access to any AI oracles!" Or even ask it to do counterfactual reasoning about what might happen in a world in which there are no AIs whatsoever. (copied from my post here). This type of question is nice for other reasons too—we're asking the system to guess what normal humans might plausibly do in the natural course of events, and thus we'll more typically get normal-human-type solutions to our problems as opposed to bizarre alien human-unfriendly solutions.

[-]evhubΩ570

If dualism holds for Abram's prediction AI, the "Predict-O-Matic", its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions -- but it's not special in any way and isn't being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic's default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an "ego").

I don't think this is right. In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it. In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it.

Yes, for the most part.

In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

I think maybe you're pre-supposing what you're trying to show. Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself. This info is usually not part of the dataset whose description length the system is attempting to minimize.

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains. Humans get self-knowledge in basically the same way we get any other kind of knowledge--by making observations. We aren't expert neuroscientists from birth. Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

[-]evhubΩ230

Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself.

If the data isn't at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains.

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

Also, if you think that, then I'm confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.

--

I think I didn't fully convey my picture here, so let me try to explain how I think this could happen. Suppose you're training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that's happened--or while it's in the process of happening--there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared--which is precisely the thing you were calling an "ego."

When you wrote

having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one?

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Sometimes people take psychedelic drugs/meditate and report an out of body experience, oneness with the universe, ego dissolution, etc. This suggests to me that ego is an evolved adaptation rather than a necessity for cognition. A clue is the fact that our ego extends to all parts of our body, even those which aren't necessary for computation (but are necessary for survival & reproduction)

there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself.

The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously.

Compression has important similarities to prediction. In compression terms, your argument is essentially that if we use zip to compress its own source code, it will be able to compress its own source code using a very small number of bytes, because it "already knows about itself".

[-]evhubΩ110

that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one?

ML doesn't minimize the description length of the dataset—I'm not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic's decision-making process—one in its prediction process and one in its world model.

The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously.

Modern predictive models don't have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it's just a bunch of transformers hooked together. The only information that isn't included in the description length of the model is what transformers are, but "what's a transformer" is quite different than "how do I make predictions." All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained.

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity. And in some cases the relevant aspect of its internals might not be available as a conceptual building block. For example, a model trained using stochastic gradient descent is not necessarily better at understanding or predicting a process which is very similar to stochastic gradient descent.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

In order to be crisp about what could happen, your explanation also has to account for what clearly won't happen.

BTW this thread also seems relevant: https://www.lesswrong.com/posts/RmPKdMqSr2xRwrqyE/the-dualist-predict-o-matic-usd100-prize#AvbnFiKpJxDqM8GYh

[-]evhubΩ110

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity.

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

Sure, but that's not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there's no reason it would encode that information again in its world model.

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

OK, it sounds like we agree then? Like, the Predict-O-Matic might have an unusually easy time modeling itself in certain ways, but other than that, it doesn't get special treatment because it has no special awareness of itself as an entity?

Edit: Trying to provide an intuition pump for what I mean here--in order to avoid duplicating information, I might assume that something which looks like a stapler behaves the same way as other things I've seen which looks like staplers--but that doesn't mean I think all staplers are the same object. It might in some cases be sensible to notice that I keep seeing a stapler lying around and hypothesize that there's just one stapler which keeps getting moved around the office. But that requires that I perceive the stapler as an entity every time I see it, so entities which were previously separate in my head can be merged. Whereas arguendo, my prediction machinery isn't necessarily an entity that I recognize; it's more like the water I'm swimming in in some sense.

[-]evhubΩ110

I don't think we do agree, in that I think pressure towards simple models implies that they won't be dualist in the way that you're claiming.

If dualism holds for Abram’s prediction AI, the “Predict-O-Matic”, its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions—but it’s not special in any way and isn’t being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic’s default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an “ego”).

I don't see why we should expect this. We're told that the Predict-O-Matic is being trained with something like sgd, and sgd doesn't really care about whether the model it's implementing is dualist or non-dualist; it just tries to find a model that generates a lot of reward. In particular, this seems wrong to me:

The Predict-O-Matic doesn't care about looking bad, and there's nothing contradictory about it predicting that it won't make the very prediction it makes, or something like that.

If the Predict-O-Matic has a model that makes bad prediction (i.e. looks bad), that model will be selected against. And if it accidentally stumbled upon a model that could correctly think about it's own behaviour in a non-dualist fashion, and find fixed points, that model would be selected for (since its predictions come true). So at least in the limit of search and exploration, we should expect sgd to end up with a model that finds fixed points, if we train it in a situation where its predictions affect the future.

If we only train it on data where it can't affect the data that it's evaluated against, and then freeze the model, I agree that it probably won't exhibit this kind of behaviour; is that the scenario that you're thinking about?

it just tries to find a model that generates a lot of reward

SGD searches for a set of parameters which minimize a loss function. Selection, not control.

If the Predict-O-Matic has a model that makes bad prediction (i.e. looks bad), that model will be selected against.

Only if that info is included in the dataset that SGD is trying to minimize a loss function with respect to.

And if it accidentally stumbled upon a model that could correctly think about it's own behaviour in a non-dualist fashion, and find fixed points, that model would be selected for (since its predictions come true).

Suppose we're running SGD trying to find a model which minimizes the loss over a set of (situation, outcome) pairs. Suppose some of the situations are situations in which the Predict-O-Matic made a prediction, and that prediction turned out to be false. It's conceivable that SGD could learn that the Predict-O-Matic predicting something makes it less likely to happen and use that as a feature. However, this wouldn't be helpful because the Predict-O-Matic doesn't know what prediction it will make at test time. At best it could infer that some of its older predictions will probably end up being false and use that fact to inform the thing it's currently trying to predict.

If we only train it on data where it can't affect the data that it's evaluated against, and then freeze the model, I agree that it probably won't exhibit this kind of behaviour; is that the scenario that you're thinking about?

Not necessarily. The scenario I have in mind is the standard ML scenario where SGD is just trying to find some parameters which minimize a loss function which is supposed to approximate the predictive accuracy of those parameters. Then we use those parameters to make predictions. SGD isn't concerned with future hypothetical rounds of SGD on future hypothetical datasets. In some sense, it's not even concerned with predictive accuracy except insofar as training data happens to generalize to new data.

If you think including historical observations of a Predict-O-Matic (which happens to be 'oneself') making bad (or good) predictions in the Predict-O-Matic's training dataset will cause a catastrophe, that's within the range of scenarios I care about, so please do explain!

By the way, if anyone wants to understand the standard ML scenario more deeply, I recommend this class.

I think our disagreement comes from you imagining offline learning, while I'm imagining online learning. If we have a predefined set of (situation, outcome) pairs, then the Predict-O-Matic's predictions obviously can't affect the data that it's evaluated against (the outcome), so I agree that it'll end up pretty dualistic. But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic variants.

If you still disagree with that, what do you think would happen (in the limit of infinite training time) with an algorithm that just made a random change proportional to how wrong it was, at every training step? Thinking about SGD is a bit complicated, since it calculates the gradient while assuming that the data stays constant, but if we use online training on an algorithm that just tries things until something works, I'm pretty confident that it'd end up looking for fixed points.

But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic variants.

Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.

If you still disagree with that, what do you think would happen (in the limit of infinite training time) with an algorithm that just made a random change proportional to how wrong it was, at every training step?

That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.

Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.

Ok, cool, that explains it. I guess the main differences between RL and online supervised learning is whether the model takes actions that can affect their environment or only makes predictions of fixed data; so it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL.

That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.

No need, since we already found the point of disagreement. (But if you're curious, the difference is that sgd makes a change in the direction of the gradient, and this one wouldn't.)

it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL.

How's that?

Assuming that people don't think about the fact that Predict-O-Matic's predictions can affect reality (which seems like it might have been true early on in the story, although it's admittedly unlikely to be true for too long in the real world), they might decide to train it by letting it make predictions about the future (defining and backpropagating the loss once the future comes about). They might think that this is just like training on predefined data, but now the Predict-O-Matic can change the data that it's evaluated against, so there might be any number of 'correct' answers (rather than exactly 1). Although it's a blurry line, I'd say this makes it's output more action-like and less prediction-like, so you could say that it makes the training process a bit more RL-like.

I think it depends on internal details of the Predict-O-Matic's prediction process. If it's still using SGD, SGD is not going to play the future forward to see the new feedback mechanism you've described and incorporate it into the loss function which is being minimized. However, it's conceivable that given a dataset about its own past predictions and how they turned out, the Predict-O-Matic might learn to make its predictions "more self-fulfilling" in order to minimize loss on that dataset?

SGD is not going to play the future forward to see the new feedback mechanism you’ve described and incorporate it into the loss function which is being minimized

My 'new feedback mechanism' is part of the training procedure. It's not going to be good at that by 'playing the future forward', it's going to become good at that by being trained on it.

I suspect we're using SGD in different ways, because everything we've talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).

I suspect we're using SGD in different ways, because everything we've talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).

Fair enough, I was thinking about supervised learning.

Two remarks.

Remark 1: Here's a simple model of self-fulfilling prophecies.

First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it's infeasible. More sophisticated variants are possible, but I don't think any of them avoids the problem.

If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there.

Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above.

Now, to the simple model. In this model we have just two outcomes: A and B (so it's not large volume data, but that doesn't matter). On each round a prediction is made, after which some outcome occurs. The true environments works as follows: if prediction "A" is made, on this round A happens with probability 99% and B with probability 1%. If prediction "B" is made, on this round B happens with probability 100%. Of course Predict-O-Matic is not aware that predictions can influence outcomes. Instead, we will assume Predict-O-Matic is doing Bayesian inference with a prior over hypotheses, each of which assumes that the environment is IID. In other words, it is learning a single parameter which is the probability A will occur on any given round.

Claim: If the prior is s.t. any interval in -space is assigned positive probability, then Predict-O-Matic will converge to predicting B with frequency 1.

Sketch of proof: If Predict-O-Matic converges to predicting B with frequency then the environment converges to producing outcome B with frequency , implying that Predict-O-Matic converges to predicting B with frequency .

Remark 2: Some of the hypotheses in the prior might be intelligent agents in their own right, with their own utility functions. Such an agent can intentionally produce correct predictions to increase its probability in the belief state, until a "treacherous turn" point when it produces a prediction designed to have irreversible consequences in the outside world in favor of the agent. If it is not a self-fulfilling prophecy, this treacherous prediction will cause Predict-O-Matic to update against the agentic hypothesis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hypothesis even stronger.

Moreover, there is a mechanism that systematically produces such agentic hypotheses. Namely, a sufficiently powerful predictor is likely to run into "simulation hypotheses" i.e. hypotheses that claim the universe is a simulation by some other agent. As Christiano argued before, that opens an attack vector for powerful agents across the multiverse to manipulate Predict-O-Matic into making whatever predictions they want (assuming Predict-O-Matic is sufficiently powerful to guess what predictions those agents would want it to make).

I disagree that self-unawareness / dualism should be the default assumption, for reasons I explained in this comment. In fact I think that making a system that knowably remains self-unaware through arbitrary increases in knowledge and capability would be a giant leap towards solving AI alignment. I have vague speculative ideas for how that might be done with a type-checking proof, again see that comment I linked.

So most animals don't seem very introspective. Machine learning algorithms haven't shown spontaneous capacity for introspection (so far, that I know of). But humans can introspect. Maybe a crux here is something along the lines of: Humans have capability for introspection. They're also smarter than animals. Maybe once our ML algorithms get good enough, the capacity for introspection will spontaneously arise.

People should be thinking about this possibility. But we also have ML algorithms which are in some ways superhuman and like I said I know no instances of spontaneous emergence of introspection. It seems like a reasonably likely possibility to me that "intelligence" of the sort needed for cross-domain superhuman prediction ability and spontaneous emergence of introspection are, in fact, orthogonal axes.

In terms of non-spontaneous emergence of introspection, that's basically meta-learning. I agree meta-learning is probably super important for the future of AI. In fact, come to think of it, I wonder if the reason why humans are both smart and introspective is because our brains evolved some additional meta-learning capabilities! And I agree that your idea of having some kind of firewall between introspection and object-level reality models could help prevent problems. I've spent a little while thinking concretely about how this could work.

(Hoping to run a different competition related to these issues at some time in the future. Was thinking it would be bigger and longer--please PM me if you want to help contribute to the prize pool.)

I am using the term "self-aware" to mean "knowing that one exists in the world and can affect the world", in which case animals, RL robots, etc., are all trivially self-aware. You seem to be using the term "introspective" for something beyond mere self-awareness—maybe "having concepts in the world-model that are sufficiently general that they apply to both the outside world and one's internal information-processing". Something like that? You can tell me.

So let's take these two levels, self-awareness ("I exist and can affect the world") and introspection ("Why am I thinking about that? I seem to have an associative memory!")

As I read the OP, it seems to me that self-awareness is the relevant threshold you rely on, not introspection. (Do you agree?) I do think that self-awareness is what you need for powerful safety guarantees, and that we should study the possibility of self-unaware systems, even if it's not guaranteed to be possible.

As for introspection, I do in fact think that any AI system which can develop deep, general, mechanistic understandings of things in the world, and which is self-aware at all, will go beyond mere self-awareness to develop deep introspection. My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others. Doing that just doesn't seem fundamentally different from, say, discovering the law of gravitation by discovering an analogy between the behavior of planets versus apples (which in turn is harder but not fundamentally different from knowing how to twist off a bottle cap by discovering an analogy with previous bottle caps that one has used). Thus, I think that the only way to prevent an arbitrarily intelligent world-modeling AI from developing arbitrarily deep introspective understanding, is to build the system to have no self-awareness in the first place.

Studying the possibility of self-aware systems seems like a good idea, but I have a feeling most ways to achieve this will be brittle. My objective with this post was to get crisp stories for why self-aware predictive systems should be considered dangerous.

My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others.

Let's taboo introspection for a minute. Suppose the AI does discover some underlying patters and analogize the piece of matter in which it is encased with the thoughts and actions of its human operator. Not only that, it finds analogies between other computers and its human operator, between its human operator and other computers, etc. Why precisely is this a problem?

I wouldn't argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).

More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don't think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven't solved the alignment problem yet, I think it's worth spending time exploring alternative approaches.

If you're making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn't know that it's outputting predictions, it doesn't know that it even has an output, it doesn't know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they're connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they'll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There's still some safety problems (not to mention bad actors etc.), but significantly less scary ones.

I wouldn't argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).

Fair enough.

Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don't think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven't solved the alignment problem yet, I think it's worth spending time exploring alternative approaches.

I suspect the important part is the agent-like part.

I'm not sure it makes to think of "the alignment problem" as a singularity entity. I'd rather taboo "the alignment problem" and just ask what could go wrong with a self-aware system that's not agent-like.

A self-unaware system will not do that because it is not aware that it can do things to affect the universe.

Hot take: it might be useful to think of "self-awareness" and "awareness that it can do things to affect the universe" separately. Not sure they are one and the same.

What is a "system that's not agent-like" in your perspective? How might it be built? Have you written anything about that?

For my part, I thought Rohin's "AI safety without goal-directed behavior" is a good start, but that we need much more and deeper analysis of this topic.

I think that the concept of a reflexive oracle would be really useful to bring into this discussion if it hasn't already been brought up. From the abstract:

In this paper, we introduce a "reflective" type of oracle, which is able to answer questions about the outputs of oracle machines with access to the same oracle. These oracles avoid diagonalization by answering some queries randomly. We show that machines with access to a reflective oracle can be used to define rational agents using causal decision theory. These agents model their environment as a probabilistic oracle machine, which may contain other agents as a non-distinguished part.

In other words, if the Predict-O-Matic knows it will predict P = A, it assigns probability 1 to the proposition that it will predict P = A.

It's a predictor - it produces probabilities (or expected value?). There's also some rules about probability that it might follow - like if asked to guess the probability it rains next wednesday, it will give the same answer as if asked to guess the probability it will give when asked tomorrow.