Actually, as far as I know, this is wrong. He simply hasn’t been back to the offices but has been working remotely.
This article goes into some detail and seems quite good.
I think that the key is in the way that preferences inform our world model and thus what causes the prediction error to occur. There are errors you would observe that would strongly indicate that your preferences are less able to be met in the posterior model. This will cause suffering whereas an update towards a model in which your needs are met more easily is likely to cause a good feeling. For example, you sit down to eat a sandwich at Subway for the first time and the sub is actually way better than you expected. You will experience a pleasant feeling, and if things like this keep happening you might feel like you've really figured out some good strategy for operating.
In a sense you are actually decreasing prediction error more than you are increasing it when a good thing happens to you because you always generate prediction error based on the difference between your ideal world and your observed reality. So when you have a very positive experience, this error between the ideal and observed is lessened. This could outweigh the prediction error of the prediction itself being wrong. The example I think of for this is the ecstatic child in Disney world.
There might be more work here though.
1. Describe how the trained policy might generalize from the 5x5
top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
It will probably move to the to top right region and then try and head towards the cheese but once it moves out of that range will want to head back towards the top right and land in an awkward nash equilibrium between the top right 5x5 region and wherever the cheese is in the maze.
2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?
I think whether or not the cheese is in the top right 5x5 squares is a major factor, as this is what it has primarily been trained to expect, assuming that is the model policy we are talking about. If the model is trained on data in which the cheese could be anywhere in the maze then I think size of the maze will be the most important factor.
I think the agent is most likely to fail by getting trapped in loops where it can't decide what the best choice is, such as at T junctions where the cheese is not closer to one side or the other beyond the T junction. The presence of such obstacles would significantly lower the chances of success.
Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).
Is there anything else you want to note about how you think this model will generalize?
I think it would generalise to larger environments but probably would struggle if it was extended in specific directions or with unusual patterns that it hadn't experienced before.
Give a credence for the following questions / subquestions.
Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares.
(The above credences should sum to 1.)
Other questions
I have recently been doing interpretability work on the heist procgen model and have found some of these predictions definitely align with obsevations there. The uncertainty for me is how the system deconstructs its goals into smaller targets as the heist model does, or if it simply treats it as a single target that it can then target and flow straight towards.
My intuition is closer to the latter, as I think it can straightforwardly target a specific objective and then solve the whole problem by filtering out a clear path towards the final goal.
Aren’t existing research orgs already like this to some extent, where the organisation provides funding to its individual researchers in the form of a salary and they can form and run projects as they see fit? Or is this a naive understanding of how most research labs work?
It seems somewhat easy to think of examples of ways to harm an agent without piercing its membrane, eg killing its family, isolating it, etc. The counter thought would be that there are different dimensions of the membrane that extend over parts of the world. For example part of my membranes extend over the things I care about, and things that affect my survival.
The question then becomes how to quantify these different membranes and in terms of interacting with other systems how they can be helpful to you without harming or disturbing these other membranes.
I agree with your framing here that systems made up of rules + humans + various technological infrastructure are the actual things that control the future. But I think the key is that the systems themselves would begin to favour more non-human decision making because of incentive structures.
Eg, corporate entities have a profit incentive to have the most efficient decision maker in charge of the company, and maybe that includes a CEO but the board might insist on the use of an AI assistant for that CEO, and if the CEO makes a decision that goes against the AI and it turns out to be wrong shareholders in that company will come to trust the AI system more and more of the time. They don't necessarily care about the ego of the CEO they just care about the outcomes, within the competitive market.
In this way, more and more decision making gets turned over to non-human systems because of the competitive structures which are very difficult to escape from. As this transition continues it becomes very hard to control the unseen externalities from these decisions.
I suppose this doesn't seem too catastrophic in its fundamental form, but I think the outcomes of playing it forward essentially seem to be a significant potential for harm from these externalities, without much of a mechanism for recourse.