One solution is to have an RL policy that chooses where to update the model, but to use self-supervised (predictive) learning to decide what direction to update the model. For example, the RL policy can choose what to look at / attend to / think about, or more basically, "what to try predicting", then the model makes that prediction, then we update the model on the prediction error.
Then the RL policy can learn to take actions that account for value of information.
(This comment is loosely based on one aspect of what I think the brain does.)
I would hestitate to call it a solution because my motivation is in proving that partially optimized models are "better than nothing" in some sense, and transforming prediction problems into RL problems sounds like something that makes proofs much harder, rather than easier. But I agree that it could mitigate the problems that arise in practice, even if it doesn't solve the theory.
Downvoted for burying the lede. I assumed from the buildup this was something other than what it was, e.g. how a model that contains more useful information can still be bad, e.g. if you run out of resources for efficiently interacting with it or something. But I had to read to the end of the second section to find out I was wrong.
"Rationality is not just about knowing facts. It's about knowing which facts are the most relevant." - Epictetus (c. AD 50)
I would be interested in a more precise definition of what you mean by information here.
In particular, it seems like you're using an unintuitive (to me) definition of information -- though one that's lines up colloquially with how we talk about computers.
For example, let's say I have a thumb drive ("Drive A") with two things on it:
And I have a second drive with one thing on it:
I might ask someone: which of these has more information on it?
The colloquial computer-storage based answer might be: the first one! It takes up a petabyte, where the second one takes up less than a gigabyte.
But it feels like something important about the meaning of information (in an AI-understanding-the-world-sense) is being lost here.
(ETA: Also, if determinism factors in here, feel free to replace the petabyte of pi digits with something like a petabyte of recordings from a TRNG or something like that.)
The definitions of information I have in mind are some sort of classification or generation loss function. So for instance, if you trained GPT-3 on the second drive, I would expect it to get lower loss on the datasets it was evaluated on (some sort of mixture of wikipedia and other internet scrapes, if I recall correctly) than it would if it was trained on the first drive. So by the measures I have in mind, the second drive would very possibly contain more information.
(Though of course this depends on your exact loss function.)
Like my post is basically based on the observation that we often train machine learning systems with some sort of information-based loss, whether that be self-supervised/generative or fully supervised or something more complicated than that. Even if you achieve a better loss on your model, you won't necessarily achieve a better reward for your agent.
For a more realistic example, this phenomenon of a more accurate model being worse is a common issue in database query optimization.
When a user runs a SQL query, the optimizer uses statistics about the data to estimate the cost of many different ways to execute the query, then picks the plan with the cheapest estimate.
When the optimizer misestimates the cost and chooses a bad plan, the typical solution is to add more detailed statistics about the data. But occasionally adding more statistics can cause the optimizer to choose a plan that's actually worse.
The brain overcomes this issue through the use of saliency-weighted learning. I don't have any references at the moment, but essentially, information is more salient when it is more surprising, either to the agent's world model or to its self model.
For the former, the agent is constantly making predictions about what it will experience along with the precision of these expectations such that when it encounters something outside of these bounds, it takes notice and updates its world model more strongly in the direction of minimizing these prediction errors.
The latter, however, is where the "usefulness" of salient information is most directly apparent. The agent is not just predicting what will happen in the external world like some disembodied observer. It is modeling what it expects to experience conditioned on its model of itself being healthy and functional. When something surprisingly good occurs, it takes special note of all information that was coincident with the pleasure signal to try to make such experiences more likely in the future. And when something surprisingly bad occurs, it also takes notice of all information coincident with the pain signal so that it can make such experiences less likely in the future.
When everything is going as expected, though, the agent will tend not to keep that information around. Saliency-weighted learning is all about steering an agent's models toward better predictive power and steering its behavior toward states of easier survivability (or easier learnability for a curiosity drive), allowing it to discard most information that it encounters in favor of only that which challenges its expectations.
Saliency-based learning can definitely reduce this problem. Neural network reinforcement learners typically do something similar, e.g. predicting rewards (this is also necessary for other purposes). However, I don't think it fully solves the problem because it only weights the information that it can immediately identify as being related to what it is seeking, and not the information that may eventually turn out to be useful for what it is seeking. Of course the latter is not really solvable in the general case.
All information currently in working memory could potentially become highly weighted when a saliency signal comes along. Through reinforcement learning, I imagine the agent could optimize whatever attention circuit does the loading of information into working memory in order to make this more useful, as part of some sort of learning-to-learn algorithm.
I'm not sure it makes sense to talk about "the model's accuracy" in the abstract. You need to have some way of actually measuring the accuracy. One way to do that might be: of a list of questions sampled with a given probability distribution from the space of possible questions, what fraction does the model get right?
Given the setup, one obvious probability distribution is: questions asked by the actor in the process of deciding what to do. This seems, in fact, to be the only such probability distribution already present in the setup, rather than an additional benchmark brought in for as-yet unstated reasons.
Of course, there's a problem: the system as a whole doesn't know what questions the actor will ask. Or, if it knows the actor will only ask "what action will maximize u", we've just shifted things into the model; it will need to ask sub-questions to give a good answer.
Still, we need to get the questions that the model will be optimized to answer from somewhere. There actually is no such thing as a "best" (or even a "most accurate") model in the abstract. So we'll need to have a model of what questions the actor will ask, and what sub-questions we'll need to answer those. (To avoid Infinite regress, the purpose of that model is to predict what questions the world-model will need to answer.)
I think in your examples, the problem is straightforwardly that the (implicit) question-model being used isn't very accurate. This is not the only way that having a more accurate world-model can harm you though. An obvious example would be if Omicron (a bit less powerful than Omega) wants the password to a vault which you'd prefer not to let it open, but will only threaten to cut off your arms and legs if it (correctly) predicts you actually know the password. It would be a lot more interesting if someone could come up with examples that don't directly punish you simply for knowing more, but I'm out of time for the moment.
The common way to evaluate model accuracy in machine learning contexts is that you have a bunch of samples of the "ground truth" that is to be predicted; e.g. classified images for supervised learning. And then you evaluate the model on those samples. That is the sort of accuracy measure I had in mind when writing the post, because that is what gets used in practice.
(In particular, models get directly optimized to perform better on this accuracy measure, but it might not reach an optimum, so the question is what happens when the model doesn't reach an optimum.)
Given the setup, one obvious probability distribution is: questions asked by the actor in the process of deciding what to do. This seems, in fact, to be the only such probability distribution already present in the setup, rather than an additional benchmark brought in for as-yet unstated reasons.
Of course, there's a problem: the system as a whole doesn't know what questions the actor will ask. Or, if it knows the actor will only ask "what action will maximize u", we've just shifted things into the model; it will need to ask sub-questions to give a good answer.
Still, we need to get the questions that the model will be optimized to answer from somewhere. There actually is no such thing as a "best" (or even a "most accurate") model in the abstract. So we'll need to have a model of what questions the actor will ask, and what sub-questions we'll need to answer those. (To avoid Infinite regress, the purpose of that model is to predict what questions the world-model will need to answer.)
Since I have already have an idea of what distribution the accuracy gets evaluated on, it won't be about the question the actor asks. However, the problem you mention here comes up in a different way, in that in e.g. reinforcement learning contexts, the distribution of data the AI faces, which we use to evaluate the model's accuracy, may depend on the decisions of the AI and so by transitivity also on the model's predictions. This prevents there from being a straightforwardly best option.
(In fact, it's a problem similar to this that led me to want to better understand partially optimized models.)
The common way to evaluate model accuracy in machine learning contexts is that you have a bunch of samples of the "ground truth" that is to be predicted; e.g. classified images for supervised learning. And then you evaluate the model on those samples. That is the sort of accuracy measure I had in mind when writing the post, because that is what gets used in practice.
That's what gets used for supervised or unsupervised learning, but your post started out with "Suppose we want to create an agent AI", and there's no straightforward way of interpreting systems trained with those techniques as agents. Perhaps you intended for some such system to be used as the "model" subsystem of an agent AI, but in that case I think the problem really is basically what I said: the actor should be defining what information it wants to get out of the model, and the model should be optimized to supply that information, and if it isn't, that model won't do as well at providing the information the actor needs.
I don't think "amount of information contained" even sounds like a property of a model that anyone would think they should care about, absent some detail about what that information is about. Otherwise a model that knows nothing but a sufficiently massive number of digits of pi would be better than one that can answer any question you have about the real world but knows pi to only 50 decimal places. "Percent of questions in the test set answered correctly" does sound possibly useful, if you want to get answers to questions drawn from the same distribution. "Percent of questions I actually ask, weighted by how much value I get from having that particular question answered correctly" would be an even better metric (with the defect of being impossible to directly optimize for), of course, but the long book about who lives where and the library describing the death chamber don't even seem to live up to the minimal "this answers the kind of questions I want to ask" criterion.
TL;DR - A model that has more information in total might have less information about the things you care about, making it less useful in practice. This can hold even if averaging over all possible things one might care about.
Suppose we want to create an agent AI - that is, an AI that takes actions in the world to achieve some goal u (which might for instance be specified as a reward function). In that case, a common approach is to split the agent up into multiple parts, with one part being a model which is optimized for accurately predicting the world, and another being an actor which, given the model, chooses its actions in such a way as to achieve the goal u.
This leads to an important observation: The model is not directly optimized to help the agent achieve u. There are important technical reasons why this is the case; for instance, it can be hard to effectively figure out how the model relates to the agent's ability to achieve u, and it can be sample-inefficient for an agent not to exploit the rich information it gets from its observations of the world.
So here's a question - if you improve the model's accuracy, do you also improve the actor's ability to achieve u? I will present two counterexamples to this claim in the post, and then discuss the implications.
The hungry tourist
Imagine that you are a tourist in a big city. You are hungry right now, so you would like to know where there is some place to eat, and afterwards you would like to know the location of various sights and attractions to visit.
Luckily, you have picked up a map for tourists at the airport, giving you a thorough guide to all of these places. Unfortunately, you have then run into the supervillain Unhelpful Man, who stole your tourist map and replaced it with a long book which labels the home of each person living in the city, but doesn't label any tourist attractions or food places at all.
You are very annoyed by having to go back to the airport for a new map, but Unhelpful Man explains that he actually helped you, because your new book contains much more accurate information about the city than the map did. Sure, the book might not contain the information about the attractions you actually want to visit, but entropy-wise his book more than makes up for that by all of the details it has about who lives where.
In a machine learning model, this book could quite plausibly achieve a lower "loss" on predicting information about the city than the original tourist's map could. So, it is a more accurate model. But for the hungry tourist's goal, it is much less useful. It could still quite plausibly be useful for achieving other goals, though, so could one perhaps hypothesize that a more accurate model would tend to be more helpful when averaged across many different goals? This could perhaps mean that optimizing for model accuracy is necessary if you want to construct the AI in a modular way, creating the model without worrying about the actor's goal.
The death chamber
The supervillain Unhelpful Man has become very mad because you did not appreciate his book of the city, so he has trapped you in a locked room. The door to the room has a password, and if you enter the wrong password, an elaborate mechanism will come out to kill you.
Unhelpful Man laughs at you and explains that he has a whole library of books which together describe the death chamber quite well. He offers you to trade the book describing the city for the library, as long as you admit that more information is better. Desperate to escape, you accept the trade.
Unfortunately, you can't seem to find any book in the library that lists the passwords. Instead, all the books go into excruciating detail about the killing mechanisms; you learn a lot about saws, poison gas, electrocution, radiation, etc., but there seems to be nothing that helps you escape, no mention of the password anywhere. Obviously Unhelpful Man is untrustworthy, but you see no other choice than to ask him for help.
"Oh, the password? It's not in any of these books, it's on page 152 of the book about the city. I told you that it was an important book."
Again we have the same problem; you traded off a smaller amount of information, the book containing the password to escape the room, for a greater amount of information, the library. However, the smaller amount of information was much more relevant to your needs. The specific difference for the death chamber, though, is that survival and escape is a convergent instrumental subgoal which we would expect most goals to imply. Thus in this case we can say that the more accurate model is usually worse for the actor than the less accurate one, even when averaging over goals.
Value of information
The above point boils down to an extremely standard problem: Not all information is equally valuable. I am just making it an extreme by considering the tradeoff of a small bit of highly valuable information against a large amount of worthless information. For many technical reasons, we want a metric for evaluating models that focuses on the amount of information content. But such a metric will ignore value of information, and thus not necessarily encourage the best models for agentic purposes.
And this might be totally fine. After all, even if it doesn't necessarily encourage the best models for agentic purposes, it might still be "good enough" in practice? My main motivation for writing this post is that I want to better understand the nature of models that are not fully optimized to reach the best conceivable loss, and so it seemed logical to consider whether getting a better loss can in some sense be said to be better for your ability to achieve your goals. There doesn't appear to be an unconditional proof here, though.
Thanks to Justis Mills for proofreading.