I know this is becoming my schtick, but have you considered the intentional stance? Specifically, the idea that there is no "the" wants and ontology of e. coli, but that we are ascribing wants and world-modeling to it as a convenient way of thinking about a complicated world, and that different specific models might have advantages and disadvantages with no clear winner.
Because this seems like it has direct predictions about where the meta-strategy can go, and what it's based on.
But all this said, I don't think it's hopeless. But it will require abstraction. There is a tradeoff between predictive accuracy of a model of a physical system, and it including anything worth being called a "value," and so you must allow agential models of complicated systems to only be able to predict a small amount of information about the system, and maybe even be poor predictors of that.
Consider how your modeling me as an agent gives you some notion of my abstract wants, but gives you only the slimmest help in predicting this text that I'm writing. Evaluated purely as a predictive model, it's remarkably bad! It's also based at least as much in nebulous "common sense" as it is in actually observing my behavior.
So if you're aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power. And another ingredient is going to be this "common sense," though maybe you can substitute for that with hand-coding - it might not be impossible, given how simplified our intuitive agential models of humans are.
I was actually going to leave a comment on this topic on your last post (which btw I liked, I wish more people discussed the issues in it), but it didn't seem quite close enough to the topic of that post. So here it is.
Specifically, the idea that there is no "the" wants and ontology of e. coli
This, I think, is the key. My (as-yet-incomplete) main answer is in "Embedded Naive Bayes": there is a completely unambiguous sense in which some systems implement certain probabilistic world-models and other systems do not. Furthermore, the notion is stable under approximation: systems which approximately satisfy the relevant functional equations use these approximate world-models. The upshot is that it is possible (at least sometimes) to objectively, unambiguously say that a system models the world using a particular ontology.
But it will require abstraction
Yup. Thus "Embedded Agency via Abstraction" - this has been my plurality research focus for the past month or so. Thinking about abstract models of actual physical systems, I think it's pretty clear that there are "natural" abstractions independent of any observer, and I'm well on the way to formalizing this usefully.
Of course any sort of abstraction involves throwing away some predictive power, and that's fine - indeed that's basically the point of abstraction. We throw away information and only keep what's needed to predict something of interest. Navier-Stokes is one example I think about: we throw away the details of microscopic motion, and just keep around averaged statistics in each little chunk of space. Navier-Stokes is a "natural" level of abstraction: it's minimally self-contained, with all the info needed to make predictions about the bulk statistics in each little chunk of space, but no additional info beyond that.
Anyway, I'll probably be writing much more about this in the next month or so.
So if you're aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power.
Hand-coded models of humans is definitely not something I aim for, but I do think that abstraction is a necessary element of useful models of humans regardless of whether they're hand-coded. An agenty model of humans is necessary in order to talk about humans wanting things, which is the whole point of alignment - and "humans" "wanting" things only makes sense at a certain level of abstraction.
Somehow I missed that second post of yours. I'll try out the subscribe function :)
Do you also get the feeling that you can sort of see where this is going in advance?
When asking what computations a system instantiates, it seems you're asking what models (or what fits to an instantiated function) perform surprisingly well, given the amount of information used.
To talk about humans wanting things, you need to locate their "wants." In the simple case this means knowing in advance which model, or which class of models, you are using. I think there are interesting predictions we can make about taking a known class of models and asking "does one of these do a surprisingly good job at predicting a system in this part of the world including humans?"
The answer is going to be yes, several times over - humans, and human-containing parts of the environment, are pretty predictable systems, at multiple different levels of abstraction. This is true even if you assume there's some "right" model of humans and you get to start with it, because this model would also be surprisingly effective at predicting e.g. the human+phone system, or humans at slightly lower or higher levels of abstraction. So now you have a problem of underdetermination. What to do? The simple answer is to pick whatever had the highest surprising power, but I think that's not only simple but also wrong.
Anyhow, since you mention you're not into hand-coding models of humans where we know where the "wants" are stored, I'd be interested in your thoughts on that step too, since just looking for all computations that humans instantiate is going to return a whole lot of answers.
I think it will turn out that, with the right notion of abstraction, the underdetermination is much less severe than it looks at first. In particular, I don't think abstraction is entirely described by a pareto curve of information thrown out vs predictive power. There are structural criteria, and those dramatically cut down the possibility space.
Consider the Navier-Stokes equations for fluid flow as an abstraction of (classical) molecular dynamics. There are other abstractions which keep around slightly more or slightly less information, and make slightly better or slightly worse predictions. But Navier-Stokes is special among these abstractions: it has what we might call a "closure" property. The quantities which Navier-Stokes predicts in one fluid cell (average density & momentum) can be fully predicted from the corresponding quantities in neighboring cells plus generic properties of the fluid (under certain assumptions/approximations). By contrast, imagine if we tried to also compute the skew or heteroskedasticity or other statistics of particle speeds in each cell. These would have bizarre interactions with higher moments, and might not be (approximately) deterministically predictable at all without introducing even more information in each cell. Going the other direction, imagine we throw out info about density & momentum in some of the cells. Then that throws off everything else, and suddenly our whole fluid model needs to track multiple possible flows.
So there are "natural" levels of abstraction where we keep around exactly the quantities relevant to prediction of the other quantities. Part of what I'm working on is characterizing these abstractions: for any given ground-level system, how can we determine which such abstractions exist? Also, is this the right formulation of a "natural" abstraction, or is there a more/less general criteria which better captures our intuitions?
All this leads into modelling humans. I expect that there is such a natural level of abstraction which corresponds to our usual notion of "human", and specifically humans as agents. I also expect that this natural abstraction is an agenty model, with "wants" build into it. I do not think that there are a large number of "nearby" natural abstractions.
Background
Intuitively, the real world seems to contain agenty systems (e.g. humans), non-agenty systems (e.g. rocks), and ambiguous cases which display some agent-like behavior sometimes (bacteria, neural nets, financial markets, thermostats, etc). There’s a vague idea that agenty systems pursue consistent goals in a wide variety of environments, and that various characteristics are necessary for this flexible goal-oriented behavior.
But once we get into the nitty-gritty, it turns out we don’t really have a full mathematical formalization of these intuitions. We lack a characterization of agents.
To date, the closest we’ve come to characterizing agents in general are the coherence theorems underlying Bayesian inference and utility maximization. A wide variety of theorems with a wide variety of different assumptions all point towards agents which perform Bayesian inference and choose their actions to maximize expected utility. In this framework, an agent is characterized by two pieces:
The Bayesian utility characterization of agency neatly captures many of our intuitions of agency: the importance of accurate beliefs about the environment, the difference between things which do and don’t consistently pursue a goal (or approximately pursue a goal, or sometimes pursue a goal…), the importance of updating on new information, etc.
Sadly, for purposes of AGI alignment, the standard Bayesian utility characterization is incomplete at best. Some example issues include:
One way to view agent foundations research is that it seeks a characterization of agents which resolves problems like the first two above. We want the same sort of benefits offered by the Bayesian utility characterization, but in a wider and more realistic range of agenty systems.
Characterizing Real-World Agents
We want to characterize agency. We have a bunch of real-world systems which display agency to varying degrees. One obvious strategy is to go study and characterize those real-world agenty systems.
Concretely, what would this look like?
Well, let’s set aside the shortcomings of the standard Bayesian utility characterization for a moment, and imagine applying it to a real-world system - a financial market, for instance. We have various coherence theorems saying that agenty systems must implement Bayesian utility maximization, or else allow arbitrage. We have a strong prior that financial markets don’t allow arbitrage (except perhaps very small arbitrage on very short timescales). So, financial markets should have a Bayesian utility function, right? Obvious next step: pick an actual market and try to figure out its world-model and utility function.
I tried this, and it didn’t work. Turns out markets don’t have a utility function, in general (in this context, it’s called a “representative agent”).
Ok, but markets are still inexploitable and still seem agenty, so where did it go wrong? Can we generalize Bayesian utility to characterize systems which are agenty like markets? This was the line of inquiry which led to “Why Subagents?”. The upshot: for systems with internal state (including markets), the standard utility maximization characterization generalizes to a multi-agent committee characterization.
This is an example of a general strategy:
Note that the last step, generalizing the characterization, still needs to maintain the structure of a characterization of agency. For example, prospect theory does a fine job predicting the choices of humans, but it isn’t a general characterization of effective goal-seeking behavior. There’s no reason to expect prospect-theory-like behavior to be universal for effective goal-seeking systems. The coherence theorems of Bayesian utility, on the other hand, provide fairly general conditions under which Bayesian induction and expected utility maximization are an optimal goal-seeking strategy - and therefore “universal”, at least within the conditions assumed. Although the Bayesian utility framework is incomplete at best, that’s still the kind of thing we’re looking for: a characterization which should apply to all effective goal-seeking systems.
Some examples of (hypothetical) projects which follow this general strategy:
… and so forth.
Why Would We Want to Do This?
Characterization of real-world agenty systems has a lot of advantages as a general research strategy.
First and foremost: when working on mathematical theory, it’s easy to get lost in abstraction and lose contact with the real world. One can end up pushing symbols around ad nauseum, without any idea which way is forward. The easiest counter to this failure mode is to stay grounded in real-world applications. Just as a rationalist lets reality guide beliefs, a theorist lets the problems, properties and intuitions of the real world guide the theory.
Second, when attempting to characterize real-world agenty systems, one is very likely to make some kind of forward progress. If the characterization works, then we’ve learned something useful about an interesting real-world system. If it fails, then we’ve identified a hole in our characterization of agency - and we have an example on hand to guide the construction of a new characterization.
Third, characterization of real-world agenty systems is directly relevant to alignment: the alignment problem itself basically amounts to characterizing the wants and ontologies of humans. This isn’t the only problem relevant to FAI - tiling and stability and subagent alignment and the like are separate - but it is basically the whole “alignment with humans” part. Characterizing e.g. the wants and ontology of an e-coli seems like a natural stepping-stone.
One could object that real-world agenty systems lack some properties which are crucial to the design of aligned AGI - most notably reflection and planned self-modification. A theory developed by looking only at real-world agents will therefore likely be incomplete. On the other hand, you don’t figure out general relativity without figuring out Newtonian gravitation first. Our understanding of agency is currently so woefully poor that we don’t even understand real-world systems, so we might as well start with that and reap all the advantages listed above. Once that’s figured out, we should expect it to pave the way to the final theory: just as general relativity has to reproduce Newtonian gravity in the limit of low speed and low energy, more advanced characterizations of agency should reproduce more basic characterizations under the appropriate conditions. The subagents characterization, for example, reproduces the utility characterization in cases where the agenty system has no internal state. It all adds up to normality - new theories must be consistent with the old, at least to the extent that the old theories work.
Finally, a note on relative advantage. As a strategy, characterizing real-world agenty systems leans heavily on domain knowledge in areas like biology, machine learning, economics, and neuroscience/psychology, along with the math involved in any agency research. That’s a pretty large pareto skill frontier, and I’d bet that it’s pretty underexplored. That means there’s a lot of opportunity for new, large contributions to the theory, if you have the domain knowledge or are willing to put in the effort to acquire it.