previously: My decomposition of the alignment problem

A simple model of meta/continual learning

In the framework of solomonoff induction, we observe an infinite stream of bitstring and we try to predict the next bit by finding the shortest hypothesis which reproduces our observations (some caveats here).  When we receive an additional bit of observation, in principle, we can rule out an infinite number of hypotheses (namely all programs which didn't predict our observation) which creates an opportunity to speedup our induction process for future observations. Specifically, as we try to find the next shortest program which predicts our next bit of observation, we can learn to skip over the programs that have already been falsified by our past observations. The process of "learning how to skip over falsified programs" takes time and computational costs upfront, but it can yield dividends of computational efficiency for future induction.

This is my mental model for how agents can "learn how to learn efficiently": An agent who has received more observations can usually adapt to new situations quicker because more incorrect hypotheses can be ruled out already, which means there's a narrower set of remaining hypotheses to choose from. 

More generally,  an important question to ask is given that the underlying space of remaining hypotheses is constantly shrinking as we receive new observations, what sorts of data structures for representing hypothesis should we use to exploit that? How should we represent programs if we don't just want to execute them, but also potentially modify them into other plausible hypothesis? If a world model is selected based on its ability to quickly adapt to new environments, what is the type signature of that world model?

Quick thoughts

  • Incremental modification: In solomonoff induction, the next shortest program which predicts the next bit of observation might look nothing like the current shortest program that reproduces the existing bits of observations. However, modifying and augmenting the current program seems much more efficient than searching for a new program from scratch, and it seems much more similar to how animals or humans update their knowledge in practice. Is there a way to structure programs that allows us to learn by incrementally modifying our existing hypothesis? Can we do this without sacrificing the expressivity of our hypothesis space?
  • Modularity: A modular program structure can be broken down into loosely coupled components, where each component influences only a few other components, leaving most other components invariant at any given time. This property can be helpful for efficient learning because when a modular program encounters a prediction error, only a small part could be responsible for that error, which means we only need to modify a small part of our program to accomodate each new observation.
  • Compression:  If we picture solomonoff induction as enumerating bitstrings as programs from shorter to longer ones, then one way to "skip over falsified hypotheses" is to enumerate bitstrings under a compressed encoding  which ignores falsified programs, where shorter bitstrings  correspond to likelier hypotheses  that have not been ruled out. Unfortunately, learning  induces another induction problem, but we can still reap the benefits insofar as we can efficiently find a generalizable approximation of the encoding
  • Closing the loop: Solomonoff induction can be framed as compression over the space of observations, while approximating the compressed encoding  is essentially compression over program space. We can continue this recursion by approximating a compressed encoding  over the space of encodings  (which would allow us to update our encodings based on observations more efficiently), then approximate another compressed encoding  over , and so on and so on. This is one picture of how we can perform meta-learning at all levels and learn meta-patterns with increasing levels of abstractions. 

Why this might be relevant for alignment

Transformative AI will often need to modify their ontologies in order to accomodate new observations, which means that if we want to translate our preferences over real world objects to the AI's world model, we need to be able to stably "point" to real world objects despite ontology shifts. If efficient learning relies on specific data structures for representing hypotheses, these structures may reveal properties that remain invariant under ontology shifts. By identifying these invariant properties, we can potentially create robust ways to maintain our preferences within the AI's evolving world model. 

Furthermore, insofar as humans utilize a similar data structure to represent their world models, this could provide insights into how our actual preferences remain consistent despite ontology shifts, offering a potential blueprint for replicating this process in AI.

New Comment
4 comments, sorted by Click to highlight new comments since:

Any ideas?

Yes, I plan to write a sequence about it some time in the future, but here are some rough high-level sketches:

  • Basic assumptions: Modularity implies that the program can be broken down into loosely coupled components, for now I'll just assume that each component has some "class definition" which specifies how it interacts with other components; "class definitions" can be reused (aka we can instantiate multiple components of the same class); each component can aggregate info from other components & the info they store can be used by other components
  •  Expressive modularity: A problem with modularity is that it cuts out information flow between certain components, but before we learn about the world we don't know which components are actually independent, & the modularity of the environment might change over time, so we need to account for that.
    • As a basic framework, we can think of each component as having transformer-style attention values over other components, modularity means that we want the "attention values"(mutual info) to be as sparse as possible
    • Expressivity means that those "attention values"  should be context dependent (they are functions of aggregate information from other components)
    • A consequence of this is that we can have variables that encode the modularity structure of the environment which influence the attention values(mutual info) of other variables
      • One example is the eulerian vs lagrangian description of fluid flow: the eulerian description has a fixed modularity structure because each region of space has a fixed markov blanket, but the lagrangian structure has a dynamic modularity structure because "what particles are directly influenced by what other particles" depends on the positions of the particles which change over time. We want to our program be able to accomodate both types of descriptions
    • We can get the equivalent of "function calls" by having attention values over "class definitions", so that components can instantiate computations of other components if it needs to. This is somewhat similar to the idea of lazy world-modelling
  • Components that generalize over other components: Given modularity, the main way that we can augment our program to accomodate new observations is to add more components (or tweak existing components), this means that the main way to learn efficiently is to structure our current program in a way such that we can accomodate new observations with as few additional components as possible
    • Since our program is made out of components, this means we want our exsting components to adapt to new components in a generalizable way
    • Concretely, if we think of each "component" as a causal node, then each causal node  should define a mapping  from another causal node  to the causal edge . This basically allows each causal node to "generalize" over other causal nodes so that it can use information from them in the right ways
  • Closing the loop: On top of that, we can use a part of our program to encode a compressed encoding of additional components (so that components that are more likely will be higher in the search ordering). Implementing the compressed encoding itself requires additional components, so that changes the distribution of additional components, & we can augment the compressed encoding to account for that (but that introduces a further change in distribution, and so on and so on...)
  • Relevance to alignment(highly speculative): Accomodating new observations by adding new components while keeping existing structures might allow us to more easily preserve a particular ontology, so that even when the AI augments it to accomodate new observations, we can still map back to the original ontology

Note: I haven't thought of the best framing of these ideas but hopefully I'll come back with a better presentation some point in the future 

I think that this would make a very nice sequence, and despite all my discussion with you, I'd absolutely like to see this sequence carried out.

Thanks! :)