ARC has published a report on Eliciting Latent Knowledge, an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.
The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.
The report is available here as a google document. If you're excited about this research, we're hiring!
Q&A
We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.
Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.
Here's an approach I just thought of, building on scottviteri's comment. Forgive me if there turns out to be nothing new here.
Supposing that the machine and the human are working with the same observation space (O:=CameraState) and action space (A:=Action), then the human's model H:SH→A→P(O×SH) and the machine's model M:SM→A→P(O×SM) are both coalgebras of the endofunctor F:=λX.A→P(O×X), therefore both have a canonical morphism into the terminal coalgebra of F, X:≅FX (assuming that such an X exists in the ambient category). That is, we can map SH→X and SM→X. Then, if we can define a distance function on X with type dX:X×X→R≥0, we can use these maps to define distances between human states and machine states, d:SH×SM→R≥0.
How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g. K(x,y)=exp(−βdX(x,y))), and then use kernel regression to predict the utility of states in SM by averaging "nearby" states in SH, and then finally (and crucially) estimating the generalization error so that states from SM that aren't really near to anywhere in SH get big warning flags (and/or utility penalties for being outside a trust region).
How to get such a distance function? One way is to use CMet (the category of complete metric spaces) as the ambient category, and instantiate P as the Kantorovich monad. Crank-turning yields the formula
dX(sH,sM)=supa:AsupU:O×X↣R∣∣Eo,s′H∼H(sH)(a)U(o,s′H)−Eo,s′M∼M(sM)(a)U(o,s′M)∣∣where U is constrained to be a non-expansive map, i.e., it is subject to the condition |U(o1,s1)−U(o2,s2)|≤max{dO(o1,o2),dX(s1,s2)}. If O is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible sH and sM, a partition of O and a next action, and optimizes the probability that sampled predictions from H and M will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if M knows more about the world than H, then of course the adversary will be able to find an action policy that eventually leads the state into some region that M can confidently predict with p≈1 while H finds it very unlikely (p⋘1). In other words, even if two states are basically concordant, this distance function will consider them maximally distant if there exists any policy that eventually leads to a maximal breakdown of bisimulation. (Both the canonical character and the too-strict character are in common with L∞ metrics.)
Inspired by this kind of corecursion but seeking more flexibility, let's consider the induced metric on the type X×X→R≥0 itself, namely the sup-norm ddX(d1,d2):=supx,y:X|d1(x,y)−d2(x,y)|, then build a contraction map on that space and apply the Banach fixed-point theorem to pick out a well-defined dX. For example,
T(dX)(xH,xM):=supa:A(dPO(π0(xH(a)),π0(xM(a)))+γ⋅Ex′H∼π1(xH(a));x′M∼π1(xM(a))dX(x′H,x′M))We are now firmly in Abstract Dynamic Programming territory. The distance between two states is the maximum score achievable by an adversary playing an MDP with state space as the product SH×SM, the initial state as the pair (sH,sM) of states being compared, the one-stage reward as the divergence of predictions about observations between the two models, the dynamics as just the H and M dynamics evolving separately (but fed identical actions), and exponential discounting.
The divergence dPO is a free parameter here, although it has to be bounded, but it doesn't have to be a metric. It could be attainable utility regret, or KL divergence, or Jensen-Shannon divergence, or Bhattacharyya distance, etc. (with truncation or softmax to keep them bounded); lots of potential for experimentation here.