Vanessa Kosoy

Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.

E-mail: {first name}@alter.org.il

Wiki Contributions

Comments

Sorted by

This post introduces Timaeus' "Developmental Interpretability" research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.

The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning "head on", while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).

I also want to point out a few potential weaknesses or (minor) reservations I have:

First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I'm not convinced that a broader view wouldn't be better. 

Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:

  • We want our alignment protocols to generalize correctly, given the available data, compute and other circumstances, and we need to understand what conditions would guarantee it (or at least prohibit catastrophic generalization failures).
  • If the resulting theory of generalization is in some sense universal, then it might be applicable to specifying a procedure for inferring human values (as human behavior is generated from human values by a learning algorithm with similar generalization properties), or at least formalizing "human values" well enough for theoretical analysis of alignment. 

Hence, compared to the OP, I would put more emphasis on these latter points.

Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:

  • The first piece is the relation between ideal Bayesian inference (the subject of SLT) and SGD. Ideal Bayesian inference is known to be computationally intractable. Maybe there is an extension of SLT that replaces Bayesian inference with either SGD or a different tractable algorithm. For example, it could be some Markov Chain Monte Carlo (MCMC) that converges to Bayesian inference in the limit. Maybe there is a natural geometric invariant that controls the MCMC relaxation time, similarly to how the log canonical threshold controls sample complexity.
  • The second missing piece is understanding the special properties of ANN architectures compared to arbitrary singular hypothesis classes. For example, maybe there is some universality property which explains why e.g. transformers (or something similar) are qualitatively "as good as it gets". Alternatively, it could be a relation between the log canonical threshold of specific ANN architectures to other simplicity measures which can be justified on other philosophical grounds.

That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.

This post is a collection of claims about acausal trade, some of which I find more compelling and some less. Overall, I think it's a good contribution to the discussion.

Claims that I mostly agree with include:

  • Acausal trade in practice is usually not accomplished by literal simulation (the latter is mostly important as a convenient toy model) but by abstract reasoning.
  • It is likely to be useful to think of the "acausal economy" as a whole, rather just about each individual trade separately.

Claims that I have some quibbles with include:

  • The claim that there is a strong relation between the prevalent acausal norms and human moral philosophy. I agree that there are likely to be some parallels: both processes are to some degree motivated by articulating mutually beneficial norms. However, human moral philosophy is likely to contain biases specific to humans and to human circumstances on Earth. Conversely, acausal norms are likely to be shaped by metacosmological circumstances that we don't even know yet. For example, maybe there is some reason why most civilizations in the multiverse really hate logarithmic spirals. In this case, there would be a norm against logarithmic spirals that we are currently completely oblivious about.
  • The claim that the concept of "boundaries" is likely to play a key role in acausal norms. I find this somewhat plausible but far from clear. AFAIK, Critch so far produced little in the way of compelling mathematical models to support the "boundaries" idea.
  • It seems to be implicit in the post that, an acausal-norm-following paperclip-maximizer would be "nice" to humans to some degree. (But Critch warns us that the paperclip-maximizer might easily fail to be acausal-norm-following.) While I grant that it's possible, I think it's far from clear. The usual trad-y argument to be nice to others is so that others are nice to you. However, (i) some agents are a priori less threatened by others and hence find the argument less compelling (ii) who exactly are the relevant "others" is unclear. For example, it might be that humans are in some ways not "advanced" enough to be considered. Conversely, it's possible that human treatment of animals has already condemned us to the status of defectors (which can be defected-against in turn).
  • The technical notion that logical proofs and Lob/Payor are ultimately the right mathematical model of acausal trade. I am very much unconvinced, e.g. because proof search is intractable and also because we don't know how to naturally generalizes these arguments far beyond the toy setting of Fair Bots in Prisoner's Dilemma. On the other hand, I do expect there to exist some mathematical justification of superrationality, just along other lines.

This post argues that, while it's traditional to call policies trained by RL "agents", there is no good reason for it and the terminology does more harm than good. IMO Turner has a valid point, but he takes it too far.

What is an "agent"? Unfortunately, this question is not discussed in the OP in any detail. There are two closely related informal approaches to defining "agents" that I like, one more axiomatic / black-boxy and the other more algorithmic / white-boxy.

The algorithmic definition is: An agent is a system that can (i) learn models of its environment (ii) use learned models to generate plans towards a particular goal (iii) execute these plans.

Under this definition, is an RL policy an "agent"? Not necessarily. There is a much stronger case for arguing that the RL algorithm, including the training procedure, is an agent. Indeed, such an algorithm (i) learns a model of the environment (at least if it's model-based RL: if it's model-free it might still do so implicitly, but it's less clear) (ii) generates a plan (the policy) (iii) executes the plans (when the policy is executed, i.e. in inference/deployment time). Whether the policy in itself is an agent amounts to asking whether the policy is capable of in-context RL (which is far from obvious). Moreover, the case for calling the system an agent is stronger when it learns online and weaker (but not completely gone) when there is a separation into non-overlapping training and deployment phases, as often done in contemporary systems.

The axiomatic definition is: An agent is a system that effectively pursues a particular goal in an unknown environment. That is, it needs to perform well (as measured by achieving the goal) when placed in a large variety of different environments.

With this definition we reach similar conclusions. An online RL system would arguably adapt to its environment and optimize towards achieving the goal (which is maximizing the reward). A trained policy will not necessarily do it: if it was trained in a particular environment, it can become completely ineffective in other environments! 

Importantly, even an online RL system can easily fail at agentic-ness, depending how good its learning algorithm is for dealing with distributional shift, nonrealizability etc. Nevertheless, the relation between agency and RL is pretty direct, more so than the OP implies.

This post proposes an approach to decision theory in which we notion of "actions" is emergent. Instead of having an ontologically fundamental notion of actions, the agent just has beliefs, and some of them are self-fulfilling prophecies. For example, the agent can discover that "whenever I believe my arm will move up/down, my arm truly moves up/down", and then exploit this fact by moving the arm in the right direction to maximize utility. This works by having a "metabelief" (a mapping from beliefs to beliefs; my terminology, not the OP's) and allowing the agent to choose its belief out of the metabelief fixed points.

The next natural question is then, can we indeed demonstrate that an agent will learn which part of the world it controls, under reasonable conditions. Abram implies that it should be possible if we only allow choice among attractive fixed point. He then bemoans the need for this restriction and tries to use ideas from Active Inference to fix it with limited success. Personally, I don't understand what's so bad with staying with the attractive fixed points.

Unfortunately, this post avoids spelling out a sequential version of the decision theory, which would be necessary to actually establish any learning-theoretic result. However, I think that I see how it can be done, and it seems to support Abram's claims. Details follows.

Let's suppose that the agent observes two systems, each of which can be in one of two positions. At each moment of time, it observes an element of , where . The agent beliefs it can control one of  and  whereas the other is a fair coin. However, it doesn't know which is which.

In this case, metabeliefs are mappings of type . Specifically, we have a hypothesis  that asserts  is controllable, a hypothesis  that asserts  is controllable and the overall metabelief is (say) .

The hypothesis  is defined by

Here,  and   is some "motor response function", e.g. .

Similarly,  is defined by

Now, let  be an attractive fixed point of  and consider some history . If the statistics of  in  seem biased towards  whereas the statistics of  in  seem like a fair coin, then the likelihoods will satisfy  and hence  will be close to  and therefore will be close to  (since  is an attractive fixed point). On the other hand, in the converse situation, the likelihoods will satisfy  and hence  will be close to . Hence, the agent effectively updates on the observed history and will choose some fixed point  which controls the available degrees of freedom correctly.

Notice that all of this doesn't work with repelling fixed points. Indeed, if we used  then  would have a unique fixed point and there would be nothing to choose.

I find these ideas quite intriguing and am likely to keep thing about them!

I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven't read your proposal yet, I don't have an opinion, but some coherence-inspired question I would be asking are:

  • Can you define an incomplete-preferences AIXI consistent with this proposal?
  • Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
  • What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?

This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.

The OP presents the position it argues against as follows (in my paraphrasing): "Sufficiently advanced agents don't play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]". They then proceed to argue that there is no theorem that can make this argument go through.

I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:

  • In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
  • The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
  • The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what "rational goal-directed behavior" is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say "moderate evidence" rather than "completely watertight proof", but given how natural the assumptions are, the evidence is good.
  • More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind "goal-directedness".
  • If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).

In conclusion, there are coherence theorems. But, more important than individual theorems are the "coherence theories".

I kinda agree with the claim, but disagree with its framing. You're imagining that peer pressure is something extraneous to the person's core personality, which they want to resist but usually fail. Instead, the desire to fit in, to be respected, liked and admired by other people, is one of the core desires that most (virtually all?) people have. It's approximately on the same level as e.g. the desire to avoid pain. So, people don't "succumb to peer pressure", they (unconsciously) choose to prioritize social needs over other considerations.

At the same time, the moral denouncing of groupthink is mostly a self-deception defense against hostile telepaths. With two important caveats:

  • Having "independent thinking" as part of the ethos of a social group is actually beneficial for that group's ability to discover true things. While the members of such a group still feel the desire to be liked by other members, they also have the license to disagree without being shunned for it, and are even rewarded for interesting dissenting opinions.
  • Hyperbolic discount seems to be real, i.e. human preferences are time-inconsistent. For example, you can be tempted to eat candy when one is placed in front of you, while also taking measures to avoid such temptation in the future. Something analogous might apply to peer pressure.

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.

Since the article was written, there were several new publications:

In addition, some new developments were briefly summarized in short-forms:

  • A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
  • Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
  • A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
  • Ambidistributions: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

  • I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
  • Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
  • Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

Seems right, but is there a categorical derivation of the Wentworth-Lorell rules? Maybe they can be represented as theorems of the form: given an arbitrary Markov category C, such-and-such identities between string diagrams in C imply (more) identities between string diagrams in C.

Load More