The linked document is my (last-minute) submission for the AI alignment prize. Feedback is appreciated, even though it cannot change my submission entry.

New Comment
6 comments, sorted by Click to highlight new comments since:

Key LW post: https://www.lesswrong.com/posts/HnS6c5Xm9p9sbm4a8/grasping-slippery-things

I had a longer, more polite comment, but the website ate it, sorry. Long story short, I think this is overall a "bounce" in the sense of the linked article. You give descriptive English labels to semi-formal facts about causal models, and then reason using the labels, not using the facts.

I think this is preliminary work in the sense that it is important for you to think these thoughts, so that you can keep trying to do reductionism and go on to think better thoughts later.

I sort of agree with your criticism: I wish I had more time to clarify my approach, and make it more mathematically precise, but I only decided at the last minute to even try to submit to the alignment competition. So I was scrambling to take down a minimal version of the idea. The call for the prize lists "philosophical" as one possible type of entry, so I kept everything verbal, trying to be just precise enough to point the way to a true formalization.

I do understand the problem of grasping slippery things as pointed out in the linked less wrong article: any reframing of a problem that just maps a set of words to a different set of words cannot by itself add to understanding, because it adds no moving parts to the model. However, I disagree that directly adding moving parts is the only means of progress in a problem domain. Sometimes the right labeling of concepts appeals to intuition enough to catalyze further progress. The right labeling can also bridge the gap between experts in different fields. From what I can tell, MIRI doesn't even try to bridge the communication gap with useful people like, say, complex systems theorists and control systems theorists (fields which are both absurdly relevant to the task). MIRI just whines that no one else is working on the problem of friendly AI and continues working in their own bubble.

The advantage of discussing "symmetries" and "regularizers" is that *there is a large mathematical body of work on these problems.* Explicitly acknowledging time dynamics and agent ecosystems brings in control theory and complex systems theory. Also, I tried to make the case that "utility" and "reward," basic concepts referred to by researchers, are *themselves* slippery. It's unclear in the traditional framing how a "reward" specifically maps onto causal processes in the world; it's taken as a primitive, a "meaningful number", giving the illusion of precision. Classifying rewards and other agent control problems as symmetries of regularizers potentially allows us to import all of Abstract Algebra to the task of describing moving parts.

I'm not familiar with the use of "regularizer" in your sense in any other context. Could you point out some examples? A naive google search is overwhelmed by the typical machine learning / statistical modeling meaning.

Ah, sorry: the way I used it in the paper, it's my own coinage, meant to evoke the traditional usage. When I say there a large "mathematical body of work," I mean abstract algebra for symmetry, classical machine learning for the usual meaning of "regularizer," and indirectly the work on complex systems theory, attractor theory, control theory, etc. I created my own meaning of the word "regularizer" because I have a philosophical intuition that the concept in traditional machine learning is generalizable, perhaps by something like a "category of regularizers" (to borrow from category theory) that can be described as energy landscapes that force some processes to approximate others, where these landscapes can be specified according to their symmetries (sort of like how periodic functions can be decomposed into sinusoids via fourier transforms, or how datasets can be broken into principle components).

Epistemic status of these ideas: just the hunch of someone with a B.S. in theoretical math and a penchant for philosophy.

The hope is that someone can read the paper and be like, "Aha! This was just the reframing I needed to make progress on this reinforcement learning problem I was stuck on." But if it's not useful to you, it's not useful. I wouldn't mind knowing, so I can do better in the future, whether or not any of the 6 alignment strategies listed at the end sounded even remotely like useful approaches to the value alignment problem.

Sure, just my quick reactions:

FAI via golden rule: Done right, this would end up looking like Inverse Reinforcement Learning, which we can't make work because it doesn't learn values we would be happy optimizing, only some values that would cause them to act as they do in the current context. I think there's just no way to avoid the hard work of figuring out, ourselves, a good way for the AI to learn human values. This is definitely something people have thought about in the past and are still thinking about, to try and get it to work.

FAI via multuple competing agents: One agent will probably find a loophole and then the whole scheme had no effect. If your scheme really works, it should work even better with just one agent.

Whitelisting: either produces a dumb agent, too computationally difficult. May require solving the difficult problems in order to generate the whitelist.

Evolution: Will produce AIs that do the equivalent of using condoms - they don't want what evolution wants, it merely correlated in the ancestral environment.

Fragility/robustness: Helps maintain value alignment once you have it, but doesn't help get there in the first place.

Thanks! I knew people had essentially devised these ideas before (and if they had instantly worked we would have solved FAI already), but think there is something to be gained via a reinterpretation of the ideas in the RRM. For example, if the human value function derives from discoverable symmetries of neural structure and external environment, then we can do the work to discover these and directly impose them in the agent architecture. And I think the statement I just made is not trivially equivalent to telling people “find human rewards and put them in the agent” (which is literally the whole FAI problem all over again). Symmetry is an empirically discoverable property and also a strong constraint for optimization purposes. Under symmetric constraints the agent still needs to learn human values, but may have an easier time of it. Anyway, clearly I’ve not done a great job communicating, and the ideas are all in intuition stage. Maybe in the future I’ll actually try to prove a reinforcement learning theorem using RRM philosophy.