I am a PhD student in computer science at the University of Waterloo, supervised by Professor Ming Li and advised by Professor Marcus Hutter.
My current research is related to applications of algorithmic probability to sequential decision theory (universal artificial intelligence). Recently I have been trying to start a dialogue between the computational cognitive science and UAI communities. Sometimes I build robots, professionally or otherwise. Another hobby (and a personal favorite of my posts here) is the Sherlockian abduction master list, which is a crowdsourced project seeking to make "Sherlock Holmes" style inference feasible by compiling observational cues. Give it a read and see if you can contribute!
See my personal website colewyeth.com for an overview of my interests and work.
I do ~two types of writing, academic publications and (lesswrong) posts. With the former I try to be careful enough that I can stand by ~all (strong/central) claims in 10 years, usually by presenting a combination of theorems with rigorous proofs and only more conservative intuitive speculation. With the later, I try to learn enough by writing that I have changed my mind by the time I'm finished - and though I usually include an "epistemic status" to suggest my (final) degree of confidence before posting, the ensuing discussion often changes my mind again. As of mid-2025, I think that the chances of AGI in the next few years are high enough (though still <50%) that it’s best to focus on disseminating safety relevant research as rapidly as possible, so I’m focusing less on long-term goals like academic success and the associated incentives. That means most of my work will appear online in an unpolished form long before it is published.
I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly.
Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it.
For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).
I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.
unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)
How.... else... do you expect to generalize human values out of distribution, except to have humans do it?
I think an upload does generalize human values out of distribution. After all, humans generalize our values out of distribution. A perfect upload acts like a human. Insofar as it generalizes improperly, it’s because it was not a faithful upload, which is a problem with the uploading process, not the idea of using an upload to generalize human values.
I think there’s a lot of truth to this - modern LLMs are kind of competence multiplier, where some competence values are negative (perhaps a competence exponentiator?).
I find that I can extract value from LLMs only if I’m asking about something that I almost already know. That way I can judge whether an answer is getting at the wrong thing, assess the relevance of citations, and verify a correct answer rapidly and highly robustly if it is offered (which is important because typically a series of convincing non-answers or wrong answers comes first).
Though LLMs seem to be getting more useful in the best case, they also seem to be getting more dangerous in the worst case, so I am not sure whether this dynamic will soften or sharpen over time.
Simple argument that imitation learning is the easiest route to alignment:
Any AI aligned to you needs to represent you in enough detail to fully understand your preferences / values, AND maintain a stable pointer to that representation of value (that is, it needs to care). The second part is surprisingly hard to get exactly right.
Imitation learning basically just does the first part - it builds a model of you, which automatically contains your values, and by running that model optimizes your values in the same way that you do. This has to be done faithfully for the approach to work safely - the model has to continue acting like you would in new circumstances (out of distribution) and when it runs for a long time - which is nontrivial.
That is, faithful imitation learning is kind of alignment-complete: it solves alignment, and any other solution to alignment kind of has to solve imitation learning implicitly, by building a model of your preferences.
I think people (other than @michaelcohen) mostly haven’t realized this for two reasons: the idea doesn’t sound sophisticated enough, and it’s easy to point at problems with naive implementations.
Imitation learning is not a new idea so you don’t sound very smart or informed by suggesting it as a solution.
And implementing it faithfully does face barriers! You have to solve “inner optimization problems” which basically come down to the model generalizing properly, even under continual / lifelong learning. In other words, the learned model should be a model in the strict sense of simulation (perhaps at some appropriate level of abstraction). This really is hard! And I think people assume that anyone suggesting imitation learning can be safe doesn’t appreciate how hard it is. But I think it’s hard in the somewhat familiar sense that you need to solve a lot of tough engineering and theory problems - and a bit of philosophy. However it’s not as intractably hard as solving all of decision theory etc. I believe that with a careful approach, the capabilities of an imitation learner do not generalize further than its alignment, so it is possible to get feedback from reality and iterate - because the model’s agency is coming from imitating an agent which is aligned (and with care, is NOT emergent as an inner optimizer).
Also, you still need to work out how to let the learned model = hopefully simulation of a human recursively self improve safely. But notice how much progress has already been made at this point! If you’ve got a faithful simulation of a human, you’re in a very different and much better situation. You can run that simulation faster as technology advances, meaning you aren’t immediately left in the dust by LLM scaling - you can have justified trust in an effectively superhuman alignment researcher. And recursive self improvement is probably easier than alignment from scratch.
I think we need to take this strategy a lot more seriously.
Here’s a longer sketch of what this should look like: https://www.lesswrong.com/posts/AzFxTMFfkTt4mhMKt/alignment-as-uploading-with-more-steps
A simulation of all humans does not automatically have “human values.” It doesn’t really have values at all. You have to extract consensus values somehow, and in order to do that, you need to specify something like a voting mechanism. But humans don’t form values in a vacuum, and such a simulation also probably needs to set interaction protocols, and governance protocols, and whatever you end up with seems quite path dependent and arbitrary.
Why not just align AI’s to each individual human and let them work it out?
I don’t know that we have much expertise on this sort of thing - we’re mostly worried about X-risk, which it doesn’t really make sense to talk about liability for in a legal sense.
Eh, the me of 4 or 5 wanted to play with swords, I still want to play with swords. I guess I’m less interested in toys, but I think that was mostly because my options were restricted (the things I like to do now were not possible).
Anyway, I think this is the wrong framing. Our minds develop into maturity from child->adult, after that it’s a lot more stable. I’m not even sure children are complete agents.
I think I agree with your take on this Abram.
The most extreme version of an AI not being self-defensive seems like the Greg Egan “permutation city” story where shutting down a simulation doesn’t even harm anyone inside - that computation just picks some other subtract “out of the dust.”
By the way, this post dovetails interestingly with my latest on alignment as uploading with more steps.
I expect this to start not happening right away.
So at least we’ll see who’s right soon.