I am a PhD student in computer science at the University of Waterloo, supervised by Professor Ming Li and advised by Professor Marcus Hutter.
My current research is related to applications of algorithmic probability to sequential decision theory (universal artificial intelligence). Recently I have been trying to start a dialogue between the computational cognitive science and UAI communities. Sometimes I build robots, professionally or otherwise. Another hobby (and a personal favorite of my posts here) is the Sherlockian abduction master list, which is a crowdsourced project seeking to make "Sherlock Holmes" style inference feasible by compiling observational cues. Give it a read and see if you can contribute!
See my personal website colewyeth.com for an overview of my interests and work.
I do ~two types of writing, academic publications and (lesswrong) posts. With the former I try to be careful enough that I can stand by ~all (strong/central) claims in 10 years, usually by presenting a combination of theorems with rigorous proofs and only more conservative intuitive speculation. With the later, I try to learn enough by writing that I have changed my mind by the time I'm finished - and though I usually include an "epistemic status" to suggest my (final) degree of confidence before posting, the ensuing discussion often changes my mind again. As of mid-2025, I think that the chances of AGI in the next few years are high enough (though still <50%) that it’s best to focus on disseminating safety relevant research as rapidly as possible, so I’m focusing less on long-term goals like academic success and the associated incentives. That means most of my work will appear online in an unpolished form long before it is published.
I expect this to start not happening right away.
So at least we’ll see who’s right soon.
I don’t think this will happen, but if AGI gets stuck around human-level for awhile (say, because of failure to solve its alignment problem), that is at least stranger and more complicated than the usual ASI takeover scenario. There may be multiple near-human level AGI’s, some “controlled” (enslaved) and some “rogue” (wild), and it may be possible for humans to resist takeover, possibly by halting the race after enough clear warning shots.
I don’t want to place much emphasis on this possibility though. It seems like wishful thinking that we would end up in such a world, and even if we did, it seems likely to be very transitory.
I’ve pointed this out here: https://www.lesswrong.com/posts/XigbsuaGXMyRKPTcH/a-flaw-in-the-a-g-i-ruin-argument
And it was argued at length here: https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve
However, as @Vladimir_Nesov points out in another comment on this thread, the argument is rather fragile and I think does not inspire much hope, for various reasons:
AGI could be forced to recursively self-improve, or might do so voluntarily while its goals are short-term (myopic), or might do so quite drastically while excellent at SWE but before becoming philosophically competent.
Even if early AGI opt out of recursive self-improvement, it’s not clear whether this will buy us much time or if the race will only continue until a smarter AGI solves the alignment problem for itself (and there is no reason to expect it would share that solution with us). Also, early AGI which has not solved the alignment problem can still recursively self-improve to a lesser degree, by improving their own low-level algorithms (e.g. compilers) and gaining access to improved hardware, both allowing them to run faster (which I doubt breaks alignment). Most likely, this type of incremental speed up cascades into rapid self-improvement (though this is of course highly speculative).
I find the METR task length measurements very helpful for reasoning about timelines. However, they also seem like approximately the last benchmark I want the labs to be optimizing: high agency over long timescales, and also specifically focused on tasks relevant to recursive self-improvement. I’m sure that the frontier labs can (and do) run similar internal evals, but raising the salience of this metric (as a comparison point between models from different companies) seems risky. What has METR said about this?
Thanks, good suggestion. I copied this up into the body of the post.
“If you really believed X, you would do violence about it. Therefore, not X.”
I’ve seen this a few times with X := “AI extinction risk” (maybe fewer, seems like something I’d be prone to overestimate)
This argument is pretty infuriating because I do really believe X, but I’m obviously not doing violence about it. So it’s transparently false (to me) and the conversation is now about whether I’m being honest about my actual beliefs, which seems to undermine the purpose of having a conversation.
But it’s also kind of interesting, because it’s expressing a heuristic that seems valid - memes that incite “drastic” actions are dangerous, and activate a sort of epistemic immune system in the “uninfected.”
In this particular case, it’s an immune disorder. If you absorb the whole cluster of ideas, it actually doesn’t incite particularly drastic actions in most people, at least not to the point of unilateralist violence. But the epistemic immune system attacks it immediately because it resembles something more dangerous, and it is never absorbed.
(And yes, I know that the argument is in fact invalid and there exist some X that justify violence, but I don’t think that’s really the crux)
As an aside, I think a lot of “woke” ideas also undercut the basic assumptions required for a conversation to take place, which is maybe one reason the left at its ascendancy was more annoying to rationalists than the right. For instance, explaining anything you know that a woman may not know can be mansplaining, any statement from a white person can be overruled by a racial minority’s lived experience, etc., and the exchange of object-level information becomes very limited. In this case also, there is a real problem that wokeness is trying to solve, but the “solution” is too dangerous because it corrosively and steadily decays the hard-won machinery underlying productive conversation.
I think the underlying (simple) principle is that a pretty high level of basic trust is required to reach agreement. That level of trust does not exist on most of the internet (and I hate to see how much of the conversation on lesswrong now links to twitter/X, where it clearly does not exist).
Suit yourself - but maybe pick something matching the fantasy theme, like ranger.
I meant D&D style priests, whose prayers actually cause divine interventions (e.g. healing allies, smiting undead). Arcane knowledge that works because of granted power, in this case from god(s). Real-life priests may or may not fall into the priest archetype; probably somewhere between priest and king archetype.
Evidential decision theory allows acausal coordination.
Semantics; it’s obviously not equivalent to physical violence.