[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point] I'm going to meet someone new tomorrow, and I'm wondering how many kids they have. I know their number of kids is a nonnegative 32...
Moral reasoning depends on believing that some things are good and others are bad. Some of these things seem very believable - I am quite firmly convinced that intense pain is bad. Though they seem very compelling, it’s hard to point to strong evidence that proves the judgement. On the...
Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations. We find that we achieve better anomaly detection performance...
Thanks to Justis for proofreading and feedback. This is a simplified followup to my post on "mimics". I think the basic idea there is valuable and still does not appear to be widely appreciated, so I'm trying to explain it again. Introduction: the misjudgment problem Suppose I try to make...
I'm working on an article for the Open Phil AI worldview contest. I am thinking of explaining my interpretation of Nate's take on agency of advaned models (see here). Generally, I just want to explain is just what Nate argues - that more ambitious tasks require more "agentic" behaviour -...
This was written as part of an entry to the FTX AI competition, but I presume that's off now and I can't justify spending more time on it. Neither my thinking nor my writing is polished here. I believe some of the arguments are original. In many (though not all)...
Thanks to Chris Leong and Nora Belrose for their feedback. This is meant to be part of an entry to the Future Fund AI Worldview Competition, but a later post is intended to address the competition questions head on. In this post, I explore mimics. Mimics are what you get...