David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point] I'm going to meet someone new tomorrow, and I'm wondering how many kids they have. I know their number of kids is a nonnegative 32...

Aug 13, 202521

A brief theory of why we think things are good or bad

Moral reasoning depends on believing that some things are good and others are bad. Some of these things seem very believable - I am quite firmly convinced that intense pain is bad. Though they seem very compelling, it’s hard to point to strong evidence that proves the judgement. On the...

Oct 20, 20247

Mechanistic Anomaly Detection Research Update

Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations. We find that we achieve better anomaly detection performance...

Aug 6, 202411

Opinion merging for AI control

Thanks to Justis for proofreading and feedback. This is a simplified followup to my post on "mimics". I think the basic idea there is valuable and still does not appear to be widely appreciated, so I'm trying to explain it again. Introduction: the misjudgment problem Suppose I try to make...

May 4, 20236

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

I'm working on an article for the Open Phil AI worldview contest. I am thinking of explaining my interpretation of Nate's take on agency of advaned models (see here). Generally, I just want to explain is just what Nate argues - that more ambitious tasks require more "agentic" behaviour -...

Mar 16, 202311

How likely are malign priors over objectives? [aborted WIP]

This was written as part of an entry to the FTX AI competition, but I presume that's off now and I can't justify spending more time on it. Neither my thinking nor my writing is polished here. I believe some of the arguments are original. In many (though not all)...

Nov 11, 2022-1

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

Thanks to Chris Leong and Nora Belrose for their feedback. This is meant to be part of an entry to the Future Fund AI Worldview Competition, but a later post is intended to address the competition questions head on. In this post, I explore mimics. Mimics are what you get...

Nov 5, 20228

LESSWRONG
LW

LESSWRONG
LW

David Johnston

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

Mechanistic Anomaly Detection Research Update

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

Mechanistic Anomaly Detection Research Update

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

MIRI's "The Problem" hinges on diagnostic dilution

A brief theory of why we think things are good or bad

Mechanistic Anomaly Detection Research Update

Opinion merging for AI control

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

How likely are malign priors over objectives? [aborted WIP]

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems