Replying toUtopiaBench

it deliberately doesn't address how the problems get solved, it only cheerfully depicts a world after the problems were all solved.

I think for a submission to be valid, it must at least attempt to answer how the alignment problem gets solved and how extreme concentrations of power are avoided

I strongly disagree with this criterion! If a vision of the future tells us compellingly things like "future X is better than future Y (where both X and Y are plausible)" this is very valuable - we can use it to make plans, and our capacity to make plans is rising so it's OK to presently have ambitions that outstrip our capacity to see... (read more)

Replying toAnthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

David Johnston10d

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

I think you have me mistaken for my infamous doppelganger, @Robert Miles.

I did, apologies. I also recently discovered Max H != Max Harms, it's quite confusing round here.

Figure 1 doesn't represent any specific experiment's data, unless I'm very confused - I think it's just an illustration of the authors' all-things-considered summary of their own results.

I got my figure numbers mixed up, but I think we're roughly on the same page here. NB the twitter thread states: "Finding 2: There is an inconsistent relationship between model intelligence and incoherence" which looks spot on to me.

As for the rest - it really seems to me like you're either trying to establish the same conceptual

... (read 359 more words →)

Replying toAnthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

David Johnston10d

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

Look, I do agree that "coherence" a questionable name for the measure they've come up with, so I'm going to keep it in quotation marks.

Ok, now let's consider a model with variance of 1e-3 and bias of 1e-6. Huge "incoherence"! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion?

Well, let's think about it. A key proposition in Yudkowskian misalignment theory is that capabilities generalise further than alignment. That is, as models get better, at some point a "capabilities engine" crystallises which is is very good at achieving a very wide variety of things; at the same time, the "thing-it-ought-to-be-achieving"... (read 721 more words →)

Replying toThe Meta-Anthropic Argument

David Johnston12d

The Meta-Anthropic Argument

I've sometimes joked that the doomsday style arguments are actually arguments about the death of interest in anthropics.

Replying toEverybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

David Johnston12d

Everybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

While I think this is a broadly reasonable response, I'm curious what you think is able to provide better public justification than longtermism. These results seem to apply fairly broadly to any realistic EV-based justification for action given that partial observability is very much the rule.

Replying toEverybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

David Johnston12d

Everybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

Well I meant it as an empirical hypothesis and thought it may have formal implications (specifically, placing the problem in a smaller, more tractable class).

Replying toFitness-Seekers: Generalizing the Reward-Seeking Threat Model

David Johnston16d*

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Just an incomplete comment about The assumptions that make reward-seekers plausible also make fitness-seekers plausible: I think a central question is whether X-seeking gives you a compressed policy vs "optimal kludge". That is: if it's just as hard to learn the optimal policy if I'm an X-seeker as it is to learn the optimal kludge if I'm not an X-seeker, then it seems like I'm unlikely to learn X-seeking (or X-seeking is at best no more likely than a whole host of other possible behavioural spandrels, which implies exactly the same thing).

I think the argument that X-seeking incentivises optimal behaviour is some reason to think it might be compressive, but not obviously... (read more)

Replying toEverybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

David Johnston17d

Everybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

Separate comment: the title doesn't seem to connect well to the content, and it'd be nice if you were clearer about whether your theorems are partly original or simply lifts from the relevant texts that are justified by your modelling choices (I think the latter, given the absence of proofs, but "my first theorem" sorta confuses this).

Replying toEverybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

David Johnston17d

Everybody Wants to Rule the Future - Is Longtermism's Mandate of Heaven by Arithmetic Justified?

A common longtermist model is one where there's a transient period of instability (perhaps "around about now") that settles into a stable state thereafter. This seems like it would be no harder than a finite-horizon problem terminating when stability is achieved. I haven't looked into the results you quote and exactly what role the infinite horizon takes, but intuitively it seems correct that eternal instability (or even very long-lived instability) along any dimension of interest would make policy choice intractable while stability in the near future can make it fairly straightforward. Maybe there's an issue where the highest value outcomes occur in the unstable regimes, which makes it hard to "bet on stability", but I'd like to see it in maths + plausible examples.

Replying toPretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

David Johnston25d

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.

MIRI's "The Problem" hinges on diagnostic dilution

David Johnston

6mo

[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point]

I'm going to meet someone new tomorrow, and I'm wondering how many kids they have. I know their number of kids is a nonnegative 32 bit integer, and almost all such numbers are greater than 1 million. So I suppose they're highly likely to have more than 1 million kids.

This is an instance of a fallacy I'm calling diagnostic dilution^[1]. It is akin to a probabilistic form of the fallacy of the undistributed middle. The error goes like this: we want to evaluate some outcome $A$ given some... (read 1750 more words →)

A brief theory of why we think things are good or bad

David Johnston

Moral reasoning depends on believing that some things are good and others are bad. Some of these things seem very believable - I am quite firmly convinced that intense pain is bad. Though they seem very compelling, it’s hard to point to strong evidence that proves the judgement. On the other hand it's easy to point to strong motivations for believing such things - I don’t want to be tortured - and we know that motivation can cause people to believe things. The same mechanism that causes us to believe things we want to be true could also be responsible for our basic moral judgements of goodness and badness.

We believe many things... (read 1167 more words →)

Mechanistic Anomaly Detection Research Update

Nora Belrose

Nora Belrose, David Johnston

Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.

We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at.

We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.

Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!

Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector

Opinion merging for AI control

David Johnston

Thanks to Justis for proofreading and feedback.

This is a simplified followup to my post on "mimics". I think the basic idea there is valuable and still does not appear to be widely appreciated, so I'm trying to explain it again.

Introduction: the misjudgment problem

Suppose I try to make a very powerful AI system, and I plan to ask it to make me rich. If it works well, this seems like it would be a great outcome to me, but maybe I'm wrong about this, in which case developing this system might be a bad decision on my part.

Assume I make a probabilistic forecast about possible futures. My forecast can be illustrated as follows:

For... (read 3064 more words →)

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

David Johnston

I'm working on an article for the Open Phil AI worldview contest. I am thinking of explaining my interpretation of Nate's take on agency of advaned models (see here). Generally, I just want to explain is just what Nate argues - that more ambitious tasks require more "agentic" behaviour - but I wanted to illustrate with a few examples that to me makes the argument clearer than typical MIRI discussions of this issue.

One worry about discussing this issue is that, if you argue compellingly that agency is required for more advanced AI, then you might convince some people working on advanced AI to look for ways to make it more agentic^[1]. This... (read 331 more words →)

How likely are malign priors over objectives? [aborted WIP]

David Johnston

This was written as part of an entry to the FTX AI competition, but I presume that's off now and I can't justify spending more time on it. Neither my thinking nor my writing is polished here. I believe some of the arguments are original.

In many (though not all) AI catastrophe theories, a critical hypothesis is the hypothesis of a malign prior over objectives. Concretely:

Suppose that an advanced AI can be well-described by an agent with a comptent model of the world pursuing some objective
Suppose we can analyse which goal is likely to be pursued by considering a prior over objectives updated based on the score that the policy induced by this

... (read 2324 more words →)

-1

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston

Thanks to Chris Leong and Nora Belrose for their feedback. This is meant to be part of an entry to the Future Fund AI Worldview Competition, but a later post is intended to address the competition questions head on.

In this post, I explore mimics. Mimics are what you get when you join a simulator with a generator. Examples are language models that learn to predict text sequences (the simulator), and generate samples of text sequences from their predictions (the generator). A number of AI safety researchers have mentioned that mimics seem to be safer than "traditional" AI architectures like reinforcement learners, with the proposed reason for this often being that mimics are... (read 4710 more words →)

There's probably a tradeoff between AI capability and safety, and we should act like it

David Johnston

I think something like the following is probably true under fairly weak assumptions:^[1]

For almost any goal X, and almost any agent that achieves X safely, there's an agent that achieves even more X unsafely.

If it is true, it makes AI safety hard: the desire to achieve goals pushes us in the direction of unsafe AI, and the desire to be safe pushes us away from achieving goals. Managing this tradeoff could be really difficult, and we could fail.

However, the fact that the desired outcome might not be achieved - lots of good stuff, no disasters - does not mean that the best strategy is hard to work out. At a high level,... (read 243 more words →)

Is evolutionary influence the mesa objective that we're interested in?

David Johnston

Epistemic status: this is the result of me trying to better understand the idea of mesa optimizers. It's speculative and full of gaps, but maybe it's interesting and I'm not realistically going to have time to improve it much in the near future.

Humans are often presented as an example of "mesa optimisers" - organisms created to "maximise evolutionary fitness" that end up doing all sorts of other things including not maximising evolutionary fitness and transforming the world in the process. This analogy is usually accompanied by a disclaimer like this:

We do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy

... (read 1201 more words →)

[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness

David Johnston

This is a cross-post from my blog.

Artificial Intelligence safety researchers are concerned with the behaviour of “highly capable” AI systems. One of the challenges of this line of research is the fact that it’s hard to say, looking at today’s moderately capable AI systems, what highly capable systems will do or how they will work.

There are two intuitions that suggest understanding highly capable systems is particularly important for safety research:

Highly capable AI systems could be very useful, so we expect that if they can be built they probably will be
Highly capable AI systems raise much more serious safety concerns than less capable systems

This seems to make sense to me, but what does... (read 1847 more words →)

LESSWRONG
LW

LESSWRONG
LW

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

Mechanistic Anomaly Detection Research Update

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

A brief theory of why we think things are good or bad

Mechanistic Anomaly Detection Research Update

Opinion merging for AI control

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

How likely are malign priors over objectives? [aborted WIP]

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

Mechanistic Anomaly Detection Research Update

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston

MIRI's "The Problem" hinges on diagnostic dilution

A brief theory of why we think things are good or bad

Mechanistic Anomaly Detection Research Update

Opinion merging for AI control

Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?

How likely are malign priors over objectives? [aborted WIP]

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

Introduction: the misjudgment problem