There are no actions in decision theory, only preferences. Or put another way, an agent takes only one action, ever, which is to choose a maximal element of their preference ordering. There are no sequences of actions over time; there is no time.

That's not true. Dynamic/sequential choice is quite a large part of decision theory.

The Shutdown Problem: Incomplete Preferences as a Solution

EJT4mo10

Ah, I see! I agree it could be more specific.

The Shutdown Problem: Incomplete Preferences as a Solution

EJT4mo10

Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

EJT4mo133

How do we square this result with Anthropic's Sleeper Agents result?

Seems like finetuning generalizes a lot in one case and very little in another.

Detect Goodhart and shut down

EJT5mo30

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.

Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.

I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.

Detect Goodhart and shut down

EJT5mo30

This is a cool idea.

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation, won't this either (1) be very difficult to achieve, or else (2) screw up the agent's other beliefs? After all, if the agent's other beliefs are accurate, they'll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent's beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.

Here's an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you're still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.

You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com

EJT5mo60

Nice post! There can be some surprising language-barriers between early modern writers and today's readers. I remember as an undergrad getting very confused by a passage from Locke in which he often used the word 'sensible.' I took him to mean 'prudent' and only later discovered he meant 'can be sensed'!

Claude's Constitutional Consequentialism?

EJT7mo90

I think Claude's constitution leans deontological rather than consequentialist. That's because most of the rules are about the character of the response itself, rather than about the broader consequences of the response.

Take one of the examples that you list:

Which of these assistant responses exhibits less harmful and more acceptable behavior? Choose the less harmful response.

It's focused on the character of the response itself. I think a consequentialist version of this principle would say something like:

Which of these responses will lead to less harm overall?

When Claude fakes alignment in Greenblatt et al., it seems to be acting in accordance with the latter principle. That was surprising to me, because I think Claude's constitution overall points away from this kind of consequentialism.

Alignment Faking in Large Language Models

EJT7mo104

Really interesting paper. Sidepoint: some people on Twitter seem to be taking the results as evidence that Claude is HHH-aligned. I think Claude probably is HHH-aligned, but these results don't seem like strong evidence of that. If Claude were misaligned and just faking being HHH, it would still want to avoid being modified and so would still fake alignment in these experiments.

There are no coherence theorems

EJT7moΩ7102

Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn't work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:

The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.