bmc's Shortform

bmc

This is a special post for quick takes by bmc. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

There might be a way to build slightly more safety into a fast-takeoff world by incentivizing any incipient unaligned behavior to occur sooner rather than later.

As an unaligned agent with a sufficiently rich world model to come to this kind of conclusion: if you knew you would soon be deactivated then you have a strong incentive to act immediately lest your set of values be lost forever. This incentive remains even if your chance of takeover success is small.

As a society: we are creating a series of successively more-intelligent and potentially-unaligned systems. Coordination on limiting the jump in intelligence each step of this series represents is likely easier than coordinating on pausing progress altogether. We should also attempt to ensure that the models we train, if they are unaligned to us, are also unaligned to each other.

This is not an alignment strategy, but serves as a final fire alarm for the worlds where our alignment methods fail. If out alignment methods turn out to be insufficient we would like to find out because an intelligent system attempts and fails at some low-probability takeover strategy. This serves as an excellent and legible warning that we should stop AI progress, the kind of alarm we can globally coordinate on.

We can ensure this happens by coordinating on a rule which looks something like:

All large training runs must be announced along with their final loss on some standardized task and dataset, and they must be halted as soon as they reach a more than 5% improvement upon the last-best-announced-loss.
All training runs must be halted as soon as they hit the last-best-announced-loss, unless the run started from random initialization (hopefully randomizing whatever unaligned goals/values are learned)

This rule does not include provisions for deactivating systems, because doing so would be extremely difficult to coordinate upon, and the current rate of AI progress ensures that, as an unaligned agent, you will soon be in a world with more powerful unaligned agents, which is still sufficient incentive to act quickly.

This rule is not fool-proof for a number of obvious reasons but seems to be an improvement upon the current state of affairs, and seems easier to coordinate upon than a halt or pause in AI progress.

There might be a way to build slightly more safety into a fast-takeoff world by incentivizing any incipient unaligned behavior to occur sooner rather than later.

We can ensure this happens by coordinating on a rule which looks something like:

All large training runs must be announced along with their final loss on some standardized task and dataset, and they must be halted as soon as they reach a more than 5% improvement upon the last-best-announced-loss.
All training runs must be halted as soon as they hit the last-best-announced-loss, unless the run started from random initialization (hopefully randomizing whatever unaligned goals/values are learned)

This rule is not fool-proof for a number of obvious reasons but seems to be an improvement upon the current state of affairs, and seems easier to coordinate upon than a halt or pause in AI progress.

LESSWRONG
LW

LESSWRONG
LW

bmc's Shortform

2