LESSWRONG
LW

1

[ Question ]

Can we create self-improving AIs that perfect their own ethics?

30th Jan 2024

1 min read

1

If the process of self-improving AIs like described in an simple article by Tim Urban (below) is mastered, then the AI alignment problem is solved: "The idea is that we’d build a computer whose two-THREE major skills would be doing research on AI, ON ETHICS, and coding changes into itself—allowing it to not only learn but to improve its own architecture. We’d teach computers to be computer scientists so they could bootstrap their own development. And that would be their main job—figuring out how to make themselves smarter and ALIGNED"

In caps: parts to add for alignment

Link to the article: https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html

1

Can we create self-improving AIs that perfect their own ethics?

5Nathan Helm-Burger

4the gears to ascension

2the gears to ascension

New Answer

New Comment

4 Answers sorted by
top scoring

Nathan Helm-Burger

Jan 30, 2024

52

What ethics? What is ethics? Does the machine mind settle on an ethics that insists on wiping out humanity? I think the intuition of looking for a star able attractor state for the AI to pursue is a good one. But "ethics" is not a sufficiently coherent and objective concept to serve as that target.

Jan 30, 2024

52

If the process of self-improving AIs like described in an simple article by Tim Urban (below) is mastered, then the AI alignment problem is solved.

I would say this has causality backwards. In other words, one of the ways of solving the AI alignment problem is figuring out how to master the plausibly extremely complex process necessary to successfully implement a strategy that can be pointed to in a simple article.

research on AI, ON ETHICS, and coding changes into itself

As I understand it, the vast majority of the difficulty is in figuring out what the second goal in that list actually is, and how to make an AI care about it. Keep in mind that in so many cases we humans are still arguing about the same questions, answers, and frameworks that we've been debating for millennia.

Jan 30, 2024

30

This overall tactic can work well for problems that are difficult to solve, but easy (or at least possible) to test a solution.

I don't think alignment is such a thing. At least I haven't seen any proposals for measuring "how aligned" a system is.

[-]the gears to ascension1y*40

I have seen many. the only ones that seem to have any chance of, after heavy modification, becoming a seed of something that holds up, are QACI and open agency+boundaries. both have big holes that make attempting to implement them as-is guaranteed to fail.

2Dagon1y

Yeah, there are a lot of sketches for how to test a system for various specific behaviors. But no actual gears-level definition of what would succeed at alignment in such a way as it does any good, while doing no (or acceptably small, being the key undefined variable) harm. A brick is aligned in that it does no harm. But it also doesn't make anyone immortal or solve any resource-allocation pains that humans have.

2the gears to ascension1y

do you know of any other sketches of how to measure that are reasonably close to mechanically specified?

6Dagon1y

Simulation or hidden Schelling fences seem to be the main mechanisms. I have seen zero ideas of WHAT to measure on a "true alignment" level. They all seem to be about noticing specific problems. I have seen none that try to quantify a "semi-aligned power" tradeoff between the good it does and the harm it does. I think Eliezer's early writing (and, AFAIK, current thinking) that it must be perfect or all is lost, with nothing in between, probably makes the goal impossible.

2ChristianKl1y

You can measure AlphaGo's ability to play go by letting it play go which you can very well mechanically specify. Just let it play a game against a pro. We don't have a similar measurement for ethics.

Jan 30, 2024

20

It's not clear if this ends up working as intended, but there are proposals to that effect.

For example, "Safety without alignment", https://arxiv.org/abs/2303.00752 proposes to explore a path which is closely related to what you are suggesting.

(It would be helpful to have a link to Tim Urban's article.)

Thanks for including the link in your edit.

One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).

An arbitrary goal or value is unlikely to persist (this is why the "classical formulation of alignment problem" is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That's the main obstacle t... (read more)

Curated and popular this week