Julian Bradshaw - LessWrong

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

No, although if the "juicy beings" are only unfeeling bugs, that might not be as bad as it intuitively sounds.

There's a wrinkle to my posts here where partly I'm expressing my own position (which I stated elsewhere as "I'd want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek's Federation ought to qualify)") and partly I'm steelmanning the OP's position, which I've interpreted as "all beings are primary sources of values for the CEV".

In terms of how various preferences involving harming other beings could be reconciled into a CEV: yeah it might not be possible. Maybe the harmed beings are simulated/fake somehow? Maybe animals don't really have preferences about reality vs. VR, and every species ends up in their own VR world...

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d40

Ah, if your position is "we should only have humans as primary sources of values in the CEV because that is the only workable schelling point", then I think that's very reasonable. My position is simply that, morally, I think that schelling point is not what I'd want. I'd want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek's Federation ought to qualify)

But of course you'd say it doesn't matter what I (or vegan EAs) want because that's not the schelling point and we don't have a right to impose our values, which is a fair argument.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d20

I admit:

Human preferences don't fully cohere, especially when extrapolated
There are many ways in which "Humanity's CEV" is fuzzy or potentially even impossible to fully specify

But I think the concept has staying power because it points to a practical idea of "the AI acts in a way such that most humans think it mostly shares their core values".^[1] LLMs already aren't far from this bar with their day-to-day behavior, so it doesn't seem obviously impossible.

To go back to agreeing with you, yes, adding new types of beings as primary sources of values to the CEV would introduce far more conflicting sets of preferences, maybe to the point that trying to combine them would be totally incoherent. (predator vs. prey examples, parasites, species competing for the same niche, etc etc.) That's a strong objection to the "all beings everywhere" idea. It'd certainly be simpler to enforce human preferences on animals.

^{^}
I think of this as meaning the AI isn't enforcing niche values ("everyone now has to wear Mormon undergarments in order to save their eternal soul"), is not taking obviously horrible actions ("time to unleash the Terminators!"), and is taking some obviously good actions ("I will save the life of this 3-year-old with cancer"). Obviously it would have to be neutral on a lot of things, but there's quite a lot most humans have in common.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d*41

No I'm saying it might be too late at that point. The moral question is "who gets to have their CEV implemented?" OP is saying it shouldn't be only humans, it should be "all beings everywhere". If we implement an AI on Humanity's CEV, then the only way that other sapient beings would get primary consideration for their values (not secondary consideration where they're considered only because Humanity has decided to care about their values) would be if Humanity's CEV allows other beings to be elevated to primary value sources alongside Humanity. That's possible I think, but not guaranteed, and EAs concerned with ex. factory farming are well within their rights to be concerned that those animals are not going to be saved any time soon under a Humanity's CEV-implementing AI.

Now, arguably they don't have a right as a minority viewpoint to control the value sources for the one CEV the world gets, but obviously from their perspective they want to prevent a moral catastrophe by including animals as primary sources of CEV values from the start.

Edit: confusion clarified in comment chain here.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d20

I think you've misunderstood what I said? I agree that a human CEV would accord some moral status to animals, maybe even a lot of moral status. What I'm talking about is "primary sources of values" for the CEV, or rather, what population is the AI implementing the Coherent Extrapolated Volition of? Normally we assume it's humanity, but OP is essentially proposing that the CEV be for "all beings everywhere", including animals/aliens/AIs/plants/whatever.

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d20

I agree that in terms of game theory you're right, no need to include non-humans as primary sources of values for the CEV. (barring some scenarios where we have powerful AIs that aren't part of the eventual singleton/swarm implementing the CEV)

But I think the moral question is still worthwhile?

Moral Alignment: An Idea I'm Embarrassed I Didn't Think of Myself

Julian Bradshaw20d3-16

This is IMO the one serious problem with using (Humanity's) Coherent Extrapolated Volition as an AI alignment target: only humans get to be a source of values. Sure animals/aliens/posthumans/AIs are included to the extent humans care about them, but this doesn't seem quite just.^[1]

On the other hand, not very many humans want their values to be given equal weight to those of a mollusk. Hypothetically you could ask the AI to do some kind of sentience-weighting...? Or possibly humanity ought to be given the option to elevate sapient peers to be primary sources of values alongside humans via a consensus mechanism. It's a tough moral problem, especially if you don't assume the EA stance that animals have considerable moral value.^[2]

^{^}
Consider a scenario where we have a society of thinking, feeling beings that's only 1/4th "human" - it would be clearly morally wrong for the other 3/4ths to not be a primary consideration of whatever AI singleton is managing things. Now, arguably CEV should solve this automatically - if we think some scenario caused by CEV is morally wrong, surely the AI wouldn't implement that scenario since it doesn't actually implement Humanity's values? But that's only true if some significant portion of idealized Humanity actually thinks there's a moral problem with the scenario. I'm not sure that even an idealized version of Humanity agrees with your classic shrimp-loving EA about the moral value of animals, for example.

Maybe this is just a function of the fact that any AI built on general human values is naturally going to trample any small minority's values that are incompatible with majority values (in this case hunting/fishing/eating meat). Obviously we can't let every minority with totalizing views control the world. But creating a singleton AI potentially limits the chance for minorities to shape the future, which is pretty scary. (I don't think a CEV AI would totally prevent a minority's ability to shape the future/total value lock-in; if you as a minority opinion group could convince the rest of humanity to morally evolve in some way, it should update the AI's behavior.)
^{^}
What's tough about giving moral status to animals? The issue here is that there's massive incentive for minority opinion groups to force their values on the rest of humanity/the world by trying to control the alignment target for AI. Obviously everyone is going to say their minority values must be enforced upon the world in order to prevent moral catastrophe, and obviously a lot of these values are mutually exclusive - probably every possible alignment target is a moral catastrophe according to someone.

Julian Bradshaw's Shortform

Julian Bradshaw24d30

The best cross-comparison on same harness info I know of is here.

Julian Bradshaw's Shortform

Julian Bradshaw25d60

o3 beat Pokémon Red today, making it the second model to do so after Gemini 2.5 Pro (technically Gemini beat Blue).

It had an advanced custom harness like Gemini's, rather than Claude's basic one. Hard to compare runs because its harness is different from Gemini's, but Gemini's most recent run finished in ~406 hours / ~37k actions, whereas o3 finished in ~388 hours / ~18k actions. (there are some differences in how actions are counted) Claude Opus 4 has yet to achieve the 4th badge on its current ~380 hour / 54k actions run, but it's very likely it could beat the game with an advanced harness.

See here for stream
See here for info on harness

the void

Julian Bradshaw26d20

Embodiment makes a difference, fair point.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments