Comment Permalink

Trevor Hill-Hand3mo10

It seems like if there is any non-determinism at all, there's always going to be an unavoidable potential for naughty thoughts, so whatever you call the "AI" must address them as part of its function anyway- either that or there is a deterministic solution?

13

[ Question ]

Is AI alignment a purely functional property?

by Roko

15th Dec 2024

1 min read

0 8

13

In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is "aligned" is purely a function of its input/output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.

So I'd like to ask the community whether it is currently considered the mainstream take that 'Alignment' is functional (only input/output mapping matters) or whether the internal computation matters (it's not OK to think a naughty thought and then have some subroutine that cancels it, for example).

Frontpage

13

Is AI alignment a purely functional property?

New Answer

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:23 AM

[-]Tahp3mo50

It may be that generating horrible counterfactual lines of thought for the purpose of rejecting them is necessary for getting better outcomes. To the extent that you have a real dichotomy here, I would say that the input/output mapping is the thing that matters. I want all humans to not end up worse off for inventing AI.

That said, humans may end up worse off by our own metrics if we make AI that is itself suffering terribly based off of its internal computation or it is generating ancestor torture simulations or something. Technically that is an alignment issue, although I worry that most humans won't care if the AI is suffering if they don't have to look at it suffer and it generates outputs that humans like aside from that hidden detail.

[-]Signer3mo42

There is no such disagreement, you just can't test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.

[-]Roko3mo40

There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.

[-]Signer3mo20

What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.

I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.

[-]p4rziv4l3mo10

What it says: irrelevant
How it thinks: irrelevant

It has always been about what it can do in the real world.

If it can generate substantial amounts of money and buy server capacity or
hack into computer systems

then we got cyberlife, aka autonomous, rogue, self-sufficient AI, subject to darwinian forces on the internet, leading to more of those qualities, which improve its online fitness, all the way into a full-blown takeover.

[-]Roko3mo30

I should have been clear: "doing things" is a form of input/output since the AI must output some tokens or other signals to get anything done

[-]p4rziv4l1mo10

in a world where mechinterp is not 100%, the answer is logically: input/output is what matters.

we won't be able to read the thoughts anyways, so why base our judgment on it?

but see my comment on why survival fitness in cyberspace is the one axis where most of the relevant input/output will be generated.

[-]Trevor Hill-Hand3mo10

Moderation Log