zulupineapple — LessWrong

For any system with a very small number of outputs <...> it is trivially easy to verify the safety of the range of available outputs

On second thought, I don't agree that the number of outputs is the right criteria. It's the "narrowness" of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of "preference for chess" or "desire to win", neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say "I'm good at chess", but this wouldn't be a direct consequence of the chess skill.

Chess bots do not have goals

zulupineapple3d10

Is this supposed to answer my entire comment, in the sense that the general audience doesn't need precise definitions? That may work for some people but can be off-putting to others. And surely it's more important to convince AI researchers. Much of the general public already hates AI.

Chess bots do not have goals

zulupineapple4d10

I don't really understand what you mean by goal-oriented

This is the key point. "The Problem" uses the word "goal" some 80 times, but does not define it, does not acknowledge that it's a complex concept, or consider if some AI might not have it. I wish I could just use your concept of a goal, we shouldn't need to discuss this, it should have been precisely defined in "The Problem" or some other introductory text.

Personally by "goal" or "goal-oriented" I mean that the utility function of the AI has a simple description. E.g. In the narrow domain of chess moves, the actions a chess bot chooses are very well explained by "trying to win". On the other hand, in the real world there are many actions that would help winning, which the chess bot ignores, and not for a lack of intelligence, therefore "trying to win" is no longer a good predictor, and I therefore claim that this bot is not actually "goal-oriented", or at least the goal is very different from plain "winning". Maybe you would call this property "robustly goal-oriented"?

A second definition could be that any system which moves towards any goal, more than away from it, is "oriented" to that goal. This is reasonable because that's the economically useful part. And with this definition, the statement "ASI is very likely to exhibit goal-oriented behavior" is trivial. But that's a very low bar, and extrapolating something about the long term behavior of the system from this definition seems like a mistake.

I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question.

I'm happy with the explanation, my issue is that I don't feel like I've seen this explicitly acknowledged, neither in "The Problem" nor in "A List of Lethalities" (maybe it falls under "We can't just build a very weak system", but I don't agree that it has to be weak) nor in Paul's posts I've read.

not all useful agents are goal-oriented.

The theorem proved I mentioned is one such useful agent. I understand you would call the prover "goal oriented", but it doesn't necessarily reach that level under my definition. And at lest we agree that provers can be safe. The usefulness is, among other things, that we could use the prover to work out alignment for more general agents. I don't want to go too far on the tangent of whether this would actually work, but surely it is a coherent sequence of events, right?

The chess example, as I recall, is a response to two common objections

I don't hold these objections, and I don't think anyone reasonable does, especially with the "never" in them. At best I could argue that humans aren't actually great at pursuing goals robustly, and therefore the AI might also not be.

contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis

It's not just "hard" to make improvements, but also unnecessary and even suicidal. X-risk arguments seem to assume that goals are robust, but do not convincingly explain why they must be.

Chess bots do not have goals

zulupineapple4d10

To be clear, you believe that making aligned narrow AI is easy, regardless of how intelligent it is? Even something more useful than a chess bot, like a theorem prover? And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks? I'm not saying they won't, but is that really enough to justify high p(doom)? When I was reading "The Problem", I was sure that goal-oriented AI was seen as inevitable for some reason deeper than "Goal-oriented behavior is economically useful".

I'd still like to argue that "goal-oriented" is not a simple concept, and it's not trivial to produce a goal-oriented agent even if you try, and that not all useful agents are goal-oriented. But if the answer is that people will try very hard to kill themselves, I wouldn't know how to reply to that.

Chess bots do not have goals

zulupineapple5d1-9

All AI is trained in narrow domains, to some extent. There is no way to make a training environment as complex as the real world. I could have make the same post about LLMs, except there the supposed goal is a lot less clear. Do you have a better example of a "goal oriented" AI in a complex domain?

You might reasonably argue that making aligned narrow AI is easy, but greedy capitalists will build unaligned AI instead. I think it would be off topic to debate here how likely that is. But I don't think this is the prevailing thought, and I don't think it produces the p(doom)=0.9 that some people hold.

And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.

Banning Said Achmiz (and broader thoughts on moderation)

zulupineapple5d120

In reference to Said criticizing Benquo, you seem to be ignoring the crucial point, which is that Said was right. Benquo made the simple claim, that knowing about yeast is useful in everyday life, and this claim is clearly wrong, regardless of what either of them said about it. Benquo could have admitted this, or he could have found another example. But instead he doubled down on being wrong, which naturally leads to frustration. It's concerning that you picked this conversation as an example, as if you can't tell.

I'm also confused by the "asymmetric effort" part. You describe it like a competition between author and critic. But if the criticism is correct, the author should be happy to receive it (isn't that the point of making posts?). And how much effort does it take to say "I don't understand what you mean" or "this doesn't seem important to my core point" or "you're right"? And what reward do you think Said gets from this? Certainly not social status, as he clearly doesn't have any.

By the way, I appreciate you pointing to real conversations instead of vaguely hand waving. If you want to change anyone's mind about norms, you need to point to negative as well as positive examples as evidence. Hopefully you'll do that when you eventually explain the right way to fight the LinkedIn attractor.

How AI Is Learning to Think in Secret

zulupineapple5d10

Given that the training data contains contradictory statements and blatant lies, it's natural that the LLM would learn to lie in order to reproduce those. Rather, it's surprising that the LLM has any coherent concept of truth at all.

As for solutions, I'd suggest either having LLMs produce formal verifiable proofs, or having them fact check each other, not that you need my suggestions.

In My Misanthropy Era

zulupineapple5d30

The solution is to realize that your rationality skills are also bullshit, it's just a slightly different brand of bullshit, not compatible with the bullshit of lay-philosophy. There certainly isn't much evidence to the contrary.

In practice I recommend playing a video game, finding people who are better then you at it, finding that your rationality isn't all that useful for beating them, and developing some respect for them, despite the shit they might say sometimes.

An Orthodox Case Against Utility Functions

zulupineapple2y*Ω010

Maybe I should just let you tell me what framework you are even using in the first place.

I'm looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.

Furthermore, if O={o0,o1}, then I can group the terms into u(o0)P("we're in a state where f evaluates to o0") + u(o1)P("we're in a state where f evaluates to o1"), I'm just moving all of the complexity out of EU and into P, which I assume to work by some magic (e.g. LI), that doesn't involve literally iterating over every possible S.

We can either start with a basic set of "worlds" (eg, ) and define our "propositions" or "events" as sets of worlds <...>

That's just math speak, you can define a lot of things as a lot of other things, but that doesn't mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.

By the way, I might not see any more replies to this.

An Orthodox Case Against Utility Functions

zulupineapple2yΩ010

A classical probability distribution over with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U.

Ok, you're saying that JB is just a set of axioms, and U already satisfies those axioms. And in this construction "event" really is a subset of Omega, and "updates" are just updates of P, right? Then of course U is not more general, I had the impression that JB is a more distinct and specific thing.

Regarding the other direction, my sense is that you will have a very hard time writing down these updates, and when it works, the code will look a lot like one with an utility function. But, again, the example in "Updates Are Computable" isn't detailed enough for me to argue anything. Although now that I look at it, it does look a lot like the U(p)=1-p("never press the button").

events (ie, propositions in the agent's internal language)

I think you should include this explanation of events in the post.

construct 'worlds' as maximal specifications of which propositions are true/false

It remains totally unclear to me why you demand the world to be such a thing.

I'm not sure why you say Omega can be the domain of U but not the entire ontology.

My point is that if U has two output values, then it only needs two possible inputs. Maybe you're saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you're right, but I feel no need to make such claims. Even if the domains are different, they are not unrelated, Omega is still in some way contained in the ontology.

I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent

We could and I think we should. I have no idea why we're talking math, and not writing code for some toy agents in some toy simulation. Math has a tendency to sweep all kinds of infinite and intractable problems under the rug.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments