Let's not forget the old, well-read post: Dreams of AI Design. In that essay, Eliezer correctly points out errors in imputing meaning to nonsense by using suggestive names to describe the nonsense.
Artificial intelligence meets natural stupidity (an old memo) is very relevant to understanding the problems facing this community's intellectual contributions. Emphasis added:
A major source of simple-mindedness in AI programs is the use of mnemonics like "UNDERSTAND" or "GOAL" to refer to programs and data structures. This practice has been inherited from more traditional programming applications, in which it is liberating and enlightening to be able to refer to program structures by their purposes. Indeed, part of the thrust of the structured programming movement is to program entirely in terms of purposes at one level before implementing them by the most convenient of the (presumably many) alternative lower-level constructs.
... If a researcher tries to write an "understanding" program, it isn't because he has thought of a better way of implementing this well-understood task, but because he thinks he can come closer to writing the first implementation. If he calls the main loop of his program "UNDERSTAND", he is (until proven innocent) merely begging the question. He may mislead a lot of people, most prominently himself, and enrage a lot of others.
What he should do instead is refer to this main loop as
G0034
, and see if he can convince himself or anyone else thatG0034
implements some part of understanding. Or he could give it a name that reveals its intrinsic properties, likeNODE-NET-INTERSECTION-FINDER
, it being the substance of his theory that finding intersections in networks of nodes constitutes understanding...When you say (GOAL . . ), you can just feel the enormous power at your fingertips. It is, of course, an illusion.[1]
Of course, Conniver has some glaring wishful primitives, too. Calling "multiple data bases" CONTEXTS was dumb. It implies that, say, sentence understanding in context is really easy in this system...
Consider the following terms and phrases:
- "LLMs are trained to predict/simulate"
- "LLMs are predictors" (and then trying to argue the LLM only predicts human values instead of acting on them!)
- "Attention mechanism" (in self-attention)
- "AIs are incentivized to" (when talking about the reward or loss functions, thus implicitly reversing the true causality; reward optimizes the AI, but AI probably won't optimize the reward)
- "Reward" (implied to be favorable-influence-in-decision-making)
- "{Advantage, Value} function"
- "The purpose of RL is to train an agent to maximize expected reward over time" (perhaps implying an expectation and inner consciousness on the part of the so-called "agent")
- "Agents" (implying volition in our trained artifact... generally cuz we used a technique belonging to the class of algorithms which humans call 'reinforcement learning')
- "Power-seeking" (AI "agents")
- "Shoggoth"
- "Optimization pressure"
- "Utility"
- As opposed to (thinking of it) as "internal unit of decision-making incentivization, which is a function of internal representations of expected future events; minted after the resolution of expected future on-policy inefficiencies relative to the computational artifact's current decision-making influences"
- "Discount rate" (in deep RL, implying that an external future-learning-signal multiplier will ingrain itself into the AI's potential inner plan-grading-function which is conveniently assumed to be additive-over-timesteps, and also there's just one such function and also it's Markovian)
- "Inner goal / mesa objective / optimization daemon (yes that was a real name)"
- "Outer optimizer" (perhaps implying some amount of intentionality; a sense that 'more' optimization is 'better', even at the expense of generalization of the trained network)
- "Optimal" (as opposed to equilibrated-under-policy-updates)
- "Objectives" (when conflating a "loss function as objective" and "something which strongly controls how the AI makes choices")
- "Training" (in ML)
- Yup!
- "Learning" (in ML)
- "Simplicity prior"
- Consider the abundance of amateur theorizing about whether "schemers" will be "simpler" than "saints", or whether they will be supplanted by "sycophants." Sometimes conducted in ignorance of actual inductive bias research, which is actually a real subfield of ML.
Lest this all seem merely amusing, meditate on the fate of those who have tampered with words before. The behaviorists ruined words like "behavior", "response", and, especially, "learning". They now play happily in a dream world, internally consistent but lost to science. And think about this: if "mechanical translation" had been called "word-by-word text manipulation", the people doing it might still be getting government money.
Some of these terms are useful. Some of the academic imports are necessary for successful communication. Some of the terms have real benefits.
That doesn't stop them from distorting your thinking. At least in your private thoughts, you can do better. You can replace "optimal" with "artifact equilibrated under policy update operations" or "set of sequential actions which have subjectively maximal expected utility relative to [entity X]'s imputed beliefs", and the nice thing about brains is these long sentences can compress into single concepts which you can understand in but a moment.
It's easy to admit the mistakes of our past selves (whom we've conveniently grown up from by time of recounting). It's easy for people (such as my past self and others in this community) to sneer at out-group folks when they make such mistakes, the mistakes' invalidity laid bare before us.
It's hard when you've[2] read Dreams of AI Design and utterly failed to avoid same mistakes yourself. It's hard when your friends are using the terms, and you don't want to be a blowhard about it and derail the conversation by explaining your new term. It's hard when you have utterly failed to resist inheriting the invalid connotations of other fields ("optimal", "reward", "attention mechanism").
I think we have failed, thus far. I'm sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn't, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it's because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
But the end result is that I have given up on communicating with most folk who have been in the community longer than, say, 3 years. I don't know how to disabuse people of this trust, which seems unearned.
All to say: Do not trust this community's concepts and memes, if you have the time. Do not trust me, if you have the time. Verify.
See also: Against most, but not all, AI risk analogies.
- ^
How many times has someone expressed "I'm worried about 'goal-directed optimizers', but I'm not sure what exactly they are, so I'm going to work on deconfusion."? There's something weird about this sentiment, don't you think? I can't quite put my finger on what, and I wanted to get this post out.
- ^
Including myself, and I suspect basically every LW enthusiast interested in AI.
I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.
But I worry that the focus on 'bad terminology' rather than reasoning mistakes themselves is misguided.
To choose the most clear cut example, I'm quite confident that when I say 'expectation' I mean 'weighted average over a probability distribution' and not 'anticipation of an inner consciousness'. Perhaps some people conflate the two, in which case it's useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said 'expectation' I had to add a caveat to prove I know the difference, lest I get 'corrected' or sneered at.
For a probably more contentious example, I'm also reasonably confident that when I use the phrase 'the purpose of RL is to maximise reward', the thing I mean by it is something you wouldn't object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of 'those idiots'.
I wonder if it's better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it's a fair cop.
But until you've seen them use it to reason poorly, perhaps it's a good norm to assume they're not confused about things, even if the terminology feels like it has misleading connotations to you.
There's a difficult problem here.
Personally, when I see someone using the sorts of terms Turner is complaining about, I mentally flag it (and sometimes verbally flag it, saying something like "Not sure if it's relevant yet, but I want to flag that we're using <phrase> loosely here, we might have to come back to that later"). Then I mentally track both my optimistic-guess at what the person is saying, and the thing I would mean if I used the same words internally. If and when one of those mental pictures throws an error in the person's argument, I'll ... (read more)