Let's not forget the old, well-read post: Dreams of AI Design. In that essay, Eliezer correctly points out errors in imputing meaning to nonsense by using suggestive names to describe the nonsense.
Artificial intelligence meets natural stupidity (an old memo) is very relevant to understanding the problems facing this community's intellectual contributions. Emphasis added:
A major source of simple-mindedness in AI programs is the use of mnemonics like "UNDERSTAND" or "GOAL" to refer to programs and data structures. This practice has been inherited from more traditional programming applications, in which it is liberating and enlightening to be able to refer to program structures by their purposes. Indeed, part of the thrust of the structured programming movement is to program entirely in terms of purposes at one level before implementing them by the most convenient of the (presumably many) alternative lower-level constructs.
... If a researcher tries to write an "understanding" program, it isn't because he has thought of a better way of implementing this well-understood task, but because he thinks he can come closer to writing the first implementation. If he calls the main loop of his program "UNDERSTAND", he is (until proven innocent) merely begging the question. He may mislead a lot of people, most prominently himself, and enrage a lot of others.
What he should do instead is refer to this main loop as
G0034
, and see if he can convince himself or anyone else thatG0034
implements some part of understanding. Or he could give it a name that reveals its intrinsic properties, likeNODE-NET-INTERSECTION-FINDER
, it being the substance of his theory that finding intersections in networks of nodes constitutes understanding...When you say (GOAL . . ), you can just feel the enormous power at your fingertips. It is, of course, an illusion.[1]
Of course, Conniver has some glaring wishful primitives, too. Calling "multiple data bases" CONTEXTS was dumb. It implies that, say, sentence understanding in context is really easy in this system...
Consider the following terms and phrases:
- "LLMs are trained to predict/simulate"
- "LLMs are predictors" (and then trying to argue the LLM only predicts human values instead of acting on them!)
- "Attention mechanism" (in self-attention)
- "AIs are incentivized to" (when talking about the reward or loss functions, thus implicitly reversing the true causality; reward optimizes the AI, but AI probably won't optimize the reward)
- "Reward" (implied to be favorable-influence-in-decision-making)
- "{Advantage, Value} function"
- "The purpose of RL is to train an agent to maximize expected reward over time" (perhaps implying an expectation and inner consciousness on the part of the so-called "agent")
- "Agents" (implying volition in our trained artifact... generally cuz we used a technique belonging to the class of algorithms which humans call 'reinforcement learning')
- "Power-seeking" (AI "agents")
- "Shoggoth"
- "Optimization pressure"
- "Utility"
- As opposed to (thinking of it) as "internal unit of decision-making incentivization, which is a function of internal representations of expected future events; minted after the resolution of expected future on-policy inefficiencies relative to the computational artifact's current decision-making influences"
- "Discount rate" (in deep RL, implying that an external future-learning-signal multiplier will ingrain itself into the AI's potential inner plan-grading-function which is conveniently assumed to be additive-over-timesteps, and also there's just one such function and also it's Markovian)
- "Inner goal / mesa objective / optimization daemon (yes that was a real name)"
- "Outer optimizer" (perhaps implying some amount of intentionality; a sense that 'more' optimization is 'better', even at the expense of generalization of the trained network)
- "Optimal" (as opposed to equilibrated-under-policy-updates)
- "Objectives" (when conflating a "loss function as objective" and "something which strongly controls how the AI makes choices")
- "Training" (in ML)
- Yup!
- "Learning" (in ML)
- "Simplicity prior"
- Consider the abundance of amateur theorizing about whether "schemers" will be "simpler" than "saints", or whether they will be supplanted by "sycophants." Sometimes conducted in ignorance of actual inductive bias research, which is actually a real subfield of ML.
Lest this all seem merely amusing, meditate on the fate of those who have tampered with words before. The behaviorists ruined words like "behavior", "response", and, especially, "learning". They now play happily in a dream world, internally consistent but lost to science. And think about this: if "mechanical translation" had been called "word-by-word text manipulation", the people doing it might still be getting government money.
Some of these terms are useful. Some of the academic imports are necessary for successful communication. Some of the terms have real benefits.
That doesn't stop them from distorting your thinking. At least in your private thoughts, you can do better. You can replace "optimal" with "artifact equilibrated under policy update operations" or "set of sequential actions which have subjectively maximal expected utility relative to [entity X]'s imputed beliefs", and the nice thing about brains is these long sentences can compress into single concepts which you can understand in but a moment.
It's easy to admit the mistakes of our past selves (whom we've conveniently grown up from by time of recounting). It's easy for people (such as my past self and others in this community) to sneer at out-group folks when they make such mistakes, the mistakes' invalidity laid bare before us.
It's hard when you've[2] read Dreams of AI Design and utterly failed to avoid same mistakes yourself. It's hard when your friends are using the terms, and you don't want to be a blowhard about it and derail the conversation by explaining your new term. It's hard when you have utterly failed to resist inheriting the invalid connotations of other fields ("optimal", "reward", "attention mechanism").
I think we have failed, thus far. I'm sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn't, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it's because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
But the end result is that I have given up on communicating with most folk who have been in the community longer than, say, 3 years. I don't know how to disabuse people of this trust, which seems unearned.
All to say: Do not trust this community's concepts and memes, if you have the time. Do not trust me, if you have the time. Verify.
See also: Against most, but not all, AI risk analogies.
- ^
How many times has someone expressed "I'm worried about 'goal-directed optimizers', but I'm not sure what exactly they are, so I'm going to work on deconfusion."? There's something weird about this sentiment, don't you think? I can't quite put my finger on what, and I wanted to get this post out.
- ^
Including myself, and I suspect basically every LW enthusiast interested in AI.
I broadly agree that a lot of discussion about AI x-risk is confused due to the use of suggestive terms. Of the ones you've listed, I would nominate "optimizer", "mesa optimization", "(LLMs as) simulators", "(LLMs as) agents", and "utility" as probably the most problematic. I would also add "deception/deceptive alignment", "subagent", "shard", "myopic", and "goal". (It's not a coincidence that so many of these terms seem to be related to notions of agency or subcomponents of agents; this seems to be the main place where sloppy reasoning can slide in.)
I also agree that I've encountered a lot of people who confidently predict Doom on the basis of subtle word games.
However, I also agree with Ryan's comment that these confusions seem much less common when we get to actual senior AIS researchers or people who've worked significantly with real models. (My guess is that Alex would disagree with me on this.) Most conversations I've been in that used these confused terms tended to involve MATS fellows or other very junior people (I don't interact with other more junior people much, unfortunately, so I'm not sure.) I've also had several conversations with people who seemed relieved at how reasonable and not confused the relevant researchers have been (e.g. with Alexander Gietelink-Oldenziel).
I suspect that a lot of the confusions stem from the way that the majority of recruitment/community building is conducted -- namely, by very junior people recruiting even more junior people (e.g. via student groups). Not only is there only a very limited amount of communication bandwidth available to communicate with potential new recruits (and therefore encouraging more arguments by analogy or via suggestive words), the people doing the communication are also likely to use a lot of (in large part because they're very junior, and likely not technical researchers).[1] There's also historical reasons why this is the case: a lot of early EA/AIS people were philosophers, and so presented detailed philosophical arguments (often routing through longtermism) about specific AI doom scenarios that in turn were suffered lossy compression during communication, as opposed to more robust general arguments (e.g. Ryan Greenblatt's example of "holy shit AI (and maybe the singularity), that might be a really big deal").[2]
Similarly, on LessWrong, I suspect that the majority of commenters are not people who have deeply engaged with a lot of the academic ML literature or have spent significant time doing AIS or even technical ML work.
And I'd also point a finger at lot of the communication from MIRI in particular as the cause for these confusions, e.g. the "sharp left-turn" concept seems to be primarily communicated via metaphor and cryptic sayings, while their communications about Reward Learning and Human Values seems in retrospect to have at least been misleading if not fundamentally confused. I suspect that the relevant people involved have much better models, but I think this did not come through in their communication.
I'm not super sure what to do about it; the problem of suggestive names (or in general, of smuggling connotations into technical work) is not a unique one to this community, nor is it one that can be fixed with reading a single article or two (as your post emphasizes). I'd even argue this community does better than a large fraction of academics (even ML academics).
John mentioned using specific, concrete examples as a way to check your concepts. If we're quoting old rationalist foundation texts, then the relevant example from "Surely You're Joking, Mr. Feynman" is relevant:
Unfortunately, in my experience, general instructions of the form "create concrete examples when listening to a chain of reasoning involving suggestive terms" do not seem to work very well, even if examples of doing so are provided, so I'm not sure there's a scalable solution here.
My preferred approach is to give the reader concrete examples to chew on as early as possible, but this runs into the failure mode of contingent facts about the example being taken as a general point (or even worse, the failure mode where the reader assumes that the concrete case is the general point being made). I'd consider mathematical equations (even if they are only toy examples) to be helpful as well, assuming you strip away the suggestive terms and focus only on the syntax/semantics. But I find that I also have a lot of difficulty getting other people to create examples I'd actually consider concrete. Frustratingly, many "concrete" examples I see smuggle in even more suggestive terms or connotations, and sometimes even fail to capture any of the semantics of the original idea.
So in the end, maybe I have nothing better than to repeat Alex's advice at the end of the post:
At the end of the day, while saying "just be better" does not serve as actionable advice, there might not be an easier answer.
To be clear, I think that many student organizers and community builders in general do excellent work that is often incredibly underappreciated. I'm making a specific claim about the immediate causal reasons for why this is happening, and not assigning fault. I don't see an easy way for community builders to do better, short of abandoning specialization and requiring everyone to be a generalist who also does techncical AIS work.
That being said, I think that it's worth trying to make detailed arguments concretizing general concerns, in large part to make sure that the case for AI x-risk doesn't "come down to a set of subtle word games". (e.g. I like Ajeya's doom story. ) After all, it's worth concretizing a general concern, and making sure that any concrete instantiations of the concern are possible. I just think that detailed arguments (where the details matter) often get compressed in ways that end up depending on suggestive names, especially in cases with limited communication bandwith.