This is also a very interesting point, thank you!
Thank you! That helps me understanding the problem better, although I'm quite skeptical about mechanistic interpretability.
Thanks for the comment! If I understand you correctly, you're saying the situation is even worse because with superintelligent AI, we can't even rely on testing a persona.
I agree that superintelligence makes things much worse, but if we define "persona" not as a simulacrum of a human being, but more generally as a kind of "self-model", a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona...
If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes "personality" is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don't know what role it actually plays. To use another p...
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence "narcissistic") and b) that we don't understand what is really going on inside the AI that simulates the character called Claude (hence the "alien" analogy). I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt, although I can imagine ...
Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.
But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience ete...
Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great
I think this is wrong. Today's AIs act really nice and ethical, because they're prompted to do that. That is a huge difference. The "Claude" you talk to is not really an AI, but a fictional character created by an AI according to your prompt and its system prompt. The latter may contain some guidelines towards "niceness",...
Thank you for being so open about your experiences. They mirror my own in many ways. Knowing that there are others feeling the same definitely helps me coping with my anxieties and doubts. Thank you also for organizing that event last June!
As a professional novelist, the best advice I can give comes from one of the greatest writers of the 20th century, Ernest Hemingway: "The first draft of anything is shit." He was known to rewrite his short stories up to 30 times. So, rewrite. It helps to let some time pass (at least a few days) before you reread and rewrite a text. This makes it easier to spot the weak parts.
For me, rewriting often means cutting things out that aren't really necessary. That hurts, because I have put some effort into putting the words there in the first place. So I use a si...
I think the term has many “valid” uses, and one is to refer to an object level belief that things will likely turn out pretty well. It doesn’t need to be irrational by definition.
Agreed. Like I said, you may have used the term in a way different from my definition. But I think in many cases, the term does reflect an attitude like I defined it. See Wikipedia.
I also think AI safety experts are self selected to be more pessimistic
This may also be true. In any case, I hope that Quintin and you are right and I'm wrong. But that doesn't make me sleep better.
From Wikipedia: "Optimism is an attitude reflecting a belief or hope that the outcome of some specific endeavor, or outcomes in general, will be positive, favorable, and desirable." I think this is close to my definition or at least includes it. It certainly isn't the same as a neutral view.
Thanks for pointing this out! I agree that my defintion of "optimism" is not the only way one can use the term. However, from my experience (and like I said, I am basically an optimist), in a highly uncertain situation, the weighing of perceived benefits vs risks heavily influences ones probability estimates. If I want to found a start-up, for example, I convince myself that it will work. I will unconsciously weigh positive evidence higher than negative. I don't know if this kind of focusing on the positiv outcomes may have influenced your reasoning and yo...
Defined well, dominance would be the organizing principle, the source, of an entity's behavior.
I doubt that. Dominance is the result, not the cause of behavior. It comes from the fact that there are conflicts in the world and often, only one side can get its will (even in a compromise, there's usually a winner and a loser). If an agent strives for dominance, it is usually as an instrumental goal for something else the agent wants to achieve. There may be a "dominance drive" in some humans, but I don't think that explains much of actual dominant behav...
That "troll" runs one of the most powerful AI labs and freely distributes LLMs on the level of state-of-the-art half a year ago on the internet. This is not just about someone talking nonsense in public, like Melanie Mitchell or Steven Pinker. LeCun may literally be the one who contributes most to the destruction of humanity. I would give everything I have to convince him that what he's doing is dangerous. But I have no idea how to do that if even his former colleagues Geoffrey Hinton and Yoshua Bengio can't.
I think even most humans don't have a "dominance" instinct. The reasons we want to gain money and power are also mostly instrumental: we want to achieve other goals (e.g., as a CEO, getting ahead of a competitor to increases shareholder value and make a "good job"), impress our neighbors, generally want to be admired and loved by others, live in luxury, distract ourselves from other problems like getting older, etc. There are certainly people who want to dominate just for the feeling of it, but I think that explains only a small part of the actual dominant...
Thanks for pointing this out! I should have made it clearer that I did not use ChatGPT to come up with a criticism, then write about it. Instead, I wanted to see if even ChatGPT was able to point out the flaws in LeCun's argument, which seemed obvious to me. I'll edit the text accordingly.
Like I wrote in my reply to dr_s, I think a proof would be helpful, but probably not a game changer.
Mr. CEO: "Senator X, the assumptions in that proof you mention are not applicable in our case, so it is not relevant for us. Of course we make sure that assumption Y is not given when we build our AGI, and assumption Z is pure science-fiction."
What the AI expert says to Xi Jinping and to the US general in your example doesn't rely on an impossibility proof in my view.
I agree that a proof would be helpful, but probably not as impactful as one might hope. A proof of impossibility would have to rely on certain assumptions, like "superintelligence" or whatever, that could also be doubted or called sci-fi.
I have strong-upvoted this post because I think that a discussion about the possibility of alignment is necessary. However, I don't think an impossibility proof would change very much about our current situation.
To stick with the nuclear bomb analogy, we already KNOW that the first uncontrolled nuclear chain reaction will definitely ignite the atmosphere and destroy all life on earth UNLESS we find a mechanism to somehow contain that reaction (solve alignment/controllability). As long as we don't know how to build that mechanism, we must not start an uncon...
Lots of people when confronted with various reasons why AGI would be dangerous object that it's all speculative, or just some sci-fi scenarios concocted by people with overactive imaginations. I think a rigorous, peer reviewed, authoritative proof would strengthen the position against these sort of objections.
That's a good point, which is supported by the high share of 92% prepared to change their minds.
I've received my fair share of downvotes, see for example this post, which got 15 karma out of 24 votes. :) It's a signal, but not more than that. As long as you remain respectful, you shouldn't be discouraged from posting your opinion in comments even if people downvote it. I'm always for open discussions as they help me understand how and why I'm not understood.
I agree with that, and I also agree with Yann LeCun's intention to "not being stupid enough to create something that we couldn't control". I even think not creating an uncontrollable AI is our only hope. I'm just not sure whether I trust humanity (including Meta) to be "not stupid".
I don't see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn't an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking - in the sense of expanding one's ability to make decisions by taking control of as many decision-relevant resources as possible - is also a logical necessity. The Russian roulette example is misleading in my view because the "safe" option is de facto suicide - if "the game ends" and the AI can't make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I'd vote for the gun as well.
To reply in Stuart Russell's words: "One of the most common patterns involves omitting something from the objective that you do actually care about. In such cases … the AI system will often find an optimal solution that sets the thing you do care about, but forgot to mention, to an extreme value."
There are vastly more possible worlds that we humans can't survive in than those we can, let alone live comfortably in. Agreed, "we don't want to make a random potshot", but making an agent that transforms our world into one of these rare ones where we want to liv...
I'm not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.
This is not about "genie-like misunderstandings". It's not the AI (the genie, so to speak), that's misunderstanding anything - it's us. We're the ones who give the AI a goal or train it in some way...
the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it's likely or that it's worth addressing the problem due to it being a tail risk.
Agreed. The orthogonality thesis alone doesn't say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficie...
Thanks for pointing this out - I may have been sloppy in my writing. To be more precise, I did not expect that I would change my mind, given my prior knowledge of the stances of the four candidates, and would have given this expectation a high confidence. For this reason, I would have voted with "no". Had LeCun or Mitchell presented an astonishing, verifiable insight previously unknown to me, I may well have changed my mind.
Thanks for adding this!
Thank you for your reply and the clarifications! To briefly comment on your points concerning the examples for blind spots:
superintelligence does not magically solve physical problems
I and everyone I know on LessWrong agree.
evolution don’t believe in instrumental convergence
I disagree. Evolution is all about instrumental convergence IMO. The "goal" of evolution, or rather the driving force behind it, is reproduction. This leads to all kinds of instrumental goals, like developing methods for food acquisition, attack and defense, impressing the opposite sex,...
Thank you for the correction!
That’s the kind of sentence that I see as arguments for believing your assessment is biased.
Yes, my assessment is certainly biased, I admitted as much in the post. However, I was referring to your claim that LW (in this case, me) was "a failure in rational thinking", which sounds a lot like Mitchell's "ungrounded speculations" in my ears.
Of course she gave supporting arguments, you just refuse to hear them
Could you name one? Not any of Mitchell's argument, but a support for the claim that AI x-risk is just "ungrounded speculation" despite decades of ...
Is the orthogonality thesis correct? (The term wasn’t mentioned directly in the debate) Yes, in the limit and probably in practice, but is too weak to be useful for the purposes of AI risk, without more evidence.
Also, orthogonality is expensive at runtime, so this consideration matters, which is detailed in the post below
I think the post you mention misunderstands what the "orthogonality thesis" actually says. The post argues that an AGI would not want to arbitrarily change its goal during runtime. That is not what the orthogonality thesis is about. It jus...
but your own post make me update toward LW being a failure of rational thinking, e.g. it’s an echo chamber that makes your ability to evaluate reality weaker, at least on this topic.
I don't see you giving strong arguments for this. It reminds me of the way Melanie Mitchell argued: "This is all ungrounded speculation", without giving any supporting arguments for this strong claim.
Concerning the "strong arguments" of LeCun/Mitchell you cite:
AIs will likely help with other existential risks
Yes, but that's irrelevant to the question of whether AI may pos...
That's really nice, thank you very much!
We added a few lines to the dialog in "Takeover from within". Thanks again for the suggestion!
Thank you!
Thank you for pointing this out. By "turning Earth into a giant computer" I did indeed mean "the surface of the Earth". The consequences for biological life are the same, of course. As for heat dissipation, I'm no expert but I guess there would be ways to radiate it into space, using Earth's internal heat (instead of sunlight) as the main energy source. A Dyson sphere may be optimal in the long run, but I think that turning Earth's surface into computronium would be a step on the way.
The way to kill everyone isn’t necessarily gruesome, hard to imagine, or even that complicated. I understand it’s a good tactic at making your story more ominous, but I think it’s worth stating it to make it seem more realistic.
See my comment above. We didn't intend to make the story ominous, but didn't want to put off readers by going into too much detail of what would happen after an AI takeover.
...Lastly, it seems unlikely alignment research won’t scale with capabilities. Although this isn’t enough to align the ASI alone and the scenario can still happen,
As I've argued here, it seems very likely that a superintelligent AI with a random goal will turn earth and most of the rest of the universe into computronium, because increasing its intelligence is the dominant instrumental subgoal for whatever goal it has. This would mean inadvertent extinction of humanity and (almost) all biological life. One of the reasons for this is the potential threat of grabby aliens/a grabby alien superintelligence.
However, this is a hypothesis which we didn't thoroughly discuss during the AI Safety Project, so we didn't fe...
Thank you very much for the feedback! I'll discuss this with the team, maybe we'll edit it in the next days.
Thank you! Very interesting and a little disturbing, especially the way the AI performance expands in all directions simultaneously. This is of course not surprising, but still concerning to see it depicted in this way. It's all too obvious how this diagram will look in one or two years. Would also be interesting to have an even broader diagram including all kinds of different skills, like playing games, steering a car, manipulating people, etc.
Thank you very much! I agree. We chose this scenario out of many possibilities because so far it hasn't been described in much detail and because we wanted to point out that open source can also lead to dangerous outcomes, not because it is the most likely scenario. Our next story will be more "mainstream".
Good point! Satirical reactions are not appropriate in comments, I apologize. However, I don't think that arguing why alignment is difficult would fit into this post. I clearly stated this assumption in the introduction as a basis for my argument, assuming that LW readers were familiar with the problem. Here are some resources to explain why I don't think that we can solve alignment in the next 5-10 years: https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/, https://aisafety.info?state=6172_, https://www.lesswrong.com/s/...
Yes, thanks for the clarification! I was indeed oversimplifying a bit.
This is an interesting thought. I think even without AGI, we'll have total transparency of human minds soon - already AI can read thoughts in a limited way. Still, as you write, there's an instinctive aversion against this scenario, which sounds very much like an Orwellian dystopia. But if some people have machines that can read minds, which I don't think we can prevent, it may indeed be better if everyone could do it - deception by autocrats and bad actors would be much harder that way. On the other hand, it is hard to imagine that the people in power wou...
I'm obviously all for "slowing down capabilites". I'm not for "stopping capabilities altogether", but for selecting which capabilites we want to develop, and which to avoid (e.g. strategic awareness). I'm totally for "solving alignment before AGI" if that's possible.
I'm very pessimistic about technical alignment in the near term, but not "optimistic" about governance. "Death with dignity" is not really a strategy, though. If anything, my favorite strategy in the table is "improve competence, institutions, norms, trust, and tools, to set the stage for right...
Well, yes, of course! Why didn't I think of it myself? /s
Honestly, "aligned benevolent AI" is not a "better alternative" for the problem I'm writing about in this post, which is we'll be able to develop an AGI before we have solved alignment. I'm totally fine with someone building an aligned AGI (assuming that it is really aligend, not just seemingly aligned). The problem is, this is very hard to do, and timelines are likely very short.
You may be right about that. Still, I don't see any better alternative. We're apes with too much power already, and we're getting more powerful by the minute. Even without AGI, there are plenty of ways to end humanity (e.g. bioweapons, nanobots, nuclear war, bio lab accidents ...) Either we learn to overcome our ape-brain impulses and restrict ourselves, or we'll kill ourselves. As long as we haven't killed ourselves, I'll push towards the first option.
We're not as far apart as you probably think. I'd agree with most of your decisions. I'd even vote for you to become king! :) Like I wrote, I think we must also be cautious with narrow AI as well, and I agree with your points about opaqueness and the potential of narrow AI turning into AGI. Again, the purpose of my post was not to argue how we could make AI safe, but to point out that we could have a great future without AGI. And I still see a lot of beneficial potential in narrow AI, IF we're cautious enough.
Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.