LESSWRONG
LW

All of Karl von Wendt's Comments + Replies

Can we ever ensure AI alignment if we can only test AI personas?

Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.

Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt14d10

This is also a very interesting point, thank you!

Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt14d30

Thank you! That helps me understanding the problem better, although I'm quite skeptical about mechanistic interpretability.

Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt15d10

Thanks for the comment! If I understand you correctly, you're saying the situation is even worse because with superintelligent AI, we can't even rely on testing a persona.

I agree that superintelligence makes things much worse, but if we define "persona" not as a simulacrum of a human being, but more generally as a kind of "self-model", a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona... (read more)

Human takeover might be worse than AI takeover

Karl von Wendt2mo10

If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play.

That is not what Claude does. Every time you give it a prompt, a new instance of Claudes "personality" is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don't know what role it actually plays. To use another p... (read more)

Human takeover might be worse than AI takeover

Karl von Wendt3mo10

Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence "narcissistic") and b) that we don't understand what is really going on inside the AI that simulates the character called Claude (hence the "alien" analogy). I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt, although I can imagine ... (read more)

5Matthew Barnett3mo

If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play. That is simply their identity; it's not merely a role if they never take off the mask. The alternative view here doesn't seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation? Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI's cognition?

Human takeover might be worse than AI takeover

Karl von Wendt3mo107

Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.

But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience ete... (read more)

Matthew Barnett3mo*146

Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.

This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?

Human takeover might be worse than AI takeover

Karl von Wendt3mo116

today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great

I think this is wrong. Today's AIs act really nice and ethical, because they're prompted to do that. That is a huge difference. The "Claude" you talk to is not really an AI, but a fictional character created by an AI according to your prompt and its system prompt. The latter may contain some guidelines towards "niceness",... (read more)

9Tom Davidson3mo

So you predict that if Claude was in a situation where it knew that it had complete power over you and could make you say that you liked it then it would stop being nice? I think would continue to be nice in any situation of that rough kind which suggests it's actually nice not just narcissistically pretending

"No-one in my org puts money in their pension"

Karl von Wendt1y182

Thank you for being so open about your experiences. They mirror my own in many ways. Knowing that there are others feeling the same definitely helps me coping with my anxieties and doubts. Thank you also for organizing that event last June!

How to write better?

Answer by Karl von WendtJan 30, 202420

As a professional novelist, the best advice I can give comes from one of the greatest writers of the 20th century, Ernest Hemingway: "The first draft of anything is shit." He was known to rewrite his short stories up to 30 times. So, rewrite. It helps to let some time pass (at least a few days) before you reread and rewrite a text. This makes it easier to spot the weak parts.

For me, rewriting often means cutting things out that aren't really necessary. That hurts, because I have put some effort into putting the words there in the first place. So I use a si... (read more)

The benefits and risks of optimism (about AI safety)

Karl von Wendt1y10

I think the term has many “valid” uses, and one is to refer to an object level belief that things will likely turn out pretty well. It doesn’t need to be irrational by definition.

Agreed. Like I said, you may have used the term in a way different from my definition. But I think in many cases, the term does reflect an attitude like I defined it. See Wikipedia.

I also think AI safety experts are self selected to be more pessimistic

This may also be true. In any case, I hope that Quintin and you are right and I'm wrong. But that doesn't make me sleep better.

The benefits and risks of optimism (about AI safety)

Karl von Wendt1y10

From Wikipedia: "Optimism is an attitude reflecting a belief or hope that the outcome of some specific endeavor, or outcomes in general, will be positive, favorable, and desirable." I think this is close to my definition or at least includes it. It certainly isn't the same as a neutral view.

The benefits and risks of optimism (about AI safety)

Karl von Wendt1y126

Thanks for pointing this out! I agree that my defintion of "optimism" is not the only way one can use the term. However, from my experience (and like I said, I am basically an optimist), in a highly uncertain situation, the weighing of perceived benefits vs risks heavily influences ones probability estimates. If I want to found a start-up, for example, I convince myself that it will work. I will unconsciously weigh positive evidence higher than negative. I don't know if this kind of focusing on the positiv outcomes may have influenced your reasoning and yo... (read more)

6RM1y

People here describe themselves as "pessimistic" about a variety of aspects of AI risk on a very regular basis, so this seems like an isolated demand for rigor. This seems like a weird bait and switch to me, where an object-level argument is only ever allowed to conclude in a neutral middle-ground conclusion. A "neutral, balanced view of possibilities" is absolutely allowed to end on a strong conclusion without a forest of caveats. You switch your reading of "optimism" partway through this paragraph in a way that seems inconsistent with your earlier comment, in such a way that smuggles in the conclusion "any purely factual argument will express a wide range of concerns and uncertainties, or else it is biased".

1Nora Belrose1y

I just disagree, I think the term has many “valid” uses, and one is to refer to an object level belief that things will likely turn out pretty well. It doesn’t need to be irrational by definition. I also think AI safety experts are self selected to be more pessimistic, and that my personal epistemic situation is at least as good as theirs on this issue, so I’m not bothered that I’m more optimistic than the median safety researcher. I also have a fairly good “error theory” for why many people are overly pessimistic, which will be elucidated in upcoming posts.

The Game of Dominance

Karl von Wendt2y21

Defined well, dominance would be the organizing principle, the source, of an entity's behavior.

I doubt that. Dominance is the result, not the cause of behavior. It comes from the fact that there are conflicts in the world and often, only one side can get its will (even in a compromise, there's usually a winner and a loser). If an agent strives for dominance, it is usually as an instrumental goal for something else the agent wants to achieve. There may be a "dominance drive" in some humans, but I don't think that explains much of actual dominant behav... (read more)

The Game of Dominance

Karl von Wendt2y40

That "troll" runs one of the most powerful AI labs and freely distributes LLMs on the level of state-of-the-art half a year ago on the internet. This is not just about someone talking nonsense in public, like Melanie Mitchell or Steven Pinker. LeCun may literally be the one who contributes most to the destruction of humanity. I would give everything I have to convince him that what he's doing is dangerous. But I have no idea how to do that if even his former colleagues Geoffrey Hinton and Yoshua Bengio can't.

The Game of Dominance

Karl von Wendt2y62

I think even most humans don't have a "dominance" instinct. The reasons we want to gain money and power are also mostly instrumental: we want to achieve other goals (e.g., as a CEO, getting ahead of a competitor to increases shareholder value and make a "good job"), impress our neighbors, generally want to be admired and loved by others, live in luxury, distract ourselves from other problems like getting older, etc. There are certainly people who want to dominate just for the feeling of it, but I think that explains only a small part of the actual dominant... (read more)

The Game of Dominance

Karl von Wendt2y143

Thanks for pointing this out! I should have made it clearer that I did not use ChatGPT to come up with a criticism, then write about it. Instead, I wanted to see if even ChatGPT was able to point out the flaws in LeCun's argument, which seemed obvious to me. I'll edit the text accordingly.

If we had known the atmosphere would ignite

Karl von Wendt2y65

Like I wrote in my reply to dr_s, I think a proof would be helpful, but probably not a game changer.

Mr. CEO: "Senator X, the assumptions in that proof you mention are not applicable in our case, so it is not relevant for us. Of course we make sure that assumption Y is not given when we build our AGI, and assumption Z is pure science-fiction."

What the AI expert says to Xi Jinping and to the US general in your example doesn't rely on an impossibility proof in my view.

1Jeffs2y

Yes. Valid. How to avoid reducing to a toy problem or such narrowing assumptions (in order to achieve a proof) that allows Mr. CEO to dismiss it. When I revise, I'm going to work backwards with CEO/Senator dialog in mind.

If we had known the atmosphere would ignite

Karl von Wendt2y21

I agree that a proof would be helpful, but probably not as impactful as one might hope. A proof of impossibility would have to rely on certain assumptions, like "superintelligence" or whatever, that could also be doubted or called sci-fi.

1Remmelt5mo

No actually, assuming the machinery has a hard substrate and is self-maintaining is enough.

If we had known the atmosphere would ignite

Karl von Wendt2y135

I have strong-upvoted this post because I think that a discussion about the possibility of alignment is necessary. However, I don't think an impossibility proof would change very much about our current situation.

To stick with the nuclear bomb analogy, we already KNOW that the first uncontrolled nuclear chain reaction will definitely ignite the atmosphere and destroy all life on earth UNLESS we find a mechanism to somehow contain that reaction (solve alignment/controllability). As long as we don't know how to build that mechanism, we must not start an uncon... (read more)

6Jeffs2y

Like dr_s stated, I'm contending that proof would be qualitatively different from "very hard" and powerful ammunition for advocating a pause... Senator X: “Mr. CEO, your company continues to push the envelope and yet we now have proof that neither you nor anyone else will ever be able to guarantee that humans remain in control. You talk about safety and call for regulation but we seem to now have the answer. Human control will ultimately end. I repeat my question: Are you consciously working to replace humanity? Do you have children, sir?” AI expert to Xi Jinping: “General Secretary, what this means is that we will not control it. It will control us. In the end, Party leadership will cede to artificial agents. They may or may not adhere to communist principals. They may or may not believe in the primacy of China. Population advantage will become nothing because artificial minds can be copied 10 billion times. Our own unification of mind, purpose, and action will pale in comparison. Our chief advantages of unity and population will no longer exist.” AI expert to US General: “General, think of this as building an extremely effective infantry soldier who will become CJCS then POTUS in a matter of weeks or months.”

dr_s2y102

Lots of people when confronted with various reasons why AGI would be dangerous object that it's all speculative, or just some sci-fi scenarios concocted by people with overactive imaginations. I think a rigorous, peer reviewed, authoritative proof would strengthen the position against these sort of objections.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y20

That's a good point, which is supported by the high share of 92% prepared to change their minds.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y31

I've received my fair share of downvotes, see for example this post, which got 15 karma out of 24 votes. :) It's a signal, but not more than that. As long as you remain respectful, you shouldn't be discouraged from posting your opinion in comments even if people downvote it. I'm always for open discussions as they help me understand how and why I'm not understood.

0Ilio2y

[canceled]

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y30

I agree with that, and I also agree with Yann LeCun's intention to "not being stupid enough to create something that we couldn't control". I even think not creating an uncontrollable AI is our only hope. I'm just not sure whether I trust humanity (including Meta) to be "not stupid".

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y10

I don't see your examples contradicting my claim. Killing all humans may not increase future choices, so it isn't an instrumental convergent goal in itself. But in any real-world scenario, self-preservation certainly is, and power-seeking - in the sense of expanding one's ability to make decisions by taking control of as many decision-relevant resources as possible - is also a logical necessity. The Russian roulette example is misleading in my view because the "safe" option is de facto suicide - if "the game ends" and the AI can't make any decisions anymore, it is already dead for all practical purposes. If that were the stakes, I'd vote for the gun as well.

1Noosphere892y

Even assuming you are right on that inference, once we consider how many choices there are, it still isn't much evidence at all, and given that there are usually lots of choices, this inference is essentially not holding up the thesis that AI is an existential risk very much, without prior commitments to AI as being an existential risk. Also, this part of your comment, as well as my hopefully final quotes below, explains why you can't get from self-preservation and power-seeking, even if they happen, into an existential risk without more assumptions. That's the problem, as we have just as plausible, if not more plausible reasons to believe that there isn't an instrumental convergence towards existential risk, for reasons related to future choices. These quotes below also explains why instrumental convergence and self-preservation doesn't imply AI risk, without more assumptions.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y10

To reply in Stuart Russell's words: "One of the most common patterns involves omitting something from the objective that you do actually care about. In such cases … the AI system will often find an optimal solution that sets the thing you do care about, but forgot to mention, to an extreme value."

There are vastly more possible worlds that we humans can't survive in than those we can, let alone live comfortably in. Agreed, "we don't want to make a random potshot", but making an agent that transforms our world into one of these rare ones where we want to liv... (read more)

2TAG2y

It's difficult to create an aligned Sovereign, but easy not to create a Sovereign at all.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y10

I'm not sure if I understand your point correctly. An AGI may be able to infer what we mean when we give it a goal, for instance from its understanding of the human psyche, its world model, and so on. But that has no direct implications for its goal, which it has acquired either through training or in some other way, e.g. by us specifying a reward function.

This is not about "genie-like misunderstandings". It's not the AI (the genie, so to speak), that's misunderstanding anything - it's us. We're the ones who give the AI a goal or train it in some way... (read more)

2TAG2y

Which is to say, it won't necessarily follow a goal correctly that is is capable of understanding correctly. On the other hand, it won't necessarily fail to. Both possibilities are open. Remember, the title of this argument is misleading: https://www.lesswrong.com/posts/NyFuuKQ8uCEDtd2du/the-genie-knows-but-doesn-t-care There's no proof that the genie will not care. Not all AI's have goals, not all have goal stability, not all are incorrigible. Mindspace is big.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y10

the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it's likely or that it's worth addressing the problem due to it being a tail risk.

Agreed. The orthogonality thesis alone doesn't say anything about x-risks. However, it is a strong counterargument against the claim, made both by LeCun and Mitchell if I remember correctly, that a sufficie... (read more)

1Noosphere892y

Yep, the orthogonality thesis is a pretty good defeater to the claims that AI intelligence alone would be sufficient to gain the right values for us, unlike where capabilities alone can be generated by say a simplicity prior. This is where I indeed disagree with Mitchell and LeCun. Not really, and this is important. Also, even if this was true, remember that given the world has many, many choices, it's probably not enough evidence to believe AI risk claims unless you already started with a high prior on AI risk, which I don't. Even at 1000 choices, the evidence is thin but not effectively useless, but by the time we reach millions or billions of choices this claim, even if true isn't very much evidence at all. Quote below to explain it in full why your statement isn't true:

3TAG2y

That could be either of two arguments: that it would be capable of figuring out what we want from first principles; or that it would not commit genie-like misunderstandings.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y51

Thanks for pointing this out - I may have been sloppy in my writing. To be more precise, I did not expect that I would change my mind, given my prior knowledge of the stances of the four candidates, and would have given this expectation a high confidence. For this reason, I would have voted with "no". Had LeCun or Mitchell presented an astonishing, verifiable insight previously unknown to me, I may well have changed my mind.

4niplav2y

Ah, okay. That makes much more sense.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y70

Thanks for adding this!

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y63

Thank you for your reply and the clarifications! To briefly comment on your points concerning the examples for blind spots:

superintelligence does not magically solve physical problems

I and everyone I know on LessWrong agree.

evolution don’t believe in instrumental convergence

I disagree. Evolution is all about instrumental convergence IMO. The "goal" of evolution, or rather the driving force behind it, is reproduction. This leads to all kinds of instrumental goals, like developing methods for food acquisition, attack and defense, impressing the opposite sex,... (read more)

1Ilio2y

That’s all important points and I’d glad to discuss them. However I’m also noticing a wave of downvotes, so maybe we should go half private with whoever signal they want to read more? Or you think I should just ignore that and go forward with my answers? Both are ok but I’d like to follow your lead as you know the house better.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y110

Thank you for the correction!

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y109

That’s the kind of sentence that I see as arguments for believing your assessment is biased.

Yes, my assessment is certainly biased, I admitted as much in the post. However, I was referring to your claim that LW (in this case, me) was "a failure in rational thinking", which sounds a lot like Mitchell's "ungrounded speculations" in my ears.

Of course she gave supporting arguments, you just refuse to hear them

Could you name one? Not any of Mitchell's argument, but a support for the claim that AI x-risk is just "ungrounded speculation" despite decades of ... (read more)

-7Ilio2y

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y52

Is the orthogonality thesis correct? (The term wasn’t mentioned directly in the debate) Yes, in the limit and probably in practice, but is too weak to be useful for the purposes of AI risk, without more evidence.
Also, orthogonality is expensive at runtime, so this consideration matters, which is detailed in the post below

I think the post you mention misunderstands what the "orthogonality thesis" actually says. The post argues that an AGI would not want to arbitrarily change its goal during runtime. That is not what the orthogonality thesis is about. It jus... (read more)

2Noosphere892y

I accept the orthogonality thesis, in the sense that a paperclip maximizer can exist, at least in theory, in the sense of being logically and physically possible. The reason I view it as too weak evidence is that the orthogonality thesis is compatible with ludicrously many worlds, including ones where AI safety in the sense of preventing rogue AI is effectively a non-problem for one reason or another. In essence, it only states that bad AI from our perspective is possible, not that it's likely or that it's worth addressing the problem due to it being a tail risk. Imagine if someone wrote an article in thein New York Times claiming that halting oracles are possible, and that this would be very bad news for us, amounting to extinction, solely because it's possible for us to go extinct via this way. The correct response here is that you should ignore the evidence and go with your priors. I see the orthogonality thesis a lot like this: It's right, but the implied actions require way more evidence than it presents. Given that the orthogonality thesis, even if true shouldn't shift our priors much, due to it being very, very weak evidence, the fact that the orthogonality thesis is true doesn't mean that Lecun is wrong, without something else assumed. IMO, this also characterizes why I don't find AI risk that depends on instrumental convergence compelling, due to the fact that even if it's true, contra Bostrom without more assumptions that need to be tested empirically, and this is still very compatible with a world where instrumental convergence is a non-problem in any number of ways, and means that without more assumptions, LeCun could still be right that in practice instrumental convergence does not lead to existential risk or even leave us in a bad future. Post below: https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft Some choice quotes to illustrate why instrumental convergence doesn't buy us much evidence at all: So my point is ev

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt2y85

but your own post make me update toward LW being a failure of rational thinking, e.g. it’s an echo chamber that makes your ability to evaluate reality weaker, at least on this topic.

I don't see you giving strong arguments for this. It reminds me of the way Melanie Mitchell argued: "This is all ungrounded speculation", without giving any supporting arguments for this strong claim.

Concerning the "strong arguments" of LeCun/Mitchell you cite:

AIs will likely help with other existential risks

Yes, but that's irrelevant to the question of whether AI may pos... (read more)

-7Ilio2y

Agentic Mess (A Failure Story)

Karl von Wendt2y30

That's really nice, thank you very much!

A Friendly Face (Another Failure Story)

Karl von Wendt2y10

We added a few lines to the dialog in "Takeover from within". Thanks again for the suggestion!

A Friendly Face (Another Failure Story)

Karl von Wendt2y10

Thank you!

Prediction: any uncontrollable AI will turn earth into a giant computer

Karl von Wendt2y30

Thank you for pointing this out. By "turning Earth into a giant computer" I did indeed mean "the surface of the Earth". The consequences for biological life are the same, of course. As for heat dissipation, I'm no expert but I guess there would be ways to radiate it into space, using Earth's internal heat (instead of sunlight) as the main energy source. A Dyson sphere may be optimal in the long run, but I think that turning Earth's surface into computronium would be a step on the way.

A Friendly Face (Another Failure Story)

Karl von Wendt2y43

The way to kill everyone isn’t necessarily gruesome, hard to imagine, or even that complicated. I understand it’s a good tactic at making your story more ominous, but I think it’s worth stating it to make it seem more realistic.

See my comment above. We didn't intend to make the story ominous, but didn't want to put off readers by going into too much detail of what would happen after an AI takeover.

Lastly, it seems unlikely alignment research won’t scale with capabilities. Although this isn’t enough to align the ASI alone and the scenario can still happen,

... (read more)

1O O2y

Don’t see why not. It would be far easier than automating neuroscience research. There is an implicit assumption that we won’t wait for alignment research to catch up but people will naturally get more suspicious of models as they get more powerful. We are already seeing the framework for a pause at this step Additionally, to align ASI correctly, you only need to align weakly super/human levels/below human level intelligent agents who will align super intelligent agents stronger then themselves or convince you to stop progress. Recursive self alignment is parallel to recursive self improvement.

A Friendly Face (Another Failure Story)

Karl von Wendt2y11

As I've argued here, it seems very likely that a superintelligent AI with a random goal will turn earth and most of the rest of the universe into computronium, because increasing its intelligence is the dominant instrumental subgoal for whatever goal it has. This would mean inadvertent extinction of humanity and (almost) all biological life. One of the reasons for this is the potential threat of grabby aliens/a grabby alien superintelligence.

However, this is a hypothesis which we didn't thoroughly discuss during the AI Safety Project, so we didn't fe... (read more)

1O O2y

I have a lot of issues with the disassembling atoms line of thought, but I won’t argue it here. I think it’s been argued enough against in popular posts. But I think the gist of it is the Earth is a tiny fraction of the solar system/near solar systems’ resources (even smaller out of the light cone), and one of the worst places to host a computer vs say Pluto, because of heat, so ultimately it doesn’t take much to avoid using Earth for all of its resources. Grabby aliens don’t really limit us from using solar system/near solar system. And some of my own thoughts: the speed of light probably limits how useful that large of computers are (say planet size), while a legion of AI systems is probably slow to coordinate. They will still be very powerful but a planet sized computer just doesn’t sound realistic in the literal sense. A planet sized compute cluster? Sure, maybe heat makes that impractical, but sure.

A Friendly Face (Another Failure Story)

Karl von Wendt2y40

Thank you very much for the feedback! I'll discuss this with the team, maybe we'll edit it in the next days.

Where are the red lines for AI?

Karl von Wendt2y32

Thank you! Very interesting and a little disturbing, especially the way the AI performance expands in all directions simultaneously. This is of course not surprising, but still concerning to see it depicted in this way. It's all too obvious how this diagram will look in one or two years. Would also be interesting to have an even broader diagram including all kinds of different skills, like playing games, steering a car, manipulating people, etc.

Agentic Mess (A Failure Story)

Karl von Wendt2y61

Thank you very much! I agree. We chose this scenario out of many possibilities because so far it hasn't been described in much detail and because we wanted to point out that open source can also lead to dangerous outcomes, not because it is the most likely scenario. Our next story will be more "mainstream".

We don’t need AGI for an amazing future

Karl von Wendt2y20

Good point! Satirical reactions are not appropriate in comments, I apologize. However, I don't think that arguing why alignment is difficult would fit into this post. I clearly stated this assumption in the introduction as a basis for my argument, assuming that LW readers were familiar with the problem. Here are some resources to explain why I don't think that we can solve alignment in the next 5-10 years: https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/, https://aisafety.info?state=6172_, https://www.lesswrong.com/s/... (read more)

Coordination by common knowledge to prevent uncontrollable AI

Karl von Wendt2y11

Yes, thanks for the clarification! I was indeed oversimplifying a bit.

We don’t need AGI for an amazing future

Karl von Wendt2y10

This is an interesting thought. I think even without AGI, we'll have total transparency of human minds soon - already AI can read thoughts in a limited way. Still, as you write, there's an instinctive aversion against this scenario, which sounds very much like an Orwellian dystopia. But if some people have machines that can read minds, which I don't think we can prevent, it may indeed be better if everyone could do it - deception by autocrats and bad actors would be much harder that way. On the other hand, it is hard to imagine that the people in power wou... (read more)

We don’t need AGI for an amazing future

Karl von Wendt2y10

I'm obviously all for "slowing down capabilites". I'm not for "stopping capabilities altogether", but for selecting which capabilites we want to develop, and which to avoid (e.g. strategic awareness). I'm totally for "solving alignment before AGI" if that's possible.

I'm very pessimistic about technical alignment in the near term, but not "optimistic" about governance. "Death with dignity" is not really a strategy, though. If anything, my favorite strategy in the table is "improve competence, institutions, norms, trust, and tools, to set the stage for right... (read more)

2Mo Putera2y

Sure, I mostly agree. To repeat part of my earlier comment, you would probably more persuasive if you addressed e.g. why my intuition that #1 is more feasible than #2 is wrong. In other words, I'm giving you feedback on how to make your post more persuasive to the LW audience. This sort of response ("Well, yes, of course! Why didn't I think of it myself? /s") doesn't really persuade readers; bridging inferential gaps would.

We don’t need AGI for an amazing future

Karl von Wendt2y10

Well, yes, of course! Why didn't I think of it myself? /s

Honestly, "aligned benevolent AI" is not a "better alternative" for the problem I'm writing about in this post, which is we'll be able to develop an AGI before we have solved alignment. I'm totally fine with someone building an aligned AGI (assuming that it is really aligend, not just seemingly aligned). The problem is, this is very hard to do, and timelines are likely very short.

8Mo Putera2y

At least 2 options to develop aligned AGI, in the context of this discussion: 1. Slow down capabilities and speed up alignment just enough that we solve alignment before developing AGI 1. e.g. the MTAIR project, in this paper, models the effect of a fire alarm for HLMI as "extra time" as speeding up safety research, leading to a higher chance that it is successful before timeline for HLMI 2. this seems intuitively more feasible, hence more likely 2. Stop capabilities altogether -- this is what you're recommending in the OP 1. this seems intuitively far less feasible, hence ~unlikely (I interpret e.g. HarrisonDurland's comment as elaborating on this intuition) What I don't yet understand is why you're pushing for #2 over #1. You would probably more persuasive if you addressed e.g. why my intuition that #1 is more feasible than #2 is wrong. Edited to add: Matthijs Maas' Strategic Perspectives on Transformative AI Governance: Introduction has this (oversimplified) mapping of strategic perspectives. I think you'd probably fall under (technical: pessimistic or very; governance: very optimistic), while my sense is most LWers (me included) are either pessimistic or uncertain on both axes, so there's that inferential gap to address in the OP.

We don’t need AGI for an amazing future

Karl von Wendt2y10

You may be right about that. Still, I don't see any better alternative. We're apes with too much power already, and we're getting more powerful by the minute. Even without AGI, there are plenty of ways to end humanity (e.g. bioweapons, nanobots, nuclear war, bio lab accidents ...) Either we learn to overcome our ape-brain impulses and restrict ourselves, or we'll kill ourselves. As long as we haven't killed ourselves, I'll push towards the first option.

6Roko2y

I do! Aligned benevolent AI!

We don’t need AGI for an amazing future

Karl von Wendt2y10

We're not as far apart as you probably think. I'd agree with most of your decisions. I'd even vote for you to become king! :) Like I wrote, I think we must also be cautious with narrow AI as well, and I agree with your points about opaqueness and the potential of narrow AI turning into AGI. Again, the purpose of my post was not to argue how we could make AI safe, but to point out that we could have a great future without AGI. And I still see a lot of beneficial potential in narrow AI, IF we're cautious enough.