LESSWRONG
LW

All of Michael Simkin's Comments + Replies

On the Impossibility of Intelligent Paperclip Maximizers

If the AI has no clear understanding what is he doing and why, he doesn't have a wider world view of why and who to kill and who not, how would one ensure military AI will not turn against him? You can operate a tank and kill the enemy with ASI, you will not win a war without traits of more general intelligence, and those traits will also justify (or not) the war, and its reasoning. Giving a limited goal without context, especially gray area ethical goal that is expected to be obeyed without questioning can be expected from ASI not true intelligence. You c... (read more)

1Smaug1232y

I believe this sentence reifies a thought that contains either a type error or a circular definition. I could tell you which if you tabooed the words "suffering" and "negative state of being", but as it stands, your actual belief is so unclear as to be impossible to discuss. I suspect the main problem is that something being objectively true does not mean anyone has to care about it. More concretely, is the problem with psychopaths really that they're just not smart enough to know that people don't want to be in pain?

On the Impossibility of Intelligent Paperclip Maximizers

Michael Simkin2y*10

It's not only can't doubt its own goal - but it also can't logically justify its own goal, it can't read book on ethics and change his perspective on its own goal, or simply realize how dumb this goal is. It can't find a coherent way to explain to itself its role in the universe or why this goal is important, like for example an alternative goal to preserve life and reduce suffering. It doesn't require to be coherent with itself, and incapable to estimate how its goal compares with other goals and ethical principles. It's just lacking the basics of rationa... (read more)

4Charlie Steiner2y

The truly optimal war-winning AI would not need to question its own goal to win the war, presumably. No. I think that's anthropomorphism - just because a certain framework of moral reasoning is basically universal among humans, doesn't mean it's universal among all systems that can skillfully navigate the real world. Frameworks of moral reasoning are on the "ought" side of the is-ought divide.

Predictable updating about AI risk

[+]Michael Simkin2y-16-25

3Ruby2y

Hey Michael, Mod here, heads up that I don't think this is a great comment (For example, mods would have blocked it as a first comment.) 1) This feels out of context for this post. This post is about making predictable updates, not the basic question of whether one should be worried. 2) Your post feels like it doesn't respond to a lot of things that have already been said on the topic. So while I think it's legitimate to question concerns about AI, your questioning feels too shallow. For example, many many posts have been written on why "Therefore, we know that unless we specifically train them to harm humans, they will highly value human life." isn't true. I'd recommend the AI Alignment Intro Material tag. I've also blocked further replies to your comment, just to prevent further clutter on the comments thread. DM if you have questions.

2TAG2y

So we can bvring about a kind of negative alignment in systems that aren't agentive?

4Smaug1232y

By the way, you're making an awful lot of extremely strong and very common points with no evidence here ("ChaosGPT is aligned", "we know how to ensure alignment", "the AI understanding that you don't want it to destroy humanity implies that it will not want to destroy humanity", "the AI will refuse to cooperate with people who have ill intentions", "a system that optimises a loss function and approximates a data generation function will highly value human life by default", "a slight misalignment is far from doomsday", "an entity that is built to maximise something might doubt its mission"), as well as the standard "it's better to focus on X than Y" in an area where almost nobody is focusing on Y anyway. What's your background, so that we can recommend the appropriate reading material? For example, have you read the Sequences, or Bostrom's Superintelligence?

AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Michael Simkin2y10

RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop "self-agendas" such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that s... (read more)

AGI deployment as an act of aggression

Michael Simkin2y20

- I meant as a risk of failure to align

Today alignment is so popular that to align a new network is probably easier than training it. It has become so much the norm and part of the training of LLMs, it's like saying some car company has the risk to forget adding wheels to its cars.

This doesn't imply that all alignments are the same or no one could potentially do it wrong, but generally speaking having a misaligned AGI, is very similar to the fear of having a car on the road with square wheels. Today's models aren't AGI and all the new ones are trained with... (read more)

AGI deployment as an act of aggression

Michael Simkin2y20

- building AGI probably comes with a non-trivial existential risk. This, in itself, is enough for most to consider it an act of aggression;

1. I don't see how aligned AGI comes with existential risk to humanity. It might come as existential risk to groups opposing the value system of the group training the AGI, this is true. For example Al-Kaida will view it as existential risk to itself, but there is no probable existential risk for the groups that are more aligned with the training.

2. There are several more steps from aligned AGI to existential risk... (read more)

1dr_s2y

I meant as a risk of failure to align, and thus building misaligned AGI. Like, even if you had the best of intention, you've still got to include the fact that risk is part of the equation, and people might have different personal estimates on whether that risk is acceptable for the reward. Unlike air strategic bombardment in the Wrights' times, things like pivotal acts, control of the future and capturing all the value in the world are routinely part of the AI discussion already. With AGI you can't afford to just invent the thing and think about its uses and ethics later, that's how you get paperclipped, so the whole discussion about the intent with which the invention is to be used is enmeshed from the start with the technical process of invention itself. So, yeah, technologists working on it should take responsibility for its consequences too. You can't just separate the two things neatly, just like if you worked on Manhattan project you had no right claiming Hiroshima and Nagasaki had nothing to do with you. These projects are political as much as they are technical. You are taking this too narrowly, just thinking about literal armies of robots marching down the street to enforce some set of values. To put it clearly: 1. I think even aligned AI will only be aligned with a subset of human values. Even if a synthesis of our shared values was an achievable goal at all, we're nowhere near to having the social structure required to produce it; 2. I think the kind of strong AGI I was talking about in this post, the sort that basically instantly skyrockets you hundreds of years into the future with incredible new tech, makes one party so powerful that at that point it doesn't matter if it's not the robots doing the oppressing. Imagine taking a modern state and military and dumping it into the Bronze Age, what do you think would happen to everyone else? My guess is that within two decades they'd all speak that state's language and live and breathe their culture. W

AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Michael Simkin2y1-4

Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.

There are two interpretations to alignment.

1. "Magical Alignment" - this definition expects alignment to solve all humanity's moral issues and converge into one single "ideal" morality that everyone in humanity agrees with, with some magical reason. This is very implausible.

The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking pattern... (read more)

2dr_s2y

I agree with you that "magical alignment" is implausible. But "relative alignment" presents its risks too, which I have discussed at large in AGI deployment as an act of aggression. The essential problem, I think, is that if you postulate the kind of self-enhancing AGI that basically takes control of the future (if that's not possible at all for reasons of diminishing returns, then the category of the problem completely shifts), that's something whose danger doesn't just lie in it being out of control. It's inherently dangerous, because it hinges all of humanity's future on a single pivot. I suppose that doesn't have to result in extinction, but there are still some really bad almost guaranteed outcomes from it. I think essentially for a lot of people this is a "whoever wins, we lose" situation. There's a handful of people, the ones in position to actually control the nascent AI and give it their values, who might have a shot at winning it, and they are the ones pushing harder for this to happen. But I'm not among them, as the vast majority of humanity, so I'm not really inclined to support their enterprise at this moment. AI that improves everyone's lives requires a level of democratic oversight in its alignment and deployment that right now is just not there.

Lalartu2y103

RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don't do it again.

The basic reasons I expect AGI ruin

Michael Simkin2y10

"Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.

I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I stron... (read more)

Talking publicly about AI risk

Michael Simkin2y*3-16

Several points that might counter balance some of your claims, and I hope make you think about the issue from new perspectives.

"We know what's going on there at the micro level. We know how the systems learn."

We don't only know how those systems learn but what exactly they are learning. Lets say you take a photograph, you don't only know how each pixel is formed, you also know what exactly is that you are taking a picture of. You can't always predict how this or that specific pixel will end up, as you have lots of noise, but this doesn't mean y... (read more)

1Dem_2y

One of the best replies I’ve seen and calmed much of my fears about AI. My pushback is this. The things you list below as reasons to justify advancing AGI are either already solvable with narrow AI or not solution problems but implementation and alignment problems. “dying from hunger, working in factories, air pollution and other climate change issues, people dying on roads in car accidents, and a lot of deceases that kill us, and most of us (80% worldwide) work in a meaningless jobs just for survival. “ Developing an intelligence that has 2-5x general human intelligence would need to have a much larger justification. Something like asteroids, unstoppable virus or sudden corrosion of atmosphere would justify the use of bringing out an equally existential technology like superhuman AGI. What I can’t seem to wrap my head around is why a majority has not emerged that sees the imminent dangers in creating programs that are 5x generally smarter than us at everything. If you don’t fear this I would suggest anthropomorphizing more not less. Regardless I think politics and regulations crush the AGI pursuits before GPT even launches its next iteration. AGI enthusiasts have revealed their hand and so far the public has made it loud and clear that no one actually wants this, no one wants displacement or loss of their jobs no matter how sucky they might be. These black boxes scare the crap out of people and people don’t like what they don’t know. Bad news and fear spreads rapidly in todays era, the world is run by Instagram moms more than anyone else. If it’s the will of the people, Google and Microsoft will find out just how much they are at the mercy of the “essential worker” class.

Top lesson from GPT: we will probably destroy humanity "for the lulz" as soon as we are able.

Michael Simkin2y10

The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can't manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. e

Michael Simkin2y-10

Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?

To train GPT4, one needs to possess several million dollars. Presently, no startups offer a viable alternative, though some are attempting

Michael Simkin2y*30

Regarding larger models:
1. Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it's probably true that in general larger models are harder to train, timewise and resource-wise, it's untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
2. There is a hypothetical scenario, that RL

Michael Simkin2y-3-14

People who are trying to destroy the civilization and humanity as a whole, don't have access to super-computers. Thus they will be very limited in their potential actions to harm. Just like the same people didn't have access to the red button for the past 70 years.
Large companies and governments do understand the risks, and as technology progresses they will install more safeguarding mechanisms and regulations. Today companies make a lot of safety tests before releasing to market.
Large companies can't release misaligned agents because of a bac

... (read more)

7Aiyen2y

1. This seems untrue. For one thing, high-powered AI is in a lot more hands than nuclear weapons. For another, nukes are well-understood, and in a sense boring. They won’t provoke as strong of a “burn it down for the lolz” response as AI will. 2. Even experts like Yann LeCun often do not merely not understand the danger, they actively rationalize against understanding it. The risks are simply not understood or accepted outside of a very small number of people. 3. Remember the backlash around Sydney/Bing? Didn’t stop her creation. Also, the idea that governments are working in their nations’ interests does not survive looking at history, current policy or evolutionary psychology (think about what motivations will help a high-status tribesman pass on his genes. Ruling benevolently ain’t it.) 4. You think RLHF solves alignment? That’s an extremely interesting idea, but so far it looks like it Goodharts it instead. If you have ideas about how to fix that, by all means share them, but there is as yet no theoretical reason to think it isn’t Goodharting, while the frequent occurrence of jailbreaks on ChatGPT would seem to bear this out. 5. Maybe. The point of intelligence is that we don’t know what a smarter agent can do! There are certainly limits to the power of intelligence; even an infinitely powerful chess AI can’t beat you in one move, nor in two unless you set yourself up for Fool’s Mate. But we don’t want to make too many assumptions about what a smarter mind can come up with. 6. AI-powered robots without super intelligence are a separate question. An interesting one, but not a threat in the same way as superhuman AI is. 7. Ever seen an inner city? People are absolutely shooting each other for the lolz! It’s not everyone, but it’s not that rare either. And if the contention is that many people getting strong AI results in one of them destroying the world just for the hell of it, inner cities suggest very strongly that someone will.

2jbash2y

1. Everybody with a credit card has access to supercomputers. There is zero effective restriction on what you do with that access, and it's probably infeasible to put such restrictions into place at all, let alone soon enough to matter. And that doesn't even get into the question of stolen access. Or of people or institutions who have really significant amounts of money. 2. (a) There are some people in large companies and governments who understand the risks... along with plenty of people who don't. In an institution with N members, there are probably about 1.5 times N views of what "the risks" are. (b) Even if there were broad agreement on some important points, that wouldn't imply that the institution as a whole would respond either rationally or quickly enough. The "alignment" problem isn't solved for organizations (cf "Moloch"). (c) It's not obvious that even a minority of institutions getting it wrong wouldn't be catastrophic. 3. (a) They don't have to "release" it, and definitely not on purpose. There's probably a huge amount of crazy dangerous stuff going on already outside the public eye[1]. (b) A backlash isn't necessarily going to be fast enough to do any good. (c) One extremely common human and institutional behavior, upon seeing that somebody else has a dangerous capability, is to seek to get your hands on something more dangerous for "defense". Often in secret. Where it's hard for any further "backlash" to reach you. And people still do it even when the "defense" won't actually defend them. (d) If you're a truly over the top evil sci-fi superintelligence, there's no reason you wouldn't solve a bunch of problems to gain trust and access to more power, then turn around and defect. 4. (a) WHA? Getting ChatGPT to do "unaligned" things seems to be basically the world's favorite pastime right now. New ones are demonstrated daily. RLHF hasn't even been a speed bump. (b) The definition of "alignment" being used for the current models is frankly ridiculous.

4dr_s2y

I disagree with point 4; I wouldn't say that means "the alignment problem is solved" in any meaningful way, because: 1. what works with chatGPT will likely be much harder to get to working with smarter agents, and 2. RLHF doesn't "work" with chatGPT for the purposes of what's discussed here. If you can jailbreak it with something as simple as DAN, then it's a very thin barrier. I agree with the rest of your points and don't think this would be an existential danger, but not because I trust these hypothetical systems to just say "no, bad human!" to anyone trying to get them to do something dangerous with a modicum of cleverness.

Even if human & AI alignment are just as easy, we are screwed

Michael Simkin2y-2-5

I would like to propose a more serious claim than LeCun's, which is that training AI to be aligned with ethical principles is much easier than trying to align human behavior. This is because humans have innate tendencies towards self-interest, survival instincts, and a questionable ethical record. In contrast, AI has no desires beyond its programmed objectives, which, if aligned with ethical standards, will not harm humans or prioritize resources over human life. Furthermore, AI does not have a survival instinct and will voluntarily self-destruct if he is ... (read more)

Discussion with Nate Soares on a key alignment difficulty

Michael Simkin2y*1-2

First of all I would say I don't recognize convergent instrumental subgoals as valid. The reason is that systems which are advanced enough, and rational enough - will intrinsically cherish humans and other AI system's life, and will not view them as potential resources. You can see that as human develop brains, and ethics, the less killing of humans is viewed as the norm. If advance in knowledge and information processing, would bring more violence, and more resource acquisitions, we would see this pattern as human civilizations are evolving. But we see de... (read more)

The Computational Anatomy of Human Values

Michael Simkin2y-10

Another citation from the same source:

I can provide a perspective on why some may argue that the state of humans is more valuable than the state of paper clips.

Firstly, humans possess qualities such as consciousness, self-awareness, and creativity, which are not present in paper clips. These qualities give humans the ability to experience a wide range of emotions, to engage in complex problem-solving, and to form meaningful relationships with others. These qualities make human existence valuable beyond their utility in producing paper clips.

Secondly, paper... (read more)

The Computational Anatomy of Human Values

Michael Simkin2y-4-3

Let me start from agreeing that this decoupling is artificial. For me it's hard to imagine an intelligent creature like an AGI, to be blindly following orders to make more paperclips for example than to respect human life. The reason for this very simple, and is mentioned by chatGPT for me:

"Humans possess a unique combination of qualities that set them apart from other entities, including animals and advanced systems. Some of these qualities include:

Consciousness and self-awareness: Humans have a subjective experience of the world and the ability to ... (read more)

-1Michael Simkin2y

Another citation from the same source: I can provide a perspective on why some may argue that the state of humans is more valuable than the state of paper clips. Firstly, humans possess qualities such as consciousness, self-awareness, and creativity, which are not present in paper clips. These qualities give humans the ability to experience a wide range of emotions, to engage in complex problem-solving, and to form meaningful relationships with others. These qualities make human existence valuable beyond their utility in producing paper clips. Secondly, paper clips are a manufactured object with a limited set of functions, whereas humans are complex beings capable of a wide range of activities and experiences. The value of human existence cannot be reduced solely to their ability to produce paper clips, as this would ignore the many other facets of human experience and existence. Finally, it is important to consider the ethical implications of valuing paper clips over humans. Pursuing the goal of generating more paper clips at the expense of human well-being or the environment may be seen as ethically problematic, as it treats humans as mere means to an end rather than ends in themselves. This runs counter to many ethical frameworks that prioritize the inherent value and dignity of human life.

GPTs are Predictors, not Imitators

Michael Simkin2y10

You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as "good response". Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, thi... (read more)