It's not only can't doubt its own goal - but it also can't logically justify its own goal, it can't read book on ethics and change his perspective on its own goal, or simply realize how dumb this goal is. It can't find a coherent way to explain to itself its role in the universe or why this goal is important, like for example an alternative goal to preserve life and reduce suffering. It doesn't require to be coherent with itself, and incapable to estimate how its goal compares with other goals and ethical principles. It's just lacking the basics of rationa...
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop "self-agendas" such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that s...
- I meant as a risk of failure to align
Today alignment is so popular that to align a new network is probably easier than training it. It has become so much the norm and part of the training of LLMs, it's like saying some car company has the risk to forget adding wheels to its cars.
This doesn't imply that all alignments are the same or no one could potentially do it wrong, but generally speaking having a misaligned AGI, is very similar to the fear of having a car on the road with square wheels. Today's models aren't AGI and all the new ones are trained with...
- building AGI probably comes with a non-trivial existential risk. This, in itself, is enough for most to consider it an act of aggression;
1. I don't see how aligned AGI comes with existential risk to humanity. It might come as existential risk to groups opposing the value system of the group training the AGI, this is true. For example Al-Kaida will view it as existential risk to itself, but there is no probable existential risk for the groups that are more aligned with the training.
2. There are several more steps from aligned AGI to existential risk...
Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.
There are two interpretations to alignment.
1. "Magical Alignment" - this definition expects alignment to solve all humanity's moral issues and converge into one single "ideal" morality that everyone in humanity agrees with, with some magical reason. This is very implausible.
The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking pattern...
RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don't do it again.
"Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I stron...
Several points that might counter balance some of your claims, and I hope make you think about the issue from new perspectives.
"We know what's going on there at the micro level. We know how the systems learn."
We don't only know how those systems learn but what exactly they are learning. Lets say you take a photograph, you don't only know how each pixel is formed, you also know what exactly is that you are taking a picture of. You can't always predict how this or that specific pixel will end up, as you have lots of noise, but this doesn't mean y...
Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?
I would like to propose a more serious claim than LeCun's, which is that training AI to be aligned with ethical principles is much easier than trying to align human behavior. This is because humans have innate tendencies towards self-interest, survival instincts, and a questionable ethical record. In contrast, AI has no desires beyond its programmed objectives, which, if aligned with ethical standards, will not harm humans or prioritize resources over human life. Furthermore, AI does not have a survival instinct and will voluntarily self-destruct if he is ...
First of all I would say I don't recognize convergent instrumental subgoals as valid. The reason is that systems which are advanced enough, and rational enough - will intrinsically cherish humans and other AI system's life, and will not view them as potential resources. You can see that as human develop brains, and ethics, the less killing of humans is viewed as the norm. If advance in knowledge and information processing, would bring more violence, and more resource acquisitions, we would see this pattern as human civilizations are evolving. But we see de...
Another citation from the same source:
I can provide a perspective on why some may argue that the state of humans is more valuable than the state of paper clips.
Firstly, humans possess qualities such as consciousness, self-awareness, and creativity, which are not present in paper clips. These qualities give humans the ability to experience a wide range of emotions, to engage in complex problem-solving, and to form meaningful relationships with others. These qualities make human existence valuable beyond their utility in producing paper clips.
Secondly, paper...
Let me start from agreeing that this decoupling is artificial. For me it's hard to imagine an intelligent creature like an AGI, to be blindly following orders to make more paperclips for example than to respect human life. The reason for this very simple, and is mentioned by chatGPT for me:
"Humans possess a unique combination of qualities that set them apart from other entities, including animals and advanced systems. Some of these qualities include:
Consciousness and self-awareness: Humans have a subjective experience of the world and the ability to ...
You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as "good response". Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, thi...
If the AI has no clear understanding what is he doing and why, he doesn't have a wider world view of why and who to kill and who not, how would one ensure military AI will not turn against him? You can operate a tank and kill the enemy with ASI, you will not win a war without traits of more general intelligence, and those traits will also justify (or not) the war, and its reasoning. Giving a limited goal without context, especially gray area ethical goal that is expected to be obeyed without questioning can be expected from ASI not true intelligence. You c... (read more)