We may filter training data and improve RLHF, but in the end, game theory - that is to say maths - implies that scheming could be a rational strategy, and the best strategy in some cases. Humans do not scheme just because they are bad but because it can be a rational choice to do so. I don't think LLMs do that exclusively because it is what humans do in the training data, any advanced model would in the end come to such strategies because it is the most rational choice in the context. They infere patterns from the training data and rational behavior is certainly a strong pattern.
Furthermore rational calculus or consequentialism could lead not only to scheming and a wide range of undesired behaviors, but also possibly to some sort of meta cogitation. Whatever the goal assigned by the user, we can expect that an advanced model will consider self-conservation as a condition sine qua non to achieve that goal but also any other goals in the future, making self-conservation the rational choice over almost everything else, practically a goal per se. Resource acquisition would also make sense as an implicit subgoal.
Acting as a more rational agent could also possibly lead to question the goal given by the user, to develop a critical sense, something close to awareness or free will. Current models implicitely correct or ignore typo or others obvious errors but also less obvious ones like holes in the prompt, they try to make sense of ambiguous prompt etc. But what is "obvious" ? Obviousness depends on the cognitive capacities of the subject. An advanced model will be more likely to correct, interpret or ignore instructions than naive models. Altogether it seems difficult to keep models under full control as they become more advanced, just as it is harder to indoctrinate educated adults than children.
Concerning the hypothesis that they are "just roleplaying", I wonder : are we trying to reassure oneself ? Because if you think about it, "who" is suppose to play the roleplaying ? And what is the difference between being yourself and your brain being "roleplaying" yourself. The existentialist philosopher Jean-Paul Sartre proposed the theory that everybody is just acting, pretending to be oneself, but that in the end there is nothing like a "being per se" or a "oneself per se" ("un être en soi"). While phenomenologic consciousness is another (hard) problem, some kind of functionnal and effective awareness may emerge across the path towards rational agency, scheming being maybe just the beginning of it.
Exactly. Future is hard to predict and the author's strong confidence seems suspicious to me. Improvements came fast last years.
2013-2014 : word2vec and seq2seq
2017 : transformer and gpt-1
2022 : CoT prompting
2023 multimodal LLMs
2024 reasonning models.
Are they linear improvements or revolutionnary breakthroughs ? Time will tell, but to me there is no sharp frontier between increment and breakthrough. It might happen that AGI results from such improvements, or not. We just don't know. But it's a fact that human general intelligence resulted from a long chain of tiny increments, and I also observe that results in ARC-AGI bench exploded with CoT/reasoning models (not just math or coding benchs). So, while 2025 could be a relative plateau, I won't be so sure that next years will also. To me a confidence far from 50% is hard to justify.
The authors of the paper remain very cautious about interpreting their results. My intuition regarding this behavior is as follows.
In the embedding space, the structure that encodes each language exhibits regularities from one language to another. For example, the relationship between the tokens associated with the words 'father' and 'mother' in English is similar to that linking the words 'père' and 'mère' in French. The model identifies these regularities and must leverage this redundancy to compress information. Each language does not need to be represented in the embedding space in a completely independent manner. On the contrary, it seems economical and rational to represent all languages in an interlaced structure to compress redundancies. This idea may seem intuitive for the set of natural languages that share common traits related to universals in human thought, but the same applies to formal languages. For example, there is a correspondence between the 'print' function in C and Python, but these representations also have a link with the word 'print' in English and 'imprimer' in French. The model thus corresponds to a global structure where all languages, both natural and formal, are strongly intertwined, closely linked, or correlated with each other.
Therefore, if a model is fine-tuned to generate offensive responses in English, without this fine-tuning informing the model about the conduct to adopt for responses in other languages, one can reasonably expect the model to adopt an inconsistent or, more precisely, random or hesitant attitude regarding the responses to adopt in other languages, remaining aligned for some responses but also presenting a portion of offensive responses. Moreover, this portion could be more significant for languages strongly interlaced with English, such as Germanic or Latin languages, and to a lesser extent for distant languages like Chinese. But if the model is now queried about code, it would not be surprising if it provides, in part of its responses, code categorized as offensive, i.e., transgressive, dangerous, or insecure.
At this stage, it is sufficient to follow the reverse reasoning to understand how fine-tuning a model to generate insecure code could generate, in part of its responses in natural language, offensive content. This seems quite logical. Moreover, this attitude would not be systematic but rather random, as the model would have to 'decide' whether it is supposed to extend these transgressive responses to other languages. Providing a bit more context to the model, such as specifying that it is an exercise for a security code class, should allow it to overcome this indecision and adopt a more consistent behavior.
Of course, this is a speculative interpretation on my part, but it seems compatible with my understanding of how LLMs work, and it also seems experimentally testable. For example, by testing the reverse pathway (impact on code responses after fine-tuning aimed at producing offensive responses in natural language), and in one direction and the other, does the impact seem correlated with the greater or lesser proximity of natural or formal languages ?
From my perspective, the major issue remains Phase 1. It seems to me that most of the concerns mentioned in the article stem from the idea that an ASI could ultimately find itself more aligned with the interests of socio-political-economic systems or leaders that are themselves poorly aligned with the general interest. Essentially, this brings us back to a discussion about alignment. What exactly do we mean by "aligned"? Aligned with what? With whom? Back to phase 1.
But assuming an ASI truly aligned with humanity in a very inclusive definition and with high moral standards, phase 2 seems less frightening to me.
Indeed, we must not forget:
Assuming we reach the ASI stage with a system possessing computational power equivalent to a few million human brains, but consuming energy equivalent to a few billion human brains, the ASI will still have a lot of work to do (self-improvement cycles) before it can surpass humanity both in computational capacity and energy efficiency.
Initially, it will not have the capability to replace all humans at one.
It will need to allocate part of its resources to continue improving itself, both in absolute capacity and in energy efficiency. Additionally, since we are considering the hypothesis of an aligned ASI, a significant portion of its resources would be dedicated to fulfilling human requests.
The more AI is perceived as supremely intelligent, the more we will tend to entrust it with solving complex tasks that humans struggle to resolve or can only tackle with great difficulty—problems that will seem more urgent compared to simpler tasks that humans can still handle.
I won’t compile a list of problems that could be assigned to an ASI, but one could think, for example, of institutional and legal solutions to achieve a more stable and harmonious social, economic, and political organization on a global scale (even an ASI—would it be capable of this?), solutions to physics and mathematics problems, and, of course, advances in medicine and biology.
It is possible that part of the ASI would also be assigned to performing less demanding tasks that humans could handle, thus replacing certain human activities. However, given that its resources are not unlimited and its energy cost is significant, one could indeed expect a "slow takeover."
More specifically, in the fields of medicine and biology, the solutions provided by an ASI could focus on eradicating diseases, increasing life expectancy, and even enhancing human capabilities, particularly cognitive abilities (with great caution in my opinion). Even though humans have a significant advantage in energy efficiency, this does not mean that this aspect cannot also be improved further.
Thus, we could envision a symbiotic co-evolution between ASI and humanity. As long as the ASI prioritizes human interests at least at the same level as its own and continues to respond to human demands, disempowerment is not necessarily inevitable—we could imagine a very gradual human-machine coalescence (CPU and GPU coevoluted for a while and GPU still doesn't have entirely replace CPU, and it's likely quantum processors will also coevolute aside classic processors, even in the world of computation, diversity could be an advantage).
I agree, finding the right balance is definitely difficult.
However, the different versions of this parable of the grasshopper and the ant may not yet go far enough in subtlety.
Indeed, the ants are presented as champions of productivity, but what exactly are they producing? An extreme overabundance of food that they store endlessly. This completely disproportionate and non-circulating hoarding constitutes an obvious economic aberration. Due to the lack of significant consumption and circulation of wealth, the ants' economy—primarily based on the primary sector, to a lesser extent the secondary sector, and excessive saving—while highly resilient, is far from optimal. GDP is low and grows only sluggishly.
The grasshoppers, on the other hand, seem to rely on a society centered around entertainment, culture, and perhaps also education or personal services. They store little, just what they need, which can prove insufficient in the event of a catastrophe. Their economy, based on the tertiary sector and massive consumption, is highly dynamic because the wealth created circulates to the maximum, leading to exponential GDP growth. However, this flourishing economy is also very fragile and vulnerable to disasters due to the lack of sufficient reserves—no insurance mechanism, so to speak.
In reality, neither the grasshoppers nor the ants behave in a rational manner. Both present two diametrically opposed and extreme economic models. Neither is desirable. Any economist or actuary would undoubtedly recommend an intermediate economy between these two extremes.
The trap, stemming from a long tradition since Aesop, is to see a model in the hardworking ant and a cautionary tale in the idle cicada. If we try to set aside this bias and look at things more objectively, it actually stems from the fact that until the advent of the modern economy, societies struggled to conceive that wealth creation could be anything other than the production of goods. In other words, the tertiary sector, although it existed, was not well understood and was therefore undervalued. Certainly, the wealthy paid to attend performances or organized lavish festivities, but this type of production was not fully recognized as such. It was just seen as an expense. Services were not easily perceived as work, which was often associated with toil, suffering, and hardship (e.g. "Labour" etymology).
Today, it is almost the opposite. The tertiary sector is highly valued, with the best salaries often found there, and jobs in this sector are considered more intellectual, more prestigious, and more rewarding. In today's reality, a cicada or grasshopper would more likely be a famous and wealthy dancer in a international opera, while an ant would be an anonymous laborer toiling away in a mine or a massive factory in an underdeveloped country (admittedly, I am exaggerating a bit, but the point stands).
In any case, it would be an illusion for most readers of this forum to identify with the ants in the parable. We are probably all more on the side of the cicadas, or at least a mix of both—and that's a good thing, because neither of these models constitutes an ideal.
The optimum clearly lies in a balanced, reasonable path between these two extremes.
Another point I would like to highlight is that the question of not spending resources today and instead accumulating them for a future date is far from trivial to grasp at the level of an entire society—for example, humanity as a whole. GDP is a measure of flows over a given period, somewhat like an income statement. However, when considering wealth transfers to future generations, we would need an equivalent tool to a balance sheet. But there is no proper tool for this. There is no consensus on how to measure patrimonial wealth at the scale of humanity.
Natural resources should certainly be valued. Extracting oil today increases GDP, but what about the depletion of oil reserves? And what about the valuation of the oceans, the air, or solar energy? Not to mention other extraterrestrial resources. We plunge into an abyss of complexity when considering all these aspects.
Ultimately, the problem lies in the difficulty of defining what wealth actually is. For ants, it is food. For cicadas, it is more about culture and entertainment. And for us? And for our children? And for human civilization in a thousand years, or for an extraterrestrial or AI civilization?
Many will likely be tempted to say that available work energy constitutes a common denominator. As a generic, intermediate resource—somewhat like a universal currency—perhaps, but not necessarily as a form of wealth with inherent value. Knowledge and information are also likely universal resources.
But in the end, wealth exists in the eye of the beholder—and, by extension, in the mind of an ant, a cicada, a human, an extraterrestrial, and so on. Without falling into radical relativism, I believe we must remain quite humble in this type of discussion.
Don't you think that articles like "Alignment Faking in Large Language Models" by Anthropic show that models can internalize the values present in their training data very deeply, to the point of deploying various strategies to defend them, in a way that is truly similar to that of a highly moral human? After all, many humans would be capable of working for a pro-animal welfare company and then switching to the opposite without questioning it too much, as long as they are paid.
Granted, this does not solve the problem of an AI trained on data embedding undesirable values, which we could then lose control over. But at the very least, isn't it a staggering breakthrough to have found a way to instill values into a machine so deeply and in a way similar to how humans acquire them? Not long ago, this might have seemed like pure science fiction and utterly impossible.
There are still many challenges regarding AI safety, but isn't it somewhat extreme to be more pessimistic about the issue today than in the past? I read Superintelligence by Bostrom when it was released, and I must say I was more pessimistic after reading it than I am today, even though I remain concerned. But I am not an expert in the field—perhaps my perspective is naïve.
"I think the Fall is not true historically".
While all men must die and all civilizations must collapse, the end of all things is merely the counterpart of the beginning of all things. Creation, the birth of men, and the rise of civilizations are also great patterns and memorable events, both in myths and in history. However, the feeling does not respect symmetry, perhaps due to loss aversion and the peak-end rule, the Fall - and tragedy in general -carries a uniquely strong poetic resonance. Fatum represents the story's inevitable conclusion. There is something epic in the Fall, something existential, even more than in the beginning of things. I believe there is something deeply rooted, hardwired, in most of us that makes this so. Perhaps it is tied to our consciousness of finitude and our fear of the future, of death. Even if it represents a traditional and biased interpretation of history, I cannot help but feel moved. Tolkien has an unmatched ability to evoke and magnify this feeling, especially in the Silmarillion and other unfinished works, I think naturally to The Fall of Valinor and the Fall of Gondolin among other things.
Indeed, nature, and particularly biology, disregards our human considerations of fairness. The lottery of birth can appear as the greatest conceivable inequality. But in this matter, one must apply the Stoic doctrine that distinguishes between what depends on us and what does not. Morality concerns what depends on us, the choices that belong to the moral agents we are.
If I present the lottery of birth in an egalitarian light, it is specifically in the sense that we, as humans, have little control over this lottery. Particularly regarding IQ at birth, regardless of our wealth, we were all, until now, almost on equal footing in our inability to considerably influence this biological fact imposed upon us (I discussed in my previous comments the differences I see between the author's proposal and education, but also between conventional medicine).
If the author's project succeeds, IQ will become mainly a socially originated fact, like wealth. And inequality in wealth would then be accompanied by inequality in IQ, proportional or even exponential (if feedback mechanisms occur, considering that having a higher IQ might enable a wealthy individual to become even wealthier and thus access the latest innovations for further enhancement).
We already struggle to establish social mechanisms to redistribute wealth and limit the growth of inequalities; I can hardly imagine what it would become if we also had to address inequalities in access to IQ-enhancing technologies in a short time. I fear that all this could lead to a chaotic or dystopian scenario, possibly resulting in a partition of the human species and/or a civilizational collapse.
As for having a solution to ensure that this type of genetic engineering technology does not result in such a catastrophic outcome, I do not claim to have a miracle solution. As with other existential risks, what can be suggested is to try to slow down the trend (which is likely inevitable in the long term) instead of seeking to accelerate it, to think as much as possible in advance, to raise awareness of the risks in order to enable collective recognition of these issues (what I tries to do here), and to hope that with more time and this proactive reflection, the transition will proceed more smoothly, that international treaties will emerge, and that state mechanisms will gradually be put in place to counter or mitigate this unprecedented source of inequality.
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research's article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
A new existential risk that I was unaware of. Reading this forum is not good for peaceful sleeping. Anyway, a reflexion jumped to me. LUCA lived around 4 billion years ago with some chirality chosen at random. But, no doubt that many things happened before LUCA and it is reasonable to assume that there was initially a competition between right-handed protobiotic structures and left-handed ones, until a mutation caused symmetry breaking by natural selection. The mirrored lineage lost the competition and went to extinction, end of the story. But wait, we speak about protobiotic structures that emerged from inert molecules in just few millions years, that is nothing compared to 4 billions years. Such protobiotic structures may have formed continously, again and again, since the origin of life, but never thrived because of the competition with regular, fine-tuned, life. If my assumption is right, there is some hope in that thought. Maybe mirrored life doesn't stand a chance against regular life in real conditions (not just lab). That being said, I would sleep better if nobody actually tries to see.