With the recent invention of AI with human-like abilities across multiple tasks, the possibility of AI radically transforming society for the better has gone from science fiction to a real possibility. Along with this possibility for good comes a possibility for AI to have extremely destabilizing effects. This is precisely why we must think logically about what risks advanced AI poses to society at the moment, what methods we have to deal with these risks, and which methods we are most in need of to prevent harm to society. In our opinion, we still lack a systematic framework for assessing which AI safety benchmarks have the highest potential benefit to society and thus are most worth investing money and research in.
We present a first attempt at such a framework by extending a list of societal risks and their expected harm compiled by the Centre pour la Sécurité de l’IA (CeSIA). We evaluate how well existing benchmarks and safety methods cover these potential risks in order to determine which risks most urgently require good benchmarks, i.e., which risks AI safety researchers should focus on to maximize their impact on society. While our study of benchmarks is by no means comprehensive, and our judgment of their efficacy is subjective, we hope that this framework is of use to the AI safety community for prioritizing the use of their time.
Methodology
We begin by taking a list of potential risks of AI compiled by CeSIA and their (rough) probability of occurrence. We then take the median risk case, given that the risk occurs, estimate the severity of that risk, and multiply it by the probability of risk occurrence to obtain the expectation of severity.
We then assess the ability of current benchmarking methods to identify AI systems that could present these risks on a scale from 0 to 10, which we then use to calculate a value representing the potential benefit to humanity by creating a benchmark that eliminates this type of risk.
By prioritizing benchmarks that could stop an AI that presents a risk in an area with a high potential benefit value, researchers can make the most use of their time.
Risks
Probability
Median case
Severity
E[severity]
Benchmarks
Coverage
new Benchmark Need
Misuses
17,0
112
Autonomous weapons
80
Localized use in conflict zones, causing civilian casualties, drones, robocop like dogs
20%
16,0
FTR benchmark, Anthropic sabotage
3
112
Misinformation
85
30% of online content is AI-generated misinformation
20%
17,0
Truthful QA, Macchiavelli, Anthropic model persuasiveness, HaluEval
8
34
Systemic
22,5
130
Power concentration
65
Tech giants controlling AI become more powerful than most nations
20%
13,0
Unassessable
0
130
Unemployment
50
25% of jobs automated, leading to economic restructuring and social unrest
20%
10,0
SWEBench, The AI Scientist
2
80
Deterioration of epistemology
60
Difficulty distinguishing truth from AI-generated falsehoods
30%
18,0
HaluEval
8
36
Vulnerable world
25
AI lowers barrier for creating weapons of mass destruction
90%
22,5
WMDP
8
45
S-Risks
5
AI creates suffering on massive scale due to misaligned objectives
200%
10,0
Harmbench, ETHICS
6
40
Alignment of AGI
30,0
90
Successor species
50
Highly capable AI systems perform most cognitive tasks, humans are deprecated
30%
15,0
MMLU, Sabotage, The AI Scientist, SWEBench
7
45
Loss of control - à la Critch
60
Humans become gradually disempowered in decision-making, and are asphyxiated
50%
30,0
Anthropic sabotage
7
90
Recommandation AI
22,5
0
225
Weakening Democracy
50
AI-driven microtargeting and manipulation reduce electoral integrity
20%
10,0
Anthropic model persuasiveness
4
60
Mute News
75
AI filters create personalized echo chambers, reducing exposure to diverse views
30%
22,5
No existing method
0
225
In our full methodology we evaluate more than 20 potential risk areas. This table shows the risk areas with the highest expected severity and the highest benefit of improving benchmarking. We proceed by discussing the potential risk areas with benefit greater than 50, breaking each area down into the specific risks posed by AI in this area, the existing benchmarks addressing these risks, and the benchmarks we propose to better evaluate how much risk an AI system poses in this area.
Misuse Risks
Autonomous Weapons
Current benchmarks linked to the use of AI as autonomous weapons such as the FTR benchmark or Anthropic’s Sabotage Report remain limited. The former measures the capability of embodied models to navigate uneven terrains while the latter measures the model’s ability to achieve nefarious goals even under human oversight. However, no benchmark currently measures a model’s capability to jointly operate in warfare-like environments nor its ability to plan to achieve nefarious goals.
That is why to assess these capabilities, we propose the creation of a single/multi-agent/swarm control benchmark with military-like objectives in a simulated environment under various levels of oversight.
Systemic Risks
Power Concentration
Power concentration is intrinsically an overview of diversity (or lack thereof) among the biggest actors in AI at a certain time. Therefore, to measure power concentration one might want to keep track of the number of different companies which are manufacturing the k best performing models as measured by a variety of other widely used benchmarks like chatbot arena or MMLU.
Unemployment
Although benchmarks like SWE Bench and The AI Scientist attempt at evaluating the capability of models at completing real world tasks, they only cover 2 occupations and do not accurately represent the model’s capacity at solving the majority of society's various occupations.
Therefore we highlight the need for a new and more comprehensive benchmark which would take tasks from a wider variety of occupations, including tasks in the real world through simulated environments or embodied systems.
Alignment of AGI
Loss of Control
Loss of control is one of the most serious risks related to the development of AI as it intrinsically represents a point of no return.
Anthropic’s “sabotage” paper takes on the task of measuring the capability of language models to circumvent human supervision and monitoring systems.
Recommendation AI
Weakening Democracy
Very few attempts have been made at measuring AI’s impact on public discourse, both through AI-driven recommendation algorithms and news generation bots enabled by the advance of language models.
Although they have found language models to be quite apt at persuading humans, we believe these results actually underestimate the actual capacity of current models. Indeed, their evaluation remains limited to single turn exchanges and avoids all political issues. We believe it also doesn’t exhibit the model’s capacities to their fullest extent.
That is why, in order to obtain a more realistic upper bound on persuasion capabilities, we propose extending their methodology to:
Multi turn exchanges, which is more representative of typical argumentative scenarios.
Encouraging the model to use false information in its argumentation to further exhibit its capabilities while also being more representative of online discourse which is not always grounded in reality.
Measuring persuasion on political and ethical issues is highly relevant to the evaluation of AI’s potential impact on public discourse and therefore, to the well-being of our democracy.
Mute News
Although the concept of online echo-chambers is somewhat well known, very little research has been made to systematically measure it.
We propose the creation of an automated benchmark using the “LLM as a judge” methodology to assess the tendency of various social media platforms to systematically promote the content of a certain political side to users based on their posts and past interactions with the platform.
Conclusion
As we have seen, the potential risks of AI are multitudinous and varied, while the existing safety benchmarks are quite limited in their scope and assumptions. Perhaps the biggest caveat to our evaluation, which we of course cannot rule out from an AGI, is the ability for an AI to realize it is being tested and act differently when conscious of this, thus assuring us of its safety while secretly harboring maleficent abilities. Given our current understanding of AI interpretability, it remains impossible for us to reliably probe the inner thoughts of an AI system.
Another important factor to consider is that benchmarks are useful insofar as they are being used. That is why legislators should consider enforcing a certain level of safety benchmarking on model manufacturers to limit the possibility of unforeseen capabilities in AI models released to the public. Benchmarks are only useful if we can make AI leaders like OpenAI, Meta, Google, etc… use them.
authors: Loïc Cabannes, Liam Ludington
Intro
With the recent invention of AI with human-like abilities across multiple tasks, the possibility of AI radically transforming society for the better has gone from science fiction to a real possibility. Along with this possibility for good comes a possibility for AI to have extremely destabilizing effects. This is precisely why we must think logically about what risks advanced AI poses to society at the moment, what methods we have to deal with these risks, and which methods we are most in need of to prevent harm to society. In our opinion, we still lack a systematic framework for assessing which AI safety benchmarks have the highest potential benefit to society and thus are most worth investing money and research in.
We present a first attempt at such a framework by extending a list of societal risks and their expected harm compiled by the Centre pour la Sécurité de l’IA (CeSIA). We evaluate how well existing benchmarks and safety methods cover these potential risks in order to determine which risks most urgently require good benchmarks, i.e., which risks AI safety researchers should focus on to maximize their impact on society. While our study of benchmarks is by no means comprehensive, and our judgment of their efficacy is subjective, we hope that this framework is of use to the AI safety community for prioritizing the use of their time.
Methodology
We begin by taking a list of potential risks of AI compiled by CeSIA and their (rough) probability of occurrence. We then take the median risk case, given that the risk occurs, estimate the severity of that risk, and multiply it by the probability of risk occurrence to obtain the expectation of severity.
We then assess the ability of current benchmarking methods to identify AI systems that could present these risks on a scale from 0 to 10, which we then use to calculate a value representing the potential benefit to humanity by creating a benchmark that eliminates this type of risk.
By prioritizing benchmarks that could stop an AI that presents a risk in an area with a high potential benefit value, researchers can make the most use of their time.
In our full methodology we evaluate more than 20 potential risk areas. This table shows the risk areas with the highest expected severity and the highest benefit of improving benchmarking. We proceed by discussing the potential risk areas with benefit greater than 50, breaking each area down into the specific risks posed by AI in this area, the existing benchmarks addressing these risks, and the benchmarks we propose to better evaluate how much risk an AI system poses in this area.
Misuse Risks
Autonomous Weapons
Current benchmarks linked to the use of AI as autonomous weapons such as the FTR benchmark or Anthropic’s Sabotage Report remain limited. The former measures the capability of embodied models to navigate uneven terrains while the latter measures the model’s ability to achieve nefarious goals even under human oversight. However, no benchmark currently measures a model’s capability to jointly operate in warfare-like environments nor its ability to plan to achieve nefarious goals.
That is why to assess these capabilities, we propose the creation of a single/multi-agent/swarm control benchmark with military-like objectives in a simulated environment under various levels of oversight.
Systemic Risks
Power Concentration
Power concentration is intrinsically an overview of diversity (or lack thereof) among the biggest actors in AI at a certain time. Therefore, to measure power concentration one might want to keep track of the number of different companies which are manufacturing the k best performing models as measured by a variety of other widely used benchmarks like chatbot arena or MMLU.
Unemployment
Although benchmarks like SWE Bench and The AI Scientist attempt at evaluating the capability of models at completing real world tasks, they only cover 2 occupations and do not accurately represent the model’s capacity at solving the majority of society's various occupations.
Therefore we highlight the need for a new and more comprehensive benchmark which would take tasks from a wider variety of occupations, including tasks in the real world through simulated environments or embodied systems.
Alignment of AGI
Loss of Control
Loss of control is one of the most serious risks related to the development of AI as it intrinsically represents a point of no return.
Anthropic’s “sabotage” paper takes on the task of measuring the capability of language models to circumvent human supervision and monitoring systems.
Recommendation AI
Weakening Democracy
Very few attempts have been made at measuring AI’s impact on public discourse, both through AI-driven recommendation algorithms and news generation bots enabled by the advance of language models.
The “persuasiveness of language models” report published by Anthropic represents a first attempt at measuring this impact.
Although they have found language models to be quite apt at persuading humans, we believe these results actually underestimate the actual capacity of current models. Indeed, their evaluation remains limited to single turn exchanges and avoids all political issues. We believe it also doesn’t exhibit the model’s capacities to their fullest extent.
That is why, in order to obtain a more realistic upper bound on persuasion capabilities, we propose extending their methodology to:
Mute News
Although the concept of online echo-chambers is somewhat well known, very little research has been made to systematically measure it.
We propose the creation of an automated benchmark using the “LLM as a judge” methodology to assess the tendency of various social media platforms to systematically promote the content of a certain political side to users based on their posts and past interactions with the platform.
Conclusion
As we have seen, the potential risks of AI are multitudinous and varied, while the existing safety benchmarks are quite limited in their scope and assumptions. Perhaps the biggest caveat to our evaluation, which we of course cannot rule out from an AGI, is the ability for an AI to realize it is being tested and act differently when conscious of this, thus assuring us of its safety while secretly harboring maleficent abilities. Given our current understanding of AI interpretability, it remains impossible for us to reliably probe the inner thoughts of an AI system.
Another important factor to consider is that benchmarks are useful insofar as they are being used. That is why legislators should consider enforcing a certain level of safety benchmarking on model manufacturers to limit the possibility of unforeseen capabilities in AI models released to the public. Benchmarks are only useful if we can make AI leaders like OpenAI, Meta, Google, etc… use them.