A Systematic Approach to AI Risk Analysis Through Cognitive Capabilities

Epistemic status: This idea emerged during my participation in the MATS program this summer. While I intended to develop it further and conduct more rigorous analysis, time constraints led me to publish this initial version (30-60m of work) . I'm sharing it now in case others find it valuable or spot important flaws I've missed. Very open to unfiltered criticism and suggestions for improvement.

Why Focus on Cognitive Capabilities?

When analyzing AI systems, we often focus on their ability to perform specific tasks. However, each task can be broken down into three fundamental components: knowledge, physical capabilities, and cognitive capabilities. This decomposition offers a potential novel approach to analyzing AI risks.

Let's examine why cognitive capabilities deserve special attention:

  1. Knowledge alone cannot lead to risk. Information without the ability to process or act on it is inert.
  2. Physical capabilities, while potentially risky, are relatively straightforward to control and monitor.
  3. Cognitive capabilities are prerequisites for nearly all risks. Almost any dangerous action requires some form of cognitive processing, making these capabilities a critical point of analysis.

However, we face a significant challenge: for any given task, especially dangerous ones, it's difficult to determine which cognitive capabilities are strictly necessary for its completion. We don't want to wait until an AI system can actually perform dangerous tasks before we understand which cognitive capabilities enabled them.

A Systematic Approach

Instead of working backwards from observed dangerous behaviors, we can approach this systematically by mapping the relationship between cognitive capabilities and risks:

  1. Start with two finite lists:
    1. A comprehensive catalog of potential risks
    2. A taxonomy of cognitive capabilities (typically ranging from 15 to 50, depending on the classification system used)
  2. For each possible combination of cognitive capabilities, we can analyze which risks it might enable, regardless of the physical capabilities or knowledge required.

This approach has several advantages:

  • It's extensible: New cognitive capabilities or risks can be added to the analysis as they're discovered
  • It's systematic: We can exhaustively examine combinations rather than relying on intuition
  • It's proactive: We don't need to wait for dangerous capabilities to emerge before analyzing them

Methodological Considerations

There are two potential approaches to this analysis:

  1. Risk-First Approach: Starting with a specific risk and working backward to identify which combinations of cognitive capabilities could enable it.
  2. Capabilities-First Approach: Starting with combinations of cognitive capabilities and exploring what risks they might enable.

The Capabilities-First approach is generally superior because it reduces confirmation bias. Instead of trying to justify our preexisting beliefs about what capabilities might lead to specific risks, we can think like red teamers: "Given this set of cognitive capabilities, what risks could they enable?"

Implementation Strategies

To make this analysis tractable, we could:

  1. Assemble a dedicated research team
  2. Develop AI-powered analysis pipelines
  3. Crowdsource the analysis to the broader AI safety community

If the analysis proves intractable even with these approaches, that finding itself would be valuable - it would demonstrate the inherent complexity of the problem space.

Practical Applications

This framework enables several practical applications:

  1. Early Warning Systems: By rigorously evaluating the cognitive capabilities of AI models, we can create effective early warning systems. Instead of waiting to see if a model can perform dangerous tasks, we can monitor specific combinations of capabilities and set appropriate thresholds.
  2. Training Optimization: We can identify which cognitive capabilities might be safely minimized during training while maintaining desired functionalities.
  3. Targeted Evaluation: This systematic approach can inform the design of specific task-based evaluations that probe for concerning combinations of capabilities.
  4. Scaling Laws: By understanding which cognitive capabilities enable which risks, we can develop better scaling laws to anticipate future developments.

Next Steps

The immediate challenge is prioritization. While a complete analysis of all possible combinations of cognitive capabilities and risks would be ideal, we can start with:

  1. High-priority risk categories based on potential impact
  2. Core cognitive capabilities that seem most relevant to current AI systems
  3. Specific combinations that appear most likely to enable most critical risks

This framework provides a structured way to think about AI risk assessment and monitoring, moving us beyond task-based analysis to a more fundamental understanding of how cognitive capabilities combine to enable potential risks.



Acknowledgments: I would like to thank Quentin Feuillade-Montixi, Ben Smith, Pierre Peigné, Nicolas Miailhe, JP and others for the fascinating discussions that helped shape this idea during the MATS program. While they contributed valuable conversations, none of them were involved in this post, and any mistakes or questionable ideas are entirely my own responsibility.

2

New Comment