Many people believe that understanding "agency" is crucial for alignment, but as far as I know, there isn't a canonical list of reasons why we care about agency. Please describe any reasons why we might care about the concept of agency for understanding alignment below. If you have multiple reasons, please list them in separate answers below.
Please also try to be specific as possible about what our goal is in the scenario. For example:
We want to know what an agent is so that we can determine whether or not a given AI is a dangerous agent
Whilst useful isn't quite as good as:
We have an AI which may or may not have goals aligned with us and we want to know what will happen if these goals aren't aligned. In particular, we are worried that the AI may develop instrumental incentives to seek power and we want to use interpretability tools to help us figure out how worried we should be for a particular system.
We can imagine a scale of such systems according to how it behaves in novel situations:
- On the low end of this scale, we have a system that only follows extremely broad heuristics that it learned during training
- On the high end of this scale, we have a system that uses general reasoning capabilities to discover new and innovative strategies even if it has never used anything like this strategy before, or seen it in the training data
We can think of such a scale as a concretisation of agency. This scale provides an indication of both:
- How dangerous a system is likely to be: though a system with excellent heuristics and very little agency could be dangerous as well
- What kind of precautions we might want to take: for example, how much can we rely on our off-switch to disable the agent
It's plausible that interpretability tools might be able to give us some idea of where an agent is on this scale just by looking at the weights. So having a clear definition of this scale could help by clarifying what we are looking for.
In a few days, I'll add any use cases I'm aware of myself that either haven't been covered or that I don't think have been adequately explained by different answers.
This is one of the answers: https://www.alignmentforum.org/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation
Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.
He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.