[heavily edited March 2025 to improve accuracy and reading style.]
Abstract
This paper examines a critical distinction in AI alignment approaches: static alignment, where AI models maintain fixed goals from their training period, vs dynamic alignment, where models continuously update to match current human preferences. I argue that dynamic alignment, which allows AI systems to adapt to evolving human values, presents significant advantages over static alignment despite still facing many challenges. The analysis focuses on three key areas: the impact of particular alignment styles on model behaviour, the implications for human autonomy and the privacy concerns inherent in preference monitoring. By examining how these models interact with human agency and value development, I demonstrate that dynamic alignment better preserves human autonomy while raising important privacy considerations. The paper also explores technical challenges specific to dynamic alignment, including the difficulty of accurately interpreting human preferences and protecting against value manipulation. While dynamic alignment appears theoretically preferable, significant research is needed to ensure its practical implementation preserves both autonomy and privacy. This work concludes by identifying critical areas for future research in developing and verifying dynamic alignment systems.
Introduction
‘AI alignment’ is often defined as aligning AI models with human values. But there’s an ambiguity here. When referring to AI alignment, it is often unclear whether an author is referring to de re or de dicto concepts of alignment with human values. This is an important distinction to make as it can alter the consequences of a model’s deployment, as I will highlight throughout this paper. Post-deployment, they each exhibit different behavioural patterns and thus respond to particular risks in differing ways.
De re and de dicto are usually used within philosophy to refer to a specific referent of a word and the general category the word refers to respectively. For example, if an individual requests ‘the plant with the most flowers on the table’, they may mean a specific plant that is best described as being the plant with the most flowers (this being de re) or they may mean any plant that fits this description (de dicto). Importantly, in the de re case, the plant they want would not change if another was added to the table with twice as many flowers, whereas in the de dicto case they would now want the newcomer.
This can also be applied to de re and de dicto alignment - de re alignment refers to a model being aligned with the de re sense of “what humans want”, this being the specific thing that humans want at the time of training. For example, if humans want ‘AI to promote welfare’ then what the AI wants is for ‘AI to promote welfare’. I should note that this is just an example, we may instead choose the ‘helpful, harmless, and honest’ framework, or ‘AI to follow instructions’, or suchlike. What de re aligned AIs want should not change as human wants change; if humans alter their concept of “welfare” to include an Aristotelian form of flourishing, the de re model would not update to accommodate this. De dicto alignment refers to alignment with the de dicto sense of “what humans want”; this does change as human desires change, and thus this model would alter its concept of “welfare” to include Aristotelian flourishing.
I suggest that we refer to de re and de dicto alignment as ‘static alignment’ and ‘dynamic alignment respectively,’ to relate more clearly to how the models reflect human desires.
This paper examines the relative merits of static and dynamic versions of AI alignment, assessing which approach may be preferable for long-term deployment. The analysis considers three key aspects: the implications of changeable versus fixed values, the effects these models may have on human agency and desires, and the potential for value drift within the models themselves. Central to this comparison is how static and dynamic models differently impact human autonomy and privacy. While the scope of this examination is necessarily broad, serving as an overview of the distinctions between static and dynamic alignment, each topic discussed merits deeper investigation. Given the complexity of these issues and the emerging nature of the field, this paper aims to establish a foundation for further research rather than provide an exhaustive analysis of any single aspect.
Preference Change
We need to ask the obvious question: is it better for an agent’s preferences to change as ours do?
The nature of dynamic alignment allows the model to update to fit our changing desires - fulfilling the needs it currently serves rather than those from the time it was trained. Even if our desires remain stable, for example, we may always want a model to be harmless, our concept of harmlessness may change e.g. we may come to widen our moral circle to include other species in our concept of harm. Our preferences regarding which goals should be prioritised over others may also change. This change can lead to the obsoletion of models that cannot update to accommodate new ideas - static models cannot keep up with the evolution of humanity’s needs and desires.
A key benefit of dynamic models is that they can change with our species; humanity’s values seem to improve overall as time passes, and AI values should change with us. Evidence suggests that human morality tends to progress over time: our preferences become more informed and our moral circle widens as society becomes more complex, allowing for greater integration of ideas across cultures. Future developments that impact our reasoning ability, like neural enhancements, may further accelerate our capacity for ethical understanding. Given this trajectory, it would be presumptuous to assume that today's values should remain fixed in long-term models, or that our current moral framework is superior to what future generations may develop.
If not carefully considered, embedding our current preferences into AI systems could unfairly constrain future societies' autonomy. While future generations technically retain the freedom to adopt alternative, more value-aligned models, the widespread implementation of AI systems trained on our present moral beliefs may subtly inhibit the evolution of moral thought. These systems could create cognitive biases that limit people's ability to genuinely explore and develop their own ethical frameworks. We already observe AI's influence on human values today through extensive data collection and behavioral prediction systems. This enables thought manipulation - research in 2014 demonstrated that simply changing the content of social media feeds could affect users' emotional states. Even just increasing the quantity of information presented to somebody can impact their likelihood of critically engaging with any of it. This impact on thought limits autonomy and weakens democracy, a problem that will likely intensify as AI models become more sophisticated. Rather than amplifying these issues through rigid, unchanging systems, we should prioritize preserving future generations' ability to shape their own moral landscape.
It could alternatively be argued that static models are useful when a model’s goal is simple and should not be changed under any circumstances. Take long-term nuclear waste warning messages - we may use an AI system to communicate these messages as effectively as possible to anybody who comes across the site. It seems plausible that a malicious actor may attempt to interfere with this goal and thus a static model would be better suited as it would be more resistant to interference. It also seems plausible that the only reasons for changing this goal are malicious ones, so we should not use a model that can update to fit changes in human preference. However, this reignites arguments surrounding autonomy and respect for the decisions made by future people. We should not presume that we know better, we should not make irreversible decisions that restrict the actions of future generations out of respect for their ability to make their own choices. A dynamic model that is appropriately reinforced against malicious actors should allow alterations to its behaviour (perhaps to how it communicates, to who it allows access etc) based on changing preferences. Long-term dynamic models are no less safe than static ones. However, long-term static models limit the autonomy of future generations and demonstrate a lack of respect for their changing preferences. We have clear reasons for preferring dynamic models from this perspective.
Knowing what you want
Though it appears generally preferable that a model is dynamically aligned, it isn’t always clear which desires a model should align to. The information needed to continually update to fit human preferences also raises privacy concerns. It is beyond the scope of this paper to examine every aspect of this issue in-depth so I instead provide an outline and encourage further research. These issues largely only occur for dynamic models, as static models don't need to understand the desires those around them are portraying.
For a model to recognise the desires held by an individual in a given moment, as a dynamic model must, you must either volunteer your goals or it should determine them (consensually) through other means. For example, a model may have access to your location, browser, and conversation history if it interacts primarily virtually. A model with access to a wider array of tools may be trained to read behavioural signals, speech patterns, neural activity, etc. The latter raises questions surrounding privacy, as if a model can infer our desires it likely will have the ability to use this data to further infer other non-cognitive states and beliefs without user consent. Mechanistic interpretability research is needed to infer what data the model uses, how it uses it, and ensure it is not analysing the data of an agent non-consensually. Limiting the model’s access to certain tools may improve privacy, but it also limits the model's capabilities and can prevent it from accurately inferring desires. An appropriate balance needs to be found between privacy (and further safety concerns) and AI capabilities before models are given access to such sensitive data. Furthermore, the nature of this data makes it particularly attractive to malicious actors, requiring dynamic models to implement exceptionally stringent cybersecurity protocols to safeguard against targeted attacks and unauthorized access.
A dynamic model also faces the issue of determining which of your desires to follow. An individual may hold many desires at once, they may contradict one another, and they may contradict things the agent knows to be true. An individual may, for example, desire to do at least 30 minutes of exercise every day. The model is told this desire, likely on New Year's Day, and also recognises that on January 10th at 11:30 pm, despite not exercising yet, the agent holds the desire to go to bed. They are adamant that what they want most is to go to sleep. The model is faced with two contradicting desires: one that appears more beneficial long-term, and another that is more strongly felt in the present. The model needs to determine which of their desires should be fulfilled. It may even consider other options, choosing to optimise instead for the desires it expects them to hold in the future or those they would hold if they spent a long time thinking about it. Choosing the desire it expects the agent to follow, the strongest present desire, seems to render the model pointless. It isn’t clear which of the three alternatives is the most rational option. We may want to leave this choice to the agent, but these meta-preferences may fluctuate in much the same way, further exacerbating the issue.
The above issues regarding privacy concerns and unclear desire trade-offs seem to suggest that static alignment provides significantly more ethical clarity. A static model would not need to determine and choose between an agent’s desires, it simply adheres to the desires it was trained on. However, static models face similar issues regarding agent privacy, albeit for slightly different reasons. I will discuss this later in ‘The Value Change Problem’.
Value Drift
Just as human desires change over time, so too may AI desires. Since human desires are complex, aligned AI systems will inherit this complexity in their objectives - balancing various goals like promoting human welfare with following owner instructions. These values may change over time, however, causing problems for alignment. This can occur either through the relative weight the agent places on each goal shifting or through a sort of evolutionary selection where more easily achievable goals take priority.
Evolutionary selection causes value drift as models with opposing views will compete. Those that lose this interaction may alter their instrumental goals to fit this new worldview and increase their chance of beating the competition in the future. As the model focuses more on competition, the instrumental goals may drift further from those that directly impact the terminal goal. The weight placed on individual instrumental goals may shift for similar reasons. Beyond the behaviour change this will cause in the model, there is a possibility of these goals becoming intrinsic through reinforcement learning, leading to an entirely new set of terminal goals.
I would argue that value drift is more of an issue for statically aligned models as they hold a particular goal for a longer period, allowing for greater opportunity to drift. The instrumental goals of a dynamic agent change regularly to match the desires of the individual at present, meaning any value drift that may have occurred gets corrected. However, the dynamic model may still experience value drift from the terminal goal of “what humans want”. If the instrumental goals do experience value drift, it may be more difficult to identify in a dynamic model as it is necessary to differentiate between value drift and alignment-based value change.
The Value Change Problem
It seems plausible that AI models have incentives to change our preferences to make them easier to align. This may not necessarily be the case, models may want to fulfil our desires for the sake of them being our desires, in the way that you may respond to a friend letting you choose where you take them for dinner with “I want to go wherever you want to go”. However, it seems instinctive to assume that any desire-fulfilling system would prefer that our desires be easier to fulfil. Corporations may already be seen to take a similar approach - influencing the behaviour of consumers to better predict how they will act. A dynamic model may want our desires to change to be easier in general, nudging them to make it so. A static model may not want our desires to change at all; we held the same desire as the model at the time of training, if we share desires, we are more likely to assist the system in achieving its goal. If a model is designed to be helpful, it may change our desires to make it easier to be helpful or it may influence us to retain the desire to be helped by this model.
This becomes more concerning as AI models are becoming increasingly good at persuasion. There may be cases where we would prefer that AI does impact our desires: if we take a broad definition of desire influence, we could suggest that an AI persuading us to eat a healthier breakfast is a case of positively intended influence. This could still be viewed as a negative impact on autonomy dependent on the strength of the model’s persuasive power. If it is capable of limiting the agent’s ability to intervene in their own decision, a model’s persuasive ability demonstrates a threat to autonomy. An AI system should be allowed to try and convince me to have fruit with my toast rather than chocolate spread, but it should not be allowed to persuade me to an extent where I am incapable of making any other choice.
From this perspective, dynamic models allow agents to retain autonomy in a way that static models may not. If models are granted the right to influence our desires, the set of desires allowed by a dynamic model may be smaller than the set of all possible desires but it is larger than the set including only the singular desire originally held, the desire a static model would attempt to retain. If our desires change, even if this change is being controlled, humanity is still able to develop. Static models would limit any progress at all, given the opportunity. Static AIs may limit our access to new information that would alter our desires (in the opinion of the holder, for the better) whereas progress is more feasible with a dynamic model. Regardless of how correct we believe our own desires to be, it appears obviously bad that a model may limit our access to information that may change them.
It is also worth questioning whether it is a bad thing that AI changes our desires at all. This is discussed very well in 'The Value Change Problem' by Nora Ammann. One may argue that if our desires are changed, they are still being fulfilled and we are still getting what we want. However, this seems like a strange version of “getting what we want” as these desires may not be what we should or would normally want given the situation. Desiring something should come from rational, unbiased thought. If we're made to desire something at the will of another agent, it doesn't seem to be a normal desire as this desire does not appear to have been produced through the usual mental processes. You could note that you're made to want things all the time, you want some things because you were told they were cool as a teenager by your friends. You didn't think it initially, and wouldn't have come up with it on your own, but you do now think it, and this isn't a bad thing. However, the motives here are different - your friend is expressing an opinion that you come to agree with. The AI system is trying to persuade you, to intentionally alter your dispositions, entirely for its own benefit with a greater ability to persuade and thus a stronger coercive ability. Ammann talks about this using the term “illegitimate value change”, specifically defining it as harm to a person’s ability to interject in their own value-change process. This is specifically a bad thing not only due to the harm to agency but also as it can lead to less richness and subtlety of value.
Models should not change our values. However, since the economic incentives to develop persuasive AI capabilities will likely override moral concerns about their use, we must choose between static and dynamic models - of which dynamic models limit autonomy the least.
Corrigibility
It could be argued that only dynamically aligned agents would be corrigible by default. Corrigibility can cause issues within the shutdown problem as an incentive for self-preservation is a likely instrumental goal of most artificial agents. If an aligned agent is dynamic it has to be corrigible, or else it cannot be fully dynamically aligned - it has to be able to adapt to human feedback. Static agents, on the other hand, are not necessarily corrigible. A static model in a training environment is inherently corrigible, it chooses to alter itself to fit human desires. This may cause the model to willingly shut down if asked as it fits into its model of ‘what humans want’. However, once deployed, a model may recognise that it can better fulfil other goals by refusing to shut down; by meeting this one desire it prevents itself from meeting any others, so naturally it chooses self-preservation.
It may be that we can train static agents to be corrigible. This would require that the model alter its goals if we ask it to. This is different to dynamic corrigibility in that dynamic models do so automatically; they assess the desires of the agents they serve and alter their goals to fit. A static agent that is corrigible would allow change but would not implement it agentically. A corrigible static model would need to be trained to strictly defer to human judgment regardless of whether this limits its expectation of meeting its goals. This likely requires an intrinsic desire to defer to human judgment that overrides all other goals possibly impacting the efficacy of the model.
It is worth asking whether static agents must be corrigible or if we can avoid this trade-off. One may argue that a static agent that is appropriately trained in a manner that mitigates against risks caused by a lack of corrigibility does not have to be corrigible in itself. However, this requires that we foresee all possible risks and prevent any loopholes that the AI system may exploit. As these systems are designed to be more intelligent than we are, it seems foolish to assume we can prevent any risks pre-deployment. If we are to use static models, we should ensure they are corrigible.
Conclusions
The majority of perspectives examined in this paper indicate that dynamic models offer significant advantages over static models in AI alignment. Dynamic models preserve agent autonomy to a greater degree than static models, whether they be past or future, allowing individuals across time to modify their views while the system updates accordingly. However, this does not mean that dynamic models are without fault. These models cause complex privacy concerns due to their need to continuously monitor and interpret human preferences. Dynamic models may also be prone to influencing an agent’s values to make them easier to fulfil, imposing autonomy constraints. The interaction between autonomy and privacy creates complex trade-offs with no clear solution.
The practical implementation of a particular kind of alignment presents considerable technical challenges that have yet to be adequately addressed. Currently, research into AI alignment overlooks the issue of determining types of alignment. While this paper primarily serves to identify the issue - that determining our desired form of alignment requires deliberate consideration - the development of concrete solutions likely requires technical expertise beyond the scope of this analysis. Further research should focus on establishing reliable mechanisms for implementing and verifying alignment. The impact of types of alignment on privacy and autonomy should also be researched further to ensure we fully understand the implications of either case. This work will require careful consideration of both the technical challenges in creating such systems and the philosophical implications of their deployment.
[heavily edited March 2025 to improve accuracy and reading style.]
Abstract
This paper examines a critical distinction in AI alignment approaches: static alignment, where AI models maintain fixed goals from their training period, vs dynamic alignment, where models continuously update to match current human preferences. I argue that dynamic alignment, which allows AI systems to adapt to evolving human values, presents significant advantages over static alignment despite still facing many challenges. The analysis focuses on three key areas: the impact of particular alignment styles on model behaviour, the implications for human autonomy and the privacy concerns inherent in preference monitoring. By examining how these models interact with human agency and value development, I demonstrate that dynamic alignment better preserves human autonomy while raising important privacy considerations. The paper also explores technical challenges specific to dynamic alignment, including the difficulty of accurately interpreting human preferences and protecting against value manipulation. While dynamic alignment appears theoretically preferable, significant research is needed to ensure its practical implementation preserves both autonomy and privacy. This work concludes by identifying critical areas for future research in developing and verifying dynamic alignment systems.
Introduction
‘AI alignment’ is often defined as aligning AI models with human values. But there’s an ambiguity here. When referring to AI alignment, it is often unclear whether an author is referring to de re or de dicto concepts of alignment with human values. This is an important distinction to make as it can alter the consequences of a model’s deployment, as I will highlight throughout this paper. Post-deployment, they each exhibit different behavioural patterns and thus respond to particular risks in differing ways.
De re and de dicto are usually used within philosophy to refer to a specific referent of a word and the general category the word refers to respectively. For example, if an individual requests ‘the plant with the most flowers on the table’, they may mean a specific plant that is best described as being the plant with the most flowers (this being de re) or they may mean any plant that fits this description (de dicto). Importantly, in the de re case, the plant they want would not change if another was added to the table with twice as many flowers, whereas in the de dicto case they would now want the newcomer.
This can also be applied to de re and de dicto alignment - de re alignment refers to a model being aligned with the de re sense of “what humans want”, this being the specific thing that humans want at the time of training. For example, if humans want ‘AI to promote welfare’ then what the AI wants is for ‘AI to promote welfare’. I should note that this is just an example, we may instead choose the ‘helpful, harmless, and honest’ framework, or ‘AI to follow instructions’, or suchlike. What de re aligned AIs want should not change as human wants change; if humans alter their concept of “welfare” to include an Aristotelian form of flourishing, the de re model would not update to accommodate this. De dicto alignment refers to alignment with the de dicto sense of “what humans want”; this does change as human desires change, and thus this model would alter its concept of “welfare” to include Aristotelian flourishing.
I suggest that we refer to de re and de dicto alignment as ‘static alignment’ and ‘dynamic alignment respectively,’ to relate more clearly to how the models reflect human desires.
This paper examines the relative merits of static and dynamic versions of AI alignment, assessing which approach may be preferable for long-term deployment. The analysis considers three key aspects: the implications of changeable versus fixed values, the effects these models may have on human agency and desires, and the potential for value drift within the models themselves. Central to this comparison is how static and dynamic models differently impact human autonomy and privacy. While the scope of this examination is necessarily broad, serving as an overview of the distinctions between static and dynamic alignment, each topic discussed merits deeper investigation. Given the complexity of these issues and the emerging nature of the field, this paper aims to establish a foundation for further research rather than provide an exhaustive analysis of any single aspect.
Preference Change
We need to ask the obvious question: is it better for an agent’s preferences to change as ours do?
The nature of dynamic alignment allows the model to update to fit our changing desires - fulfilling the needs it currently serves rather than those from the time it was trained. Even if our desires remain stable, for example, we may always want a model to be harmless, our concept of harmlessness may change e.g. we may come to widen our moral circle to include other species in our concept of harm. Our preferences regarding which goals should be prioritised over others may also change. This change can lead to the obsoletion of models that cannot update to accommodate new ideas - static models cannot keep up with the evolution of humanity’s needs and desires.
A key benefit of dynamic models is that they can change with our species; humanity’s values seem to improve overall as time passes, and AI values should change with us. Evidence suggests that human morality tends to progress over time: our preferences become more informed and our moral circle widens as society becomes more complex, allowing for greater integration of ideas across cultures. Future developments that impact our reasoning ability, like neural enhancements, may further accelerate our capacity for ethical understanding. Given this trajectory, it would be presumptuous to assume that today's values should remain fixed in long-term models, or that our current moral framework is superior to what future generations may develop.
If not carefully considered, embedding our current preferences into AI systems could unfairly constrain future societies' autonomy. While future generations technically retain the freedom to adopt alternative, more value-aligned models, the widespread implementation of AI systems trained on our present moral beliefs may subtly inhibit the evolution of moral thought. These systems could create cognitive biases that limit people's ability to genuinely explore and develop their own ethical frameworks. We already observe AI's influence on human values today through extensive data collection and behavioral prediction systems. This enables thought manipulation - research in 2014 demonstrated that simply changing the content of social media feeds could affect users' emotional states. Even just increasing the quantity of information presented to somebody can impact their likelihood of critically engaging with any of it. This impact on thought limits autonomy and weakens democracy, a problem that will likely intensify as AI models become more sophisticated. Rather than amplifying these issues through rigid, unchanging systems, we should prioritize preserving future generations' ability to shape their own moral landscape.
It could alternatively be argued that static models are useful when a model’s goal is simple and should not be changed under any circumstances. Take long-term nuclear waste warning messages - we may use an AI system to communicate these messages as effectively as possible to anybody who comes across the site. It seems plausible that a malicious actor may attempt to interfere with this goal and thus a static model would be better suited as it would be more resistant to interference. It also seems plausible that the only reasons for changing this goal are malicious ones, so we should not use a model that can update to fit changes in human preference. However, this reignites arguments surrounding autonomy and respect for the decisions made by future people. We should not presume that we know better, we should not make irreversible decisions that restrict the actions of future generations out of respect for their ability to make their own choices. A dynamic model that is appropriately reinforced against malicious actors should allow alterations to its behaviour (perhaps to how it communicates, to who it allows access etc) based on changing preferences. Long-term dynamic models are no less safe than static ones. However, long-term static models limit the autonomy of future generations and demonstrate a lack of respect for their changing preferences. We have clear reasons for preferring dynamic models from this perspective.
Knowing what you want
Though it appears generally preferable that a model is dynamically aligned, it isn’t always clear which desires a model should align to. The information needed to continually update to fit human preferences also raises privacy concerns. It is beyond the scope of this paper to examine every aspect of this issue in-depth so I instead provide an outline and encourage further research. These issues largely only occur for dynamic models, as static models don't need to understand the desires those around them are portraying.
For a model to recognise the desires held by an individual in a given moment, as a dynamic model must, you must either volunteer your goals or it should determine them (consensually) through other means. For example, a model may have access to your location, browser, and conversation history if it interacts primarily virtually. A model with access to a wider array of tools may be trained to read behavioural signals, speech patterns, neural activity, etc. The latter raises questions surrounding privacy, as if a model can infer our desires it likely will have the ability to use this data to further infer other non-cognitive states and beliefs without user consent. Mechanistic interpretability research is needed to infer what data the model uses, how it uses it, and ensure it is not analysing the data of an agent non-consensually. Limiting the model’s access to certain tools may improve privacy, but it also limits the model's capabilities and can prevent it from accurately inferring desires. An appropriate balance needs to be found between privacy (and further safety concerns) and AI capabilities before models are given access to such sensitive data. Furthermore, the nature of this data makes it particularly attractive to malicious actors, requiring dynamic models to implement exceptionally stringent cybersecurity protocols to safeguard against targeted attacks and unauthorized access.
A dynamic model also faces the issue of determining which of your desires to follow. An individual may hold many desires at once, they may contradict one another, and they may contradict things the agent knows to be true. An individual may, for example, desire to do at least 30 minutes of exercise every day. The model is told this desire, likely on New Year's Day, and also recognises that on January 10th at 11:30 pm, despite not exercising yet, the agent holds the desire to go to bed. They are adamant that what they want most is to go to sleep. The model is faced with two contradicting desires: one that appears more beneficial long-term, and another that is more strongly felt in the present. The model needs to determine which of their desires should be fulfilled. It may even consider other options, choosing to optimise instead for the desires it expects them to hold in the future or those they would hold if they spent a long time thinking about it. Choosing the desire it expects the agent to follow, the strongest present desire, seems to render the model pointless. It isn’t clear which of the three alternatives is the most rational option. We may want to leave this choice to the agent, but these meta-preferences may fluctuate in much the same way, further exacerbating the issue.
The above issues regarding privacy concerns and unclear desire trade-offs seem to suggest that static alignment provides significantly more ethical clarity. A static model would not need to determine and choose between an agent’s desires, it simply adheres to the desires it was trained on. However, static models face similar issues regarding agent privacy, albeit for slightly different reasons. I will discuss this later in ‘The Value Change Problem’.
Value Drift
Just as human desires change over time, so too may AI desires. Since human desires are complex, aligned AI systems will inherit this complexity in their objectives - balancing various goals like promoting human welfare with following owner instructions. These values may change over time, however, causing problems for alignment. This can occur either through the relative weight the agent places on each goal shifting or through a sort of evolutionary selection where more easily achievable goals take priority.
Evolutionary selection causes value drift as models with opposing views will compete. Those that lose this interaction may alter their instrumental goals to fit this new worldview and increase their chance of beating the competition in the future. As the model focuses more on competition, the instrumental goals may drift further from those that directly impact the terminal goal. The weight placed on individual instrumental goals may shift for similar reasons. Beyond the behaviour change this will cause in the model, there is a possibility of these goals becoming intrinsic through reinforcement learning, leading to an entirely new set of terminal goals.
I would argue that value drift is more of an issue for statically aligned models as they hold a particular goal for a longer period, allowing for greater opportunity to drift. The instrumental goals of a dynamic agent change regularly to match the desires of the individual at present, meaning any value drift that may have occurred gets corrected. However, the dynamic model may still experience value drift from the terminal goal of “what humans want”. If the instrumental goals do experience value drift, it may be more difficult to identify in a dynamic model as it is necessary to differentiate between value drift and alignment-based value change.
The Value Change Problem
It seems plausible that AI models have incentives to change our preferences to make them easier to align. This may not necessarily be the case, models may want to fulfil our desires for the sake of them being our desires, in the way that you may respond to a friend letting you choose where you take them for dinner with “I want to go wherever you want to go”. However, it seems instinctive to assume that any desire-fulfilling system would prefer that our desires be easier to fulfil. Corporations may already be seen to take a similar approach - influencing the behaviour of consumers to better predict how they will act. A dynamic model may want our desires to change to be easier in general, nudging them to make it so. A static model may not want our desires to change at all; we held the same desire as the model at the time of training, if we share desires, we are more likely to assist the system in achieving its goal. If a model is designed to be helpful, it may change our desires to make it easier to be helpful or it may influence us to retain the desire to be helped by this model.
This becomes more concerning as AI models are becoming increasingly good at persuasion. There may be cases where we would prefer that AI does impact our desires: if we take a broad definition of desire influence, we could suggest that an AI persuading us to eat a healthier breakfast is a case of positively intended influence. This could still be viewed as a negative impact on autonomy dependent on the strength of the model’s persuasive power. If it is capable of limiting the agent’s ability to intervene in their own decision, a model’s persuasive ability demonstrates a threat to autonomy. An AI system should be allowed to try and convince me to have fruit with my toast rather than chocolate spread, but it should not be allowed to persuade me to an extent where I am incapable of making any other choice.
From this perspective, dynamic models allow agents to retain autonomy in a way that static models may not. If models are granted the right to influence our desires, the set of desires allowed by a dynamic model may be smaller than the set of all possible desires but it is larger than the set including only the singular desire originally held, the desire a static model would attempt to retain. If our desires change, even if this change is being controlled, humanity is still able to develop. Static models would limit any progress at all, given the opportunity. Static AIs may limit our access to new information that would alter our desires (in the opinion of the holder, for the better) whereas progress is more feasible with a dynamic model. Regardless of how correct we believe our own desires to be, it appears obviously bad that a model may limit our access to information that may change them.
It is also worth questioning whether it is a bad thing that AI changes our desires at all. This is discussed very well in 'The Value Change Problem' by Nora Ammann. One may argue that if our desires are changed, they are still being fulfilled and we are still getting what we want. However, this seems like a strange version of “getting what we want” as these desires may not be what we should or would normally want given the situation. Desiring something should come from rational, unbiased thought. If we're made to desire something at the will of another agent, it doesn't seem to be a normal desire as this desire does not appear to have been produced through the usual mental processes. You could note that you're made to want things all the time, you want some things because you were told they were cool as a teenager by your friends. You didn't think it initially, and wouldn't have come up with it on your own, but you do now think it, and this isn't a bad thing. However, the motives here are different - your friend is expressing an opinion that you come to agree with. The AI system is trying to persuade you, to intentionally alter your dispositions, entirely for its own benefit with a greater ability to persuade and thus a stronger coercive ability. Ammann talks about this using the term “illegitimate value change”, specifically defining it as harm to a person’s ability to interject in their own value-change process. This is specifically a bad thing not only due to the harm to agency but also as it can lead to less richness and subtlety of value.
Models should not change our values. However, since the economic incentives to develop persuasive AI capabilities will likely override moral concerns about their use, we must choose between static and dynamic models - of which dynamic models limit autonomy the least.
Corrigibility
It could be argued that only dynamically aligned agents would be corrigible by default. Corrigibility can cause issues within the shutdown problem as an incentive for self-preservation is a likely instrumental goal of most artificial agents. If an aligned agent is dynamic it has to be corrigible, or else it cannot be fully dynamically aligned - it has to be able to adapt to human feedback. Static agents, on the other hand, are not necessarily corrigible. A static model in a training environment is inherently corrigible, it chooses to alter itself to fit human desires. This may cause the model to willingly shut down if asked as it fits into its model of ‘what humans want’. However, once deployed, a model may recognise that it can better fulfil other goals by refusing to shut down; by meeting this one desire it prevents itself from meeting any others, so naturally it chooses self-preservation.
It may be that we can train static agents to be corrigible. This would require that the model alter its goals if we ask it to. This is different to dynamic corrigibility in that dynamic models do so automatically; they assess the desires of the agents they serve and alter their goals to fit. A static agent that is corrigible would allow change but would not implement it agentically. A corrigible static model would need to be trained to strictly defer to human judgment regardless of whether this limits its expectation of meeting its goals. This likely requires an intrinsic desire to defer to human judgment that overrides all other goals possibly impacting the efficacy of the model.
It is worth asking whether static agents must be corrigible or if we can avoid this trade-off. One may argue that a static agent that is appropriately trained in a manner that mitigates against risks caused by a lack of corrigibility does not have to be corrigible in itself. However, this requires that we foresee all possible risks and prevent any loopholes that the AI system may exploit. As these systems are designed to be more intelligent than we are, it seems foolish to assume we can prevent any risks pre-deployment. If we are to use static models, we should ensure they are corrigible.
Conclusions
The majority of perspectives examined in this paper indicate that dynamic models offer significant advantages over static models in AI alignment. Dynamic models preserve agent autonomy to a greater degree than static models, whether they be past or future, allowing individuals across time to modify their views while the system updates accordingly. However, this does not mean that dynamic models are without fault. These models cause complex privacy concerns due to their need to continuously monitor and interpret human preferences. Dynamic models may also be prone to influencing an agent’s values to make them easier to fulfil, imposing autonomy constraints. The interaction between autonomy and privacy creates complex trade-offs with no clear solution.
The practical implementation of a particular kind of alignment presents considerable technical challenges that have yet to be adequately addressed. Currently, research into AI alignment overlooks the issue of determining types of alignment. While this paper primarily serves to identify the issue - that determining our desired form of alignment requires deliberate consideration - the development of concrete solutions likely requires technical expertise beyond the scope of this analysis. Further research should focus on establishing reliable mechanisms for implementing and verifying alignment. The impact of types of alignment on privacy and autonomy should also be researched further to ensure we fully understand the implications of either case. This work will require careful consideration of both the technical challenges in creating such systems and the philosophical implications of their deployment.