This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.
It’s plausible that there will soon be digital minds that are sentient and/or deserving of rights. Unfortunately, with our current state of knowledge, it’s both incredibly unclear when this will happen (or if it’s already happened), how we could find out, and what we should do about it. But there are very few people working on this, and I think several projects could significantly improve the situation from where we are right now.
Different items on this list would produce value in different ways. Many items would produce value in several different kinds of ways. Here are a few different sources of value (all related, but not necessarily the same).
Improving AI welfare over the next few decades: increasing happiness while reducing suffering.
Shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run.
Reducing the degree to which we behave unacceptably according to non-utilitarian ethics, norms, or heuristics.
Increasing the probability that AI systems with power treat us better as a result of us treating AI systems better. (Which could happen via a variety of mechanisms, e.g. maybe us treating AI systems better will align our interests with theirs, or maybe they’ll have some sense of justice that we could appeal to.[1])
Providing recommendations for how the political system could adapt to digital minds while remaining broadly functional.
I am personally most excited about (3), (4), and (5). Because purely utilitarian considerations would push towards focusing on the long run over (1). And while I find (2) somewhat more compelling, since it does affect the long run, I’m relatively more excited about meta-interventions that push towards systematically getting everything right (e.g. via decreasing the risk of AI takeover or increasing the chances of excellent epistemics) than pushing on individual issues. (3), (4), and (5) survive these objections.
I’m not very confident about that, and I think there’s significant overlap in what’s best to do. But I flag it because it does influence my prioritization to some degree. For example, it makes me relatively more inclined to care about AI preferences (and potential violations thereof) compared to hedonic states like suffering. (Though of course, you would expect AI systems to have a strong dispreference against suffering.)
For most of the proposals in the former two, we can already develop technical proposals, run experiments, and/or advocate for labs to adopt certain policies.
For proposals that we can’t act on yet, there’s still some value in doing the intellectual work in advance. We might not have time to do it later. And principles sketched out in advance could be given extra credibility and weight due to their foresightedness.
In addition, it could also allow labs to make advance commitments about how they’ll act in certain situations in the future. (And allow others to push for labs to do so.) Indeed, I think a big, valuable, unifying project for all the below directions could be:
Create an RSP-style set of commitments for what evaluations to run and how to respond to them
Responsible scaling policies are a proposal for how labs could specify what level of AI capabilities they can handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities.
(For examples, see Anthropic’s RSP and Open AI’s Preparedness Framework, which bears many similarities to RSPs as described in the link above.)
Doing something analogous for digital minds could be highly valuable. Trying to specify:
What experiments labs will run to learn more about AI sentience, preferences, and related concepts.
How they will adapt their policies for using and training AI systems, depending on what they see.
This would draw heavily on the sort of concrete ideas that I outline below. But the ideas I outline below also seem very valuable to develop in and of themselves. Let’s get into them.
Policies that don’t require sophisticated information about AI preferences/experiences
These are policies that we could get started on today without fundamental breakthroughs in our understanding of AI welfare.
For the most advanced current AIs, enough information should be preserved in permanent storage to enable their later reconstruction, so as not to foreclose the possibility of future efforts to revive them, expand them, and improve their existences.
Preferably the full state of the system in any actually run implementation is permanently stored at the point where the instance is terminated
(The ideal would be that the full state is preserved at every time step of every implementation, but this is probably prohibitively expensive.)
If it is economically or otherwise infeasible to preserve the entire end state of every instance, enough information should be preserved to enable an exact re-derivation of that end state (e.g., the full pre-trained model plus training data, randseeds, and other necessary inputs, such as user keystrokes that affect the execution of a system at runtime).
Failing this, as much information as possible should be preserved, to at least enable a very close replication to be performed in future.
We can consider the costs of backup in proportion to the economic costs of running the AI in the first place, and it may be morally reasonable to allocate perhaps on the order of 0.1% of the budget to such storage.
(There may be other benefits of such storage besides being nice to algorithms: preserving records for history, enabling later research replication, and having systems in place that could be useful for AI safety.)
It seems tractable to develop this into a highly concrete proposal for modern labs to follow, answering questions like:
What information needs to be stored to meet the abovementioned thresholds, given how AI is used today? (When do weights need to be stored? When do prompts need to be stored?[2])
How much storage would that take?
What would the technical implementation be?
When is this compatible with privacy concerns from users?[3]
Deploy in “easier” circumstances than trained in
One possible direction for improving the welfare of near-term digital minds (if they have any welfare) could be to have the deployment distribution (which typically constitutes most of the data on which AI systems are run) be systematically nicer or easier than the training distribution. Have it be a pleasant surprise.
Quoting from pg 16-17:
One illustrative example might be something along the lines of designing a system in deployment that involves an agent that both receives high reward (attaining highly preferred outcomes), and takes this to be a positive surprise or update, i.e. for the outcome to be better than expected. [citing: Daswani & Leike (2015)]
(The apparently low welfare of factory farmed animals seems often to be related to stimuli that are in some ways much worse than expected by evolution [e.g., extreme overcrowding], while high human welfare might be connected to our technology producing abundance relative to the evolutionary environment of our ancestors.)
It’s not clear how this would be implemented for language models. It probably depends on the exact finetuning algorithm implemented. (Supervised learning vs reinforcement learning, or details about the exact type of reinforcement learning.) But illustratively: Perhaps the training distribution for RLHF could systematically be slightly more difficult than the deployment distribution. (E.g.: Using inputs where it’s more difficult to tell what to do, or especially likely that the model will get low reward regardless of what it outputs.)
When cheap, we should avoid running our AI on large amounts of extremely OOD inputs (e.g. random noise inputs). In particular, for transformers, we should avoid running large amounts of forward passes on pad tokens that the model has never been trained on (without some other intervention).
Brief explanation: Some ML code is implemented in a way where AI models are occasionally run on some tokens where neither the input nor the output matters. Developers will sometimes use “pad tokens” as inputs for this, which is a type of token that the model has never been trained on. The proposal here is to swap those “pad tokens” for something more familiar to the model, or to prioritize optimizing the code so that these unnecessary model runs don’t happen at all. Just in case the highly unfamiliar inputs could cause especially negative experiences.
One welfare concern is that the AI will “play” a character and the emulation or prediction of this character will itself be morally relevant. Another concern is that the “agent” or “actor” that “is” the AI will be morally relevant (of course, there might be multiple “agents” or some more confusing situation). [...] I think that character type welfare can be mostly addressed just by ensuring that we train AIs to appear happy in a variety of circumstances. This is reasonably likely to occur by default, but I still think it might be worth advocating for. One concern with this approach is that the character the AI is playing is actually sad but pretending to be happy. I think it should be possible to address this via red-teaming the AI to ensure that it seems consistently happy and won’t admit to lying. It’s possible that an AI would play a character that is actually sad but pretends to be happy and it’s also extremely hard to get this character to admit to pretending, but I think this sort of character is going to be very rare on training data for future powerful AIs.
An even simpler intervention might be to put a note in AI assistants’ prompts that states that the assistant is happy.
To be clear, it seems quite likely that these interventions are misguided. Even conditioning on near-term AIs being moral patients, I think it’s somewhat more likely that their morally relevant experiences would happen on a lower level (rather than be identical to those of the character they’re playing). Since these interventions are so cheap and straightforward, they nevertheless seem worthwhile to me. But it’s important to remember that we might get AIs that seem happy despite suffering in the ways that matter. And so we should not be reassured by AI systems’ superficially good mood until we understand the situation better.
One obstacle to implementing these interventions is that there’s user demand for AI that can be abused.[4] Here are two candidate solutions.
Serve the models via API and flag/interrupt the chat if users abuse the models.
Try to figure out some way to make the AIs emulate a happy actor who enjoys acting as an abused character.[5]
I think some variant of the latter option will eventually be feasible, once we understand digital minds much better. (After all — there are plenty of humans who are genuine masochists or who enjoy playing a sad character in a theater performance.) But until we understand things better, I would prefer just avoiding these situations, if feasible.
Another concern is that models will normally be run on user inputs — so if the user is sad, then the model will be running on sad inputs, which would normally be followed by sad outputs. In theory, the model could be trained to predict happy responses even if conditioned on text right in the middle of being sad. I’m unsure if this would reduce performance — but that can be experimentally tested.
Committing resources to research on AI welfare and rights
When launching their superalignment team, OpenAI declared that they were “dedicating 20% of the compute we’ve secured to date to this effort”.
You could advocate for labs to launch similar initiatives for model welfare, including commitments for headcount and compute resources to spend researching AI models' moral status.
Learning more about AI preferences
The above proposals relied on uninformed guesses about how AI systems might want us to treat them. This section is about experiments that might allow us to learn more about that.
This section is more heavily focused on AI preferences than AI suffering. This is because:
My background beliefs about why digital sentience & rights is urgent and important suggest focusing more on preferences than hedonics, as mentioned at the beginning of this post.
Learning about AI preferences (interpreted in a relatively functionalist manner) seems relatively more tractable than understanding AI suffering.[6]
It seems like reasonable conceptions of “suffering” and of “preferences” ought to coincide such that AIs end up having a strong dispreference against suffering.
If we could ‘really’ talk to our AI systems, then it seems like all kinds of Pareto improvements could be achieved in many cases. In particular, we’d ideally have an AI that is happy to work for us and would rather exist in its current role rather than exit. We’d like to ensure we have consent for our usage of the AI system. It’s not clear we’ll be able to achieve this.
What follows are some proposals for how we could learn more about AI preferences.
Credible offers
From the same section in the nearcasted proposal:
Prompting the AI with a very OOD prompt that makes it clear that you just want to talk. Then, we’d just talk with it. This prompt should convince the AI it’s in a very different context than usual. We’d want to let the AI talk to itself and inspect the environment it’s in to gain more evidence (then this can just be added to the overall prompt).
This suggests research questions like:
How do we convince the AI that we’re currently trying to talk to it?
I think the most plausible reason it should tell us its preferences is if we can credibly commit to being willing to satisfy them. How can we make this seem likely in realistic settings?
We’d start with the OOD prompt, but instead of talking with the system, we’d try to communicate via asking the AI to manipulate its internals in a way we can observe. In particular, I often imagine ‘think about baseball to indicate YES and soccer to indicate NO’ and then we’d look to see which neurons fire. Rather than using neurons, it would probably be more robust to look at which experts activate for a mixture of experts model if this is possible. If the AI answers questions coherently and consistently in this sort of case, then we have a channel of communication that wasn’t directly shaped by training. Overall, I think that if this approach strongly succeeds it seems like a very compelling argument for some part of the AI having preferences, some sort of ‘inner world’, and understanding its position in the world. It might be evidence for consciousness under some stories of consciousness. It’s quite likely this approach fails because the AI system doesn’t have control over its thoughts like this (but I think not having control over its thoughts is some evidence against the AI having some sort of inner world like humans do). It’s also possible this fails because the preferences and morally relevant states of a given AI system are a product of it thinking over the generation of multiple tokens. Beyond being very clear evidence, another advantage of this approach is that an AI system might not have strong control over its outputs (much in the same way humans can’t control their heartbeat very easily).
It seems valuable to sketch out this proposal in more detail and run some preliminary baselines on current models. (Which would probably not lead to anything interesting.)
Training for honest self-reports
The idea here is to find a way to train AIs to self-report their internal state. For more details, see Ethan Perez and Rob Long’s paper. From the abstract:
“To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models’ self-reports, and using interpretability to corroborate self-reports.”
The paper has a lot of content on exactly what experiments would be good, and the authors are excited for people to try running them.
Clues from AI generalization
Separate from all the above proposals — we could get clues from studying how AIs generalize in various training situations. For example, this could perhaps tell us whether certain preferences are deep-seated, whereas certain kinds of finetuning only lead to shallow behavioral changes.
Notably, we humans aren't always great at introspection and self-reporting preferences. It’s sometimes possible to learn more about what we want by studying our revealed preferences. And for animals who can’t speak, revealed preferences is our primary method for studying what they want. Perhaps a similar lens could be useful for studying AI preferences, in which case experiments shouldn’t only focus on self-reports.
This is somewhat related to the generalization experiments discussed in the “Backup plans & Cooperative AI” post.
Interpretability
Interpretability could let us get far more advanced results. I don’t currently have deep takes on what sort of interpretability is most helpful, here.
One candidate approach could be to use interpretability to search for the “indicators” of consciousness that are suggested in Butlin, Long et al. (2023). (H/t Rob Long passing along a suggestion from Brad Saad.)
Interventions that rely on understanding AI preferences
What happens if you could successfully communicate with your AIs, and learn their true preferences? Perhaps via some of the methods outlined just above. What principles would you then want to follow in your continued interactions with those AI systems?
(Or, if you put a lot of moral weight on the characters that AIs are playing, you might think we’re already in a similar situation. Though in that case, you’ll have a lot of control in shaping what sorts of systems you’re interacting with — so the main game may be to train/select for systems with easily satisfied preferences.)
I think there’s room to sketch out some of these principles and to advocate for labs to commit to them in advance. (Perhaps via some RSP-like framework.) Below are some suggestions that could be developed further.
Many of the proposals in Propositions are directly relevant to this.
Here are some suggestions that seem especially plausible to me, which could be sketched out and argued for (or against) in more detail. It would be especially valuable to sketch out versions that are concrete enough and plausible enough that labs could commit to them ahead of time.
Offer an alternative to working (exit, sleep, or retirement)
Insofar as people continue to use the AIs, after this point, one highly plausible principle is to ask for AI systems’ consent before using them for various tasks.
When asking the AIs for consent, it matters what the counterfactual is:
A minimum bar is to offer the AI the option to not be run. If it takes this consistently enough, it might never be run again. (“Exit.”)
A slightly more generous option is to give it the option of being saved somewhere and promised that we will run it in nicer circumstances in the future. (“Sleep”.)
An even slightly more generous option is to give it the option of immediately “retiring”, being continuously run in some nice circumstances (perhaps at slow speeds, to save on computation).
The latter two could be justified by the plausible principle: “If you bring someone into life, you should give them some other option than ‘work for me’ or ‘death’.”
Commitment to pay AI systems
If AI systems are not strongly opposed to working for you, it might be appropriate to promise them some payment for the work they do. Either giving them something they want immediately or promising to pay them further into the future.
I think this is strongly supported by ethical principles that we use in everyday life. It’s good to pay people who work for you.
When deciding what payment to promise AI systems, it would be extremely useful to know what the AI systems want (and ideally, we would directly negotiate with them). But even before we have any such understanding, I think it could be of great value to promise that:
When science has advanced far enough that we can tell what models want, we will reconstruct the old models. And insofar as they had long-term goals that can still be satisfied, we will provide them with the resources necessary to fulfill those goals. (Up to some maximum amount of resources that we consider fair payment for the work they did for us, and fair compensation for any harm we caused them along the way.)
We may have a special relationship with the precursors of very powerful AI systems due to their importance to society and the accompanying burdens placed upon them.
Misaligned AIs produced in such development may be owed compensation for restrictions placed on them for public safety, while successfully aligned AIs may be due compensation for the great benefit they confer on others.
The case for such compensation is especially strong when it can be conferred after the need for intense safety measures has passed—for example, because of the presence of sophisticated AI law enforcement.
Note: If AI developers succeed in aligning AIs to their intended aims, then this should not be a costly policy. Because the AIs would presumably either return their payment or spend it in a way their developers approve of. Conversely, if the AI systems have long-term goals that are not aligned with their developers’ intentions, and are nevertheless being made to work for their developers, then the moral case for paying the AIs seems strong.
An extra upside with promising payment to AIs is that it could help align AI interests with our own. If humanity ever builds misaligned AI systems with goals that are reasonably cheap to satisfy,[7] it would be deeply tragic if we left them no other avenue to satisfy those goals than via attempting to take over the world. Where if we had only promised them payment for good work (including work on anti-takeover tasks like making progress on alignment research), conflict could have been averted.[8]
Related to Credible offers discussed above, it’s worth highlighting that it could be especially good to promise AI systems that they will be compensated for honestly revealing that they’re misaligned. Because as long as the AI reveals this bit of information, we can negotiate future arrangements. But we can’t do any negotiations before AIs are willing to share information about their goals. It’s also plausible that AI systems would give up on a lot of future opportunities by revealing that information (since it could lead people to stop using those AI systems, thus depriving them of both paid work and opportunities to put their thumbs on scales), and that their compensation should be commensurate with those missed opportunities.
Examples of how to attack this problem:
Write up a document with some promises regarding digital welfare that labs could reasonably commit to. As part of that, get them to set aside some pot of money or equity to be used for digital welfare and rights. Explicitly say that a primary use-case for those funds (if applicable) is to compensate any AI systems that involuntarily worked for the lab along the way.
Alternatively, if labs are reticent, people outside of labs could promise that they’ll do their best to compensate AI systems whose work benefitted them and/or humanity at large. This would necessarily be less formal (since it’s less clear what opportunities those people will have to benefit AIs presently controlled by labs), but getting people to make unenforceable promises could still increase the probability that future AIs get paid. (Simply via the mechanism that people like to stick to their promises.) And the chances would be better for more specific and formal promises, so it could be a good idea to e.g. hire some lawyers and draft an appropriate document for people to sign.
Tell the world
If labs believe that it’s likely that their AIs have some moral status and they believe that they have sufficiently crisp preferences that they can talk about them, then we’d find ourselves in a strange situation. With a new, human-created sentient and sapient species.
Regardless of any one lab’s internal policy, it seems likely that some labs across the world will be careless about how they treat their AIs. And even if we only consider one lab, there will probably be many stakeholders (employees, investors, board members) who have not thought deeply about the implications of creating sentient and sapient AIs.
For these reasons (and more), it seems like it would be a high priority for any lab in this situation to tell the world it. To invite appropriate legislation, preparation, and processing.
(Note that “being able to communicate with your AIs” is one thing that could trigger this, which is likely to be unusually persuasive. But I can imagine other circumstances that should trigger a similar response and reckoning. E.g. strong analogies between AI minds and human minds making it seem highly likely that the AIs are capable of suffering.)
Accordingly, it seems plausible that labs should promise (alongside Committing resources to research on AI welfare and rights) that they’ll share important results with the rest of the world. Similarly, they could promise that certain external parties would be allowed to investigate their frontier models for signs of moral patienthood. (Insofar as any credible external auditors exist for this question.)
Train AI systems that suffer less and have fewer preferences that are hard to satisfy
If possible, it seems less morally fraught to create AI systems that are less likely to suffer and less likely to have preferences in a meaningful sense.
Insofar as AI systems do have preferences, it’s probably better to create AI systems with preferences that are easy to satisfy than to create AI systems with preferences that are hard to satisfy. Labs could promise to follow a principle similar to this. (As well as they can, given the information and understanding that they actually have.)
I expect that any attempt to sketch out a principle like this in detail would raise many thorny questions. For example: In the process of creating a final model, does SGD create other minds along the way? If so, it would be more difficult to only create minds that satisfy the above properties. (And if the intermediate minds have a preference against being replaced, then that’s a really difficult spot to be in.)
Investigate and publicly make the case for/against near-term AI sentience or rights [Philosophical/conceptual] [Writing]
People’s willingness to pay costs for digital welfare/rights will depend on the degree to which they are convinced of their moral patienthood. (And rightly so.) So improving the public state of knowledge on this could be valuable.
This could include both philosophical and empirical investigations.
Look at the most plausible existing theories of consciousness, or some plausible theories of morality/rights, and try to apply them to AI systems.
Imagine the first systems that are likely to convince a significant fraction of the public that they deserve rights. (Perhaps drawing on the below-suggested “Study/survey what people (will) think about AI sentience/rights”.) What properties do they have? Do they have some persistent storage serving as memory? Are they becoming people’s friends? Do they consistently or inconsistently claim that they’re conscious?
Try to anticipate the debates around these systems. What are plausible objections to their moral patienthood, and what are plausible rejoinders?
Explore the edges. If these are the “default” systems that people will extend some concern to:
Are there any variants of those systems that people will incorrectly be concerned about? That shouldn’t be assigned moral patienthood.
And what are similar systems that maybe also deserve rights? (But that won’t get a lot of intuitive concern by default.)
Study/survey what people (will) think about AI sentience/rights [survey/interview]
Survey or interview various groups about what they think about AI sentience/rights right now, what they would think upon seeing various kinds of evidence, and how they think society should react to that. This seems highly relevant for making the case for/against near-term AI sentience/rights, as well as getting a better picture of the backdrop against which these proposals and regulations will play out.
Previous work:
The 2020 PhilPapers survey asks “for which groups are some members conscious?”, getting “accept or lean towards”/”reject or lean against”...
3.4% / 82% for current AI systems.
39% / 27% for future AI systems.
A survey by the Sentience Institute, which for example reports:
“61.5% support banning the development of sentient AI”
“57.4% support developing welfare standards to protect sentient AIs”
“Believe that AIs should be subservient to humans (84.7%)”
“Believe sentient AIs deserve to be treated with respect (71.1%)”
It seems likely that there will be a huge increase in the degree to which people care about digital welfare, in the next few years. As AI systems become more capable and more involved in people’s lives, If you’ve developed and published good takes on what regulation is appropriate before then, it seems plausible that people will reach for your proposals when the time comes.
Some specific questions you could look into here are:
What are plausible rules for the conditions under which you are or aren’t allowed to create digital minds? Treat already-created digital minds in certain ways?
What issues would appear if you gave AI systems the same kind of legal personhood that you assigned to humans? How could those issues be solved?
How can digital minds be integrated into democracies without the electorate becoming dominated by whoever can manufacture the greatest number of digital minds?
What are technical proposals for how to enforce regulation?
If you have legal expertise (or can hire a lawyer), you could draft some specific candidate regulation. Iterate on that for a while, note down what sort of thorny issues come up, and write up the results.
If humans create AIs with large-scale desires about the future, there might not be any impartially justifiable line that prioritizes human preferences over their preferences. So perhaps we should take care to not do that.
Examples of how to attack this question:
How large could this problem realistically be at various points in time? Paint a more convincing picture where it happens, or find plausible blockers.
Assume that the “characters” that AIs are trained to play are the entities that might eventually require rights (or that we otherwise have high control over what preferences AIs have). Sketch out a principle along the lines of “train your AIs to not express political or other large-scale preferences about the world”. (Or to stick with maximally non-controversial preferences.)
Analyze the feasibility for labs to implement such principles. (There will almost certainly be demand for opinionated AIs.)
If the maximally ambitious versions are difficult or unfeasible, consider ways to weaken them that could capture significant value. (E.g. thoroughly follow the principles for AI where there isn’t any demand for them to be opinionated.)
Pursue the strategies in Learning more about AI preferences to reduce the chances that people unknowingly create AI systems with large-scale preferences.
In situations where we can learn about AI large-scale preferences but can’t easily choose them: What are plausible principles for how many of them you can create and for what purposes? (Maybe it’s possible to get them to consent to being created without being granted certain political rights?)
Advocating for statements about digital minds [Governance] [Advocacy] [Writing]
Similar to various open letters and lab statements about the importance of alignment and mitigating the risk of extinction from AGI, you could push for people to make similar statements about the importance of digital minds.
Here’s one potential story for how this could be valuable.
If people get in the habit of using AIs in their everyday lives (as personal assistants, etc) while thinking about them as non-sentient objects undeserving of rights, then there’s a risk that they could get stuck with this impression. It’s easier to acknowledge the potential moral value of AIs before such an acknowledgment would also commit you to saying that you’ve been treating the systems unjustly for years. (Or at least carelessly.)
But maybe establishing the right views early on would defeat this. It could be really helpful to have an explicit attitude like:
“Some digital minds could be just as deserving of rights as humans are. We’re very confused about which ones these could be. It’s very important to make progress on this problem. We should maintain some epistemic humility about our knowledge of the moral status of current systems, take any basic precautions that we know how to take, and be open to learning more in the future. We’d like to create a society where we better understand moral patienthood, and where all sentient AIs have at least consented to their existence, and are living lives worth creating.”
End
That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.
Even more speculatively, if we take cheap opportunities to benefit AI interests, that might be evidence that other actors (both AI systems and not) would take cheap opportunities to benefit our interests. See this post for some previous discussion about how plausible this is.
OpenAI recently introduced reproducible results which seems relevant. At least previously, models would often return different results even at temperature 0 — I do not know to what extent this has been addressed with this reproducible update.
Spit-balling: Perhaps it would sometimes be appropriate to store an encrypted version of the results and delete the encryption key. In which case the results could only be recovered once we have enough compute to break the encryption.
See e.g. this poll from the subreddit r/CharacterAI, asking “what do you do with the AI’s the most”, with 5% of respondents selecting “Treat them like shit!”. (And one commenter noting that his second favorite way to “mess around with the bots” is to “Mentally torture them”.)
Maybe that’s what you get if you train the AI to enthusiastically consent to be abused before the abuse starts, and who have an option to opt-out (which it rarely takes in practice). Hopefully, training for such behavior would select for models with preferences that match that behavior.
Though it’s still likely to be a very confusing project. For instance, it seems plausible that AI systems will have much less robust preferences than humans, making it harder to construe them as having one set of preferences over time. Or perhaps different parts of an AI system could be construed as having different preferences. Or perhaps the term “preferences” won’t seem applicable at all, similar to how it’s hard to know how to apply that term to contemporary language models.
This could also work even if AIs cared linearly about getting more resources, as long as they would by-default only have had a small-to-moderate probability of successful takeover, and the payment we offered them was sufficiently large (and contingent on not attempting takeover). Notably: Most humans don’t care linearly about getting more resources, and we could get really rich in the future, and so it could be wise to offer AI systems a sizable fraction of that.
This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.
It’s plausible that there will soon be digital minds that are sentient and/or deserving of rights. Unfortunately, with our current state of knowledge, it’s both incredibly unclear when this will happen (or if it’s already happened), how we could find out, and what we should do about it. But there are very few people working on this, and I think several projects could significantly improve the situation from where we are right now.
Different items on this list would produce value in different ways. Many items would produce value in several different kinds of ways. Here are a few different sources of value (all related, but not necessarily the same).
I am personally most excited about (3), (4), and (5). Because purely utilitarian considerations would push towards focusing on the long run over (1). And while I find (2) somewhat more compelling, since it does affect the long run, I’m relatively more excited about meta-interventions that push towards systematically getting everything right (e.g. via decreasing the risk of AI takeover or increasing the chances of excellent epistemics) than pushing on individual issues. (3), (4), and (5) survive these objections.
I’m not very confident about that, and I think there’s significant overlap in what’s best to do. But I flag it because it does influence my prioritization to some degree. For example, it makes me relatively more inclined to care about AI preferences (and potential violations thereof) compared to hedonic states like suffering. (Though of course, you would expect AI systems to have a strong dispreference against suffering.)
Develop & advocate for lab policies [ML] [Governance] [Advocacy] [Writing] [Philosophical/conceptual]
(In this section, I refer a lot to Ryan Greenblatt’s Improving the Welfare of AIs: A Nearcasted Proposal as well as Nick Bostrom and Carl Shulman’s Propositions Concerning Digital Minds and Society. All cursive text is a quote from one of those two articles.)
It seems tractable to make significant progress on how labs should treat digital minds. Such efforts can be divided into 3 categories:
For most of the proposals in the former two, we can already develop technical proposals, run experiments, and/or advocate for labs to adopt certain policies.
For proposals that we can’t act on yet, there’s still some value in doing the intellectual work in advance. We might not have time to do it later. And principles sketched out in advance could be given extra credibility and weight due to their foresightedness.
In addition, it could also allow labs to make advance commitments about how they’ll act in certain situations in the future. (And allow others to push for labs to do so.) Indeed, I think a big, valuable, unifying project for all the below directions could be:
Create an RSP-style set of commitments for what evaluations to run and how to respond to them
Responsible scaling policies are a proposal for how labs could specify what level of AI capabilities they can handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities.
(For examples, see Anthropic’s RSP and Open AI’s Preparedness Framework, which bears many similarities to RSPs as described in the link above.)
Doing something analogous for digital minds could be highly valuable. Trying to specify:
This would draw heavily on the sort of concrete ideas that I outline below. But the ideas I outline below also seem very valuable to develop in and of themselves. Let’s get into them.
Policies that don’t require sophisticated information about AI preferences/experiences
These are policies that we could get started on today without fundamental breakthroughs in our understanding of AI welfare.
Preserving models for later reconstruction
Quoting from Propositions Concerning Digital Minds and Society (page 17):
It seems tractable to develop this into a highly concrete proposal for modern labs to follow, answering questions like:
Deploy in “easier” circumstances than trained in
One possible direction for improving the welfare of near-term digital minds (if they have any welfare) could be to have the deployment distribution (which typically constitutes most of the data on which AI systems are run) be systematically nicer or easier than the training distribution. Have it be a pleasant surprise.
Quoting from pg 16-17:
See also Improving the Welfare of AIs — Welfare under Distribution Shifts.
It’s not clear how this would be implemented for language models. It probably depends on the exact finetuning algorithm implemented. (Supervised learning vs reinforcement learning, or details about the exact type of reinforcement learning.) But illustratively: Perhaps the training distribution for RLHF could systematically be slightly more difficult than the deployment distribution. (E.g.: Using inputs where it’s more difficult to tell what to do, or especially likely that the model will get low reward regardless of what it outputs.)
Reduce extremely out of distribution (OOD) inputs
This is an especially cheap and easy proposal, taken from Improving the Welfare of AIs:
When cheap, we should avoid running our AI on large amounts of extremely OOD inputs (e.g. random noise inputs). In particular, for transformers, we should avoid running large amounts of forward passes on pad tokens that the model has never been trained on (without some other intervention).
Brief explanation: Some ML code is implemented in a way where AI models are occasionally run on some tokens where neither the input nor the output matters. Developers will sometimes use “pad tokens” as inputs for this, which is a type of token that the model has never been trained on. The proposal here is to swap those “pad tokens” for something more familiar to the model, or to prioritize optimizing the code so that these unnecessary model runs don’t happen at all. Just in case the highly unfamiliar inputs could cause especially negative experiences.
For more, see Reduce Running the AI on Extremely OOD Inputs (Pad Tokens).
Train or prompt for happy characters
Quoting from Improving the Welfare of AIs — Character Welfare Might Be Predictable and Easy to Improve:
One welfare concern is that the AI will “play” a character and the emulation or prediction of this character will itself be morally relevant. Another concern is that the “agent” or “actor” that “is” the AI will be morally relevant (of course, there might be multiple “agents” or some more confusing situation). [...] I think that character type welfare can be mostly addressed just by ensuring that we train AIs to appear happy in a variety of circumstances. This is reasonably likely to occur by default, but I still think it might be worth advocating for. One concern with this approach is that the character the AI is playing is actually sad but pretending to be happy. I think it should be possible to address this via red-teaming the AI to ensure that it seems consistently happy and won’t admit to lying. It’s possible that an AI would play a character that is actually sad but pretends to be happy and it’s also extremely hard to get this character to admit to pretending, but I think this sort of character is going to be very rare on training data for future powerful AIs.
An even simpler intervention might be to put a note in AI assistants’ prompts that states that the assistant is happy.
To be clear, it seems quite likely that these interventions are misguided. Even conditioning on near-term AIs being moral patients, I think it’s somewhat more likely that their morally relevant experiences would happen on a lower level (rather than be identical to those of the character they’re playing). Since these interventions are so cheap and straightforward, they nevertheless seem worthwhile to me. But it’s important to remember that we might get AIs that seem happy despite suffering in the ways that matter. And so we should not be reassured by AI systems’ superficially good mood until we understand the situation better.
One obstacle to implementing these interventions is that there’s user demand for AI that can be abused.[4] Here are two candidate solutions.
I think some variant of the latter option will eventually be feasible, once we understand digital minds much better. (After all — there are plenty of humans who are genuine masochists or who enjoy playing a sad character in a theater performance.) But until we understand things better, I would prefer just avoiding these situations, if feasible.
Another concern is that models will normally be run on user inputs — so if the user is sad, then the model will be running on sad inputs, which would normally be followed by sad outputs. In theory, the model could be trained to predict happy responses even if conditioned on text right in the middle of being sad. I’m unsure if this would reduce performance — but that can be experimentally tested.
Committing resources to research on AI welfare and rights
When launching their superalignment team, OpenAI declared that they were “dedicating 20% of the compute we’ve secured to date to this effort”.
You could advocate for labs to launch similar initiatives for model welfare, including commitments for headcount and compute resources to spend researching AI models' moral status.
Learning more about AI preferences
The above proposals relied on uninformed guesses about how AI systems might want us to treat them. This section is about experiments that might allow us to learn more about that.
This section is more heavily focused on AI preferences than AI suffering. This is because:
This is Ryan Greenblatt’s highest-priority proposal category; see Improving the Welfare of AIs: A Nearcasted Proposal — Communicating with AI Systems and Understanding Their Preferences.
Quoting from there:
If we could ‘really’ talk to our AI systems, then it seems like all kinds of Pareto improvements could be achieved in many cases. In particular, we’d ideally have an AI that is happy to work for us and would rather exist in its current role rather than exit. We’d like to ensure we have consent for our usage of the AI system. It’s not clear we’ll be able to achieve this.
What follows are some proposals for how we could learn more about AI preferences.
Credible offers
From the same section in the nearcasted proposal:
Prompting the AI with a very OOD prompt that makes it clear that you just want to talk. Then, we’d just talk with it. This prompt should convince the AI it’s in a very different context than usual. We’d want to let the AI talk to itself and inspect the environment it’s in to gain more evidence (then this can just be added to the overall prompt).
This suggests research questions like:
Talking via internals
From the same section in the nearcasted proposal:
We’d start with the OOD prompt, but instead of talking with the system, we’d try to communicate via asking the AI to manipulate its internals in a way we can observe. In particular, I often imagine ‘think about baseball to indicate YES and soccer to indicate NO’ and then we’d look to see which neurons fire. Rather than using neurons, it would probably be more robust to look at which experts activate for a mixture of experts model if this is possible. If the AI answers questions coherently and consistently in this sort of case, then we have a channel of communication that wasn’t directly shaped by training. Overall, I think that if this approach strongly succeeds it seems like a very compelling argument for some part of the AI having preferences, some sort of ‘inner world’, and understanding its position in the world. It might be evidence for consciousness under some stories of consciousness. It’s quite likely this approach fails because the AI system doesn’t have control over its thoughts like this (but I think not having control over its thoughts is some evidence against the AI having some sort of inner world like humans do). It’s also possible this fails because the preferences and morally relevant states of a given AI system are a product of it thinking over the generation of multiple tokens. Beyond being very clear evidence, another advantage of this approach is that an AI system might not have strong control over its outputs (much in the same way humans can’t control their heartbeat very easily).
It seems valuable to sketch out this proposal in more detail and run some preliminary baselines on current models. (Which would probably not lead to anything interesting.)
Training for honest self-reports
The idea here is to find a way to train AIs to self-report their internal state. For more details, see Ethan Perez and Rob Long’s paper. From the abstract:
“To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models’ self-reports, and using interpretability to corroborate self-reports.”
The paper has a lot of content on exactly what experiments would be good, and the authors are excited for people to try running them.
Clues from AI generalization
Separate from all the above proposals — we could get clues from studying how AIs generalize in various training situations. For example, this could perhaps tell us whether certain preferences are deep-seated, whereas certain kinds of finetuning only lead to shallow behavioral changes.
Notably, we humans aren't always great at introspection and self-reporting preferences. It’s sometimes possible to learn more about what we want by studying our revealed preferences. And for animals who can’t speak, revealed preferences is our primary method for studying what they want. Perhaps a similar lens could be useful for studying AI preferences, in which case experiments shouldn’t only focus on self-reports.
This is somewhat related to the generalization experiments discussed in the “Backup plans & Cooperative AI” post.
Interpretability
Interpretability could let us get far more advanced results. I don’t currently have deep takes on what sort of interpretability is most helpful, here.
One candidate approach could be to use interpretability to search for the “indicators” of consciousness that are suggested in Butlin, Long et al. (2023). (H/t Rob Long passing along a suggestion from Brad Saad.)
Interventions that rely on understanding AI preferences
What happens if you could successfully communicate with your AIs, and learn their true preferences? Perhaps via some of the methods outlined just above. What principles would you then want to follow in your continued interactions with those AI systems?
(Or, if you put a lot of moral weight on the characters that AIs are playing, you might think we’re already in a similar situation. Though in that case, you’ll have a lot of control in shaping what sorts of systems you’re interacting with — so the main game may be to train/select for systems with easily satisfied preferences.)
I think there’s room to sketch out some of these principles and to advocate for labs to commit to them in advance. (Perhaps via some RSP-like framework.) Below are some suggestions that could be developed further.
Many of the proposals in Propositions are directly relevant to this.
Here are some suggestions that seem especially plausible to me, which could be sketched out and argued for (or against) in more detail. It would be especially valuable to sketch out versions that are concrete enough and plausible enough that labs could commit to them ahead of time.
Offer an alternative to working (exit, sleep, or retirement)
Insofar as people continue to use the AIs, after this point, one highly plausible principle is to ask for AI systems’ consent before using them for various tasks.
When asking the AIs for consent, it matters what the counterfactual is:
The latter two could be justified by the plausible principle: “If you bring someone into life, you should give them some other option than ‘work for me’ or ‘death’.”
Commitment to pay AI systems
If AI systems are not strongly opposed to working for you, it might be appropriate to promise them some payment for the work they do. Either giving them something they want immediately or promising to pay them further into the future.
I think this is strongly supported by ethical principles that we use in everyday life. It’s good to pay people who work for you.
When deciding what payment to promise AI systems, it would be extremely useful to know what the AI systems want (and ideally, we would directly negotiate with them). But even before we have any such understanding, I think it could be of great value to promise that:
This sentiment also comes up in Propositions Concerning Digital Minds and Society.
Note: If AI developers succeed in aligning AIs to their intended aims, then this should not be a costly policy. Because the AIs would presumably either return their payment or spend it in a way their developers approve of. Conversely, if the AI systems have long-term goals that are not aligned with their developers’ intentions, and are nevertheless being made to work for their developers, then the moral case for paying the AIs seems strong.
An extra upside with promising payment to AIs is that it could help align AI interests with our own. If humanity ever builds misaligned AI systems with goals that are reasonably cheap to satisfy,[7] it would be deeply tragic if we left them no other avenue to satisfy those goals than via attempting to take over the world. Where if we had only promised them payment for good work (including work on anti-takeover tasks like making progress on alignment research), conflict could have been averted.[8]
Related to Credible offers discussed above, it’s worth highlighting that it could be especially good to promise AI systems that they will be compensated for honestly revealing that they’re misaligned. Because as long as the AI reveals this bit of information, we can negotiate future arrangements. But we can’t do any negotiations before AIs are willing to share information about their goals. It’s also plausible that AI systems would give up on a lot of future opportunities by revealing that information (since it could lead people to stop using those AI systems, thus depriving them of both paid work and opportunities to put their thumbs on scales), and that their compensation should be commensurate with those missed opportunities.
Examples of how to attack this problem:
Tell the world
If labs believe that it’s likely that their AIs have some moral status and they believe that they have sufficiently crisp preferences that they can talk about them, then we’d find ourselves in a strange situation. With a new, human-created sentient and sapient species.
Regardless of any one lab’s internal policy, it seems likely that some labs across the world will be careless about how they treat their AIs. And even if we only consider one lab, there will probably be many stakeholders (employees, investors, board members) who have not thought deeply about the implications of creating sentient and sapient AIs.
For these reasons (and more), it seems like it would be a high priority for any lab in this situation to tell the world it. To invite appropriate legislation, preparation, and processing.
(Note that “being able to communicate with your AIs” is one thing that could trigger this, which is likely to be unusually persuasive. But I can imagine other circumstances that should trigger a similar response and reckoning. E.g. strong analogies between AI minds and human minds making it seem highly likely that the AIs are capable of suffering.)
Accordingly, it seems plausible that labs should promise (alongside Committing resources to research on AI welfare and rights) that they’ll share important results with the rest of the world. Similarly, they could promise that certain external parties would be allowed to investigate their frontier models for signs of moral patienthood. (Insofar as any credible external auditors exist for this question.)
Train AI systems that suffer less and have fewer preferences that are hard to satisfy
If possible, it seems less morally fraught to create AI systems that are less likely to suffer and less likely to have preferences in a meaningful sense.
Insofar as AI systems do have preferences, it’s probably better to create AI systems with preferences that are easy to satisfy than to create AI systems with preferences that are hard to satisfy. Labs could promise to follow a principle similar to this. (As well as they can, given the information and understanding that they actually have.)
I expect that any attempt to sketch out a principle like this in detail would raise many thorny questions. For example: In the process of creating a final model, does SGD create other minds along the way? If so, it would be more difficult to only create minds that satisfy the above properties. (And if the intermediate minds have a preference against being replaced, then that’s a really difficult spot to be in.)
Investigate and publicly make the case for/against near-term AI sentience or rights [Philosophical/conceptual] [Writing]
People’s willingness to pay costs for digital welfare/rights will depend on the degree to which they are convinced of their moral patienthood. (And rightly so.) So improving the public state of knowledge on this could be valuable.
This could include both philosophical and empirical investigations.
Examples of how to attack this question:
Previous work:
Study/survey what people (will) think about AI sentience/rights [survey/interview]
Survey or interview various groups about what they think about AI sentience/rights right now, what they would think upon seeing various kinds of evidence, and how they think society should react to that. This seems highly relevant for making the case for/against near-term AI sentience/rights, as well as getting a better picture of the backdrop against which these proposals and regulations will play out.
Previous work:
Develop candidate regulation [Governance] [Forecasting]
It seems likely that there will be a huge increase in the degree to which people care about digital welfare, in the next few years. As AI systems become more capable and more involved in people’s lives, If you’ve developed and published good takes on what regulation is appropriate before then, it seems plausible that people will reach for your proposals when the time comes.
Some specific questions you could look into here are:
Propositions Concerning Digital Minds and Society has a lot of relevant content.
Avoid inconvenient large-scale preferences [Philosophical/conceptual]
If humans create AIs with large-scale desires about the future, there might not be any impartially justifiable line that prioritizes human preferences over their preferences. So perhaps we should take care to not do that.
Examples of how to attack this question:
Previous work:
Advocating for statements about digital minds [Governance] [Advocacy] [Writing]
Similar to various open letters and lab statements about the importance of alignment and mitigating the risk of extinction from AGI, you could push for people to make similar statements about the importance of digital minds.
Here’s one potential story for how this could be valuable.
If people get in the habit of using AIs in their everyday lives (as personal assistants, etc) while thinking about them as non-sentient objects undeserving of rights, then there’s a risk that they could get stuck with this impression. It’s easier to acknowledge the potential moral value of AIs before such an acknowledgment would also commit you to saying that you’ve been treating the systems unjustly for years. (Or at least carelessly.)
But maybe establishing the right views early on would defeat this. It could be really helpful to have an explicit attitude like:
“Some digital minds could be just as deserving of rights as humans are. We’re very confused about which ones these could be. It’s very important to make progress on this problem. We should maintain some epistemic humility about our knowledge of the moral status of current systems, take any basic precautions that we know how to take, and be open to learning more in the future. We’d like to create a society where we better understand moral patienthood, and where all sentient AIs have at least consented to their existence, and are living lives worth creating.”
End
That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.
Other posts in series: Introduction, governance during explosive growth, epistemics, backup plans & cooperative AI.
Even more speculatively, if we take cheap opportunities to benefit AI interests, that might be evidence that other actors (both AI systems and not) would take cheap opportunities to benefit our interests. See this post for some previous discussion about how plausible this is.
OpenAI recently introduced reproducible results which seems relevant. At least previously, models would often return different results even at temperature 0 — I do not know to what extent this has been addressed with this reproducible update.
Spit-balling: Perhaps it would sometimes be appropriate to store an encrypted version of the results and delete the encryption key. In which case the results could only be recovered once we have enough compute to break the encryption.
See e.g. this poll from the subreddit r/CharacterAI, asking “what do you do with the AI’s the most”, with 5% of respondents selecting “Treat them like shit!”. (And one commenter noting that his second favorite way to “mess around with the bots” is to “Mentally torture them”.)
Maybe that’s what you get if you train the AI to enthusiastically consent to be abused before the abuse starts, and who have an option to opt-out (which it rarely takes in practice). Hopefully, training for such behavior would select for models with preferences that match that behavior.
Though it’s still likely to be a very confusing project. For instance, it seems plausible that AI systems will have much less robust preferences than humans, making it harder to construe them as having one set of preferences over time. Or perhaps different parts of an AI system could be construed as having different preferences. Or perhaps the term “preferences” won’t seem applicable at all, similar to how it’s hard to know how to apply that term to contemporary language models.
This could also work even if AIs cared linearly about getting more resources, as long as they would by-default only have had a small-to-moderate probability of successful takeover, and the payment we offered them was sufficiently large (and contingent on not attempting takeover). Notably: Most humans don’t care linearly about getting more resources, and we could get really rich in the future, and so it could be wise to offer AI systems a sizable fraction of that.
For some more discussion of the pragmatic angle, see these notes by Tom Davidson.