Summary 

Consider a Membrane that is supposed to protect Steve. An AI wants to hurt Steve but does not want to pierce Steve's Membrane. The Membrane ensures that there is zero effect on predictions of things inside the membrane. The AI will never take any action that has any effect on what Steve does or experience. The Membrane also ensures that the AI will not have access to any information that is within Steve's Membrane. One does not have to be a clever AI to come up with a strategy that an AI could use to hurt Steve without piercing this type of Membrane. The AI could for example create and hurt minds that Steve cares about, but not tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the AI was built. And Bob wants to hurt Steve. Then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any access to any information that is within Steve's Membrane).

This illustrates a security hole in a specific type of Membrane formalism. This particular security hole can of course be patched. But if a Membrane is supposed to actually protect Steve from a clever AI, then there is a more serious issue that needs to be dealt with: a clever AI that wants to hurt Steve will be very good at finding other security holes. Plugging all human-findable security holes is therefore not enough to protect Steve. This post explores the question: ``what would it take to create a Membrane formalism that actually protects Steve in scenarios where Steve shares an environment with a clever AI (without giving Steve any special treatment)?''.

The post does not propose any such formalism. It instead describes one necessary feature. In other words: it describes a feature that a Membrane formalism must have, in order to reliably protect Steve in this context. Very briefly and informally: the idea is that safety requires the prevention of scenarios where a clever AI wants to hurt Steve. For Steve's Membrane to ensure this, it must be extended to encompass any adoption of Steve-referring preferences by a clever AI.


Thanks to Chris Lakin for regranting to this research.


Protecting an individual that shares an environment with a clever AI is difficult

One can construct Membrane formalisms for all sorts of reasons. The present post will cover one specific case: Membrane formalisms that are supposed to provide reliable protection for a human individual, that share an environment with a powerful and clever AI that is acting autonomously (details below). The present post will describe a feature that is necessary for a formalism to fulfil this function, in this context.

This post is focused on scenarios where Steve gets hurt despite the fact that Steve's Membrane is never pierced. This means that it does not help to make sure that an AI does not want to pierce a Membrane. In other words: this post is concerned with scenarios where Steve's Membrane is internal to an AI, but where that Membrane still fails to protect Steve. Consider a Membrane formalism such that it is possible to hurt Steve without piercing Steve's Membrane. In this case there is nothing inconsistent about an AI that both (i): wants to hurt Steve, and also (ii): wants to avoid piercing Steve's Membrane. Making a Membrane internal to an AI is therefore not enough to protect Steve, if it is possible to hurt Steve without piercing Steve's Membrane. In yet other words: if it is possible to hurt Steve without piercing Steve's Membrane, then making sure that an AI does not want to pierce Steve's Membrane does not help Steve.

The demands on a formalism is dependent on the context that it will be used in. So, for a given class of contexts we can talk about features that any Membrane formalism must have. The present post will not propose any Membrane formalism. Its scope is limited to describing one feature, that is necessary in one specific class of contexts. Let's start by describing this context in a bit more detail.

The class of contexts that the present post is analysing are situations where, (i): the environment contains a powerful, clever, and autonomous AI (in other words: not an instruction following tool AI, but an autonomously acting AI Sovereign. This AI is also clever enough to think up solutions that humans can not think up), (ii): this AI will adopt its goal entirely from billions of humans, (iii): this AI will adopt its goal by following the same rules for each individual, and (iv): all humans will get the same type of Membrane. In other words: Steve must be protected from an AI that gets its goal from billions of humans, without giving Steve any form of special treatment. This in turn means that it is not possible to extend Steve's Membrane to cover everything that Steve cares about (trying to do this for everyone would lead to contradictions). The present post will start by arguing that to reliably protect Steve in such a context, a Membrane must reliably prevent the situation where such an AI wants to hurt Steve.

If the AI in question wants to hurt Steve, then protecting Steve would require the Membrane designers to predict and counter all attacks that a clever AI can think up. Even if the designers knew that they had protected Steve from all human-findable attacks, this would still not provide Steve with reliable protection. Because an AI can think up ways of attacking Steve that no human can think up (these problems are similar to the problems that one would face, if one needed to make sure that a powerful and clever AI will remain in a human constructed box). Thus, even if a Membrane formalism is known to protect Steve from all human findable forms of attack, it is not a reliable protection against a powerful and clever AI that wants to hurt Steve. And if it is possible to hurt Steve without piercing Steve's Membrane, then ensuring that the AI wants to avoid piercing Steve's Membrane does not help Steve. To reliably protect Steve, the Membrane must therefore reliably prevent the scenario where this type of AI wants to hurt Steve.

If the AI adopts preferences that refer to Steve, using a process that falls outside of Steve's Membrane, then Steve's Membrane cannot prevent the adoption of preferences to hurt Steve. So if the Membrane is not extended to include this process, then the Membrane does not offer reliable protection for Steve in this context. In other words: to reliably protect Steve in this type of scenario, Steve's Membrane must encompass the point at which a clever and powerful AI adopts preferences that refer to Steve (as a necessary but not sufficient condition).

Let's introduce some notation. At some point a decision is made, regarding which Steve-referring Preferences will be adopted by a clever AI. Let's say that iff some specific Membrane formalism, means that Steve's Membrane will be extended to encompass this decision, then this formalism has the Extended Membrane (EM) feature. In the class of scenarios that we are looking at (where Steve shares an environment with a clever and powerful AI of the type described above), the EM feature is a necessary feature of a Membrane formalism. (It is however definitely not sufficient).

Let's recap the argument for necessity of the EM feature. If a clever and powerful AI wants to hurt Steve, then such an AI would be able to think up ways of attacking Steve that humans would not be able to think up. If Steve comes face to face with a clever AI that wants to hurt Steve, then the task of the designers of a Membrane formalism is therefore impossible. Such designers would have to find a way of protecting Steve against a set of attack vectors that they are not capable of comprehending. This is not feasible. Thus, in order to protect Steve, the Membrane must instead prevent the existence of an AI that wants to hurt Steve. If the adoption of preferences that refer to Steve happens outside of Steve's Membrane. Then the Membrane cannot prevent the adoption of a preference to hurt Steve. Thus, for the Membrane to be able to reliably prevent the adoption of such preferences, it must be extended to encompass the decision of which Steve-referring preferences the AI will adopt. Otherwise the Membrane cannot prevent the existence of an AI that wants to hurt Steve. And if a Membrane does not prevent the existence of an AI that wants to hurt Steve, then the Membrane is not able to reliably protect Steve (because even if a Membrane is internal to the AI, and even if all human findable security holes are known to be fully patched, this Membrane will still not help Steve against a clever AI that wants to hurt Steve). So if a Membrane formalism does not have the EM feature, it is known to fail at the task of reliably protecting Steve in the context under consideration.

 

The uses and limitations of establishing the necessity of the EM feature

Adding this type of extension to the Membrane of every individual does not introduce contradictions. Because the extension is in preference adoption space. Not in any form of outcome space. While this avoids contradictions, it also means that extending the Membrane in this way cannot guarantee that the AI will act in ways that Steve finds acceptable. This section will describe various scenarios where people are unhappy with a Membrane that has the EM feature. And it will discuss the fact that in many cases it will be unclear whether or not it is reasonable to describe a given Membrane formalism as having the EM feature. This section will also describe why identifying the EM feature as necessary was still useful (in brief: establishing necessity was still useful for designers, because they can now reject those Membrane formalisms, that clearly do not have the EM feature).

Let's take a trivial example of an AI that is acting in a way that Steve finds unacceptable, even though Steve is protected by a Membrane with the EM feature. If it is very important to Steve that any AI interacts with some specific historical monument in a very specific way. Then an AI might act in ways that makes Steve prefer the situation where there was no AI, even though this AI has no intention of hurting Steve. This is because adding the EM feature does not extend the Membrane to encompass everything that Steve cares about. Extending the Membranes of multiple people in such a way would introduce contradictions (other people might also care deeply about the same historical monument). In other words: defining the EM feature in preference adoption space avoids contradictions. But it means that the EM feature cannot hope to be sufficient. A necessary feature can however still be useful for designers, because it allows designers to reject any formalism that clearly does not have the necessary feature.

A necessary feature can be useful, even if there exists many cases where it is unclear whether or not a given formalism has this feature. As long as clear negatives exist (Membrane formalisms that clearly do not have this feature), then discovering that the feature is necessary can be useful for designers. In other words: as long as it is possible to determine that at least some potential formalisms definitely does not have the EM feature, then this feature can be useful for designers. The existence of clear negatives is needed for this finding to be useful. But the existence of clear positives is not important (because clear positives are treated the same as unclear cases in the design process). To illustrate the role of necessary (but far from sufficient) features, let's turn to a less trivial example. A scenario with Gregg, who categorically rejects the EM feature as inherently immoral.

The EM feature will be completely unacceptable to Gregg, on honestly held, non-strategic, moral grounds. Gregg sees most people as heretics and Gregg demands that any AI must hurt all heretics as much as possible. For an entity as powerful as an AI, hurting heretics is a non negotiable moral imperative. Thus, Gregg will categorically reject the EM feature. In fact, Gregg will reject any conceivable AI that does not subject most people to the most horrific punishment imaginable. So making Gregg happy is not actually compatible with a good outcome (from the perspective of most humans. Since Gregg demands that any AI must hurt most humans as much as possible). More importantly for our present purposes: making Gregg happy is definitely not compatible with fulfilling the function of a Membrane formalism of the type that we are exploring in the present post: protecting individuals.

Now let's get back to the issue of what function a necessary but not sufficient feature can play in the design process. Let's re-formulate the Gregg example as a necessary condition of any Membrane formalism (or any other AI proposal for that matter): Gregg must categorically reject the proposal as an abomination, due to an honestly held normative judgment. Let's refer to this feature as the Rejected by Gregg on Honestly held Moral grounds (RGHM) feature. Unless a proposal results in most people being subjected to the worst thing a clever AI can think up, then that proposal will have the RGHM feature. So the absence of the RGHM feature can probably not be used to reject a large number of proposals. But given what we know about Gregg, it is entirely valid to describe this as a necessary feature (of a Membrane or an AI). Therefore it can be used to reject any proposal that is clearly not describable as having the RGHM feature. And this necessary feature can perhaps be useful for illustrating the important difference between dealing with necessity and dealing with sufficiency. And for illustrating the role that a necessary feature still can play in a design process (in this case: the design of a Membrane formalism whose function is to keep human individuals safe in a certain context). Now consider an AI that hurts all humans as much as possible. Such an AI has this necessary RGHM feature (because any proposal that leads to non heretics getting hurt is also rejected by Gregg on moral grounds). This should drive home the point that the RGHM feature is definitely not sufficient for safety. And drive home the point that a proposal can have a necessary feature and still be arbitrarily bad. Now let's turn to the role that the RGHM feature could still play in the design process.

If Gregg is happy with some Membrane formalism (or some AI proposal), then this is a perfectly valid reason to reject the proposal in question out of hand. Because that proposal lacks a necessary feature. There will be many cases where it will be unclear whether or not Gregg can be reasonably described as rejecting a given proposal. In these cases, determining whether or not the proposal has the RGHM feature might be a fully arbitrary judgment call. There likely exist many border cases. But there will also be some cases where Gregg is clearly happy according to any reasonable set of definitions. There will be clear negatives: cases where it is clear that a given proposal does not have the RGHM feature. And in such a case, the proposal in question is known to be bad (it fails to achieve its purpose). Clear, unambiguous, rejection by Gregg does not settle things (as illustrated by the ``hurt-everyone-AI'' in the previous paragraph). And unclear cases also do not settle things. But clear approval by Gregg does in fact settle things (in other words: the clear absence of a necessary feature is informative, because it is a valid reason to reject a proposal).

The same type of considerations hold more generally when dealing with features that are necessary but not sufficient. In other words: the existence or non-existence of clear positives is not actually important. The existence of many cases that are hard to define is mostly irrelevant. The only thing that actually matters, for the feature to be able to fulfil its role in the design process, is that there exists clear negatives (in this case: Membrane formalisms that are clearly not describable as having a necessary feature). Identifying a feature as necessary can thus reduce risks from all proposals that clearly do not have the feature.

Now let's return to the EM feature. Establishing the necessity of the EM feature was useful for similar reasons. The clear presence of the EM feature does not settle things. There will also be many cases where it is not clear, whether or not it would be reasonable to describe a given formalism as having the EM feature. But the EM feature can still serve a role in the design process. Specifically: if a given Membrane formalism is clearly not describable as having the EM feature, then we know that we must reject the formalism. In other words: if a Membrane is clearly not describable as including the point at which a clever and powerful AI adopts preferences the refer to Steve, then the formalism must be rejected (assuming that the point of constructing a Membrane formalism was to offer reliable protection for Steve, in the context outlined above). This is the main takeaway of the present post. And this takeaway has probably been expressed in a sufficient number of ways at this point. So now the post will conclude with a brief discussion of a couple of tangents.

 

A brief discussion of a couple of tangents:

Davidad has a proposal for how to structure negotiations regarding AI actions. The set of actions under consideration are restricted to Pareto Improvements (relative to a baseline situation where the AI does not exist). This is not a Membrane formalism. But the idea is to protect individuals, in a way that is basically equivalent to extending individual influence in a way that I think is similar to a Membrane extension. The proposal gives each individual some measure of control over things defined in an outcome space. I think this is similar to extending a Membrane in an outcome space in a way that leads to contradictions due to overlap. The proposal is not a Membrane formalism, and the extension does not lead to contradictions. Instead, the extension results in the set of actions that can be considered during negotiations becoming empty (meaning that all possible actions are classified as unacceptable). This happens because the set of Pareto Improvements is always empty when billions of humans are negotiating about what to do with a powerful AI. In brief: the extension of individual influence in an outcome space leads to a malignant version of the problem in the historical monument example mentioned above. Consider two people with a type of morality along the lines of Gregg. Each view the other as a heretic. Both considers it to be a moral imperative to punish heretics as much as possible. Both view the existence of an immoral AI that neglects its duty to punish heretics as unacceptable (both also reject the scenario where everyone is punished as much as possible). A population of billions only has to include two people like this for the set of Pareto Improvements to be empty.

An almost identical dynamic has implications for work that is more explicitly about Membranes. In Andrew Critch's Boundaries / Membranes sequence, it is suggested that it might be possible to find a Membrane based Best Alternative To a Negotiated Agreement (BATNA), that can be viewed as having been acausally agreed upon by billions of humans. The problem is again that the existence of two people like Gregg (who view each other as heretics), means that this is not possible. There exists no BATNA that both will agree to, for the same reason that there exists no AI that both will consider acceptable. (both conclusions hold for any veil of ignorance, that does not transform a person like Gregg into a completely different person, with a completely new moral framework)


(I'm also posting this on the EA Forum)

New Comment
6 comments, sorted by Click to highlight new comments since:

I made a comment pertinent to this recently. I'm pretty concerned about the framing of a Sovereign AI, and even more so about a Sovereign AI which seeks to harm people. I would really prefer to focus on heading that scenario off before we get to it rather than trying to ameliorate the harms after it is established.

My current belief as of September 2024 is that we should be aiming for powerful tool-AIs wielded by a democratic world government, that prevent the development of super-intelligent AIs or self-replicating weapons for as many years as possible while fundamental progress is made on AI Alignment.

I think your comment about the problem of people who value punishing others really hits the mark.

Basically, I don't think it makes sense to try to satisfy everyone's preferences. Instead, we should try for something like 'the loosest, smallest libertarian world government that prevents catastrophe.' Then we can have our normal nation-states implemented within the world government framework, implementing local laws.

I do think it's possible to have a really great government that would align with my values, and yet be 'loose' enough in its decision boundaries that many other people also felt that it adequately aligned with their values. I think this hypothetical great-by-my-standards government would be better than the hypothetical minimal-libertarian-world-government with current nation-states within. Unfortunately, I don't see a safe path which goes straight to my ideal government being the world government. Maybe someday I'll get to help create and live under such a government! In the meantime, I'd prefer the minimal-libertarian-world-government to humanity being destroyed.

In regards to the Membrane idea... I find it seems less compelling to me as a safe way for someone to operate a potent AI than Corrigibility as defined by Max Harms.

 

My other comment:


""" I believe it is mostly impossible except in corner/edge cases like everyone having the same preferences, because of this post: https://www.lesswrong.com/posts/YYuB8w4nrfWmLzNob/thatcher-s-axiom

So personal intent alignment is basically all we get except in perhaps very small groups."""

I want to disagree here. I think that a widely acceptable compromise on political rules, and the freedom to pursue happiness on one's own terms without violating others' rights, is quite achievable and desirable. I think that having a powerful AI establish/maintain the best possible government given the conflicting sets of values held by all parties is a great outcome. I agree that this isn't what is generally meant by 'values alignment', but I think it's a more useful thing to talk about.

I do agree that large groups of humans do seem to inevitably have contradictory values such that no perfect resolution is possible. I just think that that is beside the point, and not what we should even be fantasizing about. I also agree that most people who seem excited about 'values alignment' mean 'alignment to their own values'. I've had numerous conversations with such people about the problem of people with harmful intent towards others (e.g. sadism, vengeance). I have yet to receive anything even remotely resembling a coherent response to this. Averaging values doesn't solve the problem, there are weird bad edge cases that that falls into. Instead, you need to focus on a widely (but not necessarily unanimously) acceptable political compromise.

Regarding Corrigibility as an alternative safety measure:

I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.

Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.

Let's take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).

(I discussed this issue with Max Harms here)

Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility.  In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.

I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would seek to harm people. For example a Sovereign AI that wants to respect a Membrane. But where the Membrane formalism does not actually prevent the AI from wanting to hurt individuals, because the formalism lacks a necessary feature.

One reason we might end up with a Sovereign AI that seeks to harm people is that someone makes two separate errors. Let's say that Bob gains control over a tool-AI, and uses it to shut down unauthorised AI projects (Bob might for example be a single individual, or a design team, or a government, or a coalition of governments, or the UN, or a democratic world government, or something else along those lines). Bob gains the ability to launch a Sovereign AI. And Bob settles on a specific Sovereign AI design: Bob's Sovereign AI (BSAI).

Bob knows that BSAI might contain a hidden flaw. And Bob is not being completely reckless about launching BSAI. So Bob designs a Membrane, whose function is to protect individuals (in case BSAI does have a hidden flaw). And Bob figures out how to make sure that BSAI will want to avoid piercing this Membrane (in other words: Bob makes sure that the Membrane will be internal to BSAI).

Consider the case where both BSAI, and the Membrane formalism in question, each have a hidden flaw. If both BSAI and the Membrane is successfully implemented, then the result would be a Sovereign AI that seeks to harm people (the resulting AI would want to both, (i): harm people, and (ii): respect the Membrane of every individual). One way to reduce the probability that such a project would go ahead, is to describe necessary features.

For example: if it is clear that the Membrane that Bob is planning to use, does not have the necessary Extended Membrane feature described in the post, then Bob should be able to see that this Membrane will not offer reliable protection from BSAI (which Bob knows might be needed, because Bob knows that BSAI might be flawed).

For a given AI project, it is not certain that there exists a realistically findable necessary feature, that can be used to illustrate the dangers of the project in question. And even if such a feature is found, it is not certain that Bob will listen. But looking for necessary features is still a tractable way of reducing the probability of a Sovereign AI that seeks to harm people.

A project to find necessary features is not really a quest for a solution to AI. It is more informative to see such a project as analogous to a quest to design a bulletproof vest for Bob, who will be going into a gunfight (and who might decide to put on the vest). Even if very successful, the bulletproof vest project will not offer full protection (Bob might get shot in the head). A vest is also not a solution. Whether Bob is a medic trying to evacuate wounded people from the gunfight, or Bob is a soldier trying to win the gunfight, the vest cannot be used to achieve Bob's objective. Vests are not solutions. Vests are still very popular amongst people who know that they will be going into a gunfight.

So if you will share the fate of Bob. And if you might fail to persuade Bob to avoid a gunfight. Then it makes sense to try to design a bulletproof vest for Bob (because if you succeed, then he might decide to wear it. And that would be very good if he ends up getting shot in the stomach). (the vest in this analogy is analogous to descriptions of necessary features, that might be used to convince designers to abandon a dangerous AI project. The vest in this analogy is not analogous to a Membrane)

Small editing note, I endorse the title of Abstracts should be either Actually Short™, or broken into paragraphs. I wish the abstract for this post was not a fearsome wall of text.

Thanks for the feedback! I see what you mean and I edited the post. (I turned a single paragraph abstract into a three paragraph Summary section. The text itself has not been changed)