Agent membranes/boundaries and formalizing “safety”

Chris Lakin

LESSWRONG
LW

Agent membranes/boundaries and formalizing “safety” — LessWrong

Boundaries (membranes) for AI safety by Chipmonk

26 Agent membranes/boundaries and formalizing “safety”

by Chris Lakin

3rd Jan 2024

3 min read

26

It might be possible to formalize what it means for an agent/moral patient to be safe via membranes/boundaries. This post tells one (just one) story for how the membranes idea could be useful for thinking about existential risk and AI safety.

Formalizing “safety” using agent membranes

A few examples:

A bacterium uses its membrane to protect its internal processes from external influences.

AI generated image: microscope view of a amoeba

A nation maintains its sovereignty by defending its borders.

A human protects their mental integrity by selectively filtering the information that comes in and out of their mind.

AI generated image: a brain surrounded by a fence

A natural abstraction for agent safety?

Agent boundaries/membranes seem to be a natural abstraction representing the safety and autonomy of agents.

A bacterium survives only if its membrane is preserved.
A nation maintains its sovereignty only if its borders aren’t invaded.
A human mind maintains mental integrity only if it can hold off informational manipulation.

Maybe the safety of agents could be largely formalized as the preservation of their membranes.

Distinct from preferences!

Boundaries are also cool because they show a way to respect agents without needing to talking about their preferences or utility functions. Andrew Critch has said the following about this idea:

my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them. In other words, I think boundaries are probably better able to carve reality at the joints than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology («Boundaries» Sequence, Part 3b)

For instance, respecting the boundary of bacterium would probably mean “preserving or not disrupting its membrane” (as opposed to knowing its preferences and satisfying them).

Protecting agents and infrastructure

By formalizing and preserving the important boundaries in the world, we could be in a better position to protect humanity from AI threats. Examples:

Critical computing infrastructure could be secured by creating strong boundaries around them. This can be enforced by cryptography and formal methods such that only the subprocesses that need to have read and/or write access to a particular resource (like memory) have the encryption keys to do so.
- Related: Object-capability model, Principle of least privilege, Evan Miyazono’s Atlas Computing, Davidad’s Open Agency Architecture.

It may be possible to specify or agree on a minimal “membrane” for each agent/moral patients humanity values, such that when each membrane is preserved, that agent largely stays safe and maintains its autonomy over the inside of its membrane.
- If your physical boundary isn't violated, you don't die. If your mental boundary isn't violated aren't manipulated. Etc…
- Note: if your membrane is preserved, this just means that you stay safe and that you maintain autonomy over everything with your membrane. It does not necessarily mean that you get actively positive outcomes to occur in the outside world. This is all about bare minimum safety.
- See this thread below and davidad's comment

Similarly, it may be possible to formalize and enforce the boundaries of physical property rights.

This is for safety, not full alignment

Note that this is only about specifying safety, not full alignment.

See: Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Caveats

I don't think the absence of membrane piercing formalizes all of safety, but I think it gets at a good chunk of what "safety" should mean. Future thinking will have to determine what more is required.

What are examples of violations of agent safety that do not involve membrane piercing?

Markov blankets

How might membranes/boundaries be formalized mathematically? Markov blankets seem to be a fitting abstraction.

Diagram:

Notice that there are no arrows directly between the agent and its environment. Ideally, all influence from one to the other flows through the boundary/membrane (e.g.: your skin).

In which case,

Infiltration of information across this Markov blanket measures membrane piercing, and low infiltration indicates the absence of such piercing.
(And it may also be useful to keep track of exfiltration across the Markov blanket?^[1])

For more details, see distillation Formalizing «Boundaries» with Markov blankets.

Also, there are probably other information-theoretic measures that are useful for formalizing membranes/boundaries.

Protecting agent membranes/boundaries

See: Protecting agent boundaries.

Subscribe to the boundaries/membranes LessWrong tag to get notified of new developments.

Thanks to Jonathan Ng, Alexander Gietelink Oldenziel, Alex Zhu, and Evan Miyazono for reviewing a draft of this post.

^{^}
exfiltration, i.e.: privacy and the absence of mind-reading. But I need to think more about this. Related section: “Maintaining Boundaries is about Maintaining Free Will and Privacy” by Scott Garrabrant.

Boundaries / Membranes [technical]Human ValuesAI

Frontpage

26

What does davidad want from «boundaries»?

1 comments46 karma

New Comment

46 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:36 AM

[-]the gears to ascension2y141

Good stuff.

What's "piercing"?

Is it piercing a membrane if I speak and it distracts you, but I don't touch you otherwise?
What about if I destroy all your food sources but don't touch your body?
What if you're dying and I have a cure but don't share it?
What if I enclose your house completely with concrete while you're in it?
How about if I give you food you would have chosen to buy anyway, but I give it to you for free?
What about if I offer you a bad trade I know you'll choose to make because of an ad you just saw?
What about if I'm the one showing you an ad rather than simply being in the right place at the right time to take advantage of someone else's ad?

[-]davidad2y116

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the boundary).

So, to your particular cases:

Probably not. There is no fundamental difference between sound and contact. Rather, the fundamental difference is between the usual flow of information through the senses and other flows of information that are possible in the concrete physical world-model but not represented in the abstraction. An interaction that pierces the membrane is one which breaks the abstraction barrier of perception. Ordinary speech acts do not. Only sounds which cause damage (internal state changes that are not well-modelled as mental states) or which otherwise exceed the "operating conditions" in the state space of the «boundary» layer (e.g. certain kinds of superstimuli) would pierce the «boundary».
Almost surely not. This is why, as an agenda for AI safety, it will be necessary to specify a handful of constructive goals, such as provision of clean water and sustenance and the maintenance of hospitable atmospheric conditions, in addition to the «boundary»-based safety prohibitions.
Definitely not. Omission of beneficial actions is not a counterfactual impact.
Probably. This causes prediction error because the abstraction of typical human spatial positions is that they have substantial ability to affect their position between nearby streets by simple locomotory action sequences. But if a human is already effectively imprisoned, then adding more concrete would not create additional/counterfactual prediction error.
Probably not. Provision of resources (that are within "operating conditions", i.e. not "out-of-distribution") is not a «boundary» violation as long as the human has the typical amount of control of whether to accept them.
Definitely not. Exploiting behavioural tendencies which are not counterfactually corrupted is not a «boundary» violation.
Maybe. If the ad's effect on decision-making tendencies is well modelled by the abstraction of typical in-distribution human interactions, then using that channel does not violate the «boundary». Unprecedented superstimuli would, but the precedented patterns in advertising are already pretty bad. This is a weak point of the «boundaries» concept, in my view. We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all: any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected. Another approach is Mariven's criterion for deception, but applying this criterion requires modelling human mental states as beliefs about the world (which is certainly not 100% scientifically accurate). I would like to see more work here, and more different proposed approaches.

[-]the gears to ascension2y30

Definitely not. Omission of beneficial actions is not a counterfactual impact.

You're sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?

[-]the gears to ascension2y20

We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all

Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.

[-]Chris Lakin2y10

he says a bit in this direction- see my other comment

[-]Chris Lakin2y10

Here's a tricky example I've been thinking about:

Is a cell getting infected by a virus a boundary violation?

What I think makes this tricky is that viruses generally don't physically penetrate cell membranes. Instead, cells just "let in" some viruses (albeit against their better judgement).

Then once you answer the above, please also consider:

Is a cell taking in nutrients from its environment a boundary violation?

I don't know what makes this different from the virus example (at least as long as we're not allowed to refer to preferences).

[-]Chris Lakin2y10

any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected.

I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.

[-]Chris Lakin2y10

Thanks for writing this! I largely agree (and the rest I need to think more about)

[-]Chris Lakin2y*10

Edit: just see Davidad's comment

Hmmm. It's becoming apparent to me that I don't want to regard membrane piercing as a necessarily objective phenomenon. Membrane piercing certainly isn't always visible from every perspective.

That said, I think it's still possible to prevent "membrane piercing", even if whether it occurred can be somewhat subjective.

Responding to some of your examples:

Is it piercing a membrane if I speak and it distracts you, but I don't touch you otherwise

Again: I don't actually care so much about whether this is or isn't a membrane piercing, and I don't want to make a decision on that in this case. Instead, I want to talk about what actions taken by which agents make the most sense for preventing the outcome if we do consider it to be a membrane piercing.

In most everyday cases, I think the best answer is "if someone's actions are supposedly distracting you, you shouldn't blame anyone else for distracting you, you should just get stronger and become less distractible". I believe this because it can be really hard to know other agent's boundaries, and if you just let other agents tell you your boundaries you can get mugged too easily.

However, in some cases, self-defense is infact insufficient, and usually in these cases as a society we collectively agree that e.g. "no one should blow an airhorn in your ear-- in this case we're going to blame the person that did that"

What about if I destroy all your food sources but don't touch your body?

It depends on how far out we can find the membranes. For example, if the membranes go so far out as to include property rights then this could be addressed.

What if I enclose your house completely with concrete while you're in it?

Again depends on how far out we go with the membranes: in this case, probably: how much of the law is included.

[-]the gears to ascension2y20

It depends on how far out we can find the membranes. For example, if the membranes go so far out as to include property rights then this could be addressed.

I sort of agree, but my food sources are not my property, they're a farmer's property.

I edited numbers into my questions, could you edit to make your response numbered and get each one?

[-]ThomasCederborg2y97

I think it is very straightforward to hurt human individual Steve without piercing Steve's Membrane. Just create and hurt minds that Steve cares about. But don't tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve's Membrane). And if it is possible to build a Membrane enforcing AI, it is presumably possible to build an AI that looks at Bob's memories of Steve, and creates some set of minds, whose fate Steve would care about. This does not involve any form of blackmail or negotiation (and definitely nothing acausal). Just Bob who wants to hurt Steve, and remembers things about Steve from before the first AI launch.

One can of course patch this. But I think there is a deeper issue in one specific case, that I think is important. Specifically: the case where the Membrane concept is supposed to protect Steve from a clever AI that wants to hurt Steve. Such an AI can think up things that humans can not think up. In this case, patching all human-findable security holes in the Membrane concept, will probably be worthless for Steve. It's like trying to keep an AI in a box by patching all human findable security holes. Even if it were know, that all human findable security holes, had been fully patched, I don't think that it changes things, from the perspective of Steve, if a clever AI tries to hurt him (whether the AI is inside a box, or Steve is inside a Membrane). This matters if the end goal is to build CEV. Specifically, it means that if CEV wants to hurt Steve, then the Membrane concept can't help him.

Let's consider a specific scenario. Someone builds a Membrane AI, with all human findable safety holes fully patched. Later, someone initiates an AI project, whose ultimate goal is to build an AI, that implements the Coherent Extrapolated Volition of Humanity. This project ends up successfully hitting the alignment target that it is aiming for. Let's refer the resulting AI as CEV.

One common aspect of human morality, is often expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell''. A tiny group of fanatics, with morality along these lines, can end up completely dominating CEV. I have outlined a thought experiment, where this happens to the most recently published version of CEV (PCEV). It is probably best to first read this comment, that clarifies some things talked about in the post that describes the thought experiment (including important clarifications regarding the topic of the post, and regarding the nature of the claims made, as well as clarifications regarding the terminology used).

So, in this scenario, PCEV will try to implement some outcome along the lines of LP. Now the Membrane concept has to protect Steve from a very clever attacker, that can presumably easily just go around whatever patches was used to plug the human findable safety holes. Against such an attacker, it's difficult to see how a Membrane will ever offer Steve anything of value (similar to how it is difficult to see how putting PCEV in a human constructed box, would offer Steve anything of value).

I like the Membrane concept. But I think that the intuitions that seems to be motivating it, should instead be used to find an alternative to CEV. In other words, I think the thing to aim for, is an alignment target, such that Steve can feel confident, that the result of a successful project, will not want to hurt Steve. One could for example use these underlying intuitions, to try to explore alignment targets along the lines of MPCEV, mentioned in the above post (MPCEV is based on giving each individual, meaningful influence, regarding the adoption, of those preferences, that refer to her. The idea being that Steve needs to have meaningful influence, regarding the decision, of which Steve-preferences, an AI will adopt). Doing so means that one must abandon the idea of building an AI, that is describable as doing what a group wants (which in turn, means that one must give up on CEV as an alignment target).

In the case of any AI, that is describable as a doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). Form Steve's perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences to adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities''. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong'' (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that this AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation. Discovering that doing what humanity wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave's cells.

One can of course give every individual meaningful influence, regarding the adoption of those preferences, that refer to her (as in MPCEV, mentioned in the linked post). So, Steve can be given this form of protection, without giving Steve any form of special treatment. But this means that one has to abandon the core concept of CEV.

I like the membrane concept on the intuition level. On the intuition level, it sort of rhymes with the MPCEV idea, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. I'm just noting that it does not actually protect Steve, from an AI that already wants to hurt Steve. However, if the underlying intuition, that seems to me to be motivating this work, is instead used to look for alternative alignment targets, then I think it might be very useful for safety (by finding an alignment target, such that a successful project would result in an AI, that does not want to hurt Steve in the first place). So, I don't think the Membrane concept can protect Steve from a successfully implemented CEV, in the unsurprising event that CEV will want to hurt Steve. But if CEV is dropped as an alignment target, and the underlying intuition behind this work, is directed towards looking for alternative alignment targets, then I think the intuitions that seems to be motivating this work, would fit very well with the proposed research effort, described in the comment linked above.

(this is a comment about dangers related to successfully hitting a bad alignment target. It is for example not a comment about dangers related to a less powerful AI, or dangers related to projects that fail to hit an alignment target. These are very different types of dangers. So, my proposed idea of using the underlying intuitions, to look for alternative alignment targets, should be seen as complementary. It can be done, in addition to looking for Membrane related safety measures, that can protect against other forms of AI dangers. In other words: if some scenario does not involve a clever AI, that already wants to hurt Steve, then nothing I have said, implies that the Membrane concept, will be insufficient for protecting Steve. In other words: using the Membrane concept, as a basis for constructing safety measures, might be useful in general. It will however not help Steve, if a clever AI is actively trying to hurt Steve)

[-]RamblinDash2y8-2

This comment has many good questions. More generally, I suspect that for any given membrane definition, it would be relatively easy to do either or both:

A - specify multiple easily-stated ways to torture or destroy the agent without piercing the membrane; and/or

B - show that the membrane definition is totally unworkable and inconsistent with other similarly-situated agents having similar membranes.

B is there because you could get around A by saying absurd things like 'well my membrane is my entire state, if nobody pierces that then I will be safe.' If you do, then people will of course need to pierce that membrane all the time, many agents' membranes will constantly be overlapping, and the 'membrane' framework just reduces to some kind of 'implied consent' framework, at which point the 'membrane' isn't doing any work.

I suspect it's not a coincidence that this post focuses on 'membranes' in the abstract rather than committing to any particular conception of what a membrane is and what it means to pierce it. I claim this is because there cannot actually exist any even reasonably precise definition of a 'membrane' that both (a) does any useful analytical work; and (b) could come anywhere close to guaranteeing safety.

[-]the gears to ascension2y31

any given

Maybe for most, but I don't know if we can confidently say "forall membrane" and make the statement you follow up with. Can we say anything durable and exceptionless about what it looks like for there to be a membrane through which passes no packet of information that will cause a violation, but which allows packets of information which do not cause a violation? Can we say there isn't? you're implying there isn't anything general we can say, but you didn't make a locally-valid-step-by-step claim, you proposed a statement without a way to show it in the most general, details-erased case.

absurd things like

whether it's absurd is yet to be shown, imo, though it very well could be

well my membrane is my entire state, if nobody pierces that then I will be safe

well, like, we're mostly talking about locality here, right? it doesn't seem to be weird to say it has to be your whole state. but -

people will of course need to pierce that membrane all the time

right, the thing that we have to nail down here is how to derive from a being what their implied ruleset should be about what interactions are acceptable. compare the immune system, for example. I don't think we get to avoid doing a CEV, but I think boundaries are a necessary type in defining CEVs, because --

many agents' membranes will constantly be overlapping

this is where I think things get interesting: I suspect that any reasonable use of membranes as a type is going to end up being some sort of integral over possible membrane worldsheets or something. in other words, it's an in-retrospect-maybe-trivial-but-very-not-optional component of an expression that has several more missing parts.

[-]RamblinDash2y30

I realized I might not have been clear above, by "state" I meant "one of the fifty United States", not "the set of all stored information that influences an an Agent's actions, when combined with the environment". I think that is absurd. I agree it hasn't been shown that the other meaning of "state" is an absurd definition.

[-]sturb2y20

It seems somewhat easy to think of examples of ways to harm an agent without piercing its membrane, eg killing its family, isolating it, etc. The counter thought would be that there are different dimensions of the membrane that extend over parts of the world. For example part of my membranes extend over the things I care about, and things that affect my survival.

The question then becomes how to quantify these different membranes and in terms of interacting with other systems how they can be helpful to you without harming or disturbing these other membranes.

[-]the gears to ascension2y42

the family unit as an agent has a aggregate meta-membrane though, doesn't it? this is why I'd expect to need an integral over possible membranes, and the thing we'd need to do to make this perspective useful is find some sturdy, causal-graph information-theoretic metric that reliably identifies agents. compare discovering agents

[-]Chris Lakin2y10

kill its family

Huh interesting.

To be clear I think this probably emotionally harms most humans, but ultimately it's that's up to whatever interpretations and emotional beliefs that person has (which are all flexible, in principle).

The counter thought would be that there are different dimensions of the membrane that extend over parts of the world. For example part of my membranes extend over the things I care about, and things that affect my survival.

Yes

The question then becomes how to quantify these different membranes and in terms of interacting with other systems how they can be helpful to you without harming or disturbing these other membranes.

If I understand what you're saying (and I may not)- yes, this

[-]the gears to ascension2y20

if I and my friend work together well, aren't we an aggregate being that has a new membrane that needs protecting?

[-]Chris Lakin2y32

from the post:

a minimal “membrane” for each agent/moral patients humanity values value

Many membranes (ie: many possible Markov blankets if you could observe them all) are not valued, empirically.

[-]the gears to ascension2y20

right, which translates into, it's not a uniform integral, there's some sort of weighting.

but I don't retract my argument that the moral value of my relationship with my friend means that me and my friend acting together as a friendship means that the friendship has a membrane. How familiar are you with social network analysis? if not very, I'd suggest speedwatching at least the first half hour of https://www.youtube.com/watch?v=2ZHuj8uBinM which should take 15m at 2x speed. I suggest this because of the way the explanation and the visual intuitions give a framework for reasoning about networks of people.

we also need to somehow take into account when membranes dissipate but this isn't a failure according to the individual beings.

[-]Chris Lakin2y10

could you restate your argument again plainly i missed it

[-]the gears to ascension2y20

groups working together to be an interaction have a aggregate meta-membrane: for any given group who are participating in an interaction of some kind, the fact of their connection is a property of {the mutual information of some variables about their locations and actions or something} that makes them act as a semicoherent shared being, and we call that shared being "a friendship", "a romance", "a family", "a party", "a talk", "an event", "a company", "a coop", "a neighborhood", "a city", etc etc etc. each of these will have a different level of membrane defense depending on how much the participants in the thing act to preserve it. in each case, we can recognize some unreliable pattern of membrane defense. typically the membrane gets stretched through communication tubes, I think? consider how loss of internet feels like something cutting a stretched membrane that was connecting you.

...this is why I'd expect to need an integral over possible membranes, and the thing we'd need to do to make this perspective useful is find some sturdy, causal-graph information-theoretic metric that reliably identifies agents. compare discovering agents

this seems like an obvious consequence of not getting to specify "organism" in any straightforward way; we have to somehow information theoretically specify something that will simultaneously identify bacteria and organisms, and then we need some sort of weighting that naturally recognizes individual humans. those should end up getting the majority of the weight in the integral, of course, but it shouldn't need to be hardcoded.

[-]Chris Lakin2y10

Oh.

But why shouldn't it be hardcoded?

[-]the gears to ascension2y20

well maybe it can be as a backstop but what about, idk, dogs? or just humans that aren't in the protected group, eg people outside a state?

[-]the gears to ascension2y20

More generally, what property of the system inside the membrane do we want to assert about?

[-]Chris Lakin2y10

Hm right now I only see asserting properties about the state of the membrane, and not about anything inside

[-]the gears to ascension2y20

I conjecture that someone will be able to prove that in expectation over properties of the membrane (call the random variable P), properties P you wish to assert about the state of the membrane without reference to the inside of the membrane are strongly probably either insufficient, and therefore allows adversaries to "damage" the insides of the membrane, or the given P is overly constraining and itself "damages" the preferences of the being inside the membrane; where "damage" means "move towards a state dispreferred by a random variable U of what partial utility function the inside of the membrane implies". this is a counting argument, and those have been taking some flak lately, but my point is that we need to do better than simplicity-weighted random properties of the membrane.

[+][anonymous]2y-50

[-]the gears to ascension2y20

What about agents implemented entirely as distributed systems, such as a fully remote company, such that the only coherent membrane you can point to is a "level down", the bodies of the agents participating in the company?