A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not inter...
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.
- I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.
Can someone explain to me what this crispness is?
As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).
Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't ...
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."
The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]
Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)
I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins
Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)
In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "
Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list above. Otherwise there are many ways how to make the basin work "in most directions".
I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals
1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Principles which counteract unbounded rationality
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned
Traps
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.
Oversight
10. Human-approva...
Seems like a worthwhile exercise...
There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.
Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author.
"And you kindly asked the world, and the world replied in a booming voice"
"NO."
(I don't actually know, probably somewhere there's a guide to writing glowfic, though I think it's not v relevant to the task which is to just outline principles you'd use to design an agent that is corrigible in ~2k words, somewhat roleplaying as though you are the engineering team.)
Eliezer's writeup on corrigibility has now been published (the posts below by "Iarwain", embedded within his new story Mad Investor Chaos). Although, you might not want to look at it if you're still writing your own version and don't want to be anchored by his ideas.
Some hopefully-unnecessary background info for people attempting this task:
A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".
An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."
I worry that the question as posed is already assuming a structure for the solution -- "the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it".
When I read that, I understand it to be describing the type of behavior or internal logic that you'd expect from an "aligned" AGI. Since I disagree that the concept of "aligning" an AGI even makes sense, it's a bit difficult for me to reply on those grounds. But I'll try to reply anyway, based on what I think is reasonable for AGI development.
In a world where AGI was developed and deployed safely, I'd expect the following properties:
1. Controlled environments.
2. Controlled access to information.
3. Safety-critical systems engineering.
4. An emphasis on at-rest encryption and secure-by-default networking.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
6. Systems with AGI are assumed to be adversarial.
Let's stop on the top of the mountain and talk about (6).
Generally, the way this discussion goes is we discuss how unaligned AGI can kill everyone, and therefore we need to align the AGI, and then once we figure out how to align the AG...
“myopia” (not sure who correctly named this as a corrigibility principle),
I think this is from Paul Christiano, e.g. this discussion.
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)
The basics
Myopia
Non-maximizing
I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.
If corrigibility has one central problem, I would call it: How do you say "If A, then prefer B." instead of "Prefer (if A, then B)."? Compare pytorch's detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.
Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)
-
An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful hum...
[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there. Epistemic status: low confidence but it seems low risk high reward to try. not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don't already see here. this probably already exists and I just don't know the name.]
1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don't give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don't go post the code to github or give it internet access.)
2) reward function stability. we know we might have made mistakes inputting the reward function, but we have some example test cases we're confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.
This feels to me like very much not how I would go about getting corrigibility.
It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.
Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.
I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".
I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, an...
Could someone give me a link to the glowfic tag where Eliezer published his list, and say how strongly it spoilers the story?
[Side note: I'm not sure I understand the prompt. Of the four "principles" Eliezer has listed, some seem like a description of how Eliezer thinks a corrigible system should behave (shutdownability, low impact) and some of them seem like defensive driving techniques for operators/engineers when designing such systems (myopia), or maybe both (quantilization). Which kinds of properties is he looking for?]
[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed. Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.] [ETA: Written in about half an hour, with some distraction and wording struggles.]
What is the hypothetical ideal of a corrigible AI? Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical refere...
Welp. I decided to do this, and here it is. I didn't take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI's Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror'd away, though IMO it might be interesting to compare...
Hmmm. The badly edited, back-of-the-envelope short version I can come up with off the top of my head goes like this:
We want an AI-in-training to, by default, do things that have as few side effects as possible. But how can we define "as few side effects as possible" in a way that doesn't directly incentivize disaster and that doesn't make the AI totally useless? Well, what if we say that we want it to prefer to act in a way that we can "undo", and then give a reasonable definition of "undo" that makes sense?
Consider the counterfactual world in which the AI...
Another failure mode: the AI stubbornly ignores you and actually does nothing when you ask it several times to put the strawberry on the plate, and you go and do it yourself out of frustration. The AI, having predicted this, thinks "Mission accomplished".
~1 hour's thoughts, by a total amateur. It doesn't feel complete, but it's what I could come up with before I couldn't think of anything new without >5 minutes' thought. Calibrate accordingly—if your list isn't significantly better than this, take some serious pause before working on anything AI related.
Quick brainstorm:
Here's my attempt. I haven't read any of the other comments or the tag yet. I probably spent ~60-90m total on this, spread across a few days.
On kill switches
On the AI accurately knowing what it is doing, and pointing at things in the real world
Here are some too-specific ideas (I realize you are probably asking for more general ones):
A "time-bounded agent" could be useful for some particular tasks where you aren't asking it to act over the long-term. It could work like this: each time it's initialized it would be given a task-specific utility function that has a bounded number of points available for different degrees of success in the assigned task and an unbounded penalty for time before shutdown.
If you try to make agents safe solely using this approach though, eventually you decide to give it ...
Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.
-Favour very high caution over realising your understanding of your goals.
-Do not act independently, defer to human operators.
-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time.
-Don’t jump ahead to what your operators will do or believe, wait for it.
-Don’t manipulate hum...
"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.
Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either. I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.
But you do think that it is important evidence about the world that no one else had written that list before you?
It seems l...
Are you looking to vastly improve your nation state's military capacity with an AGI? Maybe you're of a more intellectual bent instead, and want to make one to expound on the philosophical mysteries of the universe. Or perhaps you just want her to write you an endless supply of fanfiction. Whatever your reasons though, you might be given pause by the tendency AGIs have to take a treacherous turn, destroy all humans, and then convert the Milky Way into paperclips.
If that's the case, I've got just the thing for you! Order one of our myopic AGIs right now! She...
...the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.
Here's one straightforward such principle: minimal transfer to unrelated tasks / task-classes. If you've somehow figured out how to do a pivotal act with a theorem proving AI, and you're training a theorem proving AI, then that AI should not also be able to learn to model human behavior, predict biological interactions, etc.
One to evaluate this quantity: have many small datasets of transfer tasks, each containin...
So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability". (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)
Low impact seems so easy to pro...
better than the tag overall
What does this mean? Improve on what you've (the OP has) already written that's here (LW) tagged corrigibility?
The overall point make sense, see how far you can go on:
'principles for corrigbility'.
The phrasing at the end of the post was a little weird though.
The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.
"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.
Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either. I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.
But people asked, so, fine, let's actually try it this time. Maybe I'm wrong about how bad things are, and will be pleasantly surprised. If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.
So: As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.
So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability". (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)
Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me. Some haven't.
I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell.
One and a half thousand words or so, maybe. (2169 words.)How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI? It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.