All of Max Harms's Comments + Replies

I think the AI problem is going to bite within the next 25 years. Conditional on avoiding disaster for 25 more years, I think the probability of having solved the survive-the-current-moment problem is very high. My best guess is that does not mean the alignment problem will have been solved, but rather that we succeeded in waking up to the danger and slowing things down. But I think I'm pretty optimistic that if the world is awake to the danger and capabilities progress is successfully paused for decades, we'll figure something out. (That "something" might... (read more)

3ryan_greenblatt
Suppose your view was that P(AGI if no pause/slow before 2050) = 80%. Then, if we condition on AGI after 2050, surely most of the probability mass isn't due to pausing/slowing right? So, what would be the mechanism if not some sort of technical research or exogenous factor (e.g. society getting wiser) over the intervening time. Note that the full quote in context is:

This is a good point, and I think meshes with my point about lack of consensus about how powerful AIs are.

"Sure, they're good at math and coding. But those are computer things, not real-world abilities."

I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent at... (read more)

Sorry, I should have been clearer. I do agree that high capabilities will be available relatively cheaply. I think I expect Agent-3-mini models slightly later than the scenario depicts due to various bottlenecks and random disruptions, but showing up slightly later isn't relevant to my point, there. My point was that I expect that even in the presence of high-capability models there still won't be much social consensus, in part because the technology will still be unevenly distributed and our ability to form social consensus is currently quite bad. This me... (read more)

Yeah, good question. I think it's because I don't take politicians' (and White House staffers) ability to prioritize things based on their genuine importance. Perhaps due to listening to Dominic Cummings a decent amount, I have a sense that administrations tend to be very distracted by whatever happens to be in the news and on the forefront of the public's attention. We agree that the #1 priority will be some crisis or something, but I think the #2 and #3 priorities will be something something culture war something something kitchen-table economics somethi... (read more)

I'm not sure, but my guess is that @Daniel Kokotajlo gamed out 2025 and 2026 month-by-month, and the scenario didn't break it down that way because there wasn't as much change during those years. It's definitely the case that the timeline isn't robust to changes like unexpected breakthroughs (or setbacks). The point of a forecast isn't to be a perfect guide to what's going to happen, but rather to be the best guess that can be constructed given the costs and limits of knowledge. I think we agree that AI-2027 is not a good plan (indeed, it's not a plan at a... (read more)

Bing Sydney was pretty egregious, and lots of people still felt sympathetic towards her/them/it. Also, not all of us eat animals. I agree that many people won't have sympathy (maybe including you). I don't think that's necessarily the right move (nor do I think it's obviously the right move to have sympathy).

Yep. I think humans will be easy to manipulate, including by telling them to do things that lead to their deaths. One way to do that is to make them suicidal, another is to make them homicidal, and perhaps the easiest is to tell them to do something which "oops!" ends up being fatal (e.g. "mix these chemicals, please").

Max Harms0-2

Glad we agree there will be some people who are seriously concerned with AI personhood. It sounds like you think it will be less than 1% of the population in 30 months and I think it will be more. Care to propose a bet that could resolve that, given that you agree that more than 1% will say they're seriously concerned when asked?

8ACCount
I'm saying that "1% of population" is simply not a number that can be reliably resolved by a self-reporting survey. It's below the survey noise floor. I could make a survey asking people whether they're lab grown flesh automaton replicants, and get over 1% of "yes" on that. But that wouldn't be indicative of there being a real flesh automaton population of over 3 million in the US alone.
Max Harms131

(Apologies to the broader LessWrong readers for bringing a Twitter conversation here, but I hate having long-form interactions there, and it seemed maybe worth responding to. I welcome your downvotes (and will update) if this is a bad comment.)

@benjamiwar on Twitter says:

One thing I don’t understand about AI 2027 and your responses is that both just say there is going to be lots of stuff happening this year(2025), barely anything happening in 2026 with large gaps of inactivity, and then a reemergence of things happening again in 2027?? It’s like we are try

... (read more)
1Benjamin Ward
Thought I would clarify, add, and answer your questions. Reading back over my post and your response has made me realize what I forgot to make obvious and how others were interpreting the format of the timeline differently. Some of what I wrote may already be obvious to some, but I wanted to write what was obvious to me that I didn’t see others also making obvious. Also I rarely sometimes think something is obviously true when it actually isn’t, so let me know if I am being shortsighted. For the sake of brevity and avoiding interruptions to what I am saying, I didn’t put in clear transitions. Having a format of early, middle and late progress estimates of 2026 and then switching to more certain month by month predictions in 2027 doesn’t make a lot of sense to me. What happens if something they thought was going to happen in middle of 2026 happens in early 2026(which is extremely likely)? Wouldn’t that throw off the whole timeline? It’d be like if you planned out your schedule hour by hour for every day, not for next year, but for the year after with food, entertainment, work, etc. So you go through next year with no plan, but then when the year you planned comes up, you end up having to switch cities for a new job and everything you planned needs to be scrapped as it’s now irrelevant. Arguing over what is going to happen exactly when is a way to only be prepared if things happen exactly that way. Would we be more prepared if we knew a certain milestone would be met in August or September of 2026 right now? If our approach wouldn’t change, then how soon it happens doesn’t matter. All that matters is how likely it is that something is going to happen soon, and how prepared we are for that likelihood. AI 2027 goes into lots of stuff that could happen, but doesn’t include obvious things that definitely will happen. People will use AI more(obvious). Internet availability will increase(obvious). Context length will increase(obvious). These are things that can be expand

Right. I got sloppy there. Fixed!

Max Harms194

I think if there are 40 IQ humanoid creatures (even having been shaped somewhat by the genes of existing humans) running around in habitats being very excited and happy about what the AIs are doing, this counts as an existentially bad ending comparable to death. I think if everyone's brains are destructively scanned and stored on a hard-drive that eventually decays in the year 1 billion having never been run, this is effectively dead. I could go on if it would be helpful.

Do you think these sorts of scenarios are worth describing as "everyone is effectively dead"?

4scarcegreengrass
I see. I at first assumed your 'effectively dead' meant 'biologically alive, but certain to die soon'. 
4jbash
Not when you're obviously addressing people who don't necessarily know the details of the scenarios you're talking about, no... because the predictions could be anything, and "effectively dead" could mean anything. There are lots of people on Less Wrong who'd say that IQ 150 humans living in ease and comfort were "effectively dead" if they didn't also have the option to destroy that ease and comfort.

I don't think AI personhood will be a mainstream cause area (i.e. most people will think it's weird/not true similar to animal rights), but I do think there will be a vocal minority. I already know some people like this, and as capabilities progress and things get less controlled by the labs, I do think we'll see this become an important issue.

Want to make a bet? I'll take 1:1 odds that in mid-Sept 2027 if we poll 200 people on whether they think AIs are people, at least 3 of them say "yes, and this is an important issue." (Other proposed options "yes, but not important", "no", and "unsure".) Feel free to name a dollar amount and an arbitrator to use in case of disputes.

5ACCount
1.5% is way below the dreaded Lizardman's Constant. I don't doubt that there will be some people who are genuinely concerned with AI personhood. But such people already exist today. And the public views them about the same as shrimp rights activists. Hell, shrimp welfare activists might be viewed more generously.

This makes sense. Sorry for getting that detail wrong!

Great! I'll update it. :)

This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-che... (read more)

5johnswentworth
TBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.

:)

Now that I feel like we're at least on the same page, I'll give some thoughts.

  • This is a neat idea, and one that I hadn't thought of before. Thanks!
  • I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
  • I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big par
... (read more)
Max Harms417

This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don't really know what "corrigible goal" even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you're trying to say.

Suppose we're optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be care... (read more)

This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.

If I were a singleton AGI, but not such a Jupiter brain that I could deal with the combinatorial explosion of directly jointly-optimizing every motion of every robot, I would presumably set up an internal “free market” with spot-prices for iron ore and robot-hours and everything else. Then I would iteratively cycle through all my decisio... (read more)

8johnswentworth
Yup, exactly, and good job explaining it too.

This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.

1Noosphere89
To be fair, I think the shutdownableness of an AI/not subverting higher level goals was the original motivation of all the corrigibility research, so this is a good thing.

Not convinced it's relevant, but I'm happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?

2tailcalled
Generally you wouldn't since it's busy using that matter/energy for whatever you asked it to do. If you wanted to use it, presumably you could turn down its intensity, or maybe it exposes some simplified summary that it uses to coordinate economies of scale.
Max Harms210

Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve -- this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.

One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs ar... (read more)

3mushroomsoup
This comment raises some good points, but even "there will be a natural pressure for [subprocesses] to resemble a corrigible agent" seem to be debatable. Again consider the restaurant setting. It is sometime necessary for restaurants to close temporarily for renovation to increase the seating capacity, upgrade equipment, etc. The head chef who decided to renovate will be making the instrumental goals of all the other chefs (make a good food, earn money to stay alive) untenable while they are furloughed. More generally, progress towards terminal goals is not monotonic and thus only focusing on the local topology of the optimization landscape might be insufficient to predict long-horizon trends. 

I think this misunderstands the idea, mainly because it's framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it's just a hand-wavy mental picture, I don't expect to have the right formal operationalization yet.)

Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we're ... (read more)

2tailcalled
Pursuit of money is an extremely special instrumental goal whose properties you shouldn't generalize to other goals in your theory of instrumental convergence. (And I could imagine it should be narrowed down further, e.g. into those who want to support the state vs those who want money by whichever means including scamming the state.)
Max HarmsΩ120

Thanks for noticing the typo. I've updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.

Max HarmsΩ110

That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way.

Still, the idea of instructions carrying a degree of approved influence seems promising.

1FireStormOOO
Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task.  Central not just in terms of "how much influence", but also along whatever other axes the sort of influence could vary? I think if the agent's action space is still so unconstrained there's room to consider benefit or harm that flows through principle value modification it's probably still been given too much latitude.  Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality). At the same time, it's not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle.  Maybe the principle thinks those values are an improvement and it's free money? [e.g. Prince's insurance company wants to bribe him to stop smoking.]
Max HarmsΩ442

Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness

More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we mig... (read more)

2Rubi J. Hudson
Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility
Max HarmsΩ110

If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

1Rubi J. Hudson
I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
Max HarmsΩ110

Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.

It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the grid... (read more)

1Rubi J. Hudson
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation. On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.
Max HarmsΩ220

I talk about the issue of creating corrigible subagents here. What do you think of that? 


I may not understand your thing fully, but here's my high-level attempt to summarize your idea:

IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP ag

... (read more)
1EJT
Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I'm still concerned that CAST training doesn't get us truly corrigible agents with high probability. I think we're better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).
Max Harms10

Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I'm a human trying to gain control of a company, I think I'm basically just not choosing my strategies based on resisting being killed ("shutdown-resistance"), but I think I probably wind up with something subtle, patient, and manipulative anyway.

Max Harms10

Thanks. (And apologies for the long delay in responding.)

Here's my attempt at not talking past each other:

We can observe the actions of an agent from the outside, but as long as we're merely doing so, without making some basic philosophical assumptions about what it cares about, we can't generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that... (read more)

1EJT
Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.
Max HarmsΩ330

In the Corrigibility (2015) paper, one of the desiderata is:

(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

I think you may have made an error in not listing this one in your numbered list for the relevant section.

Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?

1Rubi J. Hudson
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal. However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
Max HarmsΩ110

Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?

2Rubi J. Hudson
Great questions! When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of.  The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".
Max HarmsΩ110

I'm curious what you mean by "anti-natural." You write:

Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states. 

My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.

If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?

1Max Harms
Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?
Max HarmsΩ220

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.

Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing ... (read more)

1EJT
This is a nice point, but it doesn't seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won't pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation. Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Max HarmsΩ110

Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

2EJT
Ah yep I'm talking about the first decision-tree in the 'Incomplete preferences' section.
Max HarmsΩ220

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.

I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.

I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set ind... (read more)

Max HarmsΩ490

That matches my sense of things.

To distinguish corrigibility from DWIM in a similar sort of way:

Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute--your mind is free to think about a variety of things. You decide to think about ___.

I'm honestly not sure what "DWIM" does here. Perhaps it doesn't think? Perhaps it keeps checking over and over again that it's doing what was meant? Perhaps it thin... (read more)

5Seth Herd
I think DWIM is underspecified in that it doesn't say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue. As for what it would think about, that would eppend on all of the previous instructions it's trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general. DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.
Max HarmsΩ460

My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.

I do think that the goals of "want what the principal wants" or "help the principal get what they want" are simpler goals than "maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 othe... (read more)

2EJT
Thanks, this comment was clarifying. Yep, agreed. Although I worry that - if we try to train agents to have a pointer - these agents might end up having a goal more like: I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents' goals crystallize around these things in the high path-dependence world.
Max HarmsΩ380

I agree that you should be skeptical of a story of "we'll just gradually expose the agent to new environments and therefore it'll be safe/corrigible/etc." CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there's a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the "attractor basin" hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but i... (read more)

1EJT
Thanks, this comment is also clarifying for me. My guess is that a corrigibility-centric training process says 'Don't get the ice cream' is the correct completion, whereas full alignment says 'Do'. So that's an instance where the training processes for CAST and FA differ. How about DWIM? I'd guess DWIM also says 'Don't get the ice cream', and so seems like a closer match for CAST.
Max HarmsΩ110

Excellent.

To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets."

I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

5Rubi J. Hudson
Hi Max, I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
Max HarmsΩ340

Right. That's helpful. Thank you.

"Corrigibility as modifier," if I understand right, says:

There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they'll behave differently. In other words, corrigibility is more like a property/constra

... (read more)
2Rubi J. Hudson
Thanks for pre-empting the responses, that makes it easy to reply!  I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and "writes useful self critiques" as a separate property we would like the AI to have. I'm writing a post about this that should be up shortly, I'll notify you when it's out.
Max HarmsΩ220

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?

Max HarmsΩ340

I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅

I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)

I think y... (read more)

2Rubi J. Hudson
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says "if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down". Alternatively, it could be an optimization constraint that takes a utility function from "Maximize X" to something like "Maximize X s.t. you always shut down when the shutdown button is pushed". While I'm not advocating for those specific changes, I hope they illustrate what I'm trying to point at as a modifier that is distinct from the optimization goal.
Max HarmsΩ110

It sounds like you're proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don't think that's corrigible.

In the specific example, just because one is confused about what they want doesn't mean the AI will be (or should be). It seems like you think the AGI should not "take a guess" at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your syste... (read more)

5Seth Herd
This seems productive. I don't understand your proposal if it doesn't boil down to "do what the principal wants" or "do what the principal says" (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI. This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don't want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end. I do think what your model of me says, but I think it's only narrowly true and probably not very useful that I think this is true for exactly the right definition of "what I want", but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That's mixed with the danger that it's incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I'd really rather be put into an experience machine where I'm the celebrated hero of the world, rather than make the real world good for everyone like I'd hoped to get). Maybe I've misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I've misunderstood. All of your examples I remember correspond to "doing what the principal wants" by a pretty common interpetation of that phrase. Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You've pointed out some ways that following instructions could be a danger (although I think your genie examples aren't the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the proje
Max HarmsΩ230

I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible... (read more)

4Seth Herd
This sounds like we're saying the same thing? My "at all times" is implied and maybe confusing. I'm saying it doesn't guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it's just having the agent's goals be a pointer to the principal's goals. This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility seems coherent and simple. Edit: to address your example, I both want and don't-want to be manipulated into wanting to exercise next week. It's confusing for me, so it should be confusing for my corrigible AGI. It should ask me to clarify when and how I want to be manipulated, rather than taking a guess when I don't know the answer. I probably haven't thought about it deeply, and overall it's pretty important to accurately doing what I want, so a good corrigible helper will suggest I spend some time clarifying for it and for myself. This is a point where things could go wrong if it takes bad guesses instead of getting clarification, but there are lots of those.
Max HarmsΩ110

Thanks. Picking out those excerpts is very helpful.

I've jotted down my current (confused) thoughts about human values.

But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I'd collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be m... (read more)

Max Harms6-6

 Here are my current thoughts on "human values." There are a decent number of confusions here, which I'll try to flag either explicitly or with a (?).


Let's start with a distribution over possible worlds, where we can split each world into a fixed past and a future function which takes an action.[1] We also need a policy, which is a sensors -> action function,[2] where the state of the sensors is drawn from the world's past.[3]

Assume that there exists either an obvious channel in many worlds that serves as a source of neutral[4] infor... (read more)

Max HarmsΩ110

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster.

This is a useful insight! Thanks for clarifying. :)

Max HarmsΩ380
  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of gr
... (read more)
Max HarmsΩ110

At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE.

I'm confused here. Is the corrigible AI trying to get the IO to happen? Why is it trying to do this? Doesn't seem very corrigible, but I think I'm probably just confused.

Maybe another frame on my confusion is that it seems to me that a corrigible AI can't have an IO?

3ThomasCederborg
Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others. The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible. This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI. The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step''). So I think it makes sense to say that the second AI has ``zero Corrigibility''. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly. This is what I was trying to communicate with the first sentence of the post:  ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.''. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as lo
Load More