All of Rubi J. Hudson's Comments + Replies

Yes, if predictors can influence the world in addition to making a  prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions. 

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

3Jeremy Gillen
Trying to think this through, I'll write a bit of a braindump just in case that's useful: The futachy hack can be split into two parts. The first is that is that conditioning on untaken actions makes most probabilities ill-defined. Because there are no incentives to get it right, the market can can settle to many equilibria. The second part is that there are various incentives for traders to take advantage of this for their own interests. With your technique, I think approach would be to duplicate each trader into two traders with the same knowledge, and make their joint earnings zero sum.[1] This removes one explicit incentive for a single trader to manipulate a value to cause a different action to happen. But only if it's doing so to make the distribution easier to predict and thereby improving their score. Potentially there are still other incentives i.e. if the trader has preferences over the world, and these aren't eliminated. Why doesn't this happen in LI already? LI is zero sum overall, because there is a finite pool of wealth. But this is shared among traders with different knowledge. If there is a wealthiest trader that has a particular piece of knowledge, it should manipulate actions to reduce variance to get a higher score. So the problem is that it's not zero-sum with respect to each piece of knowledge. But, the first issue is entirely unresolved. The probabilities that condition on untaken actions will be path-dependent leftovers from the convergence procedure of LI, when the market was more uncertain about which action will be taken. I'd expect these to be fairly reasonable, but they don't have to be. This reasonableness is coming from something though, and maybe this can be formalized.  1. ^ You'd have to build a lot more structure into the LI traders to guarantee they can't learn to cooperate and are myopic. But that seems doable. And its the sort of thing I'd want to do anyway.

I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary. 

I've been thinking of a proposal like debate, where both side... (read more)

Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition. 

1harsimony
Oh that makes sense! If the predictors can influence the world in addition to making a prediction, they would also have an incentive to change the world in ways that make their predictions more accurate than their opponents right? For example, if everyone else thinks Bob is going to win the presidency, one of the predictors can bribe Bob to drop out and then bet on Alice winning the presidency. Is there work on this? To be fair, it seems like every AI safety proposal has to deal with something like this.

I'll take a look at the linked posts and let you know my thoughts soon!

Thanks for your engagement as well, it is likewise helpful for me.

I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many  goals that can be described as instruction-following, it seems... (read more)

2Seth Herd
I think we've reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it. I think there's a good chance we'll get first AGI that's a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That's a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor. If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.

Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperc... (read more)

2Seth Herd
Thanks for engaging on this; it's helpful in checking my thinking. You are right that there may be unsolved problems here. I haven't worked all of the way through precedence of previous instructions vs. new ones.  I am definitely relying on its following instructions to solve the problems with it following instructions - provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they'll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I'm expecting them to carefully solve issues like the precedence issue, and to have more options since they'll be experimenting while they can still re-work the AGI and its core priorities.  The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won't have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point. And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don't conflict - but the instruction giver should damned well think about and be careful about how they instruct it to prioritize. The model ideally isn't maximizing anything, but I see the risk you're pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition.  Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn't clear. It's a do-what-I-mean- and check (DWIMAC) target. You are right that an instruction-follower can act aligned until it gains power - if the instruction-following alignment target just hasn't been imple

Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility

Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.

This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore prev... (read more)

2Seth Herd
In a full foom fast takeoff, you are probably right that instruction following provides little more corrigibility than value alignment. Shut down now is an instruction, and even partial success might have it follow that instruction. Shut down now is also something all of humanity might want, so maybe that's a push. But we're not expecting a full foom or a very fast takeoff. In that case, having an AGI follow instructions reasonably reliably before it's superhuman can allow fine-tuning and improvements of the technical alignment scheme before it's out of your control. "Tell me the truth" and then asking lots of questions about its actions and reasoning in different situations gives a lot of interpretability beyond what your other methods offer. As for how it balances different instructions, you'd better include that in the instructions. Take instructions as a core value definitely needs to include instructions from whom. The AGI itself is definitely not on that list. Defining principals is not a trivial problem but it also doesn't seem terribly difficult to designate who can give instructions. Hopefully this helps address the crux you mention: adequate is a lower bar here, because approximate instruction-following can be leveraged into better instruction following as you develop the AGI. Naturally this would work a lot better with good interpretability and control methods. Every alignment goal would. My point is that instruction-following or other personal intent alignment like Corrigibility as Singular Target seems like a massive advantage over full value alignment in the likely slow takeoff scenario where you can use the AGI as a collaborator in improving its alignment while it's growing toward ASI. See the post linked in the comment you responded to for more. Oh also - while language models themselves get things wrong disturbingly often, an AGI cognitive architecture that uses them would probably include multiple internal checks to improve accuracy and so f

I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.

The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.

On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.

4Max Harms
Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part. (AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.) 1. ^ The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn

It feels to me like this argument is jumping ahead to the point that the agent's goal is to do whatever the principle wants. If we already have that, then we don't need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we've initially pointed at what we want imperfectly.

I agree that it's possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it's unclear how an agent trades off against that. Doing something that we clearly don't want, like manipulation, could still be positive EV if it allows for the generation of high future value.

2Seth Herd
Not needing corrigibility with an AI that follows instructions is the point. Avoiding manipulation seems as simple as saying "most importantly, don't manipulate me. Please state your understanding of manipulation so we can establish a mutually understood definition..." I agree that, if it's not adequately following instructions, that won't work. But it seems like it would have to be pretty far off target. Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work. See my instruction-following AGI post for more of this logic and links to my other work on how we'd do the technical alignment for an LLM-based or RL-based AGI. The instruction-following as alignment goal and the technical approaches seem obvious-in-retrospect and therefore pretty likely to be used in first AGI alignment attempts.

None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That's why for it to work, the correct utility function would need to start in the ensemble.

Great questions!

When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of. 

The gridworld example is a great demonstration of this, because while we ca... (read more)

1Max Harms
Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars. It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state." If caring about historical facts is easy and common, why is it important to split this off and distinguish it? 1. ^ Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.

I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.

However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.

1Max Harms
If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation. How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

I agree that goals as pointers could have some advantages, but I don't see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it's goal switched. 

2Seth Herd
I agree that that's possible, but it's not clear to me what direction it would go. If an agent's goal is to do whatever its principal currently wants, it doesn't seem to create a clear incentive to manipulate what that is. Doing whatever the principal last said they want is a subgoql of that, but it's just a subgoal. It would help fulfill that subgoal to make the principal keep wanting that subgoal. But the principal always wants (and should say they want) to not be manipulated. So that's another subgoal, and probably a more important one than any particular instrumental subgoal. Does that make sense? If you see it differently I'd be interested to understand a different perspective. In a way this is an extra element to prevent manipulation, but it also falls directly out of the goal-as-a-pointer-to-principal's-goals.

I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.

2RogerDearnaley
That's not necessarily required. The Scientific Method works even if the true "Unified Field Theory" isn't yet under consideration, merely some theories that are closer to it and others further away from it: it's possible to make iterative progress. In practice,  considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn't explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that's less of a problem in practice than it sounds in theory. The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it's doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won't resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).

Hi Max,

I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.

Thanks for pre-empting the responses, that makes it easy to reply! 

I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and "writes useful self critiques" as a separate property we would like the AI to have. I'm writing a post about this that should be up shortly, I'll notify you when it's out.

1Max Harms
Excellent. To adopt your language, then, I'll restate my CAST thesis: "There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets." I recognize that you don't see the examples in this doc as unified by an underlying throughline, but I guess I'm now curious about what sort of behaviors fall under the umbrella of "corrigibility" for you vs being more like "writes useful self critiques". Perhaps your upcoming post will clarify. :)

When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says "if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down". Alternatively, it could be an optimization constraint that takes a utility function from "Maximize X" to something like "Maximize X s.t. you always shut down when the shutdown button is pushed". While I'm not advocating for those specific changes, I hope they illustrate what I'm trying to point at as a modifier that is distinct from the optimization goal.

4Max Harms
Right. That's helpful. Thank you. "Corrigibility as modifier," if I understand right, says: Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it's a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal? (Because opportunities for me to write are kinda scarce right now, I'll pre-empt three possible responses.) "Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible" -> It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don't think it makes sense to say that corrigibility is modifying the agent as much as it's overwriting it. "Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they're satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them." -> I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can't put those resources to work being marginally more corrigible. "Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn't mean it's incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc." -> I think you're describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researchi

I've read through your sequence, and I'm leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful. 
 

Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?

This dis... (read more)

4Max Harms
I'm glad you benefitted from reading it. I honestly wasn't sure anyone would actually read the Existing Writing doc. 😅 I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you're lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.) I think you also get this if you're trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that's easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you're trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it'll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it's not risky to do so. How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don't understand what it means for corrigibility to be a modifier.

I'd take an agnostic view on whether LLMs are doing search internally. Crucially, though, I think the relevant output to be searching over is distributions of tokens, rather than the actual token that gets chosen. Search is not required to generate a single distribution over next tokens. 

I agree that external search via scaffolding can also be done, and would be much easier to identify, but without understanding the internals it's hard to know how powerful the search process will be.

Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives. 

  • You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.

If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree w... (read more)

5ryan_greenblatt
I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I'm saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter. (I also edited my comment to improve clarity.)

I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.

ryan_greenblatt*Ω12143

For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:

  • The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
  • The advantage has to be coming from understanding exactly what conditional you're getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what's going on. Let's suppose you're getting the conditional via promp
... (read more)

Ah, ok, I see what you're saying now. I don't see any reason why restricting to input space counterfactuals wouldn't work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.

Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.

I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.

1Jacob Pfau
Yea, I agree with this description--input space is a strict subset of predictor-state counterfactuals. In particular, I would be interested to hear if restricting to input space counterfactuals is clearly insufficient for a known reason. It appears to me that you can still pull the trick you describe in "the proposal" sub-section (constructing counterfactuals which change some property in a way that a human simulator would not pick up on) at least in some cases.

I find one consistent crux I have with people not concerned about AI risk is that they believe massively more resources will be invested into technical safety before AGI is developed.

In the context of these statements, I would put it as something like "The number of people working full-time on technical AI Safety will increase by an order of magnitude by 2030".

1Sheikh Abdur Raheem Ali
Try by 2024.

Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.

I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization),  model that starts with goal misgeneralization quickly becomes a schemer too.

I think this part uses an unfair comparison:

Supposes that  and  are small finite sets. A task  can be implemented as dictionary whose keys lie in  and whose values lie in , which uses  bits. The functional  can be implemented as a program which receives input of type  and returns output of type . Easy!

In the subjective account, by contrast, the task  requires infinite bits to specify, and th

... (read more)

I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.

I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?

6Caspar Oesterheld
>I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule? Sorry if I was cryptic! Yes, it's basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because "quasi-strictly proper scoring rule w.r.t. the max decision rule" is a mouthful. :-P) Does that help? >much safer than having it effectively chosen for them by their specification of a utility function So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point... >I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course. Let's grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other. Maybe put differently:

Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.

As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment). 

6Caspar Oesterheld
>the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. Hmm... Johannes made a similar argument in personal conversation yesterday. I'm not sure how convinced I am by this argument. So first, here's one variant of the proper decision scoring rules setup where we also don't need to specify the decision maker's utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor's conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don't need to specify your utility function. One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome -- the one obtained after choosing the recommended action -- and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn't this much harder? (If you're doing expected utility maximization -- doesn't your approach entail assigning probabilities to all hypothetical outcomes?) In r

For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.

For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this act... (read more)

In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of "Conditioning Predictive Models" is still relevant.

Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.

I'm not sure, but if anyone knows how to contact them, they could be a great fit.

While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.

Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules. 

2Evan R. Murphy
I've been working on some ways to test for myopia and non-myopia (see Steering Behaviour: Testing for (Non-)Myopia in Language Models). But the main experiment is still in progress, and it only applies for a specific definition of myopia which I think not everyone is bought into yet.

I don't believe we considered logical counterfactuals as such, but it seems to me that those would be quite comparable to the counterfactual of replacing an oracle with a simpler system.

Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.

My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger's lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that's the case, there's less value in working at a school with generally talented students but more value in schools with a supportive environment. 

In general it's good to recognize that what matters to AI safety professors is di... (read more)

It sounds like you have a number of ideas as to why robustness was not achieved and how to correct those issues. Why is the project over now, rather than continuing having made those updates?

1dmz
The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

Yeah, the full reading list will be posted publicly once it's finalized.

Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn't sure which option to go with.

I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.

As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.

Re-reading your prior comment, I think I misunderstood it initially.

Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.

Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a c... (read more)

I think some generality is necessary, otherwise we'd have to retrain the reporter every time the predictor is updated. That would rule out a lot of desirable uses for a reporter, like using its output in the training process.

I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.

The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also j... (read more)

1Adam Jermyn
I don't think (3) requires that we already have the direct translator? At least I was imagining that the proposal head proposes a change that produces a given answer. E.g. "Find a state so that the answer to my question is 'pineapple'.", then we penalize if the answer isn't "pineapple". But now that I write that I see that it's easy to have that short-circuit badly by e.g. steganography in the proposed changes. So I'm back to being confused: what's the training process meant to be here?

I don't necessarily think we'd get an incoherent output, since it needs to be able to generalize to new questions, I expect a direct translator to answer questions by using computations to understanding a predictor (plus a model of natural language), rather than a function that maps the state of a particular predictor to answers for each question.

One reporter might only be able to understand the predictor up to a human level. If it gets a predictor with a human level understanding of the world, it can act as a direct translator, but if it gets a more complex predictor it would act as a human translator.

1Adam Jermyn
Ah! So you're imagining the reporter not as a module that knows how to interpret the signals in a given predictor, but instead as a meta-learning system that figures out whatever predictor happens to be in front of it. Is that right? That seems like a much harder sort of model to build...

Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.

4Jeffrey Ladish
I'm a little surprised that I don't see more discussion of ways that higher bandwidth brain-computer interfaces might help, e.g. neurolink or equivalent. Like it sounds difficult but do people feel really confident it won't work? Seems like if it could work it might be achievable on much faster timescales than superbabies.
Load More