Being a Robust Agent

You are viewing a version of this post published on the . This link will always display the most recent version of the post.

Epistemic status: not adding anything new, but figured there should be a clearer reference post for this concept.

There's a concept which many LessWrong essays have pointed at it, but I don't think there's a single post really spelling it out. I've built up an understanding of it through conversations with Zvi and Critch, and reading particular posts by Eliezer such as Meta-Honesty. (Note: none of them necessarily endorse this post, it's just my own understanding)

The idea is: you might want to become a more robust agent.

By default, humans are a kludgy bundle of ad-hoc impulses. But we have the ability to reflect upon our decision making, and the implications thereof, and derive better overall policies.

I don't think is quite the same thing as instrumental rationality (although it's tightly entwined). If your goals are simple and well-understood, and you're interfacing in a social domain with clear rules, the most instrumentally rational thing might be to not overthink it and follow common wisdom.

But it's particularly important if you want to coordinate with other agents, over the long term. Especially on ambitious, complicated projects in novel domains.

Some examples of this:

  • Be the sort of person that Omega (even a version of Omega who's only 90% accurate) can clearly tell is going to one-box. Or, more realistically – be the sort of person who your social network can clearly see is worth trusting, with sensitive information, or with power.
  • Be the sort of agent who cooperates when it is appropriate, defects when it is appropriate, and can realize that cooperating-in-this-particular-instance might look superficially like defecting, but avoid falling into a trap.
  • Think about the ramifications of people who think like you adopting the same strategy. Not as a cheap rhetorical trick to get you to cooperate on every conceivable thing. Actually think about how many people are similar to you. Actually think about the tradeoffs of worrying about a given thing. (Is recycling worth it? Is cleaning up after yourself at a group house? Is helping a person worth it? The answer actually depends, don't pretend otherwise).
  • If there isn't enough incentive for others to cooperate with you, you may need to build a new coordination mechanism so that there is enough incentive. Complaining or getting angry about it might be a good enough incentive but often doesn't work and/or isn't quite incentivizing the thing you meant. (Be conscious of the opportunity costs of building this coordination mechanism instead of other ones. Mindshare is only so big)
  • Be the sort of agent who, if some AI engineers were whiteboarding out the agent's decision making, they were see that the agent makes robustly good choices, such that those engineers would choose to implement that agent as software and run it.
  • Be cognizant of order-of-magnitude. Prioritize (both for things you want for yourself, and for large scale projects shooting for high impact).
  • Do all of this realistically given your bounded cognition. Don't stress about implementing a game theoretically perfect strategy, but do be cognizant how much computing power you actually have (and periodically reflect on whether your cached strategies can be re-evaluated given new information or more time to think). If you're being simulated on a whiteboard right now, have at least a vague, credibly notion of how you'd think better if given more resources.
  • Do all of this realistically given the bounded condition of *others*. If you have a complex strategy that involves rewarding or punishing others in highly nuanced ways.... and they can't figure out what your strategy is, you may instead just be adding random noise instead of a clear coordination protocol.

Game Theory in the Rationalsphere

The EA and Rationality worlds include lots of people with ambitious, complex goals. They have a bunch of common interests and probably should be coordinating on a bunch of stuff. But:

  • They vary in how much they've thought about their goals.
  • They vary in what their goals are.
  • They vary in where their circles of concern are drawn.
  • They vary in how hard (and how skillfully) they're trying to be be game theoretically sound agents, rather than just following local incentives.
  • They disagree on facts and strategies.

Being a robust agent means taking that into account, and executing strategies that work in a messy, mixed environment with confused allies, active adversaries, and sometimes people who are a little bit of both. (Although this includes creating credible incentives and punishments to deter adversaries from bothering, and encouraging allies to become less confused).

I'm still mulling over exactly how to translate any of this into actionable advice (for myself, let alone others). But all the other posts I wanted to write felt like they'd be easier if I could reference this concept in an off-the-cuff fashion without having to explain it in detail.

New Comment
32 comments, sorted by Click to highlight new comments since:

Author here. I still endorse the post and have continued to find it pretty central to how I think about myself and nearby ecosystems.

I just submitted some major edits to the post. Changes include:

1. Name change ("Robust, Coherent Agent")

After much hemming and hawing and arguing, I changed the name from "Being a Robust Agent" to "Being a Robust, Coherent Agent." I'm not sure if this was the right call.

It was hard to pin down exactly one "quality" that the post was aiming at. Coherence was the single word that pointed towards "what sort of agent to become." But I think "robustness" still points most clearly towards why you'd want to change. I added some clarifying remarks about that. In individual sentences I tend to refer to either "Robust Agents" or "Coherent agents" depending on what that sentence was talking about

Other options include "Reflective Agent" or "Deliberate Agent." (I think once you deliberate on what sort of agent you want to be, you often become more coherent and robust, although not necessarily)

Edit" Undid the name change, seemed like it was just a worse title.

2. Spelling out what exactly the strategy entails

Originally the post was vaguely gesturing at an idea. It seemed good to try to pin that idea down more clearly. This does mean that, by getting "more specific" it might also be more "wrong." I've run the new draft by a few people and I'm fairly happy with the new breakdown:

  • Deliberate Agency
  • Gears Level Understanding of Yourself
  • Coherence and Consistency
  • Game Theoretic Soundness

But, if people think that's carving the concept at the wrong joints, let me know.

3. "Why is this important?"

Zvi's review noted that the post didn't really argue why becoming a robust agent was so important. 

Originally, I viewed the post as simply illustrating an idea rather than arguing for it, and... maybe that was fine. I think it would have been fine to "why" that for a followup post. 

But I reflected a bit on why it seemed important to me, and ultimately thought that it was worth spelling it out more explicitly here. I'm not sure my reasons are the same as Zvi's, or others. But, I think they are fairly defensible reasons. Interested if anyone has significantly different reasons, or thinks that the reasons I listed don't make sense.

I'm leaning towards reverting the title to just "being a robust agent", since the new title is fairly clunky, and someone gave me private feedback that it felt less like a clear-handle for a concept. [edit: have done so]

So the most important point of this post is to lay out the Robust Agent paradigm explicitly, with a clear term I could quickly refer to in future discussions, to check “is this something we’re on the same page about, or not?” before continuing on to discuss more complicated ideas.

Have you found that this post (and the concept handle) have been useful for this purpose? Have you found that you do in fact reference it as a litmus test, and steer conversations according to the response others make?

It's definitely been useful with people I've collaborated closely with. (I find the post a useful background while working with the LW team, for example)

I haven't had a strong sense of whether it's proven beneficial to other people. I have a vague sense that the sort of people who inspired this post mostly take this as background that isn't very interesting or something. Possibly with a slightly different frame on how everything hangs together.

It sounds like this post functions (and perhaps was intended) primarily as a filter for people who are already good at agency, and secondarily as a guide for newbies?

If so, that seems like a key point - surrounding oneself with other robust (allied) agents helps develop or support one's own agency.

I actually think it works better as a guide for newbies than as a filter. The people I want to filter on, I typically am able to have long protracted conversations about agency with them anyway, and this blog post isn't the primary way that they get filtered.

I feel like perhaps the name "Adaptive Agent" captures a large element of what you want: an agent capable of adapting to shifting circumstances.

I like the edits!

One thing I think might be worth doing is linking to the post on Realism about Rationality, and explicitly listing at is a potential crux for this post.

I'm pretty onboard theoreticallly with the idea of being a robust agent, but I don't actually endorse it as a goal because I tend to be a rationality anti-realist.

I actually don't consider Realism about Rationality cruxy for this (I tried to lay out my own cruxes in this version). Part of what seemed important here is that I think Coherent Agency is only useful in some cases for some people, and I wanted to be clear about when that was.

I think each of the individual properties (gears level understanding, coherence, game-theoretic-soundness) are each just sort of obviously useful in some ways. There are particular failure modes to get trapped in if you've only made some incremental progress, but generally I think you can make incremental improvements in each domain and get improvements-in-life-outcome.

I do think that the sort of person who naturally gravitates towards this probably has something like 'rationality realism' going on, but I suspect it's not cruxy, and in particular I suspect shouldn't be cruxy for people who aren't naturally oriented that way.

Some people are aspiring directly to be a fully coherent, legible, sound agent. And that might be possible or desirable, and it might be possible to reach a variation of that that is cleanly mathematically describable. But I don't think that has be true for the concept to be useful. 

generally I think you can make incremental improvements in each domain and get improvements-in-life-outcome.

To me this implies some level on the continuum of realism about rationality. For instance I often think taht to make improvements on life outcomes I have to purposefully go off of pareto improvements in these domaiins, and sometimes sacrifice them. Because I don't think my brain runs that code natively, and sometimes efficient native code is in direct opposition to naive rationality.

Relatedly:

I've been watching the discussion on Realism About Rationality with some interest and surprise. I had thought of 'something like realism about rationality' as more cruxy for alignment work, because the inspectability of the AI matters a lot more than the inspectability of your own mind – mostly because you're going to scale up the AI a lot more than your own mind is likely to scale up. The amount of disagreement that's come out more recently about that has been interesting.

Some of the people who seem most invested in the Coherent Agency thing are specifically trying to operate on cosmic scales (i.e. part of their goal is to capture value in other universes and simulations, and to be the sort of person you could safely upload).

Upon reflection though, I guess it's not surprising that people don't consider realism "cruxy" for alignment, and also not "cruxy" for personal agency (i.e. upon reflection, I think it's more like an aesthetic input, than a crux. It's not necessary for agency to be mathematically simple or formalized, for incremental legibility and coherence to be useful for avoiding wasted motion)

Bumping this up to two nominations not because I think it needs a review, but because I like it and it captures an important insight that I've not seen written up like this elsewhere.

In my own life, these insights have led me to do/considering doing things like:

  • not sharing private information even with my closest friends -- in order for them to know in future that I'm the kind of agent who can keep important information (notice that there is the counterincentive that, in the moment, sharing secrets makes you feel like you have a stronger bond with someone -- even though in the long-run it is evidence to them that you are less trustworthy)
  • building robustness between past and future selves (e.g. if I was excited about and had planned for having a rest day, but then started that day by work and being really excited by work, choosing to stop work and decide to rest such that different parts of me learn that I can make and keep inter-temporal deals (even if work seems higher ev in the moment))
  • being more angry with friends (on the margin) -- to demonstrate that I have values and principles and will defend those in a predictable way, making it easier to coordinate with and trust me in future (and making it easier for me to trust others, knowing I'm capable of acting robustly to defend my values)
  • thinking about, in various domains, "What would be my limit here? What could this person do such that I would stop trusting them? What could this organisation do such that I would think their work is net negative?" and then looking back at those principles to see how things turned out
  • not sharing passwords with close friends, even for one-off things -- not because I expect them to release or lose it, but simply because it would be a security flaw that makes them more vulnerable to anyone wanting to get to me. It's a very unlikely scenario, but I'm choosing to adopt a robust policy across cases, and it seems like useful practice
If there isn't enough incentive for others to cooperate with you, don't get upset for them if they defect (or "hit the neutral button.") BUT maybe try to create a coordination mechanism so that there is enough incentive.

It seems like "getting upset" is often a pretty effective way of creating exactly the kind of incentive that leads to cooperation. I am reminded of the recent discussion on investing in the commons, where introducing a way to punish defectors greatly increased total wealth. Generalizing that to more everyday scenarios, it seems that being angry at someone is often (though definitely not always, and probably not in the majority of cases) a way to align incentives better.

(Note: I am not arguing in favor of people getting more angry more often, just saying that not getting angry doesn't seem like a core aspect of the "robust agent" concept that Raemon is trying to point at here)

Ah. The thing I was trying to point at here was the "Be Nice, At Least Until You Can Coordinate Meanness" thing.

The world is full of people who get upset at you for not living up to the norms they prefer. There are, in fact, so many people who will get upset for so many contradictory norms that it just doesn't make much sense to try to live up to them all, and you shouldn't be that surprised that it doesn't work.

The motivating examples were something like "Bob gets upset at people for doing thing X. A little while later, people are still doing thing X. Bob gets upset again. Repeat a couple times. Eventually it (should, according to me) become clear that a) getting upset isn't having the desired effect, or at most is producing the effect of "superficially avoid behavior X when Bob is around". And meanwhile, getting upset is sort of emotionally exhausting and the cost doesn't seem worth it."

I do agree that "get upset" (or more accurately "perform upset-ness") works reasonably well as localized strategy, and can scale up a bit if you can rally more people to get upset on your behalf. But the post was motivated by people who seemed to get upset... unreflectively?

(I updated the wording a bit but am not quite happy with it. I do think the underlying point was fairly core to the robust agent thing: you want policies for achieving your goals that actually work. "Getting upset in situation X" might be a good policy, but if you're enacting it as an adaption-executor rather than as a considered policy, it may not actually be adaptive in your circumstance)

Eventually it (should, according to me) become clear that a) getting upset isn't having the desired effect, or at most is producing the effect of "superficially avoid behavior X when Bob is around".

Or "avoid Bob", "drop Bob as a friend", "leave Bob out of anything new", etc. What, if anything, becomes clear to Bob or to those he gets angry with is very underdetermined.

As you would expect from someone who was one of the inspirations for the post, I strongly approve of the insight/advice contained herein. I also agree with the previous review that there is not a known better write-up of this concept. I like that this gets the thing out there compactly.

Where I am disappointed is that this does not feel like it gets across the motivation behind this or why it is so important - I neither read this and think 'yes that explains why I care about this so much' or 'I expect that this would move the needle much on people's robustness as agents going forward if they read this.'

So I guess the takeaway for me looking back is, good first attempt and I wouldn't mind including it in the final list, but someone needs to try again?

It is worth noting that Jacob did *exactly* the adjustments that I would hope would result from this post if it worked as intended, so perhaps it is better than I give it credit for? Would be curious if anyone else had similar things to report.


I'm writing my self-review for this post, and in the process attempting to more clearly define what I mean by "Robust Agent" (possibly finding a better term for it)

The concept here is pointing at four points:

  • Strategy of deliberate agency – not just being a kludge of behaviors, but having goals and decision-making that you reflectively endorse
  • Corresponding strategy of Gears-Level-Understanding of yourself (and others, and the world, but yourself-in-particular)
  • Goal of being able to operate in an environment where common wisdom isn't good enough, and/or you expect to run into edge cases.
  • Goal of being able to coordinate well with other agents.

"Robustness" mostly refers to the third and fourth points. It's possible the core strategy might actually make more sense to call "Deliberate Agency". The core thing is that you're deciding on purpose what sort of agent to be. If the environment wasn't going to change, you wouldn't care about being robust.

Or maybe, "Robust Agency" makes sense as a thing to call one overall cluster of strategies, but it's a subset of "Deliberate Agency."

Or maybe, "Robust Agency" makes sense as a thing to call one overall cluster of strategies, but it's a subset of "Deliberate Agency."

Where might "Robust Agency" not overlap with "Deliberate Agency"?

Robust Agency is a subset of Deliberate Agency (so it always overlaps in that direction). 

But you might decide, deliberately, to always ‘just copy what your neighbors are doing and not think too hard about it’, or other strategies that don’t match the attributes I listed for coherent/robust agency. (noting again that those attributes are intended to be illustrative rather than precisely defined criteria)

I find the classification of the elements of robust agency to be helpful, thanks for the write up and the recent edit.

I have some issues with Coherence and Consistency:

First, I'm not sure what you mean by that so I'll take my best guess which in its idealized form is something like: Coherence is being free of self contradictions and Consistency is having the tool to commit oneself to future actions. This is going by the last paragraph of that section-

There are benefits to reliably being able to make trades with your future-self, and with other agents. This is easier if your preferences aren’t contradictory, and easier if your preferences are either consistent over time, or at least predictable over time.

Second, the only case for Coherence is that reasons that coherence helps you make trade with your future self. My reasons for it are more strongly related to avoiding compartmentalization and solving confusions, and making clever choices in real time given my limited rationality.

Similarly, I do not view trades with future self as the most important reason for Consistency. It seems that the main motivator here for me is some sort of trade between various parts of me. Or more accurately, hacking away at my motivation schemes and conscious focus, so that some parts of me will have more votes than others.

Third, there are other mechanisms for Consistency. Accountability is a major one. Also, reducing noise in the environment externally and building actual external constraints can be helpful.

Forth, Coherence can be generalized to a skill that allows you to use your gear lever understanding of yourself and your agency to update your gears to what would be the most useful. This makes me wonder if the scope here is too large, and that gears level understanding and deliberate agency aren't related to the main points as much. These may all help one to be trustworthy, in that one's reasoning can judged to be adequate - including for oneself - which is the main thing I'm taking out from here.

Fifth (sorta), I have reread the last section, and I think that I understand now that your main motivation for Coherence and Consistency is that the conversation between rationalists can be made much more effective in that they can more easily understand each other's point of view. This I view related to Game Theoretic Soundness, more than the internal benefits of Coherence and Consistency which are probably more meaningful overall.


I definitely did not intend to make either an airtight or exhaustive case here. I think coherence and consistency are good for a number of reasons, and I included the ones I was most confident in, and felt like I could explain quickly and easily. (The section was more illustrative than comprehensive)

This response will not lay out the comprehensive case, but will try to answer my current thoughts on some specific questions. (I feel a desire to stress that I still don't consider myself an expert or even especially competent amature on this topic)

Second, the only case for Coherence is that reasons that coherence helps you make trade with your future self

That's actually not what I was going for – coherence can be relevant in the moment (if I had to pick, my first guess is that coherence is more costly in the moment and inconsistency is more costly over time, although I'm not sure I was drawing a strong distinction between them)

If you have multiple goals that are at odds, this can be bad in the immediate moment, because instead of getting to focus on one thing, you have to divide up your attention (unnecessarily) between multiple things that are at odds. This can be stressful, it can involve cognitive dissonance which makes it harder to think, and it involves wasted effort

This post has helped me understand quite a bit the mindset of a certain subset of rationalists, and being able to point to it and my disagreements with it has been quite helpful in finding cruxes with disagreements.

Seems like you are trying to elaborate on Eliezer's maxim Rationality is Systematized Winning. Some of what you mentioned implies shedding any kind of ideology, though sometimes wearing a credible mask of having one. Also being smarter than most people around you, both intellectually and emotionally. Of course, if you are already one of those people, then you don't need rationality, because, in all likelihood, you have already succeeded in what yo

Hmm.

I think the thing I'm gesturing at here is related but different to the systemized winning thing.

Some distinctions that I think make sense. (But would defer to people who seem further ahead in this path than I)

  • Systemized Winning – The practice of identifying and doing the thing that maximizes your goal (or, if you're not a maximizer, ensures a good distribution of satisfactory outcomes)
  • Law Thinking – (i.e. Law vs Tools) – Lawful thinking is having a theoretical understanding of what would be the optimal action for maximizing utility, given various constraints. This is a useful idea for a civilization to have. Whether it's directly helpful for you to maximize your utility depends on your goals, environment, and shape-of-your-mental-faculties.
    • I'd guess for most humans (of average intelligence), what you want is for some else to do Law thinking, figuring out the best thing, figure out the best approximation of the best thing, and then distill it down to something you can easily learn.
  • Being a Robust Agent - Particular strategies, for pursuing your goals, wherein you strive to have rigorous policy-making, consistent preferences (or consistent ways to resolve inconsistency), ways to reliably trust yourself and others, etc.
    • You might summarize this as "the strategy of embodying lawful thinking to achieve your goals." (not sure if that quite makes sense)
    • I expect this to be most useful for people who either
      • find rigorous policy-level, consistency-driven thinking easy, such that it's just the most natural way for them to approach their problems
    • have an preference to ensure that their solutions to problems don't break down in edge cases (i.e. nerds often like having explicit understandings of things independent of how useful it is)
    • people with goals that will likely cause them to run into edge cases, such that it's more valuable to have figured out in advance how to handle those.

When you look at the Meta-Honesty post... I don't think the average person will find it a particularly valuable tool for achieving their goals. But I expect there to be a class of person who actually needs it as a tool to figure out how to trust people in domains where it's often necessary to hide or obfuscate information.

Whether you want your decision-theory robust enough such that Omega simulating you will give you a million dollars depends a lot on whether you expect Omega to actually be simulating you and making that decision. I know at least some people who are actually arranging their life with that sort of concern in mine.

I do think there's an alternate frame where you just say "no, rationality is specifically about being a robust agent. There are other ways to be effective, but rationality is the particular way of being effective where you try to have cognitive patterns with good epistemology and robust decision theory."

This is in tension with the "rationalists should win", thing. Shrug.

I think it's important to have at least one concept that is "anyone with goals should ultimately be trying to solve them the best way possible", and at least one concept that is "you might consider specifically studying cognitive patterns and policies and a cluster of related things, as a strategy to pursue particular goals."

I don't think is quite the same thing as instrumental rationality (although it's tightly entwined). If your goals are simple and well-understood, and you're interfacing in a social domain with clear rules, the most instrumentally rational thing might be to not overthink it and follow common wisdom.

But it's particularly important if you want to coordinate with other agents, over the long term. Especially on ambitious, complicated projects in novel domains.

On my initial read, I read this as saying "this is the right thing for some people, even when it isn't instrumentally rational" (?!). But

I think it's important to have at least one concept that is "anyone with goals should ultimately be trying to solve them the best way possible", and at least one concept that is "you might consider specifically studying cognitive patterns and policies and a cluster of related things, as a strategy to pursue particular goals."

makes me think this isn't what you meant. Maybe clarify the OP?

I was meaning to say "becoming a robust agent may be the instrumentally rational thing for some people in some situation. For other people in other situations, it may not be helpful."

I don't know that "instrumental rationality" is that well defined, and there might be some people who would claim that "instrumental rationality" and what I (here) am calling "being a robust agent" are the same thing. I disagree with that frame, but it's at least a cogent frame.

You might define "instrumental rationality" as "doing whatever thing is best for you according to your values", or you might use it it to mean "using an understanding of, say, probability theory and game theory and cognitive science to improve your decision making". I think it makes more sense to define it the first way, but I think some people might disagree with that. 

If you define it the second way, then for some people – at least, people who aren't that smart or good at probability/game-theory/cog-science – then "the instrumentally rational thing" might not be "the best thing."

I'm actually somewhat confused about which definition Eliezer intended. He has a few posts (and HPMOR commentary) arguing that "the rational thing" just means "the best thing". But he also notes that it makes sense to use the word "rationality" specifically when we're talking about understanding cognitive algorithms. 

Not sure whether that helped. (Holding off on updating the post till I've figured out what the confusion here is)

I define it the first way, and don't see the case for the second way. Analogously, for a while, Bayesian reasoning was our best guess of what the epistemic Way might look like. But then we find out about logical induction, and that seems to tell us a little more about what to do when you're embedded. So, we now see it would have been a mistake to define "epistemic rationality" as "adhering to the dictates of probability theory as best as possible".

I think that Eliezer's other usage of "instrumental rationality" points to fields of study for theoretical underpinning of effective action.

(not sure if this was clear, but I don't feel strongly about which definition to use, I just wanted to disambiguate between definitions people might have been using)

I think that Eliezer's other usage of "instrumental rationality" points to fields of study for theoretical underpinning of effective action.

This sounds right-ish (i.e. this sounds like something he might have meant). When I said "use probability and game theory and stuff" I didn't mean "be a slave to whatever tools we happen to use right now", I meant sort of as examples of "things you might use if you were trying to base your decisions and actions off of sound theoretical underpinnings."

So I guess the thing I'm still unclear on (people's common usage of words): Do most LWers think it is reasonable to call something "instrumentally rational" if you just sorta went with your gut without ever doing any kind of reflection (assuming your gut turned out to be trustworthy?). 

Or are things only instrumentally rational if you had theoretical underpinnings? (Your definition says "no", which seems fine. But it might leave you with an awkward distinction between "instrumentally rational decisions" and "decisions rooted in instrumental rationality.")

I'm still unsure if this is dissolving confusion, or if the original post still seems like it needs editing.

Your definition says "no", which seems fine. But it might leave you with an awkward distinction between "instrumentally rational decisions" and "decisions rooted in instrumental rationality."

My definition was the first, which is "instrumental rationality = acting so you win". So, wouldn't it say that following your gut was instrumentally rational? At least, if it's a great idea in expectation given what you knew - I wouldn't say lottery winners were instrumentally rational.

I guess the hangup is in pinning down "when things are actually good ideas in expectation", given that it's harder to know that without either lots of experience or clear theoretical underpinnings.

I think one of the things I was aiming for with Being a Robust Agent is "you set up the longterm goal of having your policies and actions have knowably good outcomes, which locally might be a setback for how capable you are, but allows you to reliably achieve longer term goals."