We often laugh over human-specific "bugs" in reasoning, comparing it to a gold standard of some utility-maximizing perfect Bayesian reasoner.

We often fear that a very capable AI following strict rules of optimization would reach some repugnant conclusions, and struggle to find "features" to add to guard against it.

What if some of the "bugs" we are looking at, are actually the "features" we are looking for?

  • We seem to distinguish "sacred" and "non-sacred"  values, refusing to mix the two in calculations (for example human life vs money). What if this "tainted bit", "NaN-propagation", is a security feature guarding against Goodharting leading to genocide or dissolution of social trust? What if utility is not a single real number, but instead a pair? What if the ordering is not even lexicographic, but partial? What if it's a much longer tuple? Which brings me to next point:
  • We often experience decision paralysis apparently unable to compare two actions. What if this is simply because the order must be partial for security reasons? An alternative explanation of this phenomenon is that we implicitly treat "wait for more data to arrive and/or situation to change in tie-braking way" as an action available to us - is that bad?
  •  We often decide which of the two end-states A vs B we prefer based on the path leading to them, amusingly favoring A in some scenarios and B in others. What if this is because we implicitly assume that end-state contains our brain with the memory of the path leading there? Isn't this a cool feature to treat the agent as part of its environment? Or what if this is because we implicitly factor in considerations of "what if other society members would follow this kind of path, or decision-making algorithm?". Isn't this a cool feature to think about second-order effects, about "acausal trades", and to treat own software as perhaps shared with other agents?
  • At least some long-term stable cultures have norms requiring children to follow adults' advice even if it conflicts their own judgment and more importantly said children apparently follow along instead of revolting and doing what (seems) good for them. Isn't that corrigibility a feature we want from AIs we plan to rear? Shouldn't there be safe-guards against child-knowing-better-than-parent in any self-modifying system spawning new generations of itself?
  • The whole sunken cost fallacy/heuristic. Isn't it actually a good thing to associate cost with each deviation from the original plan? Do we really want to zig-zag between more and more shiny objects with no meta-level realization that there's something wrong with this whole algorithm in general if it can't keep its trajectory predictable to itself? Yeah, sunken cost is more than that - it's not just fixed additional cost of decision - it's more like the guilt for not caring about your past self being invested into something. But again, isn't that a good thing from a security perspective?

I anticipate that each of these examples can be laughed at using some toy problem simple enough to calculate on the napkin. Sure, but we are talking about producing agents with partial information about very fuzzy world around them with lots of other agents embedded in them, some of them sharing goals or even parts of source code - we will rarely meet spherical cows on our way and overfitting to these learning examples is the very problem we want to solve. Do we really plan to solve all of that with a single simple elegant formula (AIXI style), or the plan always was to throw in some safety heuristics to the mix? If it's the later, then perhaps we can take a hint from parents raising children, societies avoiding dissolving, and people avoiding mania? Thus, what I propose is to take a look at the list of fallacies and other "strange" phenomenons from a different angle: could I use something like that as a security feature?

New to LessWrong?

New Comment


9 comments, sorted by Click to highlight new comments since:

Strong upvote. Those are very good examples. I think that many of the so called irrational actions are in fact pretty rational and what fails is the framework we use to judge them. I wrote a post some time ago touching on a similar subject: https://www.lesswrong.com/posts/R2GAuP9CdGtsDcpy4/beware-of-small-world-puzzles

I've thought something like this for a while (though you put it far better than I could), and I'm glad to see there are others in this space who feel the same. I think the rationalist community is often too quick to scoff at perceived cognitive fallacies, and only much later come back with a more understanding eye—people aren't generally stupid, and there are often very good reasons for actions and behaviors which initially seem arbitrary.

We seem to distinguish "sacred" and "non-sacred"  values, refusing to mix the two in calculations (for example human life vs money).

People claim to make such distinctions and claim to refuse to mix the two in ‘calculations’, but no person I’ve ever heard of actually does so to an extent anywhere close to the widely affirmed claims. The revealed preferences show that the statistical average human life has a roughly consistent valuation (within one order of magnitude) between widely disparate spheres of activity. 

In fact several industries would cease to exist, or at least cease to make any plans concerning the future, if such a dichotomy actually existed, as their cost of calculation would skyrocket towards infinity.

On one hand, yes, many things that seem like "bugs" in short term, start seeming like "features" when you consider the second-order effects, etc.

On the other hand, some people update to the opposite extreme, and conclude that there are no "bugs"; that every behavior, no matter how seemingly random, short-sighted, or self-destructive, is a result of some hidden deeper wisdom that we merely do not understand yet.

I call this the just world fallacy, the proponents prefer to talk about revealed preferences. The idea is that everything that you do, or everything that merely happens to you, is by definition something you (perhaps unconsciously) wanted to happen. So in the end, everyone gets exactly what they wanted most, exactly in the way they wanted it most, there is no problem to solve, ever. Of course, it sounds stupid when I put it so bluntly, but it can be further defended by saying "okay, even if you say that X happened by accident, there was certainly a way to prevent X or at least reduce its probability, if only you spent enough resources towards that goal, but it was your choice - your revealed preference - to spend those resources on something else instead, so at least in that sense you preferred X to sometimes happen, statistically".

So I would recommend caution in both directions. Don't call things "bugs" in thinking, just because you cannot think about any useful purpose in 5 seconds. On the other hand, do not conclude that humanity has a revealed preference to be converted to paperclips, and we are merely in denial about it (most likely for signalling reasons).

I think you're contrasting ideas that don't contradict each other.

(1) On the other hand, some people update to the opposite extreme, and conclude that there are no "bugs"; that every behavior, no matter how seemingly random, short-sighted, or self-destructive, is a result of some hidden deeper wisdom that we merely do not understand yet.

(2) The idea is that everything that you do, or everything that merely happens to you, is by definition something you (perhaps unconsciously) wanted to happen. So in the end, everyone gets exactly what they wanted most, exactly in the way they wanted it most, there is no problem to solve, ever.

I believe in 1, but not in 2 (not 100%). Something may contain "a deeper wisdom" and be unsuccessful in the real world. It's true even for such wisdom as "rationality": one day you may find out that we live in a simulation and everyone who practiced rationality was punished. But it won't make rationality "silly". I think "wisdom" is what we want to call it, what we decide to call it.

I think a rationalist may connect 1 and 2 because "the problem to solve" is very important for rationality (on a philosophical level). Anything that diminishes its importance looks the same. So, "we need to find places where our wisdom fits the world" or "if you failed you may've chosen a bad goal" sounds like "there're no problems ever, we don't need to solve any problems ever".

So in the end, everyone gets exactly what they wanted most, exactly in the way they wanted it most, there is no problem to solve, ever.

On the other hand, do not conclude that humanity has a revealed preference to be converted to paperclips, and we are merely in denial about it (most likely for signalling reasons).

It's a strange way to put it, but I think you can put it like this if you consider multiple possible worlds.

For example, you could say a person who behaves recklessly has a preference to suffer the risks of reckless behavior for the chance to live her life fully in a world where reckless behavior makes sense. I think such formulation may reveal more about people ignoring AI risks (I don't support ignoring those risks) than "people can't understand risks, people can't calculate".

If we assume that behavior has a genetical component, and that mutations are random, than would imply that some behaviors we observe are at least partially random. Do you disagree with this?

If that randomness in the behavior happens to be harmful, would it be okay to call it a "bug"?

I'm not sure it would change much for me.

I may dislike my decisions or habits. But to understand if it's a "bug" or not I would need to have a near complete understanding of my own thinking. I don't think I have that. So I don't see any goal/gain of conceptualizing something as a "bug". If my behavior depends on genes or weather or anything else it's not relevant for me at the moment.

Yep, the only difference between a bug and a feature is what norm you've adopted that makes a claim to what is right. Stuart Armstrong has a few posts about this topic over the years. This one seems most relevant to your current line of thinking, but he has several others.

This feels really valuable. Outside of the realm of paper napkins and trolleys, having fuzzy heuristics may be a reasonable way to respond to a world where actors tend to have fuzzy perceptions.