electroswing

Wikitag Contributions

Comments

Sorted by

I wrote a what I believe to be simpler explanation of this post here. Things I tried to do differently: 

  1. More clearly explaining what Nash equilibrium means for infinitely repeated games -- it's a little subtle, and if you go into it just with intuition, it's not clear why the "everyone puts 99" situation can be a Nash equilibrium 
  2. Noting that just because something is a Nash equilibrium doesn't mean it's what the game is going to converge to 
  3. Less emphasis on minimax stuff (it's just boilerplate, not really the main point of folk theorems) 

The strategy profile I describe is where each person has the following strategy (call it "Strategy A"): 

  • If empty history, play 99 
  • If history consists only of 99s from all other people, play 99 
  • If any other player's history contains a choice which is not 99, play 100 

The strategy profile you are describing is the following (call it "Strategy B"): 

  • If empty history, play 99
  • If history consists only of 99s from all other people, play 99 
  • If any other player's history contains a choice which is not 99, play 30 

I agree Strategy B weakly dominates Strategy A. However, saying "everyone playing Strategy A forms a Nash equilibrium" just means that no player has a profitable deviation assuming everyone else continues to play Strategy A. Strategy B isn't a profitable deviation -- if you switch to Strategy B and everyone else is playing Strategy A, everyone will still just play 99 for all eternity. 

The general name for these kinds of strategies is grim trigger.

I'm not sure what the author intended, but my best guess is they wanted to say "punishment is bad because there exist really bad equilibria which use punishment, by folk theorems". Some evidence from the post (emphasis mine): 

Rowan: "If we succeed in making aligned AGI, we should punish those who committed cosmic crimes that decreased the chance of an positive singularity sufficiently."

Neal: "Punishment seems like a bad idea. It's pessimizing another agent's utility function. You could get a pretty bad equilibrium if you're saying agents should be intentionally harming each others' interests, even in restricted cases."

[...]

Rowan: "Well, I'll ponder this. You may have convinced me of the futility of punishment, and the desirability of mercy, with your... hell simulation. That's... wholesome in its own way, even if it's horrifying, and ethically questionable."

Folk theorems guarantee the existence of equilibria for both good (31) and bad (99) payoffs for players, both via punishment. For this reason I view them as neutral: they say lots of equilibria exist, but not which ones are going to happen. 

I guess if you are super concerned about bad equilibria, then you could take a stance against punishment, because then it would be harder/impossible for the everyone-plays-99 equilibrium to form. This could have been the original point of the post but I am not sure. 

I think the obvious answer here is AutoPay -- this should hedge against situations you are describing. 

The costs of making a mistake are certainly high, since it's a permanent hit to your credit report. I am not super knowledgeable of how late payments affect credit score (other than that it has a negative sign), this is an interesting question.

Hmmm...the orthogonality thesis is pretty simple to state, so I don't think necessarily that it has been grossly misunderstood. The bad reasoning in Fallacy 4 seems to come from a more general phenomenon with classic AI Safety arguments, where they do hold up, but only with some caveats and/or more precise phrasing. So I guess "bad coverage" could apply to the extent that popular sources don't go in depth enough. 

I do think the author presented good summaries of Bostrom's and Russell's viewpoints. But then they immediately jump to a "special sauce" type argument. (Quoting the full thing just in case)

The thought experiments proposed by Bostrom and Russell seem to assume that an AI system could be“superintelligent” without any basic humanlike common sense, yet while seamlessly preserving the speed, precision and programmability of a computer. But these speculations about superhuman AI are plagued by flawed intuitions about the nature of intelligence. Nothing in our knowledge of psychology or neuroscience supports the possibility that “pure rationality” is separable from the emotions and cultural biases that shape our cognition and our objectives. Instead, what we’ve learned from research in embodied cognition is that human intelligence seems to be a strongly integrated system with closely interconnected attributes, include emotions, desires, a strong sense of self hood and autonomy, and a commonsense understanding of the world. It’s not at all clear that these attributes can be separated.

I really don't understand where the author is coming from with this. I will admit that the classic paperclip maximizer example is pretty far-fetched, and maybe not the best way to explain the orthogonality thesis to a skeptic. I prefer more down-to-earth examples like, say, a chess bot with plenty of compute to look ahead, but its goal is to protect its pawns at all costs instead of its king. It will pursue its goal intelligently but the goal is silly to us, if what we want is for it to be a good chess player. 

I feel like the author's counterargument would make more sense if they framed it as an outer alignment objection like "it's exceedingly difficult to make an AI whose goal is to maximize paperclips unboundedly, with no other human values baked in, because the training data is made by humans". And maybe this is also what their intuition was, and they just picked on the orthogonality thesis since it's connected to the paperclip maximize example and easy to state. Hard to tell. 

It would be nice if AI Safety were less disorganized, and had a textbook or something. Then, a researcher would have a hard time learning about the orthogonality thesis without also hearing a refutation of this common objection. But a textbook seems a long way away...

I mean...sure...but again, this does not affect the validity of my counterargument. Like I said, I'm using as strong as possible of a counterargument by saying that even if the non-brain parts of the body were to add 2-100x computing power, this would not restrict our ability to scale up NNs to get human-level cognition. Obviously this still holds if we replace "2-100x" with "1x". 

The advantage of "2-100x" is that it is extraordinarily charitable to the "embodied cognition" theory—if (and I consider this to be extremely low probability) embodied cognition does turn out to be highly true in some strong sense, then "2-100x" takes care of this in a way that "~1x" does not. And I may as well be extraordinarily charitable to the embodied cognition theory, since "Bitter lesson" type reasoning is independent of its veracity. 

Load More