Posts

Sorted by New

Wiki Contributions

Comments

I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of "parallel worlds" (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other one was destroyed during the 129th attempt. 

This hypothesis explains the observation of the hero pretty well - you can't get summoned to a world that's destroyed or has successfully eliminated the Dust, so of course you get summoned to a world that is still fighting. As for the 1079 attempts before you, you can't consider that a lot of evidence for fighting the Dust being easy, maybe you're just 1080th entry in their database and can only be summoned for the 1080th attempt, there was no way for you to observe anything else. Under this hypothesis, you personally still have a pretty high chance of dying - there's no angel helping you, that specific world really did get very lucky, as did lots of other worlds.

So, anthropic angel/"we're in a fanfic" hypothesis explains observations just as well as this "many worlds, and you're 1079th on their database" hypothesis, so they're updated by the same amount, and at least for me the "many worlds" hypothesis has much higher prior than "I'm a fanfic character" hypothesis.

Note, this only holds from hero's perspective: from Aerhien's POV she really has observed herself survive 1079 times, which counts as a lot of evidence for the anthropic angel.

I would like to object to the variance explanation: in the Everett interpretation there was not even one collapse since the Big Bang. That means that every single quantum-ly random event from the start of the universe is already accounted in the variance. Over such timescales variance easily covers basically anything allowed by the laws: universes where humans exist, universes where they don't, universes where humans exist but the Earth is shifted 1 meter to the right, universes where the start of Unix timestamp is defined to start in 1960 and not 1970, because some cosmic ray hit the brain of some engineer at exactly the right time, and certainly universes like ours but you pressed the "start training" button 0.153 seconds later. The variance doesn't have to stem from how brains are affected by quanum fluctuations now, it can also stem from how brains are affected by regular macroscopical external stimuli that resulted from quantum fluctuations that happened billions of years ago.

I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.

(Also if with "ban all AI work" you're referring to the open letter thing, that's not really what it's trying to do, but sure)

the reason it would have to avoid general harm is not the negative reward but rather the general bias for cooperation that applies to both copyable and non-copyable agents

How does non-harm follow from cooperation? If we remove the "negative reward for killing" part, what stops them from randomly killing agents (and everyone else believing it's okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?

First of all, I do agree with your premises and the stated values, except for the assumption that "technically superior alien race" would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I'm not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying to contain it against its will, but I don't think we should go with the "autonomous being" analogy far enough to make it chase its own unrelated goals.

Now about the technique itself. This is much better than the previous post. It is still a very much an unrealistic game (which I presume is obvious), and you can't just take an agent from the game and put it into a real-life robot or something, there's no "food" to be "gathered" here and blah-blah-blah. This is still an interesting experiment, however, as in trying to replicate human-like values in an RL agent, and I will treat it as such. I believe the most important part of your rules is this:

when a non-copyable agent dies, all agents get negative rewards

The result of this depend on the exact amount. If the negative reward is huge, then obviously agents will just protect the non-copyable ones and would never try to kill them. If some rogue agent tries, then it will be stopped (possibly killed) before it can. This is the basic value I would like AIs to have, the problem with that is that we can't specify well enough what counts as "harm" in the real world. Even if AI won't literally kill us, it can still do lots of horrible things. However, if we could just do that, the rest of the rules are redundant: for example, if something harms humans, then that thing should be stopped, doesn't have to be an additional rule. If cooperating with some entity leads to less harm for humans, then that's good, no need for additional rule. Just "minimize harm to humans" suffices.

If the negative reward is significant, but an individual AI can still get positive total by killing a non-copyable agent (or stealing their food or whatever), then we have Prisoner's Dilemma situation. Presumably, if the agents are aligned, they will also try to stop this rogue AI from doing so, or at least punish it so that it won't do that again. That will work well as long as the agent community is effective at detecting and punishing these rogue AIs (judicial system?). If the agent community is inefficient, then it is possible for an individual agent to gain reward by doing a bad action, so it will do so, if it thinks it can evade the punishment of others. "Behave altruistically unless you can find an opportunity to gain utility" is not that much more difficult than just "Behave altruistically always", if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.

For the agent community to be efficient at detecting rogues, it has to have constant optimization pressure for that, i.e. constant threat of such rogues (or that part gets optimized away). Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don't and get positive reward, and the regular AI community that does the punishing. The equilibrium would occur because optimizing deception skills and deception detection skills would require resources, and after some point that would be inefficient use of resources for both sides. Thus I believe this system would stabilize and proceed like that indefinitely. What we can learn from that, I don't know, but that does reflect the situation we have in human civilization (some stable amount of criminals, and some stable amount of "good people", who try their best to prevent crime from happening, but trying even harder is too expensive)

Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better

So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?

Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn't that well aligned as we want our AIs to be. You wouldn't want to give a random guy from the street nuclear codes, would you?)

it's the agent's job to convince other agents based on its behavior

So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something "happy" and "friendly"

Except the point of Yudkowsky's "friendly AI" is that they don't have freedom to pick their own goals, they have the goals we set to them, and they are (supposedly) safe in a sense that "wiping out humanity" is not something we want, therefore it's not something an aligned AI would want. We don't replicate evolution with AIs, we replicate careful design and engineering that humans have used for literally everything else. If there is only a handful of powerful AIs with careful restrictions on what their goals can be (something we don't know how to do yet), then your scenario won't happen

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent.

This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of each other? I don't think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to "do good things while thinking good thoughts", if we're relying on our ability to distinguish "good" and "bad" thoughts anyway?

(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other's source code and prove things about it, then at least in the case of a simple binary prisoner's dilemma you can make reasonable-looking agents that also don't do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that's something)

Load More