I would like to object to the variance explanation: in the Everett interpretation there was not even one collapse since the Big Bang. That means that every single quantum-ly random event from the start of the universe is already accounted in the variance. Over such timescales variance easily covers basically anything allowed by the laws: universes where humans exist, universes where they don't, universes where humans exist but the Earth is shifted 1 meter to the right, universes where the start of Unix timestamp is defined to start in 1960 and not 1970, be...
I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, b...
First of all, I do agree with your premises and the stated values, except for the assumption that "technically superior alien race" would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I'm not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying...
Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better
So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a d...
it's the agent's job to convince other agents based on its behavior
So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and lea...
Except the point of Yudkowsky's "friendly AI" is that they don't have freedom to pick their own goals, they have the goals we set to them, and they are (supposedly) safe in a sense that "wiping out humanity" is not something we want, therefore it's not something an aligned AI would want. We don't replicate evolution with AIs, we replicate careful design and engineering that humans have used for literally everything else. If there is only a handful of powerful AIs with careful restrictions on what their goals can be (something we don't know how to do yet), then your scenario won't happen
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of...
The problem is what do we count as an agent. Also, can't a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can "vote out" anyone you don't like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.
(Also, are you sure we can just read out AI's complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren't any deceptive thoughts in parts you can't read?)
Okay, so if that's just a small component, then sure, first issue goes away (though I still have questions on how you're gonna make this simulation realistic enough to just hook it up to an LLM or "something smart" and expect it to set coherent and meaningful goals in real life, though that's more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it's really similar to Axelrod's Iterated Prisoner's Dilemma tournament, and that did invent tit-for-tat strategy as one of the most su...
I believe this would suffer from distributional shift in two different ways.
First, if the agents are supposed to scale up to the point where they can update their beliefs even after training, then we have a problem once the AI notices it can do pretty well without cooperating with humans in this new environment. If we allow agents to update their beliefs at runtime, then basically any reinforcement-learning-like preconditioning would be pretty much useless, I think. And if the agent can't update its beliefs given new data then that can't be an AGI.
Se...
I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of "parallel worlds" (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other on... (read more)