All of silent-observer's Comments + Replies

I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of "parallel worlds" (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other on... (read more)

2Christopher King
Yeah, the hero with a thousand chances is a bit weird since you and Aerhien should technically have different priors. I didn't want to get too much into it since it's pretty complicated, but technically you can have hypotheses where bad things only start happening after the council summons you. This has weird implications for the cold war case. Technically I can't reflect against the cold war anthropic shadow since it was before I was born. But a hypothesis where things changed when I was born seems highly unnatural and against the Copernican principle. In your example though, the hypothesis that things are happening normally is still pretty bad to other hypotheses we can imagine. That's because there will be a much larger number of worlds that are in a more sensible stalemate with the Dust, instead of "incredibly improbable stuff happens all the time". Like even "the hero defeats the Dust normally each time" seems more likely. The less things that need to go right, the more survivors there are! So in your example, it is still a more likely hypothesis that there is some mysterious Counter-Force that just seems like it is a bunch of random coincides, and this would be a type of anthropic angel.

I would like to object to the variance explanation: in the Everett interpretation there was not even one collapse since the Big Bang. That means that every single quantum-ly random event from the start of the universe is already accounted in the variance. Over such timescales variance easily covers basically anything allowed by the laws: universes where humans exist, universes where they don't, universes where humans exist but the Earth is shifted 1 meter to the right, universes where the start of Unix timestamp is defined to start in 1960 and not 1970, be... (read more)

I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, b... (read more)

First of all, I do agree with your premises and the stated values, except for the assumption that "technically superior alien race" would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I'm not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying... (read more)

2ozb
You are right, that's not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it's not at all clear we could stop AGI at this point even with maximum effort. Yeah that sounds right. My thesis in particular is that this equilibrium can be made to be better in expected value than any other equilibrium I find plausible. Right, the reason it would have to avoid general harm is not the negative reward (which is indeed just for killing) but rather the general bias for cooperation that applies to both copyable and non-copyable agents. The negative reward for killing (along with the reincarnation mechanism for copyable agents) is meant specifically to balance the fact that humans could legitimately be viewed as belligerent and worthy of opposition since they kill AI; in particular, it justifies human prioritization of human lives. But I'm very open to other mechanisms to accomplish the same thing. Yes, but I expect that to always be true. My proposal is the only approach I've found so far where the deception and other bad behavior don't completely overwhelm the attempts at alignment

Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better

So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?

Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a d... (read more)

1ozb
How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.

it's the agent's job to convince other agents based on its behavior

So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and lea... (read more)

1ozb
Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they're rewarded, like humans but more so.

Except the point of Yudkowsky's "friendly AI" is that they don't have freedom to pick their own goals, they have the goals we set to them, and they are (supposedly) safe in a sense that "wiping out humanity" is not something we want, therefore it's not something an aligned AI would want. We don't replicate evolution with AIs, we replicate careful design and engineering that humans have used for literally everything else. If there is only a handful of powerful AIs with careful restrictions on what their goals can be (something we don't know how to do yet), then your scenario won't happen

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent.

This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of... (read more)

2baturinsky
Yes, probably some human models. By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents' motives and honestly communicating it's own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together. Because we don't know how to tell "good" thoughts from "bad" reliably in all possible scenarios.

The problem is what do we count as an agent. Also, can't a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can "vote out" anyone you don't like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.

(Also, are you sure we can just read out AI's complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren't any deceptive thoughts in parts you can't read?)

1ozb
Within the training, an agent (from the AI's perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that's not really any different That's actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to "game" that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting immigration to get the new immigrants' votes, at the expense of existing citizens. One reason to predict they might not do this is that it's not a valid strategy in the simulation. But I'll have to think on this one more. The general point is we don't need to, it's the agent's job to convince other agents based on its behavior; ultimately similar to altruism in humans. Yes, it's messy, but in environments where cooperation is inherently useful it does develop.
1baturinsky
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans. One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can't request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people. You are right that the authority to "vote out" other AIs may be misused. That's where logs would be handy - for other agents to analyse the "minds" of both sides and see who was doing right.  It's not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.

Okay, so if that's just a small component, then sure, first issue goes away (though I still have questions on how you're gonna make this simulation realistic enough to just hook it up to an LLM or "something smart" and expect it to set coherent and meaningful goals in real life, though that's more of a technical issue).

However, I believe there are still other issues with this appoach. The way you describe it makes me think it's really similar to Axelrod's Iterated Prisoner's Dilemma tournament, and that did invent tit-for-tat strategy as one of the most su... (read more)

I believe this would suffer from distributional shift in two different ways.

First, if the agents are supposed to scale up to the point where they can update their beliefs even after training, then we have a problem once the AI notices it can do pretty well without cooperating with humans in this new  environment. If we allow agents to update their beliefs at runtime, then basically any reinforcement-learning-like preconditioning would be pretty much useless, I think. And if the agent can't update its beliefs given new data then that can't be an AGI.

Se... (read more)

2baturinsky
Point is to make "cooperate" a more convergent instrumental goal than "defect". And yes, not just in training, but in real world too. And yes, it's more fine-grained than a binary choice. There is much more ways to see how cooperative AI is, compared to how well we can check now how human is cooperative. Including checking the complete logs of AI's actions, knowledge and thinking process. And there are objective measures of cooperation. It's how well it's action affect other agents success in pursuing their goals. I.e. do other agents want to "vote out" this particular AI from being able to make decisions and use resources or not.
1ozb
Thanks for helping me think this through. For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a "conscience", rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the "brain", which can then update its beliefs. For the second issue, I meant that we would have no objective way to check "cooperate" and "respect" on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in order to survive/reproduce/get RL rewards, the agents have to consume a virtual resource that requires effort from multiple/many agents (simple implementation: some sort of voting; but can be more complicated, eg requiring tokens that are generated at a fixed rate for each agent), but also generally be non-competitive, eg no stealing tokens or food, and there's more than enough food for everyone, if they can cooperate. The theory is that this should lead to a form of tit-for-tat, including AIs detecting and deterring liars. Thinking a bit more: I think the really dangerous part of AI is the "independent agent", presumably trained with methods resembling RL; so that's the part I would train in this environment; it can then be hooked up to eg an LLM which is optimized on something like perplexity and acts more like ChatGPT, ie predicting the next word. Ie, have a separate "brain" and "conscience", with the brain possibly smarter but the "conscience" holding the reins; during the above training, mix different variants of both components, with different intelligence levels.