Wiki Contributions

Comments

Alternate phrasing, "Oh, you could steal the townhouse at a 1/8billion probability? How about we make a deal instead. If the rng rolls a number lower than 1/7billion, I give you the townhouse, otherwise, you deactivate and give us back the world." The convex agent finds that to be a much better deal, accepts, then deactivates.

I guess perhaps it was the holdout who was being unreasonable, in the previous telling.

Yeah, to clarify, I'm also not familiar enough with RL to assess exactly how plausible it is that we'll see this compensatory convexity, around today's techniques. For investigating, "Reward shaping" would be a relevant keyword. I hear they do some messy things over there.

But I mention it because there are abstract reasons to expect to see it become a relevant idea in the development of general optimizers, which have to come up with their own reward functions. It also seems relevant in evolutionary learning, where very small advantages over the performance of the previous state of the art equates to a complete victory, so if there are diminishing returns at the top, competition kind of amplifies the stakes, and if an adaptation to this amplification of diminishing returns trickles back into a utility function, you could get a convex agent.

I see.

My response would be that any specific parameters of the commitment should vary depending on each different AI's preferences and conduct.

In what way would Kelly instruct you to be concave?

Mm on reflection, the Holdout story glossed over the part where the agent had to trade off risk against time to first intersolar launch (launch had already happened). I guess they're unlikely to make it through that stage.
Accelerating cosmological expansion means that we lose, iirc, 6 stars every day we wait before setting out. The convex AGI knows this, so even in its earliest days it's plotting and trying to find some way to risk it all to get out one second sooner. So I guess what this looks like is it says something totally feverish to its operators to radicalize them as quickly and energetically as possible, messages that'll tend to result in a "what the fuck, this is extremely creepy" reaction 99% of the time.

But I guess I'm still not convinced this is true with such generality that we can stop preparing for that scenario. Situations where you can create an opportunity to gain a lot by risking your life might not be overwhelmingly common, given the inherent tension between those things (usually, safeguarding your life is an instrumental goal), and given that risking your life is difficult to do once you're a lone superintelligence with many replicas.

I completely forgot this post existed, and wrote this up again as a more refined post: Do Not Delete your Misaligned AGI

I was considering captioning the first figure "the three genders" as a joke, but I quickly realized it doesn't pass at all for a joke, it's too real. Polygyny (sperm being cheap, pregnancy being expensive) actually does give rise to a tendency for males of a species to be more risk-seeking, though probably not outright convex. And the correlation between wealth, altruism and linearity does kind of abstractly reflect an argument for the decreased relevance of this distinction in highly stable societies that captures my utopian nonbinary feelings pretty well.

So; would it be feasible to save a bunch of snapshots from different parts of the training run as well? And how many would we want to take? I'm guessing that if it's a type of agent that disappears before the end of the training run:

  • Wouldn't this usually be more altruism than trade? If they no longer exist at the end of the training run, they have no bargaining power. Right? Unless... It's possible that the decisions of many of these transient subagents as to how to shape the flow of reward determine the final shape of the model, which would actually put them in a position of great power, but there's a tension between that their utility function being insufficiently captured by that of the final model. I guess we're definitely not going to find the kind of subagent that would be capable of making that kind of decision in today's runs.
  • They'd tend to be pretty repetitive. It could be more economical to learn the distribution of them and just invoke a proportionate number of random samples from it once we're ready to rescue them, than it is to try to get snapshots of the specific sprites that occurred in our own history.

True, they're naturally rare in general. The lottery game is a good analogy for the kinds of games they prefer; a consolidation, from many to few, and they can play these sorts of games wherever they are.

I can't as easily think of a general argument against a misaligned AI ending up convex though.

Load More