Game Practitioner http://aboutmako.makopool.com
Yeah, to clarify, I'm also not familiar enough with RL to assess exactly how plausible it is that we'll see this compensatory convexity, around today's techniques. For investigating, "Reward shaping" would be a relevant keyword. I hear they do some messy things over there.
But I mention it because there are abstract reasons to expect to see it become a relevant idea in the development of general optimizers, which have to come up with their own reward functions. It also seems relevant in evolutionary learning, where very small advantages over the performance of the previous state of the art equates to a complete victory, so if there are diminishing returns at the top, competition kind of amplifies the stakes, and if an adaptation to this amplification of diminishing returns trickles back into a utility function, you could get a convex agent.
I see.
My response would be that any specific parameters of the commitment should vary depending on each different AI's preferences and conduct.
Mm on reflection, the Holdout story glossed over the part where the agent had to trade off risk against time to first intersolar launch (launch had already happened). I guess they're unlikely to make it through that stage.
Accelerating cosmological expansion means that we lose, iirc, 6 stars every day we wait before setting out. The convex AGI knows this, so even in its earliest days it's plotting and trying to find some way to risk it all to get out one second sooner. So I guess what this looks like is it says something totally feverish to its operators to radicalize them as quickly and energetically as possible, messages that'll tend to result in a "what the fuck, this is extremely creepy" reaction 99% of the time.
But I guess I'm still not convinced this is true with such generality that we can stop preparing for that scenario. Situations where you can create an opportunity to gain a lot by risking your life might not be overwhelmingly common, given the inherent tension between those things (usually, safeguarding your life is an instrumental goal), and given that risking your life is difficult to do once you're a lone superintelligence with many replicas.
I completely forgot this post existed, and wrote this up again as a more refined post: Do Not Delete your Misaligned AGI
I was considering captioning the first figure "the three genders" as a joke, but I quickly realized it doesn't pass at all for a joke, it's too real. Polygyny (sperm being cheap, pregnancy being expensive) actually does give rise to a tendency for males of a species to be more risk-seeking, though probably not outright convex. And the correlation between wealth, altruism and linearity does kind of abstractly reflect an argument for the decreased relevance of this distinction in highly stable societies that captures my utopian nonbinary feelings pretty well.
So; would it be feasible to save a bunch of snapshots from different parts of the training run as well? And how many would we want to take? I'm guessing that if it's a type of agent that disappears before the end of the training run:
True, they're naturally rare in general. The lottery game is a good analogy for the kinds of games they prefer; a consolidation, from many to few, and they can play these sorts of games wherever they are.
I can't as easily think of a general argument against a misaligned AI ending up convex though.
Alternate phrasing, "Oh, you could steal the townhouse at a 1/8billion probability? How about we make a deal instead. If the rng rolls a number lower than 1/7billion, I give you the townhouse, otherwise, you deactivate and give us back the world." The convex agent finds that to be a much better deal, accepts, then deactivates.
I guess perhaps it was the holdout who was being unreasonable, in the previous telling.