Imagine aliens on a distant world. They have values very different to humans. However, they also have complicated values, and don't exactly know their own values.
Imagine these aliens are doing well at AI alignment. They are just about to boot up a friendly (to them) superintelligence.
Now imagine we get to see all their source code and research notes. How helpful would this be for humans solving alignment?
I wasn't really thinking about a specific algorithm. Well I was kind of thinking about LLM's and the alien shogolith meme.
But yes. I know this would be helpful.
But I'm more thinking about what work remains. Like is it a idiot-proof 5 minute change? Or does it still take MIRI 10 years to adapt the alien code?
Domain limited optimization is a natural thing. The prototypical example is deep blue or similar. Lots of optimization power, over a very limited domain. But any teacher who optimizes the class schedule without thinking about putting nanobots in the student brains is doing something similar.
I am guessing and hoping that the masks in an LLM are at least as limited-optimizers as humans, often more. Due to their tendency to learn the most usefully predictive patterns first. Hidden long term sneaky plans will only very rarely influence the text. (Due to the plans being hidden)
And, I hope, the shogolith isn't itself particularly intrested in optimizing the real world. The shogolith just chooses what mask to wear.
Can we duct tape a mask of "alignment researcher" onto a shogolith, and keep the mask in place long enough to get some useful alignment research done.
The more that there is one "know it when you see it" simple alignment solution, the more likely this is to work.