Suppose the long-term risk center's researchers, or a random group of teenage nerd hackers, or whatever, come up with what they call an "alignment solution". A really complicated and esoteric, yet somehow elegant, way of describing what we really, really value, and cramming it into a big mess of virtual neurons. Suppose Eliezer and Tammy and Hanson and Wentworth and everyone else all go and look at the "alignment solution" very carefully for a very long time, and do not find any flaws in it. Lastly, suppose they test it on a weak AI and the AI immediately stops producing strange outputs/deceiving supervisors/specification gaming, and starts acting super nice and reasonable.
Great, right? Awesome, right? We won eternal eutopia, right? Our hard work finally paid off, right?
Even if this were to happen, I would still be skeptical to plug our new, shiny solution into a superintelligence and hit run. I believe that before we stumble on an alignment solution, we will stumble upon an "alignment solution" - something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot. That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?
I already figured that. The point of this question was to ask if there could possibly exist things that look indistinguishable from true alignment solutions (even to smart people), but that aren't actually alignment solutions. Do you think things like this could exist?
By the way, good luck with your plan. Seeing people actively go out and do actually meaningful work to save the world gives me hope for the future. Just try not to burn out. Smart people are more useful to humanity when their mental health is in good shape.