Upvoted since I like how literally you went through the plan. I think we need to think about and criticize both, the literal version of the plan and the way it intersects with reality.
The methods you are trying are all known to fail at sufficiently high levels of intelligence. But if these are your only ideas, it is possible they get you far enough for GPT-5 to output a better idea.
To me this seems like a key point that many other critiques are missing that focus on specific details.
This is my attempt at Eliezer's challenge.
https://www.lesswrong.com/posts/tD9zEiHfkvakpnNam/a-challenge-for-agi-organizations-and-a-challenge-for-1
My overall impression of this plan is that it is surprisingly good. Not that I am particularly surprised such a plan exists, if MIRI had created this plan it would have been surprisingly bad. I think it is somewhat plausible that this plan could actually work, or at least some steel-manned version could work if several unknown parameters are set to favorable values.
You stand in a vast minefield, one end armed with popping balloons, the other end armed with some quantum vacuum decay weapon. Between there is a huge range, fireworks, conventional mines, nukes, antimatter and more.
Your plan is to wander the safer parts of the minefield, recording where mines go off, in the hope you spot a pattern that can lead you through the whole minefield.
This plan is not hopelessly doomed. But it is risky. You definitely need a way to detect that dangerous terrain is approaching, and to stop before you get there. Your job isn't to charge ahead. It is to march back and forth over swaths of balloon filled ground, carefully recording every balloon that pops, and scrutinizing the data for a pattern. Maybe you need to venture a little further into the space of firecrackers. But have a plan for where you stop, and an idea what the danger signals would be.
It may be that there is no pattern in the mines. Or at least none you can discern. In which case, you don't venture further. You don't keep on, hoping that you will spot a pattern with just a few more large fireworks. You go back to base. And you hope that someone somewhere has been working on an airplane to fly over the minefield, and you can ask to help out with that.
That is at least a possibility. A favorable bit not totally implausible setting of those hidden dials. You probably need some new ideas, and if someone shows you a paper on say conservative learning, how to learn a classification boundary that is big enough to fit the datapoints and no bigger, be ready to read that paper.
I do hope you have some procedure for deciding "when it's safe to do so". And ideally a way to share your results with a few other top labs, if you deem something safe enough to share with Deepmind or MIRI, but not safe enough to make public.
The methods you are trying are all known to fail at sufficiently high levels of intelligence. But if these are your only ideas, it is possible they get you far enough for GPT-5 to output a better idea. If you are going to delve into the hacky methods that might possibly drag you just far enough, here are some more.
As we trek further into the minefield, we expect to encounter entirely new kinds of explosives. You should be planning not to encounter them, at least for the really dangerous ones.
It is at least plausible that a GPT based approach can learn a superhumanly large pile of heuristics that can output interesting AI ideas. My understanding of existing AI is more like you took an existing article from the training data, and asked humans to translate it into Chinese and back. New words for the existing ideas, but little more than random mutation in the ideas space. However, that could easily be wrong, or could change with further scaling.
Optimistic. I think this still deserves to be called a plan. It has a strong vein of wishlist running through it.
If you can offload almost all the work of alignment, you can probably offload most of capabilities too. You have an AI you can just ask for "code for a superintelligence" and get back run-able code for a potentially unaligned superintelligence. Unless of course you have used some sort of fine tuning to stop this, and the fine tuning works better than it did for ChatGPT. This isn't in and of itself doom. If you get this far, you have constructed an artifact powerful enough to save or destroy the world. May you be careful with it.
Language models come with several great advantages, but also a great disadvantage. Language models automatically learn all the dark arts of persuasion and manipulation.
Suppose some programmer at openAI asks for "a highly convincing argument that the world is flat". The argument is indeed highly convincing. The programmer is convinced. Do you have a plan for this situation? My plan would be: Delete the language model, and any documents that you think were written by the language model. And any documents that might contain the programmer rephrasing the arguments in their own words. Send the programmer on at least 6 months paid leave, and psycological screening.
Suppose the programmer asks for "a highly convincing argument for why AI is totally safe". And is indeed convinced. Similar to before, except now the programmer is on definite leave, as in the definitely aren't coming back. And you told everyone else enough of what happened that they won't be working elsewhere on anything relating to AI. Pay them full salary to sit at home knitting. This is probably the point you want to do serious soul searching over why the question was put into such a dangerous model. And possibly the point you want to give up as an AI research org and hope someone else can safely align AI. This mine blew your leg off. Going any further is suicide.
But why delete the model? Shouldn't we use such a powerful model for good? Like maybe ask it to generate reasons AI is really dangerous, and send them to some people.
No. There are some spells so dark, no light wizard should ever cast them.
I suppose I have to explain in detail why this is a bad idea, for the benefit of those who just don't get it. The AI has already demonstrated the ability to hack human brains, to load information uncorrelated with reality directly into a smart and functioning mind.
Suppose you do get the AI to create such an argument. You send it to a few politicians who are making noises about alignment being a waste of public money. Magic this dark is harder to control than it is to unleash. The argument inevitably goes viral online. Now a large fraction of the population, including nearly every alignment researcher, has seen the argument. They are all convinced AI is dangerous because giant space penguins like the taste of AI, and will eat the earth if the earth has too many AI's on it. (Or some other nonsense that makes Scientology look sane in comparison) 3000 years later, humanity has rebuild from the rubble and sets out into space, on a holy quest to destroy all giant penguins. It can only be Murphy's law that the first aliens humanity came across had a decidedly penguinoid appearance. Murphy's law and the fact that, all those years ago, some corner of GPT-5 had a surprisingly good grasp of astrobiology and had successfully predicted one of the most common body plans to evolve across the universe.