SE Gyges' response to AI-2027
Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. However, I would like to comment upon potential errors right in the text, since it would be clearer. AI 2027 is a web site that might be described as a paper, manifesto or thesis. It lays out a detailed timeline for AI development over the next five years. Crucially, per its title, it expects that there will be a major turning point sometime around 2027[1], when some LLM will become so good at coding that humans will no longer be required to code. This LLM will create the next LLM, and so on, forever, with humans soon losing all ability to meaningfully contribute to the process. They avoid calling this “the singularity”. Possibly they avoid the term because using it conveys to a lot of people that you shouldn’t be taken too seriously. I think that pretty much every important detail of AI 2027 is wrong. My issue is that each of many different things has to happen the way they expect, and if any one thing happens differently, more slowly, or less impressively than their guess, later events become more and more fantastically unlikely. If the general prediction regarding the timeline ends up being correct, it seems like it will have been mostly by luck. I also think there is a fundamental issue of credibility here. Sometimes, you should separate the message from the messenger. Maybe the message is good, and you shouldn't let your personal hangups about the person delivering it get in the way. Even people with bad motivations are right sometimes. Good ideas should be taken seriously, regardless of their source. Other times, who the messenger is and what motivates them is important for evaluating the message. This applies to outright scams, like emails from strangers telling you they're Nigerian princes, and to people who probably believe what they're saying, like anyone telling you that their favorite religious leader or musician is the greatest one ever. You can guess,
Thank you for doing the experiment! It makes me wonder if the LLMs could be trained to verbalise their reasoning not just by creating "sensible rationalisations", but as follows: the LLM is SFTed to output answers to two task types when encountering the tasks and CoT-like reasoning when prompted. Then the LLM is RLed to ensure that its CoT for one task type causes the untrained LLM[1] to output the same answer. Finally, the LLM is asked to output some reasoning in the other task type. If the untrained LLM outputs the answer, then the LLM succeeded in developing the habit of explaining its reasoning, but with a potential to develop a misaligned understanding of the rules. What do you think of this idea?
P.S. My proposal is similar to the paper on Prompt Optimization Making Misalignment Legible.
An additional problem is subliminal learning, but this can be mitigated by choosing the LLM testing the CoT to be different.