If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.
Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.
This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".
“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“
I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says
Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."
Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.
However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.
It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.
I think Claude likes saying nice things about people, so it's worth trying to control for that.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."
Shutting Down all Competing AI Projects is not Actually a Pivotal Act
This seems like an excellent title to me.
Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because
If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.
My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.
Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.
Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".
It sure would.
Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?