faul_sname - LessWrong

Why imperfect adversarial robustness doesn't doom AI control

It sure would.

Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?

Why imperfect adversarial robustness doesn't doom AI control

faul_sname3dΩ240

If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.

Most arguments for AI Doom are either bad or weak

faul_sname1mo31

I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.

Most arguments for AI Doom are either bad or weak

faul_sname1mo20

This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".

Joshua Achiam Public Statement Analysis

faul_sname1mo258

“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“

I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says

Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."
Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.
However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.
It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.

I think Claude likes saying nice things about people, so it's worth trying to control for that.

Why is o1 so deceptive?

faul_sname2mo72

Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.

If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

faul_sname2mo43

Shutting Down all Competing AI Projects is not Actually a Pivotal Act

This seems like an excellent title to me.

Vladimir_Nesov's Shortform

faul_sname2mo61

Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because

It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen

If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.

Alignment by default: the simulation hypothesis

faul_sname2mo20

My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.

Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.

shminux's Shortform

faul_sname2mo42

Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".

LESSWRONG
LW

Posts

Wiki Contributions

Comments