Davidmanheim

Sequences

Modeling Transformative AI Risk (MTAIR)

Wikitag Contributions

Comments

Sorted by

I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.

This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals. 

DavidmanheimΩ120

CoT monitoring seems like a great control method when available


As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.

DavidmanheimΩ13-2

First, strongly agreed on the central point - I think that as a community, we've been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.

That said, I am concerned about what happens if interpretability is wildly successful - against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on "miss things," "measuring progress," and "scalability," partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And  thoughts from @Buck / @Adam Shai as well!)

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the "automate AI alignment" plan has a safe buffer zone.

If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

 

That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don't, and they keep making systems that predictably are unsafe and exploitable, and they don't have serious plans to change their deployments, much less actually build a safety-oriented culture.

Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it's on top of LLMs. 

Responses to o4-mini-high's final criticisms of the post:

Criticism: "You're treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?"

Response: Yes, these are distinct, and one won't necessarily lead to the other - but both are being developed by the same groups in order to deploy them. There's a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.

Criticism: "You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I'd find your argument stronger if you were more explicit about the conditional risk landscape."

Response: If we don't solve ASI alignment, which no-one seems to think we can do, we're doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.

Criticism: "Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn't we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what's the alternative plan if we don’t pursue it?

Response: Don't build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race - because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.

Criticism: "You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?"

Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn't materially change the largest risks. That would be great news, but I don't want to bet my kid's lives on it.

In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment - we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn't.

Quick take: it's focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)

...yest it hasn't happened, which is pretty strong evidence the other way.

I think you are fooling yourself about how similar people in 1600 are to people today. The average person at the time was illiterate, superstitious, and could maybe do single digit addition and subtraction. You're going to explain nuclear physics?

Load More