Davidmanheim

Sequences

Modeling Transformative AI Risk (MTAIR)

Wikitag Contributions

Comments

Sorted by

They edited the text. It was an exact quote from the earlier text.

I think that's what they meant you should not do when they said [edit to add: directly quoting a now-modified part of the footnote] "Bulk preorders don’t count, and in fact hurt."

My attitude here is something like "one has to be able to work with moral monsters".


You can work with them without inviting them to hang out with your friends.

This flavor of boycotting seems like it would generally be harmful to one's epistemics to adopt as a policy.

Georgia did not say she was boycotting, nor calling for others not to attend - she explained why she didn't want to be at an event where he was a featured speaker.

This seems mostly right, except that it's often hard to parallelize work and manage large projects - which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can't be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)

Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.

So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.

I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.

This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals. 

CoT monitoring seems like a great control method when available


As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.

First, strongly agreed on the central point - I think that as a community, we've been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.

That said, I am concerned about what happens if interpretability is wildly successful - against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on "miss things," "measuring progress," and "scalability," partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And  thoughts from @Buck / @Adam Shai as well!)

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the "automate AI alignment" plan has a safe buffer zone.

If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

 

That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don't, and they keep making systems that predictably are unsafe and exploitable, and they don't have serious plans to change their deployments, much less actually build a safety-oriented culture.

Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it's on top of LLMs. 

Responses to o4-mini-high's final criticisms of the post:

Criticism: "You're treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?"

Response: Yes, these are distinct, and one won't necessarily lead to the other - but both are being developed by the same groups in order to deploy them. There's a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.

Criticism: "You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I'd find your argument stronger if you were more explicit about the conditional risk landscape."

Response: If we don't solve ASI alignment, which no-one seems to think we can do, we're doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.

Criticism: "Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn't we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what's the alternative plan if we don’t pursue it?

Response: Don't build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race - because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.

Criticism: "You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?"

Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn't materially change the largest risks. That would be great news, but I don't want to bet my kid's lives on it.

Load More