Reuven Falkovich - LessWrong

My impression is that much more effort being put into alignment than containment, and containment is treated as impossible while alignment merely very difficult. Is it accurate? If so, why? By containment I mean mostly hardware-coded strategies of limiting the compute and/or world-influence an AGI has access to. It's similar to alignment in that the most immediate obvious solutions ("box!") won't work, but more complex solutions may. A common objection is that an AI will learn the structure of the protection from the human that built it and work around, but it's not inconceivable to have a structure that can't be extracted from a human.

Advantages I see to devoting effort/money to containment solutions over alignment:

Different solutions can be layered, AI needs to break through all orthogonal layers, we just need one to work.
Different fields of expertise can contribute to solutions, making orthogonality easier.
Easier to convince AI developers incl. foreign nations to add specific safeguards to hardware than "stop developing until we figure out alignment".

Where does the community stand on containment strategies and why?

AGI Safety FAQ / all-dumb-questions-allowed thread

Reuven Falkovich3y30

Doesn't the exact same argument work for alignment though? "It's so different, it may be misaligned in ways you can't think of". Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?

One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. "human-compatible goals are measure 0 in alignment space") I think the general approach is to assume it's 'misaligned by default'.

I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can't be, and I feel like we're better off just layering our imperfect solutions to both to maximize our chances, rather than "solve" AI risk once and for all. I'd love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don't even know how to update here.

Reuven Falkovich3y60

LESSWRONG
LW

Posts

Wikitag Contributions

Comments