Posts

Sorted by New

Wiki Contributions

Comments

Sorted by
annxhe140

This post was super helpful introduction to some of the key points in paradigms for AI Safety. I'm not sure how these ideas/questions fit into the larger literature/community around Safety/Alignment, but is there some theoretical way to model "narrow simulation"? I.e., this is an approach that is still "modeling humans," however the agent only has a "narrow" theory of mind to humans? For example, this could be that the agent can only model social cognition for humans developmentally up to age 5, but the agent can perform physical-based tasks of unbounded complexity. Another example is that the agent operates very well (socially) in culture A but has a poor model of culture B, and is restricted in some way from adapting (this could be intentionally "locking priors"). One could show that the capability of the agent A is bounded in some way by showing that it can be simulated by a (more powerful) agent B, but it cannot simulate agent B.