I think it’s actually another control process—specifically, the process of controlling our identities. We have certain conceptions of ourselves (“I’m a good person” or “I’m successful” or “people love me”.) We then are constantly adjusting our lives and actions in order to maintain those identities—e.g. by selecting the goals and plans which are most consistent with them, and looking away from evidence that might falsify our identities. So perhaps our outermost loop is a control process after all.
rhymes with Steve on approval reward (and MONA)
in general I find I'm sowewhat skeptical about the particulars of activate inference coalitional agency frame, but very excited about locality, path-dependence, predictive processing, and myopia
yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)
the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)
yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"
I think the "open character training" paper is probably a good place to look
also curious how you feel about using standard harmfulness questions for auditing MOs.
My vague sense is that models tend to have more reflexive refusal mechanisms that asking Chinese models about CCP censored stuff, but does seem like models strategically lie sometimes. (ofc you'd also need to setup a weak-to-strong situational where the weak model genuinely doesn't know how to make a bomb, chem weapon, etc., but seems doable)
nice, looks promising!
is this true? I think many people (myself included) are worried about conflationary alliances backfiring (as we see to some extent in the current admin)