Angie Normandale - LessWrong

I've been exploring this for the last year, I think it's a promising avenue for solving some key alignment issues. Homeostatic approaches are well documented in neuroscience but a surprisingly neglected approach to alignment.

Research in support:

Managing competing drives, homeostatic approaches ensure safety wins out
-Mathematical formalisations by Laurençon et al, Kermati and Gutkin
-Robotics researchers successfully implemented a homeostatic approach to help a system manage competing drives
-Friston suggests it's a way to manage Free Energy

Scaling to group behaviour
This could mathematically support Joel Leibo and team's appropriateness agenda and provide a mechanism for the unsolved problem of alignment with changing and influenceable reward functions.

Declaration of interest: I recently joined Roland at Aintelope to support this agenda alongside other applications from neuroscience to alignment!

Great paper! Important findings.

What’s your intuition re ways to detect and control such behaviour?

An interesting extension would be training a model on a large dataset which includes low level but consistent elements of primed data. Do the harmful behaviours persist and generalise? If yes, could be used to exploit existing ‘aligned’ models which update on publicly modifiable datasets.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments