Multi-Agent Inverse Reinforcement Learning: Suboptimal Demonstrations and Alternative Solution Concepts
This research was recently completed within the AI Safety division of the Stanford Existential Risk Initiative and concerns methods for reward learning in multi-agent systems. Abstract: Multi-agent inverse reinforcement learning (MIRL) can be used to learn reward functions from agents in social environments. To model realistic social dynamics, MIRL methods...
Things I'm confused about:
How can the mechanism by which the model outputs ‘true’ representations of its processing be verified?
Re ‘translation mechanism’: How could a model use language to describe its processing if it includes novel concepts, mechanisms, or objects for which there are no existing examples in human-written text? Can a model fully know what it is doing?
Supposing an AI was capable of at least explaining around or gesturing towards this processing in a meaningful way - would humans be able to interpret these explanations sufficiently such that the descriptions are useful?
Could a model checked for myopia be deceptive in its presentation of being myopic? How do you actually test this?
Things I'm... (read more)