If I understand it correctly, and please correct me if I am mistaken, an approval-directed agent is an artificial intelligence that perfectly/near perfectly simulates a person, and then implements a decision only if that (simulation of a) person would like the decision (and here it is important that it does not compute the outcome of such decisions and then deteremines which outcome maximises the person's happiness, but instead it uses the person's heuristics (via the simulation) to determine whether or not the person would implement the decision given more time to think about it). So the decision making algorithm of the AI consists entirely of implementing the decisions that a faster human would.
Could you explain the difference between this approval-directed AI Arthur and an upload of the human Hugh? Or is there no difference? Under which conditions would they act differenty, i.e. implement a different strategy?
An approval-directed agent doesn’t simulate a person any more than a goal-directed agent simulates the universe. It tries to predict what actions the person would approve of, just as a goal-directed agent tries to predict what actions lead to good consequences. In the limit, the approval-directed agent is more like an emulation. This is analogous to the way in which a goal-directed agent approaches a simulation of the universe.
So there are two big differences:
You can implement it now; it's just an objective for your system, which it can satisfy to varyin
(Crossposted from ordinary ideas).
I’ve recently been thinking about AI safety, and some of the writeups might be interesting to some LWers:
I’m excited about a few possible next steps: