what's even the point of working strong AI if you can't ever be convinced that it's friendly
As I understand it, EY's/MIRI's position on this is that they will be convinced an AI is Friendly by having coded it using procedures which they are confident (based on theoretical analysis) produce Friendly AI.
Once the AI is running, on this view, it's too late.
If you've said the position correctly, there seems to be a fatal flaw in that position. I realize, of course, that I've only thought for 5 minutes and that they're domain experts and been thinking about this for longer...but here is the flaw:
If we believe that an AI can convince Person X who has seen its algorithm that it is Friendly when it isn't actually friendly, then we shouldn't trust Person X to judge the algorithm's Friendliness anyway. Why would someone who makes the wrong decision with less information make the right decision with more information?
Edit: I guess knowledge of human biases could make the AGI in a box slightly scarier than uncompiled source code of said AGI
AI Box Experiment Update #3
Tuxedage (AI) vs Alexei (GK) - Gatekeeper Victory
Tuxedage (AI) vs Anonymous (GK) - Gatekeeper Victory
I have won a second game of AI box against a gatekeeper who wished to remain Anonymous.
This puts my AI Box Experiment record at 3 wins and 3 losses.