Anatoly_Vorobey comments on Clarification of AI Reflection Problem - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (8)
I would like to request some clarifications.
Given an agent A as you describe, let AO be its object-level algorithm, and AM, for meta, be its agent-search algorithm. A itself stands for "execute AM for 1/2 the time, and if that didn't transform yourself, execute AO for the remaining time".
When you say "if A' is a valid description of an agent and P is a proof that A' does better on the object level task than A", do you mean that P proves: 1) that A'O is better than AO at the task T; 2) that A' is better than A at T; 3) that A' is better than AO at T?
If you mean 2), doesn't it follow that you can get at most 1/2 speedup from any transformation? After all, if A solves T through transforming into A', that is still a case of A solving T, and so the only way A' could be better at it than A is due to the penalty A incurs by searching for (A',P); that penalty is at most 1/2 the time spent by A. It also depends a lot on the search strategy and looks "brittle" as a measure of difference between A and A'.
It's not clear to me what you mean in "but suppose A' considers many candidate modifications (A'', P1), (A'', P2), ..., (A'', Pk). It is now much harder for A to show that all of these self-modifications are safe..." - are these different agents A"1, A"2 etc., or the same agent with different proofs? If the latter, it's not clear what "these self-modifications" refer to, as there's just one modification, with different potential proofs; if the former, the language later on that seems to refer to one unique A" is confusing.
If we look at the simplest case, where A' runs twice faster than A but as a program is equivalent to A (e.g. suppose they run on the same underlying hardware platform, but A has a systematic slowdown in its code that's trivial to remove and trivial to show it doesn't affect its results), and A is able to prove that, do you claim that this case falls under your scenario, and A in fact should not be able to self-modify into A' w/o assuming its consistency? If you do, are you able, for this specific simple case, to debug the reasoning and show how it carries through?
Thanks, this whole part of your post is clearer to me now. I think the post would benefit from integrating these explanations to some degree into the text.