really nice comment that I also happen to agree with. As a programmer working with Claude Code and Cursor every day I have yet to see AI systems achieve "engineering taste", which seems far easier than "research taste" as discussed by OPs. In my experience, these systems cannot perform medium-term planning and execution of tasks, even those that are clearly within distribution.
Perhaps the apparent limitations relate to the independent probability of things going wrong when you aren't maintaining a consistent world-model / in-line learning and feedback.
For example, even if 0.90 of your actions are correct, if they all can independently tank your task then your probably of success after 6 actions is a coin flip. I feel you can see the contours of this effect in CPP (Claude plays pokemon). So while I like the METR's proposed metrics in task-space, the "scaling curve" they are showing may not hold. This is because the tasks that help populate the y-axis are things that are in principle "one-shotable" by this set of model architectures, and thus don't fall to problems with independent errors. This all leads me to believe that the "research taste" as discussed by OP is a lot further off, ultimately pushing take-off scenarios back
really nice comment that I also happen to agree with. As a programmer working with Claude Code and Cursor every day I have yet to see AI systems achieve "engineering taste", which seems far easier than "research taste" as discussed by OPs. In my experience, these systems cannot perform medium-term planning and execution of tasks, even those that are clearly within distribution.
Perhaps the apparent limitations relate to the independent probability of things going wrong when you aren't maintaining a consistent world-model / in-line learning and feedback.
For example, even if 0.90 of your actions are correct, if they all can independently tank your task then your probably of success after 6 actions is a coin flip. I feel you can see the contours of this effect in CPP (Claude plays pokemon). So while I like the METR's proposed metrics in task-space, the "scaling curve" they are showing may not hold. This is because the tasks that help populate the y-axis are things that are in principle "one-shotable" by this set of model architectures, and thus don't fall to problems with independent errors. This all leads me to believe that the "research taste" as discussed by OP is a lot further off, ultimately pushing take-off scenarios back