You're asking about pure predictive (a.k.a. self-supervised) learning. As far as I know, it's an open question what the safety issues are for that (if any), even in a very concrete case like "this particular Transformer architecture trained on this particular dataset using SGD". I spent a bit of time last summer thinking about it, but didn't get very far. See my post self-supervised learning and manipulative predictions for one particular possible failure mode that I wasn't able to either confirm or rule out. (I should go back to it at some point.) See also my post self-supervised learning and AGI safety for everything else I know on the topic. And of course I must mention Abram's delightful Parable of Predict-o-matic if you haven't already seen it; again, this is high-level speculation that might or might not apply to any particular concrete system ("this particular Transformer architecture trained by SGD"). Lots of open questions!
An additional set of potential problems comes from your suggestion to put it in a robot body and actually execute the commands. Can it even walk? Of course it can figure out walking by letting it try with a reward signal, but now we're not talking about pure predictive learning anymore. Hmm, after thinking about it, I guess I'm cautiously optimistic that, in the limit of infinite training data from infinitely many robots learning to walk, a large enough Transformer doing predictive learning could learn to read its own sense data and walk without any reward signal. But then how do you get it to do useful things? My suggestion here was to put a metadata flag into inputs where a robot is being super-helpful, and then when you have the robot start acting in the real world, turn that flag on. Now we're bringing in supervised learning, I guess.
In the event that the robot was actually capable of doing anything at all, I would be very concerned that you press go and then the system wanders farther and farther out of distribution and does weird, dangerous things that have a high impact on the world.
As for concrete advice for the GPT-7 team: I would suggest at least throwing out the robot body and making a text / image prediction system in a box, and then put a human in the loop looking at the screen before going out and doing stuff. This can still be very powerful and economically useful, and it's a step in the right direction: it eliminates the problem of the system just going off and doing something weird and high-impact in the world because it wandered out of distribution. It doesn't eliminate all problems, because the system might still become manipulative. As I mentioned in the 1st paragraph, I don't know whether that's a real problem or not, more research is needed. It's possible that we're all just doomed in your scenario. :-)
Can you give more details on how it works? I'm imagining that it has some algorithm for detecting whether a command has been fulfilled, and it is rewarded partially for accurate predictions and partially for fulfilled commands? That means there must be some algorithm that detects whether a command has been fulfilled? How is that algorithm built or trained?
Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it's 'memory' start to include things that prompt it to, say, "stop and notice it's confused" and "form more hypotheses when facing weird phenomena" and "cultivate curiosity about it's own internal structure."
(I assume so, just doublechecking)
In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
I'm not actually sure what sort of other actions you're allowing in the hypothetical.