How is this clever javascript code the most likely text continuation of the human's question? GPT-N outputs text continuations, unless the human input is "here is malicious javascript code, which hijacks the browser when displayed and takes over the world: ... " then GPT-N will not output something like it. In fact, such code is quite hard to write, and would not really be what a human would write in response to that question, so you'd need to do some really hard work to actually get something like GPT-N (assuming same training setup as GPT-3) to actually output malicious code. Of course, some idiot might in fact ask that question, and then we're screwed.
For one, I don't think it's reasonnable to assume that the future GPT-N will only work in a "text continuation setup".
But also, what would happen if you had it read a database of exploits and ask him to evade the API, or to "create a new exploit to break this API" or something like that.
I don't work in the field so it's a question.
future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of "GPT-N" is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don't assume this then "GPT-N" is no more specific than "Deep Learning-based AGI", and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn't trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that's what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
It's not fully escaped the box because it still has to communicate through the GTP-3 API to the main backend.
Lot's of low cost ways to prevent this- perhaps already implemented (I don't use GPT3 or I'd verify). Human's have been doing this for awhile, so we have a lot of practice defending against it.
https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html
Computer security is an endless game of cats and mouse. Here you showed a URL pointing to something the cat knows. But there are plenty of literature/db on what the mouse have learned throughout the years.
If an AI became somehow self aware and had access to the knowledge of all the previous mice and cats, I wouldn't be surprised if it could break free. But that's a big if.
I agree with the cat and mouse metaphor and that we should assume an AI to be hyper competent.
At the same time, it will be restricted to operating within the constraints of the systems in can influence. My main point, which I admit was poorly made, is that cross site scripting attacks can be covered with a small investment, which eliminates clever java script as a possible attack vector. I would place lower probability on this being the way an AI escapes.
I would place higher probability on an AI exploiting a memory buffering type error similar to the one you referenced. Furthermore I would expect it to be in software it is running on top of and can easily experiment/iterate on. (OS, container, whatever) Whereas browser interactions are limited in iteration by the number of times a user calls the service, one would expect the local software can by manipulated and experimented with constantly and only be constrained by the CPU /IO resources available.
This makes me wonder, do AI researcher working on GPT-N corpus intentionnaly avoir putting their own work?
Does GPT have, in its corpus, details about the implementation of itself, or past attemps, or access to blogposts from people like you suggesting this very idea?
Or do they try to remove it from the corpus?
edit : A friend working in AI told me that they take enormous care of what they include and that, to the best of his knowledge it doesn't contain research paper. He doesn't know however if it includes wikipedia pages related to AI, or to the creator of itself etc.
Scenario: A human user asks GPT-N a plain text question through the usual browser textbox interface. GPT-N answers in the form of text, which contains clever Javascript code. Instead of being displayed, the malicious code triggers a browser exploit and gets executed. Now GPT-N has an executable of its choice running on the client side and can proceed from there. It has left its AI box with only a text-interface, without asking the user to let it out, and possibly without the user even noticing. Perhaps GPT-3 has done this already, while "deliberately" pretending to be stupider than it is?