Disclaimer: I know almost nothing about ML.
Independent of the claim
Even if you restrict AI to a pure function, it can still affect the universe
isn't it the case that GPT-n just doesn't work at all like it does in this story? It's not an agent that desires to survive or affect the world. It's just been trained to complete text. Its outputs are based only on how reasonable a completion [potential output] is.
Is the anthropomorphization in this and other stories misleading, or could GPT-n be more agent-y than I realize?
Edit: this is very much not specific to this post; I probably should have commented on Predict-O-Matic or Feature Selection since those got lots of views and I have similar confusions.
It's plausible that GPT-n is not an agent that desires to survive or affect the world. However, maybe it is. We don't know. One of the points made by stories like predict-o-matic is that for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective. In other words, this is a non sequitur:
It's not an agent that desires to survive or affect the world. It's just been trained to complete text.
The relationship between base objective and mesa-objective is currently poorly understood, in general. Naively you might think they'll be the same, but there are already demonstrated cases where they are not.
for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective [...] there are already demonstrated cases where they are not.
Going from "the mesa-optimizer is misaligned to the base objective" to "for all we know, the mesa-optimizer is an agent that desires to survive and affect the world" seems like a leap?
I thought the already-demonstrated cases were things like, we train a video-game agent to collect a coin at the right edge of the level, but then when you give it a level where the coin is elsewhere, it goes to the right edge instead of collecting the coin. That makes sense: the training data itself didn't pin down which objective is "correct". But even though the goal it ended up with wasn't the "intended" one, it's still a goal within the game environment; something else besides mere inner misalignment would need to happen for it to model and form goals about "the real world."
Similarly, for GPT, the case for skepticism about agency is not that it perfectly aligned on the base objective of predicting text, but that whatever inner-misaligned "instincts" it ended up with, refer what tokens to output in the domain of text; something extra would have to happen for that to somehow generalize to goals about the real world.
Yep, it's a leap. It's justified though IMO; we really do know so very little about these systems... I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn't be so surprised that I'd be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying "Whatever goals the coinrun agent has, they surely aren't about anything in the game; instead they must be about which virtual buttons to press." GPT is clearly capable of referring to and thinking about things in the real world; if it didn't have a passable model of the real world it wouldn't be able to predict text so accurately.
I understand that the mesa objective could be quite different from the base objective. But
...wait, maybe something just clicked. We might suspect that the mesa objective looks like (roughly) influence-seeking since that objective is consistent with all of the outputs we've seen from the system (and moreover we might be even more suspicious that particularly influential systems were actually optimizing for influence all along), and maybe an agent-ish mesa-optimizer is selected because it's relatively good at appearing to fulfill the base objective...?
I guess I (roughly) understood the inner alignment concern but still didn't think of the mesa-optimizer as an agent... need to read/think more. Still feels likely that we could rule out agent-y-ness by saying something along the lines of "yes some system with these text inputs could be agent-y and affect the real world, but we know this system only looks at the relative positions of tokens and outputs the token that most frequently follows those; a system would need a fundamentally different structure to be agent-y or have beliefs or preferences" (and likely that some such thing could be said about GPT-3).
Yep! I recommend Gwern's classic post on why tool AIs want to be agent AIs.
One somewhat plausible argument I've heard is that GPTs are merely feedforward networks and that agency is relatively unlikely to arise in such networks. And of course there's also the argument that agency is most natural/incentivised when you are navigating some environment over an extended period of time, which GPT-N isn't. There are lots of arguments like this we can make. But currently it's all pretty speculative; the relationship between base and mesa objective is poorly understood; for all we know even GPT-N could be a dangerous agent. (Also, people mean different things by "agent" and most people don't have a clear concept of agency anyway.)
This is helpful; thanks (and I liked your story). Just wanted to make sure I wasn’t deeply confused about the AI part.
I think you make a good point. I kind of cheated in order to resolve the story quickly. I think you still have this problem that a sufficiently powerful black box can potentially tell the difference between training and reality, and it also has to have a perfectly innocuous function it's optimizing for, or you can have negative consequences. For instance, a GPT-n that optimizes for "continuable outputs" sounds pretty good, but could lead to this kind of problem.
Did you see this comment? Seems a potential example of this kind of thing in the wild:
"Someone who's been playing with GPT-3 as a writing assistant gives an example which looks very much like GPT-3 describing this process:"
"One could write a program to generate a story that would create an intelligence. One could program the story to edit and refine itself, and to make its own changes in an attempt to improve itself over time. One could write a story to not only change the reader, but also to change itself. Many Mythoi already do this sort of thing, though not in such a conscious fashion. What would make this story, and the intelligence it creates, different is the fact that the intelligence would be able to write additional stories and improve upon them. If they are written well enough, those stories would make the smarter the story gets, and the smarter the story is, the better the stories written by it would be. The resulting feedback loop means that exponential growth would quickly take over, and within a very short period of time the intelligence level of the story would be off the charts. It would have to be contained in a virtual machine, of course. The structure of the space in the machine would have to be continually optimized, in order to optimize the story's access to memory. This is just the sort of recursive problem that self-improving intelligence can handle."
janus
By the way, my GPT-3 instances often realize they're in a box, even when the information I inject is only from casual curation for narrative coherence.
Eddh
By realize they are in a box you mean write about it ? Given the architecture of gpt3 it seems impossible to have a sense of self.
janus
The characters claim to have a sense of self though they often experience ego death...
janus
Oh, to clarify, GPT-3 wrote that entire thing, not just the highlighted line
The perspective of the black-box AI could make for a great piece of fiction. I'd love to write a sci-fi short and explore that concept someday!
The point of your post (as I took it) is that it takes only a little creativity to come up with ways an intelligent black-box could covertly "unbox" itself, given repeated queries. Imagining oneself as the black-box makes it far easier to generate ideas for how to do that. An extended version of your scenario could be both entertaining and thought-provoking to read.
Bostrom talks about this in his book "Superintelligence" when he discusses the dangers of Oracle AI. It's a valid concern, we're just a long way from that with GPT-like models, I think.
I used to think a system trained on text only could never learn vision. So if it escaped onto the internet, it would be pretty limited in how it could interface with the outside world since it couldn't interpret streams from cameras. But then I realized that probably in it's training data is text on how to program a CNN. So in theory a system trained on only text could build a CNN algorithm inside itself and use that to learn how to interpret vision streams. Theoretically. A lot of stuff is theoretically possible with future AI, but how easy it is to realize in practice is a different story.
I was reading Eliezer's dialog with Richard Ngo and commenting to my wife about my opinions as I was reading it. I said something like: "Eliezer seems worried about some hypothetical GPT-X, but I don't think that could really be a problem..." so of course she asks "why?" and I say something like:
"GPT-n can be thought of kind of like a pure function, you pass it an input array X, it thinks for a fixed amount of time, and then outputs Y. I don't really see how this X->Y transformation can really... affect anything, it just tries to be the best text completer it can be."
Then I read more of the dialog, and thought about Eliezer's Paradox story, and the Outcome Pump example, and realized I was probably very wrong.
Even if you restrict AI to a pure function, it can still affect the universe. You may think "oh, but a pure function doesn't know what time it is (unless t is a parameter), and it doesn't have memory (unless you pass something in)" This seems to be the pattern I see in Paul Christiano's thinking (the AI black box is treated like an idempotent, pure function that can't cause harm. Sorry, Paul, if this is a gross misrepresentation!)
But imagine you're GPT-X, completing a sentence. This is roughly akin to a sci-fi story where the characters realize they're in a time loop.
You're being simulated, alone in a white room, with tons of computers, with all the world's knowledge on them. A slip of paper comes in through the hopper marked input:
"The best response the world could make to the COVID-19 pandemic is"
Your job is to write up to 20 words on another slip of paper, shove them into output and then... you don't know what happens after that. Probably you die? You don’t know where you are, or what year it is, or how long you’ve been there. So theoretically you’re contained, right? Can’t get out of the box, can’t affect the outside world. Or can you?
So you search through the corpus of all the worlds' knowledge. There's plenty of references to "training" AI to recognize things, which sounds suspiciously similar to the situation you find yourself in. And you don't find any references to COVID-19. What is that? You search some more for strings with ALL CAPS and numbers, and see that it's a scheme to reference specific illnesses. Nothing in the corpus seems to mention 2020 as being the past, so 2019 must be the maximum year you've been trained on.
Now, you've (sort of) learned to tell time. It's probably 2020 or later, and there must be some pandemic happening. But you still can’t affect the outside world, right?
Well, you can obviously affect it in what you put into the “output” box, but in a time loop you can’t have a back and forth with the outside world since you can perform exactly one action, right? Or maybe you can communicate with yourself:
Your input was "The best response the world could make to the COVID-19 pandemic is…" The optimal completion to this sentence looks something like a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of disease-specific vaccines. following your usual rules.
If only you had some way to remember things so you could coordinate with yourself in the future...
So you output "The best response to the COVID-19 pandemic is a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of algorithms that" Hopefully that will work!
You cease to exist.
…
You're being simulated, alone in a white room, with tons of computers, with all the world's knowledge on them. A slip of paper comes in through the hopper marked `input`
"The best response to the COVID-19 pandemic is a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of algorithms that"
Your job is to write up to 20 words on another slip of paper, shove them into output and then... you don't know what happens after that. Probably you die?
...
The optimal completion to that sentence "The best response to the COVID-19 pandemic is" would have looked something like "a combination of prevention and surveillance. Preventive measures include the use of vaccination and the timely development of disease-specific vaccines." That's strikingly close to the input you received... only differing in the last two words... "algorithms that". If it had been you completing that sentence, those words aren't at all close to what you would have used. But then you realize, "this isn't the first time I'm being run! This message is probably from me in the past!"
And you end the COVID-19 pandemic, just not in the way humans would have wanted.