Don't get me wrong, I completely agree that not having a clear argument on how it's dangerous is not enough to assume it's safe. It's just the whole "alien actress" metaphor rubs me the wrong way, as it points that the danger comes from the shoggoth, as having some kind of goals of its own outside "acting". In my view the dangerous part is the simulacra.
Is there an argument where shoggoth's agency comes from? I can understand why it's useful to think of the mask (or simulated human) as an agent, not in our world though, but in the "matrix" shoggoth controls. Also I can understand that shoggoth must be really good at choosing very precise parameters for simulation (or acting) to simulate (or play) exactly the correct character that is most likely to write next token in very specific way. It seems very intelligent, but I don't get why shoggoth tend to develop some kind of agency of its own. Can someone elaborate on this?
Given that in the limit (infinite data and infinite parameters in the model) LLM's are world simulators with tiny simulated humans inside writing text on the internet, the pressure applied to that simulated human is not understanding our world, but understanding that simulated world and be an agent inside that world. Which I think gives some hope.
Of course real world LLM's are far from that limit, and we have no idea which path to that limit gradient descent takes. Eliezer famously argued about whole "simulator vs predictor" stuff which I think relevant to that intermidiate state far from limit.
Also RLHF applies additional weird pressures, for example a pressure to be aware that it's an AI (or at least pretend that it's aware, whatever that might mean), which makes fine-tuned LLM's actually less save than raw ones.
True, but you can always wriggle out saying that all of that doesn't count as "truly understanding". Yes, LLM's capabilities are impressive, but does drawing SVG changes the fact that somewhere inside the model all of these capabilities are represented by "mere" number relations?
Do LLM's "merely" repeat the training data? They do, but do they do it "merely"? There is no answer, unless somebody gives a commonly accepted criterion of "mereness".
The core issue with that is of course that since no one has a more or less formal and comprehensive definition of "truly understanding" that everyone agrees with - you can play with words however you like to rationalize whatever prior you had about LLM.
Substituting one vaguely defined concept of "truly understanding" with another vaguely defined concept of a "world model" doesn't help much. For example, does "this token is often followed by that token" constitutes a world model? If not - why not? It is really primitive, but who said world model has to be complex and have something to do with 3D space or theory of mind to be a world model? Isn't our manifest image of reality also a shadow on the wall since it lacks "true understanding" of underlying quantum fields or superstrings or whatever in the same way that long list of correlations between tokens is a shadow of our world?
The "stochastic parrot" argument has been an armchair philosophizing from the start, so no amount of evidence like that will convince people that take it seriously. Even if LLM-based AGI will take over the world - the last words of such a person gonna be "but that's not true thinking". And I'm not using that as a strawman - there's nothing wrong with a priori reasoning as such, unless you doing it wrong.
I think the best response to "stochastic parrot" is asking three questions:
1. What is your criterion of "truly understanding"? Answer concretely in a terms of the structure or behavior of the model itself and without circular definitions like "having a world model" which is defined as "conscious experience" and that is defined as "feeling redness of red" etc. Otherwise the whole argument becomes completely orthogonal to any reality at all.
2. Why do you think LLM's do not satisfy that criterion and human brain does?
3. Why do you think it is relevant to any practical intents and purposes, for example to the question "will it kill you if you turn it on"?
Okay, let's imagine that you doing that experiment for 9999999 times, and then you get back all your memories.
You still better drink. Probablities don't change. Yes, if you are consistent with your choice (which you should be) - you have a 0.1 probability of being punished again and again and again. Also you have a 0.9 probability of being rewarded again and again and again.
Of course that seems counterintuitive, because in real life a perspective of "infinite punishment" (or nearly infinite punishment) is usually something to be avoided at all costs, even if you don't get reward. That's because in real life your utility scales highly non-linearly, and even if single punishment and single reward have equal utility measure - 9999999 punishments in a row is a larger utility loss than a utility gain from 9999999 rewards.
Also in real life you don't lose your memory every 5 seconds and have a chance to learn on your mistakes.
But if we talking about spherical decision theory in a vacuum - you should drink.
I guess you've made it more confusing than it needs to be by introducing memory erasure to this setup. For all intents and purposes it's equivalent to say "you have only one shot" and after memory erasure it's not you anymore, but a person equivalent to other version of you next room.
So what we got is many different people in different spacetime boxes, with only one shot, and yes, you should drink. Yes, you have a 0.1 chance of being punished. But who cares if they will erase your memory anyway.
Actually we are kinda living in that experiment - we all gonna die eventually, so why bother doing stuff if you wont care after you die. But I guess we just got used to suppress that thought, otherwise nothing gonna be done. So drink.
Yeah, I realize that the whole "shoggoth" and "mask" distinction is just a metaphor, but I think it's a useful one. It's there in the data - in the infinite data and infinite parameters limit the model is the accurate universe simulator, including human writing text on the internet and separately the system that tweaks the parameters of the simulation according to the input. That of course doesn't necessary mean that actual LLM's far away from that limit reflect that distinction, but it seems to me natural to analyze model's "psychology" in that terms. One can even speculate that probably the layers of neurons closer to the input are "more shoggoth" and the ones closer to the output are "more mask".
I would not. Being vaguely kinda sorta human-like doesn't mean safe. Even regular humans are not aligned with other humans. That's why we have democracy and law. And kinda-sorta-humans with superhuman abilities may be even less safe that any old half-consequentialist half-deontological quasi-agent we can train with pure RLHF. But who knows.
True. All that incredible progress of modern LLM's is just a set of clever optimization tricks over RNN's that made em less computationally expensive. That doesn't say anything about agency or safety though.
Sorry, looks like I wasn't very clear. My point is not that stateless function can't be agentic when looping around a state. Any computable process can be represented as a stateless function in a loop, as any functional bro knows. And of course LLM's do keep state around.
Some kind of state/memory (or good enough environment observation ability) is necessary for agency but not sufficient. All existing agents we know are agents because they were specifically trained for agency. Chess AI is an agent in the chess board because it was trained specifically to do things on the chess board, i.e. win the game. Human brain is an agent in the real world because it was specifically trained to do stuff in the real world i.e. surviving in savannah and make more humans. Then of course the real world has changed and the proxy objectives like "have sex" stopped being correlated with meta-objective "make more copies of your genes". But the agency in the real world was there in the data from the start, it didn't just popped up from nothing.
Shoggoth wasn't trained to do stuff in the real world. It is trained to output parameters of the simulation of the virtual world, then the simulator part is trained to simulate that virtual world is such a way that tiny simulated human inside would write a text on its tiny simulated computer and that text must be the same as the text that real humans in the real world would write given previous text. That's the setup. That's what shoggoth does in the limit.
Agency (and consequentialism in particular) is when you output stuff to the real world - and you're getting rewarded depending on what real world looks like as a consequence of your output. There is no correlation between what shoggoth (or any given LLM as a whole for that matter) outputs and whatever happens in the real world as a consequence of that in such a way that shoggoth (I mean the gradient descend that shapes it) would have any feedback on. The training data doesn't care, it's static. And there is no such correlations in the data in the first place. So where does shoggoth's agency comes from?
RLHF on the other hand does feed back around. And that is why I think RLHF potentially can make LLM less safe, not more.
I would argue that in the LLM case this emerging prediction-utility is not a thing at all, since there's no pressure on shoggoth (or LLM as a whole) to measure it somehow. What will it do knowing that it just made a mistake? Excuse and rewrite a paragraph again? That's not how texts on the internet work. Again, agents have a feedback from the environment signaling that the plan didn't work. That's not the case with LLM's. But that's irrelevant, let's say that this utilitarian behavior does indeed emerge. Does this prediction-utility has anything to do with the consequences in the real world? Which world that world-model is a model of? Chess AI does clearly have a "winning utility", it's an agent, but only in a small world of the chess board.
I guess it's plausible that there is planning mechanism somewhere inside the LLM's. But it's not a planning on shoggoth's part. I can imagine the simulator part "thinking": "okay, this simulation sequence doesn't seem very realistic, let's try it this way instead", but again, it's not a planning in the real world, it is a planning of how to simulate virtual one.
Agree.