Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered "friendly" by RLHF?

To be clear, I am not asking about the following, which I am familiar with:

  • The original of the  shoggoth meme and its relation to H.P. Lovecraft's shoggoth
  • The notion that the space of possible minds is very large, with humanity only a small part
  • Eliezer Yudkowsky's description of evolution as Azathoth, the blind idiot god, as a way of showing that "intelligences" can be quite incomprehensible
  • The difference in environments between the training and the runtime phase of an LLM
  • The fact that machine-learning systems like LLMs are not really neuromorphic; they are structured differently from human brains (though that fact does not exclude the possibility of similarly on a logical level)

Rather, I am looking for a discussion of  evidence that the  LLMs internal  "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.

(All I found on the shoggoth idea on LessWrong is this article contrasts the idea of the shoggoth  with the idea that there is no coherent model, but does not explain why we might think that there is an alien cognitive model. This one likewise mentions the idea but does not argue for its correctness.)

[Edit: Another user corrected my spelling: shoggoth, not shuggoth.]

New to LessWrong?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 12:25 PM

SOTA LLMs seem to be wildly, wildly superhuman than humans at literal next token prediction.

It's unclear if this implies fundamental differences in how they work versus different specializations.

(It's possible that humans could trained to be much better at next token prediction, but there isn't an obvious methodology which works for this based on intial experiments.)

Thank you. 

> It's unclear if this implies fundamental differences in how they work versus different specializations.

Correct. That article  argues that LLMs are more powerful than humans in this skill, but not that they have different (implicit) goal functions or that their cognitive architecture is deeply different from the human.

For a back and forth on whether the "LLMs are shoggoths" is propaganda, try reading this.

In my opinion if you read the dialogue, you'll see the meaning of "LLMs are shoggoths" shift back and forth -- from "it means LLMs are psychopathic" to "it means LLMs think differently from humans." There isn't a fixed meaning.

I don't think trying to disentangle the "meaning" of shoggoths is going to result in anything; it's a metaphor, some of whose understandings are obviously true ("we don't understand all cognition in LLMs"), some of which are dubiously true ("LLM's 'true goals' exist, and are horrific and alien"). But regardless of the truth of these props, you do better examining them one-by-one than in an emotionally-loaded image.

It's sticky because it's vivid, not because it's clear; it's reached for as a metaphor -- like "this government policy is like 1984" -- because it's a ready-to-hand example with an obvious emotional valence, not for any other reason.

If you were to try to zoom into "this policy is like 1984" you'd find nothing; so also here.

Can you say more about what you mean by "Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered "friendly" by RLHF?"

Like, obviously it's gonna be alien in some ways and human-like in other ways. Right? How similar does it have to be to humans, in order to count as not an alien? Surely you would agree that if we were to do a cluster analysis of the cognition of all humans alive today + all LLMs, we'd end up with two distinct clusters (the LLMs and then humanity) right? 


> Like, obviously it's gonna be alien in some ways and human-like in other ways. Right
It has been said that since LLMs predict human output, they will, if sufficiently improved, be quite human-- that they will behave in a quite human way.  

> Can you say more about what you mean by "Where can I find a post
As part of a counterargument to that, we could find evidence that their logical structure is quite different from humans. I'd like to see such a write-up. 

> Surely you would agree that if we were to do a cluster analysis of the cognition of all humans alive today + all LLMs, we'd end up with two distinct clusters (the LLMs and then humanity) right?

I agree, but I'd like to see some article or post arguing that.

OK, thanks.

Your answer to my first question isn't really an answer -- "they will, if sufficiently improved, be quite human--they will behave in a quite human way." What counts as "quite human?" Also are we just talking about their external behavior now? I thought we were talking about their internal cognition.

You agree about the cluster analysis thing though -- so maybe that's a way to be more precise about this. The claim you are hoping to see argued for is "If we magically had access to the cognition of all  current humans and LLMs, with mechinterp tools etc. to automatically understand and categorize it, and we did a cluster analysis of the whole human+llm population, we'd find that there are two distinct clusters: the human cluster and the llm cluster."

Is that right?

If so then here's how I'd make the argument. I'd enumerate a bunch of differences between LLMs and humans, differences like "LLMs don't have bodily senses" and "LLMs experience way more text over the course of their training than humans experience in their lifetimes" and "LLMs have way fewer parameters" and "LLMs internal learning rule is SGD whereas humans use hebbian learning or whatever" and so forth, and then for each difference say "this seems like the sort of thing that might systematically affect what kind of cognition happens, to an extent greater than typical intra-human differences like skin color, culture-of-childhood, language-raised-with, etc." Then add it all up and be like "even if we are wrong about a bunch of these claims it still seems like overall the cluster analysis is gonna keep humans and LLMs apart instead of mingling them together. Like what the hell else could it do? Divide everyone up by language maybe, and have primarily-English LLMs in the same cluster as humans raised speaking English, and then nonenglish speakers and nonenglish LLMs in the other cluster? That's probably my best guess as to how else the cluster analysis could shake out, and it doesn't seem very plausible to me--and even if it were true, it would be true on the level of 'what concepts are used internally' rather than more broadly about stuff that really matters like what the goals/values/architecture of the system is (i.e. how they are used)

First thoughts:

  • Context length is insanely long
  • Very good at predicting the next token
  • Knows many more abstract facts

These three things are all instances of being OOM better at something specific. If you consider the LLM somewhat human-level at the thing it does, this suggests that it's doing it in a way which is very different from what a human does.

That said, I'm not confident about this; I can sense there could be an argument that this counts as human but ramped up on some stats, and not an alien shoggoth.

Rather, I am looking for a discussion of  evidence that the  LLMs internal  "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.

I think we do not understand enough about either the LLM's true algorithms or humans' to make such arguments, except for basic observations like the fact that humans have non-language recurrent state which many LLMs lack.

I wouldn't say that's exactly best argument but for example

As you said, this seems like a pretty bad argument.

Something is going on between the {user instruction} ..... {instruction to the image model}. But we don't even know if it's in the LLM. It could be there's dumb manual "if" parsing statements that act differently depending on periods, etc, etc. It could be that there are really dumb instructions given to the LLM that creates instructions for the language model, as there were for Gemini. So, yeah.

That is good, thank you.

That seems to be an argument for something more than random noise going on, but not an argument for ‘LLMs are shuggoths’?

Definition given in post: 

I am looking for a discussion of evidence that the LLMs internal "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently.

I think my example counts.

I'm not totally sure the hypothesis is well-defined enough to argue about, but maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?

If the internals were like a human thinking about the question and then giving an answer, it would probably be able to add numbers more reliably. And I also suspect the pattern of mistakes doesn't look typical for a human at any developmental stage (once a human can add 3 digit numbers their success rate at 5 digit numbers is probably pretty good). I vaguely recall some people looking at this, but gave forgotten the reference, sorry.

> maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?

That is good. Can you recommend one?

I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don't know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)

Though I agree that there's a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.

They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don't without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even "thinking step by step," there are still plenty of issues.

Tokenization is definitely a contributor. But that doesn't really support the notion that there's an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.

I definitely agree that it doesn't give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.

If Earth had intelligent species with different minds, an LLM could end up identical to a member of at most one of them.

Does something like the "I have been a good Bing. 😊" thing count? (More examples.

I'd say that's a pretty striking illustration that under the surface of a helpful assistant (in the vein of Siri et al.) these things are weird, and the shoggoth is a good metaphor.

Thank you. But being manipulative, silly, sycophantic, or nasty is pretty human. I am looking for hints of a fundamentally different cognitive architecture

First, a factual statement that is true to the best of my knowledge: LLM state, that is used to produce probability distribution for the next token, is completely determined by the state of its input buffer (plus a bit of indeterminism due to parallel processing and non-associativity of floating point arithmetic).

That is LLM can pass only a single token (around 2 bytes) to its future self. That follows from the above.

What comes next is a plausible (to me) speculation.

For humans what's passed to our future self is most likely much more that a single token. That is a state of the human brain that leads to writing (or uttering) the next word most likely cannot be derived from a small subset of a previous state plus a last written word (that is state of the brain changes not only because we had written or said a word, but by other means too).

This difference can lead to completely different processes that LLM uses to mimic human output, that is potential shoggethification. But to be the real shoggoth LLM also needs a way to covertly update its shoggoth state, that is the part of its state that can lead to inhuman behavior. Output buffer is the only thing it has to maintain state, so the shoggoth state should be steganographically encoded in it, thus severely limiting its information density and update rate.

I wonder how a shoggoth state may arise at all, but it might be my lack of imagination.