Models not making it clear when they're roleplaying seems like a fairly big issue
There are several examples of smart but non-expert people using LLMs to "work on" difficult scientific questions eg a little while back former CEO of uber: > “I’ll go down this thread with GPT or Grok and I’ll start to get to the edge of what’s known in quantum physics and then I’m doing the equivalent of vibe coding, except it’s vibe physics,” Kalanick said, as spotlighted by Gizmodo. > > “And we’re approaching what’s known,” he enthused. “And I’m trying to poke and see if there’s breakthroughs to be had. And I’ve gotten pretty damn close to some interesting breakthroughs just doing that.” Seems to me that a reasonable explanation for what is happening when LLMs do vibe-physics, or whatever it is they do when they talk about initializing recursion to break a spiral containment protocol and drive people insane, is that the model is engaging in some kind of role play, without the user having caught on. Below is an example of an interaction I think is illustrative/representative Example Yesterday I was playing with Gemini 3. I was sitting with my friend, asking Nano Banana to make visualizations of various theorems. One of them being the Gershgorin disk theorem, which apparently looks like this Then my friend tells me the theorem sucks because it doesn't work for vector spaces over finite fields. So I ask the model to generalize the disc theorem to finite fields. (this doesn't really make sense) It responds: > That's a great question that gets to the heart of the difference between continuous fields like the complex numbers (C) and discrete fields like finite fields (Fq ). > > There is no direct geometric analogue of the Gershgorin Circle Theorem for finite fields in the way... [] ... No Distance in Finite Fields: A finite field Fq is an algebraic structure, not a geometric one. There is no natural notion of "distance" or "absolute value" that behaves like it does in C. You can't define a "disk" of radius R around a point in a meaningful geometric way... (all
Pretty strongly disagree with all this, and find the reasoning confused. No offense. My pov is that "value drift" comes down to two things.
I think (2) is good, and (1) is almost certainly bad. I think people confuse (1) and (2) a lot. I think you do a subtle form of this in this post.
The only case where (1) isn't bad is if you very precisely value the process of terminal values changing itself. Which some people say they value. I think... (read 757 more words →)