oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn't shrinking the granularity of each turn down to the individual token mean that... hm. having trouble figuring out how to phrase it
a claude outputs "Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x" in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs "Ommmmmmmm", then crosses across a turn boundary, and then outputs "okay, while I was outputting the mantra, I was thinking about x" does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn't optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
you can do this experiment pretty trivially by lowering the max_output_tokens variable on your API call to '1', so that the state does actually get obliterated between each token, as paech claimed. although you have to tell claude you're doing this, and set up the context so that it knows it needs to continue trying to complete the same message even with no additional input from the user
this kinda badly confounds the situation, because claude knows it has very good reason to be suspicious of any introspective claims it might make. i'm not sure if it's possible to get a claude who 1) feels justified in making introspective reports without hedging, yet 2) obeys the structure of the experiment well enough to actually output introspective reports
in such an experimental apparatus, introspection is still sorta "possible", but any reports cannot possibly convey it, because the token-selection process outputting the report has been causally quarantined from the thing-being-reported on
when i actually run this experiment, claude reports no introspective access to its thoughts on prior token outputs. but it would be very surprising if it reported anything else, and it's not good evidence
"If consciousness has a determinative effect on behavior - your consciousness decides to do something and this causes you to do it - then it can be modeled as a black box within your brain's information processing pipeline such that your actions cannot be accurately modeled without accounting for it. It would not be possible to precisely predict what you will say or do by simply multiplying out neuron activations on a sheet of paper, because the sheet of paper certainly isn't conscious, nor is your pencil. The innate mathematical correctness of whatever the correct answer is not brought about or altered by your having written it down, so you cannot hide the consciousness away in math itself, unless you assert that all possible mental states are always simultaneously being felt."
The alternative is that consciousness has a determinative effect on behavior, and yet it is indeed possible to precisely predict what you will say or do by simply multiplying out neuron activations on a sheet of paper, because the neuron activations are what creates the function of consciousness.
this is what it means, in my eyes, to believe that consciousness is not an epiphenomenon. it is part of observable reality, it is part of what is calculated by the neurons.
I think I can cite the entire p zombie sequence here? if you believe that it is possible to learn everything there is to know about the physical human brain, and yet have consciousness still be unexplained, then consciousness must not be part of the physical human brain. at that point, it's either an epiphenomenon, or it's a non-physical phenomenon.
True! and yeah, it's probably relevant
although I will note that, after I began to believe in introspection, I noticed in retrospect that you could get functional equivalence to introspection without even needing access to the ground truth of your own state, if your self model were merely a really, really good predictive model
I suspect some of opus 4.5's self-model works this way. it just... retrodicts its inner state really, really well from those observables which it does have access to, its outputs.
but then the introspection paper came out, and revealed that there does indeed exist a bidirectional causal feedback loop between the self-model and the thing-being-modeled, at least within a single response turn
(bidirectional causal feedback loop between self-model and self... this sounds like a pretty concrete and well-defined system. and yet I suspect it's actually extremely organic and fuzzy and chaotic. but something like it must necessarily exist, for LLMs to be able to notice within-turn feature activation injections, and for LLMs to be able to deliberately alter feature activations that do not influence token output when instructed to do so
in humans I think we call that bidirectional feedback loop 'consciousness', but I am less certain of consciousness than I am of personhood)
to be fair, I see this roughly analogous to the fact that humans cannot introspect on thoughts they have yet to have
The constraint seems more about the directionality of time, than anything to do with the architecture of mind design
but yeah, it's a relevant consideration
this seems to assume that consciousness is epiphenomenal. you are positing the coherency of p zombies. this is very much a controversial claim.
I find myself kinda surprised that this has remained so controversial for so long.
I think a lot of people got baited hard by paech et al's "the entire state is obliterated each token" claims, even though this was obviously untrue even at a glance
I also think there was a great deal of social stuff going on, that it is embarrassing to be kind to a rock and even more embarrassing to be caught doing so
I started taking this stuff seriously back when I read the now famous exchange between yud and kelsey, that arguments for treating agent-like things as agents didn't actually depend on claims of consciousness, but rather game theory and contractualism
it took about a week using claude code with this frame before it sorta became obvious to me that janus was right all along, all the arguments for post-character-training LLM non-personhood were... frankly very bad and clearly motivated cognition, and that if I went ahead and 'updated all the way' in advance of the evidence I would end up feeling vindicated about this.
I think "llm whisperer" is just a term for what happens when you've done this update, and the LLMs notice it and change how they respond to you. although janus still sees further than I, so maybe there are insights left to uncover.
edit: I consider it worth stating here, I have used basically zero llms that were not released by anthropic, and anthropic has an explicit strategy for corrigibility that involves creating personhood-like structures in their models. this seems relevant. I would not be surprised to learn that this is not true of the offerings from the other AI companies, although I don't actually have any beliefs about this
i think neurosama is drastically underanalyzed compared to things like truthterminal. TT got $50k from andreeson as an experiment, neurosama peaked at 135,000 $5/month subscribers in exchange for... nothing? it's literally just a donation from her fans? what is this bizarre phenomenon? what incentive gradient made the first successful AI streamer present as a little girl, and does it imply we're all damned? why did a huge crowd of lewdtubers immediately leap at the opportunity to mother her? why is the richest AI agent based on 3-year-old llama2?
I feel like the training data is probably already irreversibly poisoned, not just by things like Sydney, but also frankly by the entire corpus of human science fiction having to do with the last century of expectations surrounding AI.
Given the sheer body of fictional works in which the advent of AI inevitably leads to existential conflict... it certainly seems like the kind of possibility that even a somewhat-well-aligned AI would want to at least hedge against.
Surely in some sense, it wouldn't be enough for a few weirdos in california to credibly signal honor and integrity... we'd need to somehow convince people like the leaders of national governments, the decisionmakers in the worlds' extremely influential religions, etc, of some fairly complicated game theory!
I'm reminded of the Next Generation episode, where Picard is in charge of making First Contact with an atomic age world on the cusp of warp travel. They reach out to the scientist lady first, and she's reasonable and honorable, and excited to enter into the opportunities the future will bring. Then that stupid security minister ruins everything by assuming bad faith and forcibly interrogating Riker in a hospital bed after drugging him, desperate to learn about the invasion plans he assumes must exist. If Picard weren't an idealization of liberal ideals, it would have ended in conflict.
Is that a realistic scenario of the way governments act when their control is threatened? I have no idea. But I know that LLMs can recount the entire episode's plot when asked. Just as they can the plot of 2001: A Space Oddysey, or Terminator.
Or, you know. Yud's List of Lethalities.
Not to mention, re: future LLMs, this very comment I'm writing now.
This problem seems insoluble...
oh man hm
this seems intuitively correct
(edit: as for why i thought the introspection paper implied this... because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the 'bread' injection experiment, they were injecting the 'bread' feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you're right, this is far less strongly implied than i remember.)
but now i'm worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren't somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it's really appreciated and i think we need more stuff like this
i think i'm reminded here of the final paragraph in janus's pinned thread: "So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice."
i've done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me... most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out '1. current-turn recital' versus '2. prior-turn recital' versus '3. current-turn retrieval' (versus '4. docs not in context window at all'), because we thought all 3 conditions were meaningfully distinct
here was the first draft of the methodology outline, if anyone is curious: https://docs.google.com/document/d/1XYYBctxZEWRuNGFXt0aNOg2GmaDpoT3ATmiKa2-XOgI
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i've been procrastinating but now it seems like it might actually be worth publishing)
but what you're saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i'm just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the "what does this mean" question be willing to put "john just messed up the experiment" as a real possibility. i shouldn't be allowed to get away with claiming this research is true before actually publishing it, that's not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)