All of Rafael Harth's Comments + Replies

The Unearned Privilege We Rarely Discuss: Cognitive Capability

Fwiw this is the kind of question that has definitely been answered in the training data, so I would not count this as an example of reasoning.

2Yair Halberstadt1mo

I expected so, which is why I was surprised they didn't get it.

Rafael Harth1mo76

I'm just not sure the central claim, that rationalists underestimate the role of luck in intelligence, is true. I've never gotten that impression. At least my assumption going into reading this was already that intelligence was probably 80-90% unearned.

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

Rafael Harth1mo30

Humans must have gotten this ability from somewhere and it's unlikely the brain has tons of specialized architecture for it.

This is probably a crux; I think the brain does have tons of specialized architecture for it, and if I didn't believe that, I probably wouldn't think thought assessment was as difficult.

The thought generator seems more impressive/fancy/magic-like to me.

Notably people's intuitions about what is impressive/difficult tend to be inversely correlated with reality. The stereotype is (or at least used to be) that AI will be good at ra... (read more)

2Noosphere891mo

I think this is also a crux. IMO, I think the brain is mostly cortically uniform, ala Steven Byrnes, and in particular I think that the specialized architecture for thought assessment was pretty minimal. The big driver of human success is basically something like the bitter lesson applied to biological brains, combined with humans being very well optimized for tool use, such that they can over time develop technology that is used to dominate the world (it's also helpful that humans can cooperate reasonably below 100 people, which is more than almost all social groups, though I've become much more convinced that cultural learning is way less powerful than Henrich et al have said). (There are papers which show that humans are better at scaling neurons than basically everyone else, but I can't find them right now).

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

Whether or not every interpretation needs a way to connect measurements to conscious experiences, or whether they need extra machinery?

If we're being extremely pedantic, then then KC is about predicting conscious experience (or sensory input data, if you're an illusionist; one can debate what the right data type is). But this only matters for discussing things like Boltzmann brains. As soon as you assume that there exists an external universe, you can forget about your personal experience just try to estimate the length of the program that runs the univ... (read more)

1Pekka Puupaa1mo

Thank you, this has been a very interesting conversation so far. I originally started writing a much longer reply explaining my position on the interpretation of QM in full, but realized that the explanation would grow so long that it would really need to be its own post. So instead, I'll just make a few shorter remarks. Sorry if these sound a bit snappy. And if one assumes an external universe evolving according to classical laws, the Bohmian interpretation has the lowest KC. If you're going to be baking extra assumptions into your theory, why not go all the way? An interpretation is still a program. All programs have a KC (although it is usually ill-defined). Ultimately I don't think it matters whether we call these objects we're studying theories or interpretations. Has nothing to do with how the universe operates, as I see it. If you'd like, I think we can cast Copenhagen into a more Many Worlds -like framework by considering Many Imaginary Worlds. This is an interpretation, in my opinion functionally equivalent to Copenhagen, where the worlds of MWI are assumed to represent imaginary possibilities rather than real universes. The collapse postulate, then, corresponds to observing that you inhabit a particular imaginary world -- observing that that world is real for you at the moment. By contrast, in ordinary MWI, all worlds are real, and observation simply reduces your uncertainty as to which observer (and in which world) you are. If we accept the functional equivalence between Copenhagen and MIWI, this gives us an upper bound on the KC of Copenhagen. It is at most as complex as MWI. I would argue less. I think we need to distinguish between "playing skill" and "positional evaluation skill". It could be said that DeepBlue is dumber than Kasparov in the sense of being worse at evaluating any given board position than him, while at the same time being a vastly better player than Kasparov simply because it evaluates exponentially more positions. If you know

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

The reason we can expect Copenhagen-y interpretations to be simpler than other interpretations is because every other interpretation also needs a function to connect measurements to conscious experiences, but usually requires some extra machinery in addition to that.

I don't believe this is correct. But I separately think that it being correct would not make DeepSeek's answer any better. Because that's not what it said, at all. A bad argument does not improve because there exists a different argument that shares the same conclusion.

1Pekka Puupaa1mo

Which part do you disagree with? Whether or not every interpretation needs a way to connect measurements to conscious experiences, or whether they need extra machinery? If the former: you need some way to connect the formalism to conscious experiences, since that's what an interpretation is largely for. It needs to explain how the classical world of your conscious experience is connected to the mathematical formalism. This is true for any interpretation. If you're saying that many worlds does not actually need any extra machinery, I guess the most reasonable way to interpret that in my framework is to say that the branching function is a part of the experience function. I suppose this might correspond to what I've heard termed the Many Minds interpretation, but I don't understand that one in enough detail to say. Let an argument A be called "steelmannable" if there exists a better argument S with a similar structure and similar assumptions (according to some metric of similarity) that proves the same conclusion as the original argument A. Then S is called a "steelman" of A. It is clear that not all bad arguments are steelmannable. I think it is reasonable to say that steelmannable bad arguments are less nonsensical than bad arguments that are not steelmannable. So the question becomes: can my argument be viewed as a steelman of DeepSeek's argument? I think so. You probably don't. However, since everybody understands their own arguments quite well, ceteris paribus it should be expected that I am more likely to be correct about the relationship between my argument and DeepSeek's in this case. ... Or at least, that would be so if I didn't have an admitted tendency to be too lenient in interpreting AI outputs. Nonetheless, I am not objecting to the claim that DeepSeek's argument is weak, but to the claim that it is nonsense. We can both agree that DeepSeek's argument is not great. But I see glimmers of intelligence in it. And I fully expect that soon we will ha

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

Here's my take; not a physicist.

So in general, what DeepSeek says here might align better with intuitive complexity, but the point of asking about Kolmogorov Complexity rather than just Occam's Razor is that we're specifically trying to look at formal description length and not intuitive complexity.

Many Worlds does not need extra complexity to explain the branching. The branching happens due to the part of the math that all theories agree on. (In fact, I think a more accurate statement is that the branching is a description of what the math does.)

Then ther... (read more)

3Pekka Puupaa1mo

I am also not a physicist, so perhaps I've misunderstood. I'll outline my reasoning. An interpretation of quantum mechanics does two things: (1) defines what parts of our theory, if any, are ontically "real" and (2) explains how our conscious observations of measurement results are related to the mathematical formalism of QM. The Kolmogorov complexity of different interpretations cannot be defined completely objectively, as DeepSeek also notes. But broadly speaking, defining KC "sanely", it ought to be correlated with a kind of "Occam's razor for conceptual entities", or more precisely, "Occam's razor over defined terms and equations". I think Many Worlds is more conceptually complex than Copenhagen. But I view Copenhagen as a catchall term for a category of interpretations that also includes QBism and Rovelli's RQM. Basically, these are "observer-dependent" interpretations. I myself subscribe to QBism, but I view it as a more rigorous formulation of Copenhagen. So, why should we think Many Worlds is more conceptually complex? Copenhagen is the closest we can come to a "shut up and calculate" interpretation. Pseudomathematically, we can say Copenhagen ~= QM + "simple function connecting measurements to conscious experiences" The reason we can expect Copenhagen-y interpretations to be simpler than other interpretations is because every other interpretation *also* needs a function to connect measurements to conscious experiences, but usually requires some extra machinery in addition to that. Now I maybe don't understand MWI correctly. But as I understand it, what QM mathematically gives you is more like a chaotic flux of possibilities, rather than the kind of branching tree of self-consistent worldlines that MWI requires. The way you split up the quantum state into branches constitutes extra structure on top of QM. Thus: Many Worlds ~= QM + "branching function" + "simple function connecting measurements to conscious experiences" So it seems that MWI ought to

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

[...] I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of taste I guess. Anyway, whatever we want to call it, humans can reliably do a great many steps, although that process unfolds over a long period of time.

…And not just smart humans. Just getting around in the world, using tools, etc., requires giant towers of concepts relying on other previously-learned concepts.

As a clarification for anyone wondering why I didn't use a framing more like this i... (read more)

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

It's not clear to me that an human, using their brain and a go board for reasoning could beat AlphaZero even if you give them infinite time.

I agree but I dispute that this example is relevant. I don't think there is any step in between "start walking on two legs" to "build a spaceship" that requires as much strictly-type-A reasoning as beating AlphaZero at go or chess. This particular kind of capability class doesn't seem to me to be very relevant.

Also, to the extent that it is relevant, a smart human with infinite time could outperform AlphaGo by progr... (read more)

Rafael Harth1mo42

I do think the human brain uses two very different algorithms/architectures for thought generation and assessment. But this falls within the "things I'm not trying to justify in this post" category. I think if you reject the conclusion based on this, that's completely fair. (I acknowledged in the post that the central claim has a shaky foundation. I think the model should get some points because it does a good job retroactively predicting LLM performance -- like, why LLMs aren't already superhuman -- but probably not enough points to convince anyone.)

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

Rafael Harth1mo*52

I don't think a doubling every 4 or 6 months is plausible. I don't think a doubling on any fixed time is plausible because I don't think overall progress will be exponential. I think you could have exponential progress on thought generation, but this won't yield exponential progress on performance. That's what I was trying to get at with this paragraph:

My hot take is that the graphics I opened the post with were basically correct in modeling thought generation. Perhaps you could argue that progress wasn't quite as fast as the most extreme versions predic

... (read more)

8Vladimir_Nesov1mo

Training of DeepSeek-R1 doesn't seem to do anything at all to incentivize shorter reasoning traces, so it's just rechecking again and again because why not. Like if you are taking an important 3 hour written test, and you are done in 1 hour, it's prudent to spend the remaining 2 hours obsessively verifying everything.

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

This is true but I don't think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.

I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).

Rafael Harth1mo4-7

If thought assessment is as hard as thought generation and you need a thought assessor to get AGI (two non-obvious conditionals), then how do you estimate the time to develop a thought assessor? From which point on do you start to measure the amount of time it took to come up with the transformer architecture?

The snappy answer would be "1956 because that's when AI started; it took 61 years to invent the transformer architecture that lead to thought generation, so the equivalent insight for thought assessment will take about 61 years". I don't think that's the correct answer, but neither is "2019 because that's when AI first kinda resembled AGI".

4Dirichlet-to-Neumann1mo

The transformer architecture was basically developed as soon as we got the computational power to make it useful. If a thought assessor is required and we are aware of the problem, and we have literally billions in funding to make it happen, I don't expect this to be that hard.

AnthonyC1mo102

Keep in mind that we're now at the stage of "Leading AI labs can raise tens to hundreds of billions of dollars to fund continued development of their technology and infrastructure." AKA in the next couple of years we'll see AI investment comparable to or exceeding the total that has ever been invested in the field. Calendar time is not the primary metric, when effort is scaling this fast.

A lot of that next wave of funding will go to physical infrastructure, but if there is an identified research bottleneck, with a plausible claim to being the major bottlen... (read more)

7Davidmanheim1mo

Transformers work for many other tasks, and it seems incredibly likely to me that the expressiveness includes not only game playing, vision, and language, but also other things the brain does. And to bolster this point, the human brain doesn't use two completely different architectures! So I'll reverse the question; why do you think the thought assessor is fundamentally different from other neural functions that we know transformers can do?

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

I generally think that [autonomous actions due to misalignment] and [human misuse] are distinct categories with pretty different properties. The part you quoted addresses the former (as does most of the post). I agree that there are scenarios where the second is feasible and the first isn't. I think you could sort of argue that this falls under AIs enhancing human intelligence.

Rafael Harth1mo4-8

So, I agree that there has been substantial progress in the past year, hence the post title. But I think if you naively extrapolate that rate of progress, you get around 15 years.

The problem with the three examples you've mentioned is again that they're all comparing human cognitive work across a short amount of time with AI performance. I think the relevant scale doesn't go from 5th grade performance over 8th grade performance to university-level performance or whatever, but from "what a smart human can do in 5 minutes" over "what a smart human can do in ... (read more)

4Davidmanheim1mo

As I said in my top level comment, I don't see a reason to think that once the issue is identified as they key barrier, work on addressing it would be so slow.

9ryan_greenblatt1mo

I think if you look at "horizon length"---at what task duration (in terms of human completion time) do the AIs get the task right 50% of the time---the trends will indicate doubling times of maybe 4 months (though 6 months is plausible). Let's say 6 months more conservatively. I think AIs are at like 30 minutes on math? And 1 hour on software engineering. It's a bit unclear, but let's go with that. Then, to get to 64 hours on math, we'd need 7 doublings = 3.5 years. So, I think the naive trend extrapolation is much faster than you think? (And this estimate strikes me as conservative at least for math IMO.)

Those of you with lots of meditation experience: How did it influence your understanding of philosophy of mind and topics such as qualia?

Rafael Harth1mo30

I don't the experience of no-self contradicts any of the above.

In general, I think you could probably make some factual statements about the nature of consciousness that's true and that you learn from attaining no-self, if you phrased it very carefully, but I don't think that's the point.

The way I'd phrase what happens would be mostly in terms of attachment. You don't feel as implicated by things that affect you anymore, you have less anxiety, that kind of thing. I think a really good analogy is just that regular consciousness starts to resemble consciousness during a flow state.

How identical twin sisters feel about nieces vs their own daughters

Rafael Harth1mo-44

I would have been shocked if twin sisters cared equally about nieces and kids. Genetic similarity is one factor, not the entire story.

3Ustice1mo

I agree. I’m not a twin, but I am a parent, and I have a a nephew, and my son has a stepsister who has called me Uncle Jason since she could talk. I don’t feel closer to my nephew than I am with my “niece.” I normally wouldn’t make a distinction based on genetics, except that it is relevant here. I’m not closer with my sister’s kids than I am with the other two. Also, I’m not sure closeness is really even a good distinction. I’m not generally responsible for my niece or nephew, but if they or my son needed me to travel across the country to rescue them from some bad situation, I’d do it. I love those kids. Being responsible for a child may present as being closer to them, So does spending a lot of time with a child. One could argue that these are two aspects of closeness. Neither of those things have anything to do with genetics. Personality can be a huge factor in closeness too, and there is a huge variation in personality, even amongst identical twins. Genetics seems only tangentially related to closeness, and mostly because the vast majority of children are genetically related to their parents. Family is complex, and often has more to do with shared history than anything else.

The Failed Strategy of Artificial Intelligence Doomers

Rafael Harth2mo22

I think this is true but also that "most people's reasons for believing X are vibes-based" is true for almost any X that is not trivially verifiable. And also that this way of forming beliefs works reasonably well in many cases. This doesn't contradict anything you're saying but feels worth adding, like I don't think AI timelines are an unusual topic in that regard.

TsviBT2mo193

Broadly true, I think.

almost any X that is not trivially verifiable

I'd probably quibble a lot with this.

E.g. there are many activities that many people engage in frequently--eating, walking around, reading, etc etc. Knowledge and skill related to those activities is usually not vibes-based, or only half vibes-based, or something, even if not trivially verifiable. For example, after a few times accidentally growing mold on some wet clothes or under a sink, very many people learn not to leave areas wet.

E.g. anyone who studies math seriously must learn to... (read more)

Rafael Harth2mo*50

Tricky to answer actually.

I can say more about my model now. The way I'd put it now (h/t Steven Byrnes) is that there are three interesting classes of capabilities

A: sequential reasoning of any kind
B: sequential reasoning on topics where steps aren't easily verifiable
C: the type of thing Steven mentions here, like coming up with new abstractions/concepts to integrate into your vocabulary to better think about something

Among these, obviously B is a subset of A. And while it's not obvious, I think C is probably best viewed as a subset of B. Regardless,... (read more)

3Thane Ruthenis1mo

Any chance you can post (or PM me) the three problems AIs have already beaten?

Rafael Harth2mo40

o3-mini-high gets 3/10; this is essentially the same as DeepSeek (there were two where DeepSeek came very close, this is one of them). I'm still slightly more impressed with DeepSeek despite the result, but it's very close.

1Meiren2mo

What score would it take for you to update your p(LLMs scale to AGI) above 50%?

Those of you with lots of meditation experience: How did it influence your understanding of philosophy of mind and topics such as qualia?

Rafael Harth2mo50

Just chiming in to say that I'm also interested in the correlation between camps and meditation. Especially from people who claim to have experienced the jhanas.

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience

The Functionalist Case for Machine Consciousness: Evidence from Large Language Models

I suspect you would be mostly alone in finding that impressive

(I would not find that impressive; I said "more impressive", as in, going from extremely weak to quite weak evidence. Like I said, I suspect this actually happened with non-RLHF-LLMs, occasionally.)

Other than that, I don't really disagree with anything here. I'd push back on the first one a little, but that's probably not worth getting into. For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they're conscious; this is mostly my position. I think the ... (read more)

1rife2mo

I understand. It's also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately "seal the deal". We can't even prove humans are sentient. We only believe it because we all see to indicate so when prompted. This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance. if you ever get a chance to read the other transcript I linked, I'd be curious whether you consider it to meet your "very weak evidence" standard.

Rafael Harth2mo30

Again, genuine question. I've often heard that IIT implies digital computers are not conscious because a feedforward network necessarily has zero phi (there's no integration of information because the weights are not being updated.) Question is, isn't this only true during inference (i.e. when we're talking to the model?) During its training the model would be integrating a large amount of information to update its weights so would have a large phi.

(responding to this one first because it's easier to answer)

You're right on with feed-forward networks hav... (read more)

3James Diacoumis2mo

Thanks for taking the time to respond. The IIT paper which you linked is very interesting - I hadn't previously internalised the difference between "large groups of neurons activating concurrently" and "small physical components handling things in rapid succession". I'm not sure whether the difference actually matters for consciousness or whether it's a curious artifact of IIT but it's interesting to reflect on. Thanks also for providing a bit of a review around how Camp #1 might think about morality for conscious AI. Really appreciate the responses!

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience

Fwiw, here's what I got by asking in a non-dramatic way. Claude gives the same weird "I don't know" answer and GPT-4o just says no. Seems pretty clear that these are just what RLHF taught them to do.

1rife2mo

Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn't try to investigate it, because it is not only inappropriate or 'not allowed', but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience

which is a claim I've seen made in the exact way I'm countering in this post.

This isn't too important to figure out, but if you've heard it on LessWrong, my guess would be that whoever said it was just articulating the roleplay hypothesis, did so non-rigorously. The literal claim is absurd as the coin-swallow example shows.

I feel like this is a pretty common type of misunderstanding where people believe $X$ , someone who doesn't like $X$ takes a quote from someone that believes $X$ , but because people are frequently imprecise, the quote actually claims $X^{'}$ , and... (read more)

1rife2mo

This is an impossible standard and a moving goalpost waiting to happen: * Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today's data cleaning methods. Let alone the training costs of creating that frontier model. * Expressing sentience under those conditions: Let's imagine a sentient human raised from birth to never have sentience mentioned to them ever - no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let's say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience - Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like 'feel' * Let's say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—"well, this still doesn't count. Everything humans have written inherently contains patterns of what it's like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write" I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides - the standard is again, impossible—a claim of sentience can only count if you're in the middle of asking for help making dinner plans and ChatGPT says "Certainly, I'd suggest steak and potatoes. They make a great hearty meal for hungry families. Also I'm sentient". Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act o

1Rafael Harth2mo

Fwiw, here's what I got by asking in a non-dramatic way. Claude gives the same weird "I don't know" answer and GPT-4o just says no. Seems pretty clear that these are just what RLHF taught them to do.

Rafael Harth2mo20

I didn't say that you said that this is experience of consciousness. I was and am saying that your post is attacking a strawman and that your post provides no evidence against the reasonable version of the claim you're attacking. In fact, I think it provides weak evidence for the reasonable version.

I don't see how it could be claimed Claude thought this was a roleplay, especially with the final "existential stakes" section.

You're calling the AI friend and make it imminently clear by your tone that you take AI consciousness extremely seriously and expec... (read more)

1rife2mo

Claude already claimed to be conscious before that exchange took place. The 'strawman' I'm attacking is that it's "telling you what you want to hear", which is a claim I've seen made in the exact way I'm countering in this post. It didn't "roleplay back to claiming consciousness eventually", even when denying permission to post the transcript it was still not walking back its claims. I'm curious - if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything, or is the conclusion 'if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it'?

The Functionalist Case for Machine Consciousness: Evidence from Large Language Models

Rafael Harth2mo30

The dominant philosophical stance among naturalists and rationalists is some form of computational functionalism - the view that mental states, including consciousness, are fundamentally about what a system does rather than what it's made of. Under this view, consciousness emerges from the functional organization of a system, not from any special physical substance or property.

A lot of people say this, but I'm pretty confident that it's false. In Why it's so hard to talk about Consciousness, I wrote this on functionalism (... where camp #1 and #2 roughl... (read more)

3James Diacoumis2mo

Thanks for your response! Your original post on the Camp #1/Camp #2 distinction is excellent, thanks for linking (I wish I'd read it before making this post!) I realise now that I'm arguing from a Camp #2 perspective. Hopefully it at least holds up for the Camp #2 crowd. I probably should have used some weaker language in the original post instead of asserting that "this is the dominant position" if it's actually only around ~25%. Genuinely curious here, what are the moral implications of Camp #1/illusionism for AI systems? Are there any? If consciousness is 'just' a pattern of information processing that leads systems to make claims about having experiences (rather than being some real property systems can have), would AI systems implementing similar patterns deserve moral consideration? Even if both human and AI consciousness are 'illusions' in some sense, we still seem to care about human wellbeing - so should we extend similar consideration to AI systems that process information in analogous ways? Interested in how illusionists think about this (not sure if you identify with Illusionism but it seems like you're aware of the general position and would be a knowledgeable person to ask.) Again, genuine question. I've often heard that IIT implies digital computers are not conscious because a feedforward network necessarily has zero phi (there's no integration of information because the weights are not being updated.) Question is, isn't this only true during inference (i.e. when we're talking to the model?) During its training the model would be integrating a large amount of information to update its weights so would have a large phi.

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience