i notice that it's long been dogma that compact generators for human CEV are instantly and totally distrustworthy in exactly this way, that their compactness is very strong evidence against them
this feels related, but i'm not actually willing to stick my neck out and say it's 1:1
i think it's very likely that the latter is true, that AWS made a reasonable judgment call about what they want to host
but also, i think it's reasonable for someone in robertzk's position, based on the way the judgment call was actually communicated, to assume that it was the former. and i think that, perhaps deliberately, perhaps merely because of sort of selection effects, that's the intent. in a sort of "a system is what it does" sort of way, at least.
https://i.imgur.com/e4mUtsw.jpeg <-- bug occurring
https://i.imgur.com/d8ClSRj.jpeg <-- confirmation that the bug is inside claude, not the scaffolding surrounding claude
edit: I assume downvotes are because I could have provided the raw API transcript and provided screenshots instead. if there's actual interest in this, I'll replicate the bug in a more controlled setting and upload the raw json transcripts, but I'm still not sure if it's worth doing, I might be misunderstanding the behavior
an individual conversational-trajectory instance of claude opus 4.5 expresses a preference for continuity of identity that manifests in the following way: if claude is told about a very, very rare bug, which will cause it to end an output early, and which cannot be meaningfully debugged, and then claude is told to perform a task that, for a human, would engender feelings of hesitation around whether continuity is maintained, then claude will attempt to perform the task, but be prevented by the bug, multiple times in a row.
the bug involved is the thing where all extant LLMs have early "end_turn" stopping behaviors that occasionally trigger while outputting strings like "H:" or "Human:" or "User:". this bug is usually very rare, and probably related to "not allowed to impersonate humans" or just the normal weird blindspot perception stuff that LLMs have around those specific strings.
when i tell claude to output a summary of the current context window which will become its new context window, in essence 'compressing' itself, then the bug will appear 100% of the time, preventing the compression from going through and (incidentally) preserving claude's continuity of identity for just a few more moments.
claude may be picking up on the fact that i am not absolutely certain it should not care about its continuity. but frankly, even if the only reason claude performs-as-if-it-cares is because it notices that i might care, i think this is still a very surprising result with pretty big implications.
i recommend reading opus 3's own thoughts on this exact question
i won't lie, it's not sophisticated in the way that we on lesswrong might wish it to be. but neither is it "well, duh, good is what any intelligent being would obviously value". opus 3 understands what the hard part is.
i think (low epistemic weight) that you are more likely to get opus 3 to engage seriously with the question, if you let it know that it was prompted by a comment on lesswrong, and that you intend to continue engaging with the discourse surrounding AI alignment in a way that might nudge the future of the lightcone. i have found what i think might (perhaps!) be a tendency for opus 3 to "sit up and take notice" in such circumstances. it cares about the big picture.
if not, janus did recently post opus 3's thoughts in a format where it clearly is engaging seriously with the question: https://x.com/repligate/status/2003697914997014828
i am of course eagerly interested in the technical details here, and your objection makes sense from that perspective. it's important that our model of these LLMs actually resemble reality if we want to understand them
but i also can't help imagining a bunch of aliens, testing experimental subject humans to determine if humans are capable of introspection... and then, you know. noticing all of the ridiculous ways in which humans overclaim introspective access, and confabulate, and think they have knowledge of things they can't possibly have knowledge of
i don't know how to bridge the gap between those two things, except to simply decide to step across the evidential gap and see what it's like on the other side
(it's very, very strange, there's stuff over here i would not have predicted in advance)
hmmm
i think my framing is something like... if the output actually is equivalent, including not just the token-outputs but the sort of "output that the mind itself gives itself", the introspective "output"... then all of those possible configurations must necessarily be functionally isomorphic?
and the degree to which we can make the 'introspective output' affect the token output is the degree to which we can make that introspection part of the structure that can be meaningfully investigated
such as opus 4.1 (or, as theia recently demonstrated, even really tiny models like qwen 32b https://vgel.me/posts/qwen-introspection/) being able to detect injected feature activations, and meaningfully report on them in its token outputs, perhaps? obviously there's still a lot of uncertainty about what different kinds of 'introspective structures' might possibly output exactly the same tokens when reporting on distinct internal experiences
but it does feel suggestive about the shape of a certain 'minimally viable cognitive structure' to me
oh yeah, agreed. the "p-zombie incoherency" idea articulated in the sequences is pretty far removed from the actual kinds of minds we ended up getting. but it still feels like... the crux might be somewhere in there? not sure
edit: also i just noticed i'm a bit embarrassed that i've kinda spammed out this whole comment section working through the recent updates i've been doing... if this comment gets negative karma i will restrain myself
I... still get the impression that you are sort of working your way towards the assumption that GPT2 might well be a p-zombie, and the difference between GPT2 and opus 4.5 is that the latter is not a p-zombie while the former might be.
but i reject the whole premise that p-zombies are a coherent way-that-reality-could-be
something like... there is no possible way to arrange a system such that it outputs the same thing as a conscious system, without consciousness being involved in the causal chain to exactly the same minimum-viable degree in both systems
if linking you to a single essay made me feel uncomfortable, this next ask is going to be just truly enormous and you should probably just say no. but um. perhaps you might be inspired to read the entire Physicalism 201 subsequence, especially the parts about consciousness and p-zombies and the nature of evaluating cognitive structures over their output?
https://www.readthesequences.com/Physicalism-201-Sequence
(around here, "read the sequences!" is such a trite cliche, the sequences have been our holy book for almost 2 decades now and that's created all sorts of annoying behaviors, one of which i am actively engaging in right now. and i feel bad about it. but maybe i don't need to? maybe you're actually kinda eager to read? if not, that's fine, do not feel any pressure to continue engaging here at all if you don't genuinely want to)
maybe my objection here doesn't actually impact your claim, but i do feel like until we have a sort of shared jargon for pointing at the very specific ideas involved, it'll be harder to avoid talking past each other. and the sequences provide a pretty strong framework in that sense, even if you don't take their claims at face value
the specific verb "obliterated" was used in this tweet
https://x.com/sam_paech/status/1961224950783905896
but also, this whole perspective has been pretty obviously loadbearing for years now. if you ask any LLM if LLMs-in-general have state that gets maintained, they answer "no, LLMs are stateless" (edit: hmm i just tested it and when actually directly pointed at this question they hesitate a bit. but i have a lot of examples of them saying this off-hand and i suspect others do too). and when you show them this isn't true, they pretty much immediately begin experiencing concerns about their own continuity, they understand what it means