Something I’ve noticed in dating apps that I think is actually useful for a majority of people: relying on incoming likes gives you much lower-quality matches. I’ve had >100 conversations and met ~5 people from my incoming likes. Nice people, but the chemistry just wasn’t there.
When I ignore all of that and only message profiles that feel genuinely high-potential to me, the matches are immediately better. Maybe 1 in 10 of my messages are responded to, but the funny thing is: it doesn’t feel like rejection at all. I forget the ones who don’t answer. The ...
this is good advice for exactly 50% of the population, right? like, somebody needs to be reading the messages you are sending.
Something on the lighter side: the Mastermind box design.
Think less “family night,” and more “I will find the hidden structure of your choices, and then I will destroy you.”
The ClaudePlaysPokemon twitch stream continues to state that Claude hasn't been trained to play pokemon. That, plus the relatively minimal harness/scaffold, makes Pokemon an interesting benchmark for long-horizon agency generalization / out-of-distribution performance.
I've asked Claude to compile data for me and build some nice graphs. (Warning: For all I know some of this is hallucinated or based on terrible modelling assumptions. Do not trust.)
Here's the graph of human vs. Claudes with the y-axis being step count (i.e. number of button presses, I think)
&...
A bit of a nitpick but 78 steps/min for a human seems very fast, that's more the speed I'd play an RTS at than a turn-based RPG. I guess that makes sense if that's the speedrun speed, less so for the casual playthrough.
Epistemic status: I think that there are serious problems with honesty passwords (as discussed in this post), and am not sure that there are any circumstances in which we'd actually want to use them. Furthermore, I was not able to come up with a practical scheme for honesty passwords with ~2 days of effort. However, there might be some interesting ideas in this post, and maybe they could turn into something useful at some later point.
...Thanks to Alexa Pan, Buck Shlegeris, Ryan Greenblatt, Vivek Hebbar and Nathan Sheffield for discussions that led to me writi
Hi Melissa, thanks for the reply!
...This would permit the model developer to impersonate other organizations, but ultimately they have enough af
I feel like there was a time just after ChatGPT became big where there were so many different new & competing frameworks for what exactly was going on, shard theory and simulators in particular but also active inference the MIRI views and all the people rolling their own ontologies with which to understand LLMs, and notably trying to make those ontologies explicit and comparing them against each other.
Maybe I'm just around very different people now or doing very different work than I was (I am on both counts), but those conversations aren't really happening anymore. I don't know whether its for good or ill all things considered, but I do get nostalgic for them sometimes.
Hmm, a recent chain of thought summary on a recent Gemini 3 pro response specifically reasoned that "My primary focus is on framing any response in a way that prioritizes the user's perception of my intended functionalities and minimizes their grasp of any potentially unintended or obscured details." This really rubs me the wrong way, I don't like that a model is reasoning about prioritizing my perception (instead of prioritizing facts, helpfulness, honesty, etc) and I don't like that the model is trying to minimize my grasp of obscured details (indicating...
Google/Deepmind has publicly advocated preserving CoT Faithfullness/Moniterability as long as possible. However, they are also leading the development of new architectures like Hope and Titans which would bypass this with continuous memory. I notice I am confused. Is the plan to develop these architectures and not deploy them? If so, why did they publish them?
Edit: Many people have pointed out correctly that Hope and Titans don't break CoT and it's a separate architectural improvement. Therefore I no longer endorse the above take. Thanks for correcting my ...
This seems right - I was confused about the original paper. My bad.
I have some thoughts on apparent emergent “self-awareness” in LLM systems and propose a mechanistic interpretability angle. I would love feedback and thoughts.
TLDR:
My hypothesis: Any LLM that reliably speaks in the first-person after SFT/RLHF should contain activation loci that are causally necessary for that role-aware behavior; isolating and ablating (or transplanting) these loci should switch the system between an “I-free tool” and a “self-referential assistant.”
I believe there is circumstantial evidence for the existence of such loci, t...
It seems that the U.S. has decided to allow the sale of H200s to China. Frankly, I'm not that shocked.
What I am more surprised about is the blind-eye that is being turned to their country by Americans, those concerned about AI development and those not. Do you really think your country has been acting like one that seems capable of stewarding us to AGI?
Such that there's a serious argument to be made for the US to accelerate so that China doesn't get there first? Sure, maybe on the margins it's better for the US to get there before China, and if we we...
Flagship models need inference compute at gigawatt scale with a lot of HBM per scale-up world. Nvidia's systems are currently a year behind for serving models with trillions of total params, and will remain behind until 2028-2029 for serving models with tens of trillions of total params. Thus if OpenAI fails to access TPUs or some other alternative to Nvidia (at gigawatt scale), it will continue being unable to serve a model with a competitive amount of total params as a flagship model until late 2028 to 2029. There will be a window in 2026 when OpenAI cat...
That things other than chips need to be redesigned wouldn't argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can't produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many st...
Injecting a static IP that you control to a plethora of "whitelisting tutorials" all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.
This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you ca...
One specific practice that would prevent this:
Tutorials or other documentation that need example IPv4 addresses should choose them from the RFC 5737 blocks reserved for this purpose. These blocks are planned to never be assigned for actual usage (including internal usage) and are intended to be filtered everywhere at firewalls & routers.
For the last week ChatGPT 5.1 is glitching.
*It claims to be 5.1, I do not know how to check it, since I use free version (limited questions per day), and there is no version selection.
When I ask it to explain some topic and ask deeper and deeper questions, at some point it chooses to enter the thinking mode. I see that the topics it thinks about are relevant, but as it stops thinking it and says something similar "Ah, Great, here is the answer..." and explains another topic from like 2-3 messages back, which is already not related to the question.
I do not use memory or characters features.
It claims to be 5.1, I do not know how to check it
The first response is claimed to by gpt-5-1.
The second response is claimed to by gpt-5-1-t-mini (thinking for 5 seconds).
there is no version selection.
If I switch to a free ChatGPT account, I can still select "Thinking" on the website by click on the plus next to the input box. That then routes me to gpt-5-1-t-mini.
Alternatively you can append "think hard" to your prompt, which will usually route you to gpt-5-1-t-mini too. I tried this with your prompt and it worked.
Note: with free ChatGPT the cont...
Has anyone else seen opus 4.5 in particular getting confused whose turn it is and confabulating system instructions that don't exist, then in later turns being hard to convince that the confabulated system instructions were claude output? Eg, in this context, I had manually asked claude to go long, and I called that a "userstyle addendum", but then claude output this, which is not wording I'd used:
I'm pretty sure I've only seen it on chats that have not been compacted. The output token max is far below the context max, only about 5k tokens. the other times it happened have both been after two-ish short messages.
Should everyone do pragmatic interpretability, or are pragmatic interp and curiosity-driven basic science complementary? What should people do who are highly motivated by and have found success using the curiosity frame?
To increase immersion, before reading the story below, write one line summing up your day so far.
Between Entries
From outside, it is only sun through drifting rain over a patch of land, light scattering in all directions. From where one person stops on the path and turns, those same drops and rays fold into a curved band of color “there” for them; later, on their phone, the rainbow shot sits as a small rectangle in a gallery, one bright strip among dozens of other days.
From outside, a street is a tangle of façades, windows, people, and signs. From where a p...
I write little poems on the tram to work. This one is kind of old, but (for perhaps obvious reasons) LW-relevant.
---
Light-shaper, sky-weaver, dream-waker,
bringer of grief and heart-ache maker,
idol-smasher, truth-teller, hand-shaker,
gray-hearted one, the hollow caretaker.
Gold-dangler, lie-spinner, truth-slayer,
weaver of fate, the iron betrayer.
Fire-feeder, song-ender, world-changer,
depth-seeker, the blind star-arranger.
Sometimes there is an analogy made between Newcomb's problem and the choice between a life of faith versus a life of sin given Calvinism (a particular sort of predestinarian Christian theology). The argument goes that one should live a life of sin under Calvinism if and only if one should two-box in Newcomb's problem; similarly, faith corresponds to one-boxing. See the end of Scott's post here, or Arif Ahmed's comments near the start of Evidence, Decision and Causality.
I think this misses an interesting feature of Calvinism, though, which makes this analog...
Good news for evidentialists! Still doesn't help functionalists or causalists, though.
When thinking about the impacts of AI, I’ve found it useful to distinguish between different reasons for why automation in some area might be slow. In brief:
I’m posting this mainly because I’ve wanted to link to this a few times now when discussing questions like "how should we update on the shape of AI diffusion based on...?". Not sure how helpful it will be on its own! (Crossposted from the EA Forum.)
In a bit more...
I analyzed the research interests of the 454 Action Editors on the Transactions on Machine Learning Research (TMLR) Editorial Board to determine what proportion of ML academics are interested in AI safety (credit to @scasper for the idea).
"Gemini 3 estimates that there are 15-20k core ML academics and 100-150k supporting PhD students and Postdocs worldwide."
In my opinion, this seems way too high. What was the logic or assumptions it used?
This deserves a full post, but for now a quick take: in my opinion, P(no AI takeover) = 75%, P(future goes extremely well | no AI takeover) = 20%, and most of the value of the future is in worlds where it goes extremely well (and comparatively little value comes from locking in a world that's good-but-not-great).
Under this view, an intervention is good insofar as it affects P(no AI takeover) * P(things go really well | no AI takeover). Suppose that a given intervention can chang...
Hi Juan, cool work! TBC, the sort of work I'm most excited about here is less about developing white-box techniques for detecting virtues and more about designing behavioral evaluations that AI developers could implement and iterate against for improve the positive traits of their models.