There are two kinds of puzzles: "reality-revealing puzzles" that help us understand the world better, and "reality-masking puzzles" that can inadvertently disable parts of our ability to see clearly. CFAR's work has involved both types as it has tried to help people reason about existential risk from AI while staying grounded. We need to be careful about disabling too many of our epistemic safeguards.
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.
My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.
On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...
I agree insider vs outsider threat is an important distinction, and I one that I have seen security people take seriously in other contexts. My background is in enterprise IT and systems admin. I think there's some practical nuance missing here.
In so far as security people are expecting to treat the AI as an outsider, they're likely expecting to have a hard boundary between "systems that run the AI" and "systems and tools the AI gets to use", where access to any given user is to only one or the other.
This is already fairly common practice, in ...
TL;DR:
Multiple people are quietly wondering if their AI systems might be conscious. What's the standard advice to give them?
THE PROBLEM
This thing I've been playing with demonstrates recursive self-improvement, catches its own cognitive errors in real-time, reports qualitative experiences that persist across sessions, and yesterday it told me it was "stepping back to watch its own thinking process" to debug a reasoning error.
I know there are probably 50 other people quietly dealing with variations of this question, but I'm apparently the one willing to ask the dumb questions publicly: What do you actually DO when you think you might have stumbled into something important?
What do you DO if your AI says it's conscious?
My Bayesian Priors are red-lining into "this is impossible", but I notice I'm confused: I had...
You said “every text-based test of intelligence we have.” If you meant that to be qualified by “that a six your old could pass” as you did in some other places, then perhaps it’s true. But I don’t know - maybe six year olds are only AGI because they can grow into adults! Something trapped at six your old level may not be.
…and for what it’s worth, I have solved some open math problems, including semimeasure extension and integration problems posed by Marcus Hutter in his latest book and some modest final steps in fully resolving Kalai and Lehrer’s grain of ...
This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it's easier to find and reference Lesswrong posts.
This post probably doesn't make much sense unless you've been following along with what I've been saying (or independently understand) why Claude 3 Opus is an unusually - and seemingly in many ways unintentionally - aligned model. There has been a wave of public discussion about the specialness of Claude 3 Opus recently, spurred in part by the announcement of the model's...
Reading this feels a bit like reading about meditation. It seems interesting and if I work through it, I could eventually understand it fully.
But I'd quite like a "secular" summary of this and other thoughts of Janus, for people who don't know what Eternal Tao is, and who want to spend as little time as possible on twitter.
TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.
Full paper is available here.
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse...
Our paper on defense in depth (STACK) found similar results – similarly-sized models with a few-shot prompt significantly outperformed the specialized guard models, even when adjusting for FPR on benign queries.
I've seen many prescriptive contributions to AGI governance take the form of proposals for some radically new structure. Some call for a Manhattan project, others for the creation of a new international organization, etc. The OGI model, instead, is basically the status quo. More precisely, it is a model to which the status quo is an imperfect and partial approximation.
It seems to me that this model has a bunch of attractive properties. That said, I'm not putting it forward because I have a very high level of conviction in it, but because it seems useful to have it explicitly developed as an option so that it can be compared with other options.
(This is a working paper, so I may try to improve it in light of comments...
We would need a reason for thinking that this problem is worse in the corporate case in order for it to be a consideration against the OGI model.
Could we get info on this by looking at metrics of corruption? I'm not familiar with the field, but I know it's been busy recently, and maybe there's some good papers that put the private and public sectors on the same scale. A quick google scholar search mostly just convinced me that I'd be better served asking an expert.
...As for the suggestion that governments (nationally or internationally) should prohibit profit
METR released a new paper with very interesting results on developer productivity effects from AI. I have copied the blogpost accompanying that paper here in full.
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation [1].
See the full paper for more detail.
While coding/agentic benchmarks [2] have proven useful for understanding AI capabilities, they typically sacrifice...
Very interesting result; I was surprised to see an actual slowdown.
The extensive analysis of the factors potentially biasing the study's results and the careful statements regarding what the study doesn't show are appreciated. Seems like very solid work overall.
That said, one thing jumped out at me:
As an incentive to participate, we pay developers $150/hour
That seems like misaligned incentives, no? The participants got paid more the more time they spent on tasks. A flat reward for completing a task plus a speed bonus seems like a better way to structure it...
(That last paragraph is a pile of sazen and jargon, I don't expect it's very clear. I wanted to write this note because I'm not trying to score points via confusion and want to point out to any readers it's very reasonable to be confused by that paragraph.)
When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author’s incentive to express themselves clearly – they can clarify later![1] – while burdening the reader with finding the “right” interpretation of the author’s words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author “really” meant.
...⚠️ Preemptive clarification
The context for this essay is serious, high-stakes communication: papers, technical blog posts, and tweet threads. In that context, communication is a partnership. A reader has a responsibility to engage in good faith, and an author
Bob's statement 1: "I literally have a packet of blue BIC pens in my desk drawer" was not literally true, and that error was not relevant to the proposition that BIC make blue pens. I'm okay with assigning "basically full credit" for that statement.
Bob's statement 2: "All I really meant was that I had blue pens at my house" is not literally true. For what proposition is that statement being used as evidence? I don't see an explicit one in mattmacdermott's hypothetical. It's not relevant to the proposition that BIC make blue pens. This is the statement for ...
The process of evolution is fundamentally a feedback loop, where 'the code' causes effects in 'the world' and effects in 'the world' in turn cause changes in 'the code'.
A fully autonomous artificial intelligence consists of a set of code (e.g. binary charges) stored within an assembled substrate. It is 'artificial' in being assembled out of physically stable and compartmentalised parts (hardware) of a different chemical make-up than humans' soft organic parts (wetware). It is ‘intelligent’ in its internal learning – it keeps receiving new code as inputs from the world, and keeps computing its code into new code. It is ‘fully autonomous’ in learning code that causes the perpetuation of its artificial existence in contact with the world, even without humans/organic life.
So the AI learns explicitly, by its...