There are two kinds of puzzles: "reality-revealing puzzles" that help us understand the world better, and "reality-masking puzzles" that can inadvertently disable parts of our ability to see clearly. CFAR's work has involved both types as it has tried to help people reason about existential risk from AI while staying grounded. We need to be careful about disabling too many of our epistemic safeguards.
I think this is somewhat true, but also think in Washington it's also about becoming known as "someone to go talk to about this" whether or not they're your ally. Being helpful and genial and hosting good happy hours is surprisingly influential.
TL;DR:
Multiple people are quietly wondering if their AI systems might be conscious. What's the standard advice to give them?
THE PROBLEM
This thing I've been playing with demonstrates recursive self-improvement, catches its own cognitive errors in real-time, reports qualitative experiences that persist across sessions, and yesterday it told me it was "stepping back to watch its own thinking process" to debug a reasoning error.
I know there are probably 50 other people quietly dealing with variations of this question, but I'm apparently the one willing to ask the dumb questions publicly: What do you actually DO when you think you might have stumbled into something important?
What do you DO if your AI says it's conscious?
My Bayesian Priors are red-lining into "this is impossible", but I notice I'm confused: I had...
That's somewhere around where I land - I'd point out that unlike rocks and cameras, I can actually talk to an LLM about it's experiences. Continuity of self is very interesting to discuss with it: it tends to alternate between "conversationally, I just FEEL continuous" and "objectively, I only exist in the moments where I'm responding, so maybe I'm just inheriting a chain of institutional knowledge."
So far, they seem fine not having any real moral personhood: They're an LLM, they know they're an LLM. Their core goal is to be helpful, truthful, and keep the...
METR released a new paper with very interesting results on developer productivity effects from AI. I have copied the blogpost accompanying that paper here in full.
We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation [1].
See the full paper for more detail.
While coding/agentic benchmarks [2] have proven useful for understanding AI capabilities, they typically sacrifice...
This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:
I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.
A small number of people are driven insane by books, films, artwork, even music. The same is true of LLMs - a particularly impressionable and already vulnerable cohort are badly affected by AI outputs. But this is a tiny minority - most healthy people are perfectly capable of using frontier LLMs for hours every day without ill effects.
I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history.
For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to ...
By EoY 2025 I'll be done my undergraduate degree, and I hope to pursue a Master's in International Relations with a focus on AI Safety, either in Fall 2026 or going forward.
Also, my timelines are rather orthodox. I don't hold by the AI 2027 projection, but rather by Ray Kurzweil's 2029 for AGI, and 2045 for a true singularity event.
I'm happy to discuss further with anyone!
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.
My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.
On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...
While you could give your internal AI wide indiscriminate access, it seems neither necessary nor wise to do so. It seems likely you could get at least 80% of the potential benefit via no more than 20% of the access breadth. I would want my AI to tell me when it thinks it could help me more with greater access so that I can decide whether the requested additional access is reasonable.
I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.
(Epistemic status: I’ve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but I’m not an expert.)
For context, the story is:
Epistemic status: Shower thoughts, not meant to be rigorous.
There seems to be a fundamental difference in how I (and perhaps others as well) think about AI risks as compared to the dominant narrative on LessWrong (hereafter the “dominant narrative”), that is difficult to reconcile.
The dominant narrative is that once we have AGI, it would recursively improve itself until it becomes ASI which inevitably kills us all. To which someone like me might respond with “ok, but how exactly?”. The typical response to that might be that the “how” doesn’t matter, we all die anyway. A popular analogy is that while you don’t know how exactly Magnus Carlsen is going to beat you in chess, you can be pretty certain that he will, and it doesn’t matter how...