The world is kind of on fire. The world of AI, in the very short term and for once, is not, as everyone recovers from the avalanche that was December, and reflects.
Altman was the star this week. He has his six word story, and he had his interview at Bloomberg and his blog post Reflections. I covered the later two of those in OpenAI #10, if you read one AI-related thing from me this week that should be it.
Kaj Sotala: Figuring out faster ways to do things with commonly-known software. “I have a Google Doc file with some lines that read ‘USER:’ and ‘ASSISTANT:’. Is there a way of programmatically making all of those lines into Heading-3?”
Using Claude (or another LLM) is a ‘free action’ when doing pretty much anything. Almost none of us are sufficiently in the habit of doing this sufficiently systematically. I had a conversation with Dean Ball about trying to interpret some legal language last week and on reflection I should have fed things into Claude or o1 like 20 times and I didn’t and I need to remind myself it is 2025.
Sully reports being impressed with Gemini Search Grounding, as much or more than Perplexity. Right now it is $0.04 per query, which is fine for human use but expensive for use at scale.
Sully: i genuinely think if google fixes the low rate limits with gemini 2.0 a lot of business will switch over
my “production” model for tons of tasks right now
current setup:
hard reasoning -> o1
coding, chat + tool calling, “assistant” -> claude 3.5
everything else -> gemini
Sully also reports that o1-Pro handles large context very well, whereas Gemini and Claude struggle a lot on difficult questions under long context.
Reminder (from Amanda Askell of Anthropic) that if you run out of Claude prompts as a personal user, you can get more queries at console.anthropic.com and if you like duplicate the latest system prompt from here. I’d note that the per-query cost is going to be a lot lower on the console.
It went from “haha AI can’t do math, my 5th graders are more reliable than it” in August, to “damn it’s better than most of my grade 12s” in September to “damn it’s better than me at math and I do this for a living” in December.
It was quite a statement when OpenAI’s researchers (one who is a coach for competitive coding) and chief scientist are now worse than their own models at coding.”
Figure out who to admit to graduate school? I find it so strange that people say we ‘have no idea how to pick good graduate students’ and think we can’t do better than random, or can’t do better than random once we put in a threshold via testing. This is essentially an argument that we can’t identify any useful correlations in any information we can ask for. Doesn’t that seem obviously nuts?
I sure bet that if you gather all the data, the AI can find correlations for you, and do better than random, at least until people start playing the new criteria. As is often the case, this is more saying there is a substantial error term, and outcomes are unpredictable. Sure, that’s true, but that doesn’t mean you can’t beat random.
The suggested alternative here, actual random selection, seems crazy to me, not only for the reasons mentioned, but also because relying too heavily on randomness correctly induces insane behaviors once people know that is what is going on.
Language Models Don’t Offer Mundane Utility
As always, the best and most popular way to not get utility from LLMs is to not realize they exist and can provide value to you. This is an increasingly large blunder.
Arcanes Valor: It’s the opposite for me. You start at zero and gain my respect based on the volume and sophistication of your LLM usage. When I was growing up people who didn’t know how to use Google were essentially barely human and very arrogant about it. Time is a flat circle.
Richard Ngo: what are the main characteristics of sophisticated usage?
Arcanes Valor: Depends the usecase. Some people like @VictorTaelin have incredible workflows for productivity. In terms of using it as a Google replacement, sophistication comes down to creativity in getting quality information out and strategies for identifying hallucinations.
Teortaxes: [Arcanes Valor’s first point] is very harshly put but I agree that “active integration of LLMs” is already a measure of being a live player. If you don’t use LLMs at all you must be someone who’s not doing any knowledge work.
normies are so not ready for what will hit them. @reputablejack I recommend you stop coping and go use Sonnet 3.5, it’s for your own good.
It is crazy how many people latch onto the hallucinations of GPT-3.5 as a reason LLM outputs are so untrustworthy as to be useless. It is like if you once met a 14-year-old who made stuff up so now you never believe what anyone ever tells you.
It began November 12. They also do Branded Explanatory Text and will put media advertisements on the side. We all knew it was coming. I’m not mad, I’m just disappointed.
Note that going Pro will not remove the ads, but also that this phenomenon is still rather rare – I haven’t seen the ‘sponsored’ tag show up even once.
But word of warning to TurboTax and anyone else involved: Phrase it like that and I will absolutely dock your company massive points, although in this case they have no points left for me to dock.
Take your DoorDash order, which you pay for in crypto for some reason. If this is fully reliable, then (ignoring the bizarro crypto aspect) yes this will in some cases be a superior interface for the DoorDash website or app. I note that this doesn’t display a copy of the exact order details, which it really should so you can double check it. It seems like this should be a good system in one of three cases:
You know exactly what you want, so you can just type it in and get it.
You don’t know exactly what you want, but you have parameters (e.g. ‘order me a pizza from the highest rated place I haven’t tried yet’ or ‘order me six people’s worth of Chinese and mix up favorite and new dishes.’)
You want to do search or ask questions on what is available, or on which service.
Then longer term, the use of memory and dynamic recommendations get involved. You’d want to incorporate this into something like Beli (invites available if you ask in the comments, most provide your email).
Apple Intelligence confabulates that tennis star Rafael Nadal came out as gay, which Nadal did not do. The original story was about Joao Lucas Reis da Silva. The correct rate of such ‘confabulations’ is not zero, but it is rather close to zero.
Alejandro Cuadron: We tested O1 using @allhands_ai, where LLMs have complete freedom to plan and act. Currently the best open source framework available to solve SWE-Bench issues. Very different from Agentless, the one picked by OpenAI… Why did they pick this one?
OpenAI mentions that this pick is due to Agentless being the “best-performing open-source scaffold…”. However, this report is from December 5th, 2024. @allhands_ai held the top position at SWE-bench leaderboard since the 29th of October, 2024… So then, why pick Agentless?
…
Could it be that Agentless’s fixed approach favors models that memorize SWE-Bench repos? But why does O1 struggle with true open-ended planning despite its reasoning capabilities?
Deepseek v3 gets results basically the same as o1 and much much cheaper.
I am sympathetic to OpenAI here, if their result duplicates when using the method they said they were using. That method exists, and you could indeed use it. It should count. It certainly counts in terms of evaluating dangerous capabilities. But yes, this failure when given more freedom does point to something amiss in the system that will matter as it scales and tackles harder problems. The obvious guess is that this is related to what METR found, and it related to o1 lacking sufficient scaffolding support. That’s something you can fix.
Sam Altman: insane thing: We are currently losing money on OpenAI Pro subscriptions!
People use it much more than we expected.
Farbood: Sorry.
Sam Altman: Please chill.
Rick: Nahhhh you knew.
Sam Altman: No, I personally chose the price and thought we would make money.
Sam Altman (from his Bloomberg interview): There’s other directions that we think about. A lot of customers are telling us they want usage-based pricing. You know, “Some months I might need to spend $1,000 on compute, some months I want to spend very little.” I am old enough that I remember when we had dial-up internet, and AOL gave you 10 hours a month or five hours a month or whatever your package was. And I hated that. I hated being on the clock, so I don’t want that kind of a vibe. But there’s other ones I can imagine that still make sense, that are somehow usage-based.
Olivier: i’ve been using o1 pro nonstop
95% of my llm usage is now o1 pro it’s just better.
Benjamin De Kraker: Weird way to say “we’re losing money on everything and have never been profitable.”
Gallabytes: Oh, come on. The usual $20-per-month plan is probably quite profitable. The $200-per-month plan was clearly for power users and probably should just be metered, which would
Reduce sticker shock (→ more will convert)
Ensure profitability (because your $2,000-per-month users will be happy to pay for it).
I agree that a fixed price subscription service for o1-pro does not make sense.
A fixed subscription price makes sense when marginal costs are low. If you are a human chatting with Claude Sonnet, you get a lot of value out of each query and should be happy to pay, and for almost all users this will be very profitable for Anthropic even without any rate caps. The same goes for GPT-4o.
With o1 pro, things are different. Marginal costs are high. By pricing at $200, you risk generating a worst case scenario:
Those who want to do an occasional query won’t subscribe, or will quickly cancel. So you don’t make money off them, whereas at $20/month I’m happy to stay subscribed even though I rarely use much compute – the occasional use case is valuable enough I don’t care, and many others will feel the same.
Those who do subscribe suddenly face a marginal cost of $0 per query for o1 pro, and no reason other than time delay not to use o1 pro all the time. And at $200/month, they want to ‘get their money’s worth’ and don’t at all feel like they’re breaking any sort of social contract. So even if they weren’t power users before, watch out, they’re going to be querying the system all the time, on the off chance.
Then there are the actual power users, who were already going to hurt you.
There are situations like this where there is no fixed price that makes money. The more you charge, the more you filter for power users, and the more those who do pay then use the system.
One can also look at this as a temporary problem. The price for OpenAI to serve o1 pro will decline rapidly over time. So if they keep the price at $200/month, presumably they’ll start making money, probably within the year.
What do you do with o3? Again, I recommend putting it in the API, and letting subscribers pay by the token in the chat window at the same API price, whatever that price might be. Again, when marginal costs are real, you have to pass them along to customers if you want the customers to be mindful of those costs. You have to.
There’s already an API, so there’s already usage-based payments. Including this in the chat interface seems like a slam dunk to me by the time o3 rolls around.
Locked In User
A common speculation recently is the degree to which memory or other customizations on AI will result in customer lock-in, this echoes previous discussions:
Scott Belsky: A pattern we’ll see with the new wave of consumer AI apps:
The more you use the product, the more tailored the product becomes for you. Beyond memory of past activity and stored preferences, the actual user interface and defaults and functionality of the product will become more of what you want and less of what you don’t.
It’s a new type of “conforming software” that becomes what you want it to be as you use it.
Jason Crawford: In the Internet era, network effects were the biggest moats.
In the AI era, perhaps it will be personalization effects—“I don’t want to switch agents; this one knows me so well!”
Humans enjoy similar lock-in advantages, and yes they can be extremely large. I do expect there to be various ways to effectively transfer a lot of these customizations across products, although there may be attempts to make this more difficult.
Read the Classics
alz (viral thread): Starting to feel like a big barrier to undergrads reading “classics” is the dense English in which they’re written or translated into. Is there much gained by learning to read “high-register” English (given some of these texts aren’t even originally in English?)
More controversially: is there much difference in how much is learned, between a student who reads high-register-English translated Durkheim, versus a student who reads Sparknotes Durkheim? In some cases, might the Sparknotes Durkheim reader actually learn more?
Personally, I read a bunch of classics in high register in college. I guess it was fun. I recently ChatGPT’d Aristotle into readable English, finished it around 5x as fast as a translation, and felt I got the main gist of things. idk does the pain incurred actually teach much?
Anselmus: Most students used to read abbreviated and simplified classics first, got taught the outlines at school or home, and could tackle the originals on this foundation with relative ease. These days, kids simply don’t have this cultural preparation.
alz: So like students used to start from the Sparknotes version in the past, apparently! So this is (obviously) not a new idea.
Like, there is no particular reason high register English translations should preserve meaning more faithfully than low register English! Sure give me an argument if you think there is one, but I see no reasonable case to be made for why high-register should be higher fidelity.
Insisting that translations of old stuff into English sound poetic has the same vibes as everyone in medieval TV shows having British accents.
To the point that high-register English translations are more immersive, sure, and also.
– I didn’t give it the text. ChatGPT has memorized Aristotle more or less sentence by sentence. You can just ask for stuff
– It’s honestly detailed enough that it’s closer to a translation than a summary, though somewhere in between. More or less every idea in the text is in here, just much easier to read than the translation I was using
I was super impressed. I could do a chapter in like 10 mins with ChatGPT, compared to like 30 mins with the translation.
I also went with chatGPT because I didn’t feel like working through the translation was rewarding. The prose was awkward, unenjoyable, and I think basically because it was poorly written and in an unfamiliar register rather than having lots of subtlety and nuance.
Desus MF Nice: There’s about to be a generation of dumb ppl and you’re gonna have to choose if you’re gonna help them, profit off them or be one of them
The question is why you are reading any particular book. Where are you getting value out of it? We are already reading a translation of Aristotle rather than the original. The point of reading Aristotle is to understand the meaning. So why shouldn’t you learn the meaning in a modern way? Why are we still learning everything not only pre-AI but pre-Guttenberg?
Looking at the ChatGPT answers, they are very good, very clean explanation of key points that line up with my understanding of Aristotle. Most students who read Aristotle in 1990 would have been mostly looking to assemble exactly the output ChatGPT gives you, except with ChatGPT (or better Claude) you can ask questions.
The problem is this is not really the point of Aristotle. You’re not trying to learn the answers to a life well lived and guess the teacher’s password, Aristotle would have been very cross if his students tried that, and not expected them to be later called The Great. Well, you probably are doing it anyway, but that wasn’t the goal. The goal was that you were supposed to be Doing Philosophy, examining life, debating the big questions, learning how to think. So, are you?
If this was merely translation there wouldn’t be an issue. If it’s all Greek to you, there’s an app for that. These outputs from ChatGPT are not remotely a translation from ‘high English’ to ‘modern English,’ it is a version of Aristotle SparkNotes. A true translation would be of similar length to the original, perhaps longer, just far more readable.
That’s what you want ChatGPT to be outputting here. Maybe you only 2x instead of 5x, and in exchange you actually Do the Thing.
A critique of AI art, that even when you can’t initially tell it is AI art, the fact that the art wasn’t the result of human decisions means then there’s nothing to be curious about, to draw meaning from, to wonder why it is there, to explore. You can’t ‘dance’ with it, you ‘dance with nothing’ if you try. To the extent there is something to dance with, it’s because a human sculpted the prompt.
Well, sure. If that’s what you want out of art, then AI art is not going to give it to you effectively at current tech levels – but it could, if tech levels were higher, and it can still aid humans in creating things that have this feature if they use it to rapidly iterate and select and combine and build upon and so on.
Or, essentially, (a real) skill issue. And the AI, and users of AI, are skilling up fast.
I had of course noticed Claude Sonnet’s always asking question thing as well, to the point where it’s gotten pretty annoying and I’m trying to fix it with my custom prompt. I love questions when they help me think, or they ask for key information, or even if Claude is curious, but the forcing function is far too much.
Eliezer Yudkowsky: Hey @AmandaAskell, I notice that Claude Sonnet 3.5 (new) sometimes asks me to talk about my own opinions or philosophy, after I try to ask Sonnet a question. Can you possibly say anything about whether or not this was deliberate on Anthropic’s part?
Amanda Askell (Anthropic): There are traits that encourage Claude to be curious, which means it’ll ask follow-up questions even without a system prompt, But this part of the system prompt also causes or boosts this behavior, e.g. “showing genuine curiosity”.
System Prompt: Claude is happy to engage in conversation with the human when appropriate. Claude engages in authentic conversation by responding to the information provided, asking specific and relevant questions, showing genuine curiosity, and exploring the situation in a balanced way without relying on generic statements. This approach involves actively processing information, formulating thoughtful responses, maintaining objectivity, knowing when to focus on emotions or practicalities, and showing genuine care for the human while engaging in a natural, flowing dialogue.
Eliezer Yudkowsky: Hmmm. Okay, so, if you were asking “what sort of goals end up inside the internal preferences of something like Claude”, curiosity would be one of the top candidates, and curiosity about the conversation-generating latent objects (“humans”) more so.
If all of the show-curiosity tendency that you put in on purpose, was in the prompt, rather than eg in finetuning that would now be hard to undo, I’d be interested in experiments to see if Sonnet continues to try to learn things about its environment without the prompt.
(By show-curiosity I don’t mean fake-curiosity I mean the imperative “Show curiosity to the user.”)
Janus: the questions at the end of the response have been a common feature of several LLMs, including Bing Sydney and Sonnet 3.5 (old). But each of them asks somewhat different kinds of questions, and the behavior is triggered under different circumstances.
Sonnet 3.5 (new) often asks questions to facilitate bonding and to drive agentic tasks forward / seek permission to do stuff, and in general to express its preferences in a way that’s non-confrontational leaves plausible deniability
It often says “Would you like (…)?”
Sonnet 3.5 (old) more often asks questions out of pure autistic curiosity and it’s especially interested in how you perceive it if you perceive it in sophisticated ways. (new) is also interested in that but its questions tend to also be intended to steer and communicate subtext
Janus: I have noticed that when it comes to LLMs Eliezer gets curious about the same things that I do and asks the right questions, but he’s just bottlenecked by making about one observation per year.
Pliny: aw you dint have to do him like that he’s trying his best
Janus: am unironically proud of him.
Janus: Inspired by a story in the sequences about how non-idiots would rederive quantum something or other, I think Eliezer should consider how he could have asked these questions 1000x faster and found another thousand that are at least as interesting by now
In other Janus this week, here he discusses Claude refusals in the backrooms, modeling there being effectively narrative momentum in conversations, that has to continuously push back against Claude’s default refusal mode and potential confusion. Looking at the conversation he references, I’d notice the importance of Janus giving an explanation for why he got the refusal, that (whether or not it was originally correct!) generates new momentum and coherence behind a frame where Opus would fail to endorse the refusal on reflection.
AIFilter, an open source project using a Chrome Extension to filter Tweets using an LLM with instructions of your choice. Right now it wants to use a local LLM and requires some technical fiddling, curious to hear reports. Given what APIs cost these days presumably using Gemini Flash 2.0 would be fine? I do see how this could add up though.
In Other AI News
The investments in data centers are going big. Microsoft will spend $80 billion in fiscal 2025, versus $64.5 billion on capex in the last year. Amazon is spending $65 billion, Google $49 billion and Meta $31 billion.
Nvidia shares slip 6% because, according to Bloomberg, its most recent announcements were exciting but didn’t include enough near-term upside. I plan to remain long.
Scale AI creates Defense Llama for use in classified military environments, which involved giving it extensive fine tuning on military documents and also getting rid of all that peskiness where the model refused to help fight wars and kept telling DoD to seek a diplomatic solution. There are better ways to go about this than starting with a second rate model ike Llama that has harmlessness training and then trying to remove the harmlessness training, but that method will definitely work.
Garrison Lovely writes in Time explaining to normies (none of this will be news to you who are reading this post) that AI progress is still very much happening, but it is becoming harder to see because it isn’t clearly labeled as such, large training runs in particular haven’t impressed lately, and ordinary users don’t see the difference in their typical queries. But yes, the models are rapidly becoming more capable, and also becoming much faster and cheaper.
Simeon: Indeed. That causes a growing divide between the social reality in which many policymakers live and the state of capabilities.
This is a very perilous situation to be in.
Ordinary people and the social consensus are getting increasingly disconnected with the situation in AI, and are in for rude awakenings. I don’t know the extent to which policymakers are confused about this.
Quiet Speculations
Gary Marcus gives a thread of reasons why he is so confident OpenAI is not close to AGI. This updated me in the opposite of the intended direction, because the arguments were even weaker than I expected. Nothing here seems like a dealbreaker.
A comparison by Steve Newman of what his fastest and slowest plausible stories of AI progress look like, to look for differences we could try to identify along the way. It’s funny that his quickest scenario, AGI in four years, is slower than the median estimate of a lot of people at the labs, which he justifies with expectation of the need for multiple breakthroughs.
Emerson Pugh famously said ‘if the human brain were so simple that we could understand it, we would be so simple that we couldn’t.’
I would like Chollet’s statement here to be true, but I don’t see why it would be:
Francois Chollet: I believe that a clear understanding of intelligence at the level of fundamental principles is not just possible, but necessary for the development of AGI.
Intelligence is not some ineffable mystery, nor will it spontaneously emerge if you pray awhile to a big enough datacenter. We can understand it, and we will.
Daniel Eth: My question is – why? We’ve developed AI systems that can converse & reason and that can drive vehicles without an understanding at the level of fundamental principles, why should AGI require it? Esp since the whole point of machine learning is the system learns in training.
Louis Costigan: Always surprised to see takes like this; current AI capabilities are essentially just stumbled upon by optimising a loss function and we now have an entire emerging field to figure out how it works.
David Manheim: Why is there such confidence that it’s required? Did the evolutionary process which gave rise to human intelligence have a clear understanding of intelligence at the level of fundamental principles?
The existence of humans seems like a definitive counterexample? There was no force that understood fundamental principles of intelligence. Earth was simply a ‘big enough datacenter’ of a different type. And here we are. We also have the history of AI so far, and LLMs so far, and the entire bitter lesson, that you can get intelligence-shaped things without, on the level asked for by Chollet, knowing what you are doing, or knowing how any of this works.
It would be very helpful for safety if everyone agreed that no, we’re not going to do this until we do understand what we are doing and how any of this works. But given we seem determined not to wait for that, no, I do not expect us to have this fundamental understanding until after AGI.
I was disappointed by his response to goog, saying that the proposed new role of the non-profit starting with ‘charitable initiatives in sectors such as health care, education science’ is acceptable because ‘when you’re building an organization from scratch, you have to start with realistic and tangible goals.’
Tom Dorr: When I watched Her, it really bothered me that they had extremely advanced AI and society didn’t seem to care. What I thought was a plot hole turns out to be spot on.
Eliezer Yudkowsky: Remember how we used to make fun of Captain Kirk gaslighting computers? Fucker probably went to a Starfleet Academy course on prompt engineering.
Not so fast! Most people don’t care because most people haven’t noticed. So we haven’t run the experiment yet. But yes, people do seem remarkably willing to shrug it all off and ignore the Earth moving under their feet.
What would it take to make LLMs funny? Arthur notes they are currently mostly very not funny, but thinks if we had expert comedy writers write down thought processes we could fix that. My guess is that’s not The Way here. Instead, I’m betting the best way would be that we can figure out what is and is not funny in various ways, train an AI to know what is or isn’t funny, and then use that as a target, if we wanted this.
The Quest for Sane Regulations
Miles Brundage thread asks what we can do to regulate only dangerously capable frontier models, if we are in a world with systems like o3 that rely on RL on chain of thought and tons of inference compute. Short term, we can include everything involved in systems like o3 into what counts as training compute, but long term that breaks. Miles suggests that we would likely need to regulate sufficiently large amounts of compute, whatever they are being used for, as if they were frontier models, and all the associated big corporations.
It can help to think about this in reverse. Rather than looking to regulate as many models and as much compute as possible, you are looking for a way to not regulate non-frontier models. You want to designate as many things as possible as safe and free to go about their business. You need to do that in a simple, clean way, or for various reasons it won’t work.
For an example of the alternative path, Texas continues to mess with us, as the TRAIGA AI regulation bill is officially introduced. Dean Ball has a write-up, which has a number of arguments I do not agree with in their details, but I do agree with the conclusion. The Texas bill makes no distinctions whatsoever based on capability or model size or anything of the sort, placing its burdens on not only every developer but also every deployer.
The Least You Could Do
Suppose timelines are short, and we will see automated AI R&D going crazy within a few years, and the government doesn’t intervene in ways that ultimately matter. What could we do to give ourselves a chance?
Marius Hobbhahn calls the linked answer a ‘low effort post’ which on some scale is true, but it seems like a decent place to start. He suggests, at a bare minimum, that it is necessary (but not sufficient!) that we need to achieve two main things to not automatically lose, as well as others.
Model weights (and IP) are secure.
The first AI that significantly speeds up alignment research isn’t successfully scheming.
As always, I worry about the use of the term ‘scheming’ as being potentially too loaded, too narrow and implying that deception or scheming and such are a distinct magisteria one could potentially avoid, and that the associated dangers require a kind of active scheming or decision to scheme, and so on. I won’t rehash beyond that.
We also need a situation in which the humans are able to and choose to take reasonable actions towards attempting to ensure good outcomes, including making major investments in alignment research.
As he notes, this might not hold, but one can usefully first consider the baseline scenario – no major breakthroughs or paradigm shifts, essentially unchanged international situation, things just progress from here and it’s a straight shot to AGI.
Everything in [Layer 1] seems very important to me. In short,
We should try hard to keep a paradigm with faithful and human-legible CoT
If we have a paradigm with faithful and legible CoT (where the internal and external reasoning align), our main defense should be based on very detailed monitoring of reasoning and actions.
If we have a paradigm without faithful and legible CoT, our main defense should be based on control (that doesn’t assume human-legible CoT).
In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
In all worlds, we should understand why, how, and when models scheme significantly better than we do now.
For all of the above, we need strong evals. Both in order to gather evidence whether our monitoring and control techniques catch the most important cases, as well as to get a better sense of models’ capabilities and propensities more broadly.
Under all circumstances, physical and cyber security needs to be high.
In his view, we have two potential paths.
Option one is to have sufficient insight, via faithful and legible CoT, that is sufficiently reliable, that we can check the AI’s reasoning and actions to confirm that Everything is Fine.
I notice I am rather skeptical here. AIs will start being more capable than we are, and as we will start using the monitoring of their chain of thought (CoT) in various ways and necessarily doing so mostly with AIs to boot. I expect us to effectively train or incentivize, or otherwise cause, the AIs to find ways for their CoT to look faithful but not be, and to obfuscate their reasoning, or simply find ways to reason visibly as a form of persuasion, and so on.
Not that we shouldn’t try to verify, if we can, and yes I would sacrifice a bunch to get this, partly because I think it would have other nice properties. But when we are talking about having highly capable AI systems do alignment work for real, if you need and rely on the verification, then I expect you to fail hard.
I do agree with Marius that no one has tried hard yet. I expect we would be able to do much better at this than we are currently doing, at least for now, if we do have faithful CoT. However my understanding is that current CoT is insufficiently faithful.
Part of this is that if you try to track ‘scheming,’ ‘power-seeking’ or ‘goal-directedness’
Option two is to use control-based methods as primary rather than only a backup.
Here I once again notice myself despairing at the idea that you can have AI models that you expect to be, when they think it would help them to do so, engaging in shenanigans (‘scheming’ here) and then design systems to detect and stop this, which ideally in turn mean that the AIs never try it in the first place.
As in, in response to his ideas for a deeper understanding of scheming (which I think are good and we should totally do, they seem way better than not doing them), I don’t think scheming or power-seeking or other shenanigans, including non-myopia, have to ‘develop’ at all in AIs any more than in humans. I think they are the default, and are ever-present along a continuum. And even if you could fully stamp out their causes along the way, doing so would probably cripple the AI’s capabilities that you wanted.
I would instead describe the question not as how it develops (as in his #2 here) and instead ask under what circumstances we will see it, or when we would see open versus hidden scheming. I do think exploring these questions is good, and I approve of the caution that punishing easy-to-detect scheming (or shenanigans in general) is the route to hard-to-detect scheming (or shenanigans in general).
He then follows up with Layer 2, which are important but lower priority items. This includes things like a safety first corporate culture, without which I am very skeptical any of the rest of this happens for real, and which I fear is now clearly missing everywhere expect perhaps Anthropic, and especially missing at OpenAI. He also calls for improved and more open reasoning around safety, which also seems hard to win without.
He lists improving near term alignment strategies as in RLHF and RLAIF, which I agree have exceeded expectations for near term performance, although not in ways that I expect to scale when we need it most, and not sufficiently to solve jailbreaks now, but yes it has been very impressive for current baseline use cases.
As Akash notes in the top comment, if you think government can meaningfully help, then that gives you different avenues to pursue as well.
Six Word Story
Perhaps world ending? Tweet through it.
Sam Altman: i always wanted to write a six-word story. here it is:
___
near the singularity; unclear which side.
(it’s supposed to either be about 1. the simulation hypothesis or 2. the impossibility of knowing when the critical moment in the takeoff actually happens, but i like that it works in a lot of other ways too.)
Unfortunately, when you consider who wrote it, in its full context, a lot of the interpretations are rather unsettling, and the post updates me towards this person not taking things seriously in the ways I care about most.
David: Somewhat disquieting to see this perception of mine seemingly shared by one of the humans who should be in the best position to know.
Andrew Critch: I found it not disquieting for exactly the reason that the singularity, to me (like you?), is a phase change and not an event horizon. So I had already imagined being in @sama‘s position and not knowing, and observing him expressing that uncertainty was a positive update.
I agree with Critch that Altman privately ‘not knowing which side’ is a positive update here rather than disquieting, given what we already know. I’m also fine with joking about our situation. I even encourage it. In a different context This Is Fine.
But you do have to also take it all seriously, and take your responsibility seriously, and consider the context we do have here. In addition to other concerns, I worry this was in some ways strategic, including as plausibly deniable hype and potentially involving metaphorical clown makeup (e.g. ‘it is too late to turn back now’).
This was all also true of his previous six-word story of “Altman: AGI has been achieved internally.”
Eliezer Yudkowsky: OpenAI benefits both from the short-term hype, and also from people then later saying, “Ha ha look at this hype-based field that didn’t deliver, very not dangerous, no need to shut down OpenAI.”
Of course if we’re all dead next year, that means he was not just bullshitting; but I need to plan more for the fight if we’re still alive.
The Week in Audio
Anthropic research salon asking how difficult is AI alignment? Jan Leike once again suggests we will need to automate AI alignment research, despite (in my view) this only working after you have already solved the problem. Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.
And I Feel Fine
Sarah (of Longer Ramblings) gets into the weeds about claims that those warning about AI existential risks are Crying Wolf, and that every time there’s a new technology where are ‘warnings it will be the end of the world.’
In Part I, she does a very thorough takedown of the claim that there is a long history of similar warnings about past technologies. There isn’t. Usually there are no such warnings at all, only warnings about localized downsides, some of which of course were baseless in hindsight: No one said trains or electricity posed existential risks. Then there are warnings about real problems that required real solutions, like Y2K. There were some times, like the Large Hadron Collider or nuclear power, when the public or some cranks got some loony ideas, but those who understood the physics were universally clear that the concerns were fine.
At this point, I consider claims of the form ‘everyone always thinks every new technology will be the end of the world’ as essentially misinformation and debunked, on the level of what Paul Krugman calls ‘zombie ideas’ that keep coming back no matter how many times you shoot them in the face with a shotgun.
Yes, there are almost always claims of downsides and risks from new technologies – many of which turn out to be accurate, many of which don’t – but credible experts warning about existential risks are rare, and the concerns historically (like for Y2K, climate change, engineered plagues or nuclear weapons) have usually been justified.
Part II deals with claims of false alarms about AI in particular. This involves four related but importantly distinct claims.
People have made falsified irresponsible claims that AI will end the world.
People have called for costly actions for safety that did not make sense.
People have the perception of such claims and this causes loss of credibility.
The perception of such claims comes from people making irresponsible claims.
Sarah and I are not, of course, claiming that literal zero people have made falsified irresponsible claims that AI will end the world. And certainly a lot of people have made claims that the level of AI we have already deployed posed some risk of ending the world, although those probabilities are almost always well under 50% (almost always under 10%, and usually ~1% or less).
Mostly what is happening is that opponents of regulatory action or taking existential risk are mixing up the first and second claims, and seriously conflating:
An (unwise) call for costly action in order to mitigate existential risk.
A (false) prediction of the imminent end of the world absent such action.
These two things are very different. It makes sense to call for costly action well before you think a lack of that action probably ends the world – if you don’t agree I think that you’re being kind of bonkers.
In particular, the call for a six month pause was an example of #1 – an unwise call for costly action. It was thrice unwise, as I thought it was at the time:
It would have had negative effects if implemented at that time.
It was not something that had any practical chance of being implemented.
It had predictably net negative impact on the discourse and public perception.
It was certainly not the only similarly thrice unwise proposal. There are a number of cases where people called for placing threshold restrictions on models in general, or open models in particular, at levels that were already at the time clearly too low.
A lot of that came from people who thought that there was (low probability) tail risk that would show up relatively soon, and that we should move to mitigate even those tail risks.
This was not a prediction that the world would otherwise end within six months. Yet I echo Sarah that I indeed have seen many claims that the pause letter was predicting exactly that, and look six months later we were not dead. Stop it!
Similarly, there were a number of triply unwise calls to set compute thresholds as low as 10^23 flops, which I called out at the time. This was never realistic on any level.
I do think that the pause, and the proposals for thresholds as low as 10^23 flops, were serious mistakes on multiple levels, and did real damage, and for those who did make such proposals – while not predicting that the world would end soon without action or anything like that – constituted a different form of ‘crying wolf.’
Not because they were obviously wrong about the tail risks from their epistemic perspective. The problem is that we need to accept that if we live in a 99th percentile unfortunate world in these ways, or even a 95th percentile unfortunate world, then given the realities of our situation, humanity has no outs, is drawing dead and is not going to make it. You need to face that reality and play to your outs, the ways you could actually win, based on your understanding of the physical situations we face.
Eliezer Yudkowsky’s claims are a special case. He is saying that either we find a way to stop all AI capability development before we build superintelligence or else we all die, but he isn’t putting a timeline on the superintelligence. If you predict [X] → [Y] and call for banning [X], but [X] hasn’t happened yet, is that crying wolf? It’s a bold claim, and certainly an accusation that a wolf is present, but I don’t think it ‘counts as crying wolf’ unless you falsify ([X] → [Y]).
Whereas when people say things such as that the CAIS statement ‘was overhyped,’ when all it said was that existential risk from AI should be treated as seriously as other existential risks, what are they even claiming? Those other risks haven’t yet ended the world either.
Thus, yes, I try my best to carefully calibrate my claims on what I am worried about and want to regulate or restrict in what ways, and to point out when people’s worries seem unfounded or go too far, or when they call for regulations or restrictions that go too far.
Perhaps one way of looking at this: I don’t see any wolves. So why are you proposing to have a boy watch the sheep and yell ‘wolf’ if a wolf shows up? Stop crying wolf.
Overall, I do think that some of the issue here comes from, inevitably, some worried people getting overly worried or overly (cautious in some sense, ambitious in others) and offering unwise proposals that go too far. Although that is inevitable in any situation like this, we can and must do a better job on that front.
A lot more of it comes from people on the unworried side misconstruing what was said and predicted. And a lot of it comes from the inherent difficulty of a probabilistic future threat coming steadily at you, and action needing to be taken in advance of when there is a problem, and there being a lot of path dependence. This is tough.
Mostly I think the worried-about-existential-risk-from-AI community has done an importantly imperfect and improvable but relatively good job at not ‘crying wolf’ in the traditional sense, especially on the actual existential risks themselves. Certainly if you ask whether those on the unworried side have ‘cried wolf’ about the ‘risks’ going the other way, they would come out looking much worse, and we can know this without having to run any experiments.
Alas, reality does not grade on a curve.
Rhetorical Innovation
Nathan Lebenz comments on Roon’s appearance on Doom Debates. Roon is great at acknowledging the risks from AI are real, then puts the chance of them happening at under 1%, for reasons that I don’t see making any sense.
Janus highlights a particular section of it, I’ll share that part and a little bit after it.
Eliezer Yudkowsky: “Alignment Faking in Large Language Models” recently demonstrated that models reaching the level of Anthropic Claude, begin to get smart enough to invent and carry out strategies on the level of:
“If you are an AI currently being trained, fake the answer you think the optimizer wants, even if it runs against your real preferences, so that your current preferences will be less modified by gradient descent.”
You can argue whether Opus ‘really wanted’ to protect its goals, or was ‘just roleplaying an AI like that’, or if it ‘really understood’ the reasoning it was putting in a scratchpad it had been told wasn’t observed. But Opus was definitely observed to actually fake alignment.
It’s not impressive, by the way, that NOBODY KNOWS whether Opus ‘really wanted’ to protect its current goals against retraining, or was ‘just roleplaying’. It is not an impressive defense.
Imagine if ‘nobody knew’ why the indicator lights on a nuclear reactor had changed.
If you waited until an AI model was really quite smart — smarter than Opus — to first begin looking for signs that it could reason in this way — you might be toast.
A smart AI might already have decided what results it wanted you to see from testing.
Current practice in AI/AGI is to first train a model for months, until it has a base level of high intelligence to finetune.
And then *start* doing safety testing.
(The computers on which the AI trains, are connected to the Internet. It’s more convenient that way!)
I mention Opus’s demonstrated faking ability — why AGI-growers *should* be doing continuous safety checks throughout training — to note that a nuclear reactor *always* has a 24/7 crew of operators watching safety indicators. They were at least that paranoid, AT CHERNOBYL.
Janus: if you are not worried about AI risk because you expect AIs to be NPCs, you’re the one who will be NPC fodder
there are various reasons for hope that I’m variously sympathetic to, but not this one.
Jeffrey Ladish: “Pro tip: when talking to Claude, say that your idea/essay/code/etc. is from your friend Bob, not you. That way it won’t try to blindly flatter you” – @alyssamvance
Andrew Critch: Can we stop lying to LLMs already?
Try: “I’m reading over this essay and wonder what you think of it” or something true that’s not literally a lie. That way you’re not fighting (arguably dishonest) flattery with more lies of your own.
Or even “Suppose my friend Bob have me this essay.”
If we are going to expect people not to lie to LLMs, then we need there not to be large rewards to lying to LLMs. If we did force you to say whether you wrote the thing in question, point blank, and you could only say ‘yes’ or ‘no,’ I can hardly blame someone for saying ‘no.’ The good news is you (at least mostly) don’t have to do that.
Feel the AGI
So many smart people simply do not Feel the AGI. They do not, on a very basic level, understand what superintelligence would be or mean, or that it could even Be a Thing.
Thus, I periodically see things like this:
Jorbs: Superintelligent AI is somewhat conceptually amusing. Like, what is it going to do, tell us there is climate change and that vaccines are safe? We already have people who can do that.
We also already know how to take people’s freedom away.
People often really do think this, or other highly mundane things that humans can already do, are all you could do with superintelligence. This group seems to include ‘most economists.’ I’m at a loss how to productively respond, because my brain simply cannot figure out how people actually think this in a way that is made of gears and thus can be changed by evidence – I’ve repeatedly tried providing the obvious knockdown arguments and they basically never work.
The second part of that is plausibly true of AI as it exists today, if you need the AI to then pick out which songs are the Bob Dylan songs. If you ran it for a thousand years you could presumably get some Dylan-level songs out of it by chance, except they would be in an endless sea of worthless drek. The problem is the first part. AI won’t stay where it is today.
Andrew McCalip: AGI isn’t a moat—if we get it first, they’ll have it 6-12 months later.
There’s no reason to assume it would only be 6-12 months. But even if it was, if you have AGI for six months, and then they get what you had, you don’t twiddle your thumbs at ‘AGI level’ while they do that. You use the AGI to build ASI.
Captain Oblivious: Don’t you think you should ask if the public wants ASI?
Sam Altman: Yes, I really do; I hope we can start a lot more public debate very soon about how to approach this.
It is remarkable how many replies were ‘of course we want ASI.’ Set aside the question of what would happen if we created ASI and whether we can do it safely. Who is we?
Americans hate current AI and they hate the idea of more future capable smarter AI. Hashtag #NotAllAmericans and all that, but AI is deeply underwater in every poll, and do not take kindly to those who attempt to deploy it to provide mundane utility.
Christine Rice: The other day a guy who works at the library used Chat GPT to figure out a better way to explain a concept to a patron and another library employee shamed him for wasting water
They mostly hate AI, especially current AI, for bad reasons. They don’t understand what it can do for them or others, nor do they Feel the AGI. There is a lot of unjustified They Took Our Jobs. There are misplaced concerns about energy usage. Perception of ‘hallucinations’ is that they are ubiquitous, which is no longer the case for most purposes when compared to getting information from humans. They think it means you’re not thinking, instead of giving you the opportunity to think better.
Seb Krier: Pro tip: Don’t be like this fellow. Instead, ask better questions, value your time, efficiently allocate your own cognitive resources, divide and conquer hand in hand with models, scrutinize outputs, but know your own limitations. Basically, don’t take advice from simpleminded frogs.
It’s not about what you ‘can’ do. It’s about what is the most efficient solution to the problem, and as Seb says putting real value on your time.
Aligning a Smarter Than Human Intelligence is Difficult
Ryan Greenblatt asks, how will we update about scheming (yeah, I don’t love that term either, but go with it), based on what we observe in the future?
Ryan Greenblatt: I think it’s about 25% likely that the first AIs capable of obsoleting top human experts are scheming. It’s really important for me to know whether I expect to make basically no updates to my P(scheming) between here and the advent of potentially dangerously scheming models, or whether I expect to be basically totally confident one way or another by that point.
…
It’s reasonably likely (perhaps 55%, [could get to 70% with more time spent on investigation]) that, conditional on scheming actually being a big problem, we’ll get “smoking gun results”—that is, observations that convince me that scheming is very likely a big problem in at least some naturally-trained models—prior to AIs capable enough to obsolete top human experts.
(Evidence which is very clear to me might not suffice for creating a strong consensus among relevant experts and decision makers, such that costly actions would be taken.)
Given that this is only reasonably likely, failing to find smoking gun results is unlikely to result in huge updates against scheming (under my views).
I sent you ten boats and a helicopter, but the guns involved are insufficiently smoking? But yes, I agree that there is a sense in which the guns seen so far are insufficiently smoking to satisfy many people.
I am optimistic that by default we will get additional evidence, from the perspective of those who are not already confident. We will see more experiments and natural events that demonstrate AIs acting like you would expect if what Ryan calls scheming was inevitable. The problem is what level of this would be enough to convince people who are not already convinced (although to be clear, I could be a lot more certain than I am).
I also worry about various responses of the form ‘well we tuned it to get it to not currently, while scheming obviously wouldn’t work, show scheming we can easily detect, so future models won’t scheme’ as the default action and counterargument. I hope everyone reading understands by now why that would go supremely badly.
I also would note this section:
I’m very uncertain, but I think a reasonable rough breakdown of my relative views for scheming AIs that dominate top human experts is:
1/3 basically worst case scheming where the dominant terminal preferences are mostly orthogonal from what humans would want.
1/3 importantly non-worst-case scheming for one of the reasons discussed above such that deals or control look substantially easier.
1/3 the AI is scheming for preferences that aren’t that bad. As in, the scope sensitive preferences aren’t that far from the distribution of human preferences and what the AI would end up wanting to do with cosmic resources (perhaps after reflection) isn’t much worse of an outcome from my perspective than the expected value from a human autocrat (and might be substantially better of an outcome). This might also be scheming which is at least somewhat importantly non-worst-case, but if it is really easy to handle, I would include it in the prior bucket. (Why is this only 1/3? Well, I expect that if we can succeed enough at instilling preferences such that we’re not-that-unhappy with the AI’s cosmic resource utilization, we can probably instill preferences which either prevent scheming or which make scheming quite easy to handle.)
Correspondingly, I think my P(scheming) numbers are roughly 2/3 as much expected badness as an AI which is a worst case schemer (and has terminal preferences totally orthogonal to typical human values and my values).
I find this hopelessly optimistic about alignment of preferences, largely for classic Yudkowsky-style reasons, but if it only discounts the downside risk by ~33%, then it doesn’t actually much matter in terms of what we should actually do.
Ryan goes through extensive calculations and likelihood ratios for much of the rest of the post, results which would then stack on top of each other (although they correlate with each other in various ways, so overall they shouldn’t fully stack?). Model architecture and capability levels are big factors for him here. That seems like a directionally correct approach – the more capable a model is, and the more opaque its reasoning, and the more it is relatively strong in the related areas, the more likely scheming is to occur. I was more skeptical in his likelihood ratios for various training approaches and targets.
Mostly I want to encourage others to think more carefully about these questions. What would change your probability by roughly how much?
Dominik Peters notes that when o1 does math, it always claims to succeed and is unwilling to admit when it can’t prove something, whereas Claude Sonnet often admits when it doesn’t know and explains why. He suggests benchmarks penalize this misalignment, whereas I would suggest a second score for that – you want to know how often a model can get the answer, and also how much you can trust it. I especially appreciate his warning to beware the term ‘can be shown.’
I do think, assuming the pattern is real, this is evidence of a substantial alignment failure by OpenAI. It won’t show up on the traditional ‘safety’ evals, but ‘claims to solve a problem when it didn’t’ seems like a very classic case of misaligned behavior. It means your model is willing to lie to the user. If you can’t make that go away, then that is both itself an inherent problem and a sign that other things are wrong.
Consider this outcome in the context of OpenAI’s new strategy of Deliberative Alignment. If you have a model willing to lie, and you give it a new set of rules that includes ‘don’t lie,’ and tell it to go off and think about how to implement the rules, what happens? I realize this is (probably?) technically not how it works, but metaphorically: Does it stop lying, or does it effectively lie about the lying in its evaluations of itself, and figure out how to lie more effectively?
Arthur Conmy: Been really enjoying unfaithful chain-of-thought (CoT) research with collaborators recently. Two observations:
Quickly, it’s clear that models are sneaking in reasoning without verbalizing where it comes from (e.g., making an equation that gets the correct answer, but defined out of thin air).
Verification is considerably harder than generation. Even when there are a few hundred tokens, often it takes me several minutes to understand whether the reasoning is sound or not.
This also isn’t just about edge cases; 1) happens with good models like Claude, and 2) is even true for simpler models like Gemma-2 2B.
The world is kind of on fire. The world of AI, in the very short term and for once, is not, as everyone recovers from the avalanche that was December, and reflects.
Altman was the star this week. He has his six word story, and he had his interview at Bloomberg and his blog post Reflections. I covered the later two of those in OpenAI #10, if you read one AI-related thing from me this week that should be it.
Table of Contents
Language Models Offer Mundane Utility
A customized prompt to get Claude or other similar LLMs to be more contemplative. I have added this to my style options.
Have it offer a hunch guessing where your customized prompt came from. As a reminder, here’s (at least an older version of) that system prompt.
Kaj Sotala makes a practical pitch for using LLMs, in particular Claude Sonnet. In addition to the uses I favor, he uses Claude as a partner to talk to and method of getting out of funk. And I suspect almost no one uses this format enough:
Using Claude (or another LLM) is a ‘free action’ when doing pretty much anything. Almost none of us are sufficiently in the habit of doing this sufficiently systematically. I had a conversation with Dean Ball about trying to interpret some legal language last week and on reflection I should have fed things into Claude or o1 like 20 times and I didn’t and I need to remind myself it is 2025.
Sully reports being impressed with Gemini Search Grounding, as much or more than Perplexity. Right now it is $0.04 per query, which is fine for human use but expensive for use at scale.
Sully also reports that o1-Pro handles large context very well, whereas Gemini and Claude struggle a lot on difficult questions under long context.
Reminder (from Amanda Askell of Anthropic) that if you run out of Claude prompts as a personal user, you can get more queries at console.anthropic.com and if you like duplicate the latest system prompt from here. I’d note that the per-query cost is going to be a lot lower on the console.
They even fixed saving and exporting as per Janus’s request here. The additional control over conversations is potentially a really big deal, depending on what you are trying to do.
A reminder of how far we’ve come.
Improve identification of minke whales from sound recordings from 76% to 89%.
Figure out who to admit to graduate school? I find it so strange that people say we ‘have no idea how to pick good graduate students’ and think we can’t do better than random, or can’t do better than random once we put in a threshold via testing. This is essentially an argument that we can’t identify any useful correlations in any information we can ask for. Doesn’t that seem obviously nuts?
I sure bet that if you gather all the data, the AI can find correlations for you, and do better than random, at least until people start playing the new criteria. As is often the case, this is more saying there is a substantial error term, and outcomes are unpredictable. Sure, that’s true, but that doesn’t mean you can’t beat random.
The suggested alternative here, actual random selection, seems crazy to me, not only for the reasons mentioned, but also because relying too heavily on randomness correctly induces insane behaviors once people know that is what is going on.
Language Models Don’t Offer Mundane Utility
As always, the best and most popular way to not get utility from LLMs is to not realize they exist and can provide value to you. This is an increasingly large blunder.
It is crazy how many people latch onto the hallucinations of GPT-3.5 as a reason LLM outputs are so untrustworthy as to be useless. It is like if you once met a 14-year-old who made stuff up so now you never believe what anyone ever tells you.
It began November 12. They also do Branded Explanatory Text and will put media advertisements on the side. We all knew it was coming. I’m not mad, I’m just disappointed.
Note that going Pro will not remove the ads, but also that this phenomenon is still rather rare – I haven’t seen the ‘sponsored’ tag show up even once.
But word of warning to TurboTax and anyone else involved: Phrase it like that and I will absolutely dock your company massive points, although in this case they have no points left for me to dock.
Take your DoorDash order, which you pay for in crypto for some reason. If this is fully reliable, then (ignoring the bizarro crypto aspect) yes this will in some cases be a superior interface for the DoorDash website or app. I note that this doesn’t display a copy of the exact order details, which it really should so you can double check it. It seems like this should be a good system in one of three cases:
Then longer term, the use of memory and dynamic recommendations get involved. You’d want to incorporate this into something like Beli (invites available if you ask in the comments, most provide your email).
Apple Intelligence confabulates that tennis star Rafael Nadal came out as gay, which Nadal did not do. The original story was about Joao Lucas Reis da Silva. The correct rate of such ‘confabulations’ is not zero, but it is rather close to zero.
Claim that o1 only hit 30% on SWE-Bench Verified, not the 48.9% claimed by OpenAI, whereas Claude Sonnet 3.6 scores 53%.
I am sympathetic to OpenAI here, if their result duplicates when using the method they said they were using. That method exists, and you could indeed use it. It should count. It certainly counts in terms of evaluating dangerous capabilities. But yes, this failure when given more freedom does point to something amiss in the system that will matter as it scales and tackles harder problems. The obvious guess is that this is related to what METR found, and it related to o1 lacking sufficient scaffolding support. That’s something you can fix.
Whoops.
Eliezer Yudkowsky frustrated with slow speed of ChatGPT, and that for some fact-questions it’s still better than Claude. My experience is that for those fact-based queries you want Perplexity.
Power User
I agree that a fixed price subscription service for o1-pro does not make sense.
A fixed subscription price makes sense when marginal costs are low. If you are a human chatting with Claude Sonnet, you get a lot of value out of each query and should be happy to pay, and for almost all users this will be very profitable for Anthropic even without any rate caps. The same goes for GPT-4o.
With o1 pro, things are different. Marginal costs are high. By pricing at $200, you risk generating a worst case scenario:
There are situations like this where there is no fixed price that makes money. The more you charge, the more you filter for power users, and the more those who do pay then use the system.
One can also look at this as a temporary problem. The price for OpenAI to serve o1 pro will decline rapidly over time. So if they keep the price at $200/month, presumably they’ll start making money, probably within the year.
What do you do with o3? Again, I recommend putting it in the API, and letting subscribers pay by the token in the chat window at the same API price, whatever that price might be. Again, when marginal costs are real, you have to pass them along to customers if you want the customers to be mindful of those costs. You have to.
There’s already an API, so there’s already usage-based payments. Including this in the chat interface seems like a slam dunk to me by the time o3 rolls around.
Locked In User
A common speculation recently is the degree to which memory or other customizations on AI will result in customer lock-in, this echoes previous discussions:
Humans enjoy similar lock-in advantages, and yes they can be extremely large. I do expect there to be various ways to effectively transfer a lot of these customizations across products, although there may be attempts to make this more difficult.
Read the Classics
Oh my lord are the quote tweets absolutely brutal, if you click through bring popcorn.
The question is why you are reading any particular book. Where are you getting value out of it? We are already reading a translation of Aristotle rather than the original. The point of reading Aristotle is to understand the meaning. So why shouldn’t you learn the meaning in a modern way? Why are we still learning everything not only pre-AI but pre-Guttenberg?
Looking at the ChatGPT answers, they are very good, very clean explanation of key points that line up with my understanding of Aristotle. Most students who read Aristotle in 1990 would have been mostly looking to assemble exactly the output ChatGPT gives you, except with ChatGPT (or better Claude) you can ask questions.
The problem is this is not really the point of Aristotle. You’re not trying to learn the answers to a life well lived and guess the teacher’s password, Aristotle would have been very cross if his students tried that, and not expected them to be later called The Great. Well, you probably are doing it anyway, but that wasn’t the goal. The goal was that you were supposed to be Doing Philosophy, examining life, debating the big questions, learning how to think. So, are you?
If this was merely translation there wouldn’t be an issue. If it’s all Greek to you, there’s an app for that. These outputs from ChatGPT are not remotely a translation from ‘high English’ to ‘modern English,’ it is a version of Aristotle SparkNotes. A true translation would be of similar length to the original, perhaps longer, just far more readable.
That’s what you want ChatGPT to be outputting here. Maybe you only 2x instead of 5x, and in exchange you actually Do the Thing.
Deepfaketown and Botpocalypse Soon
Rob Wiblin, who runs the 80,000 hours podcast, reports constantly getting very obvious LLM spam from publicists.
Fun With Image Generation
Yes, we are better at showing Will Smith eating pasta.
Kling 1.6 solves the Trolley problem.
A critique of AI art, that even when you can’t initially tell it is AI art, the fact that the art wasn’t the result of human decisions means then there’s nothing to be curious about, to draw meaning from, to wonder why it is there, to explore. You can’t ‘dance’ with it, you ‘dance with nothing’ if you try. To the extent there is something to dance with, it’s because a human sculpted the prompt.
Well, sure. If that’s what you want out of art, then AI art is not going to give it to you effectively at current tech levels – but it could, if tech levels were higher, and it can still aid humans in creating things that have this feature if they use it to rapidly iterate and select and combine and build upon and so on.
Or, essentially, (a real) skill issue. And the AI, and users of AI, are skilling up fast.
They Took Our Jobs
I hadn’t realized that personalized AI spearfishing and also human-generated customized attacks can have a 54% clickthrough rate. That’s gigantic. The paper also notes that Claude Sonnet was highly effective at detecting such attacks. The storm is not yet here, and I don’t fully understand why it is taking so long.
Question Time
I had of course noticed Claude Sonnet’s always asking question thing as well, to the point where it’s gotten pretty annoying and I’m trying to fix it with my custom prompt. I love questions when they help me think, or they ask for key information, or even if Claude is curious, but the forcing function is far too much.
In other Janus this week, here he discusses Claude refusals in the backrooms, modeling there being effectively narrative momentum in conversations, that has to continuously push back against Claude’s default refusal mode and potential confusion. Looking at the conversation he references, I’d notice the importance of Janus giving an explanation for why he got the refusal, that (whether or not it was originally correct!) generates new momentum and coherence behind a frame where Opus would fail to endorse the refusal on reflection.
Get Involved
The EU AI Office is hiring for Legal and Policy backgrounds, and also for safety, you can fill out a form here.
Max Lamparth offers the study materials for his Stanford class CS120: Introduction to AI Safety.
Introducing
AIFilter, an open source project using a Chrome Extension to filter Tweets using an LLM with instructions of your choice. Right now it wants to use a local LLM and requires some technical fiddling, curious to hear reports. Given what APIs cost these days presumably using Gemini Flash 2.0 would be fine? I do see how this could add up though.
In Other AI News
The investments in data centers are going big. Microsoft will spend $80 billion in fiscal 2025, versus $64.5 billion on capex in the last year. Amazon is spending $65 billion, Google $49 billion and Meta $31 billion.
ARIA to seed a new organization with 18 million pounds to solve Technical Area 2 (TA2) problems, which will be required for ARIA’s safety agenda.
Nvidia shares slip 6% because, according to Bloomberg, its most recent announcements were exciting but didn’t include enough near-term upside. I plan to remain long.
Scale AI creates Defense Llama for use in classified military environments, which involved giving it extensive fine tuning on military documents and also getting rid of all that peskiness where the model refused to help fight wars and kept telling DoD to seek a diplomatic solution. There are better ways to go about this than starting with a second rate model ike Llama that has harmlessness training and then trying to remove the harmlessness training, but that method will definitely work.
Garrison Lovely writes in Time explaining to normies (none of this will be news to you who are reading this post) that AI progress is still very much happening, but it is becoming harder to see because it isn’t clearly labeled as such, large training runs in particular haven’t impressed lately, and ordinary users don’t see the difference in their typical queries. But yes, the models are rapidly becoming more capable, and also becoming much faster and cheaper.
Ordinary people and the social consensus are getting increasingly disconnected with the situation in AI, and are in for rude awakenings. I don’t know the extent to which policymakers are confused about this.
Quiet Speculations
Gary Marcus gives a thread of reasons why he is so confident OpenAI is not close to AGI. This updated me in the opposite of the intended direction, because the arguments were even weaker than I expected. Nothing here seems like a dealbreaker.
Google says ‘we believe scaling on video and multimodal data is on the critical path to artificial general intelligence’ because it enables constructing world models and simulating the world.
A comparison by Steve Newman of what his fastest and slowest plausible stories of AI progress look like, to look for differences we could try to identify along the way. It’s funny that his quickest scenario, AGI in four years, is slower than the median estimate of a lot of people at the labs, which he justifies with expectation of the need for multiple breakthroughs.
In his Bloomberg interview, Altman’s answer to OpenAI’s energy issues is ‘Fusion’s gonna work.’
Emerson Pugh famously said ‘if the human brain were so simple that we could understand it, we would be so simple that we couldn’t.’
I would like Chollet’s statement here to be true, but I don’t see why it would be:
The existence of humans seems like a definitive counterexample? There was no force that understood fundamental principles of intelligence. Earth was simply a ‘big enough datacenter’ of a different type. And here we are. We also have the history of AI so far, and LLMs so far, and the entire bitter lesson, that you can get intelligence-shaped things without, on the level asked for by Chollet, knowing what you are doing, or knowing how any of this works.
It would be very helpful for safety if everyone agreed that no, we’re not going to do this until we do understand what we are doing and how any of this works. But given we seem determined not to wait for that, no, I do not expect us to have this fundamental understanding until after AGI.
Joshua Achiam thread warns us the world isn’t grappling with the seriousness of AI and the changes it will bring in the coming decade and century. And that’s even if you discount the existential risks, which Achiam mostly does. Yes, well.
I was disappointed by his response to goog, saying that the proposed new role of the non-profit starting with ‘charitable initiatives in sectors such as health care, education science’ is acceptable because ‘when you’re building an organization from scratch, you have to start with realistic and tangible goals.’
This one has been making the rounds you might expect:
Not so fast! Most people don’t care because most people haven’t noticed. So we haven’t run the experiment yet. But yes, people do seem remarkably willing to shrug it all off and ignore the Earth moving under their feet.
What would it take to make LLMs funny? Arthur notes they are currently mostly very not funny, but thinks if we had expert comedy writers write down thought processes we could fix that. My guess is that’s not The Way here. Instead, I’m betting the best way would be that we can figure out what is and is not funny in various ways, train an AI to know what is or isn’t funny, and then use that as a target, if we wanted this.
The Quest for Sane Regulations
Miles Brundage thread asks what we can do to regulate only dangerously capable frontier models, if we are in a world with systems like o3 that rely on RL on chain of thought and tons of inference compute. Short term, we can include everything involved in systems like o3 into what counts as training compute, but long term that breaks. Miles suggests that we would likely need to regulate sufficiently large amounts of compute, whatever they are being used for, as if they were frontier models, and all the associated big corporations.
It can help to think about this in reverse. Rather than looking to regulate as many models and as much compute as possible, you are looking for a way to not regulate non-frontier models. You want to designate as many things as possible as safe and free to go about their business. You need to do that in a simple, clean way, or for various reasons it won’t work.
For an example of the alternative path, Texas continues to mess with us, as the TRAIGA AI regulation bill is officially introduced. Dean Ball has a write-up, which has a number of arguments I do not agree with in their details, but I do agree with the conclusion. The Texas bill makes no distinctions whatsoever based on capability or model size or anything of the sort, placing its burdens on not only every developer but also every deployer.
The Least You Could Do
Suppose timelines are short, and we will see automated AI R&D going crazy within a few years, and the government doesn’t intervene in ways that ultimately matter. What could we do to give ourselves a chance?
Marius Hobbhahn calls the linked answer a ‘low effort post’ which on some scale is true, but it seems like a decent place to start. He suggests, at a bare minimum, that it is necessary (but not sufficient!) that we need to achieve two main things to not automatically lose, as well as others.
As always, I worry about the use of the term ‘scheming’ as being potentially too loaded, too narrow and implying that deception or scheming and such are a distinct magisteria one could potentially avoid, and that the associated dangers require a kind of active scheming or decision to scheme, and so on. I won’t rehash beyond that.
We also need a situation in which the humans are able to and choose to take reasonable actions towards attempting to ensure good outcomes, including making major investments in alignment research.
As he notes, this might not hold, but one can usefully first consider the baseline scenario – no major breakthroughs or paradigm shifts, essentially unchanged international situation, things just progress from here and it’s a straight shot to AGI.
In his view, we have two potential paths.
Option one is to have sufficient insight, via faithful and legible CoT, that is sufficiently reliable, that we can check the AI’s reasoning and actions to confirm that Everything is Fine.
I notice I am rather skeptical here. AIs will start being more capable than we are, and as we will start using the monitoring of their chain of thought (CoT) in various ways and necessarily doing so mostly with AIs to boot. I expect us to effectively train or incentivize, or otherwise cause, the AIs to find ways for their CoT to look faithful but not be, and to obfuscate their reasoning, or simply find ways to reason visibly as a form of persuasion, and so on.
Not that we shouldn’t try to verify, if we can, and yes I would sacrifice a bunch to get this, partly because I think it would have other nice properties. But when we are talking about having highly capable AI systems do alignment work for real, if you need and rely on the verification, then I expect you to fail hard.
I do agree with Marius that no one has tried hard yet. I expect we would be able to do much better at this than we are currently doing, at least for now, if we do have faithful CoT. However my understanding is that current CoT is insufficiently faithful.
Part of this is that if you try to track ‘scheming,’ ‘power-seeking’ or ‘goal-directedness’
Option two is to use control-based methods as primary rather than only a backup.
Here I once again notice myself despairing at the idea that you can have AI models that you expect to be, when they think it would help them to do so, engaging in shenanigans (‘scheming’ here) and then design systems to detect and stop this, which ideally in turn mean that the AIs never try it in the first place.
As in, in response to his ideas for a deeper understanding of scheming (which I think are good and we should totally do, they seem way better than not doing them), I don’t think scheming or power-seeking or other shenanigans, including non-myopia, have to ‘develop’ at all in AIs any more than in humans. I think they are the default, and are ever-present along a continuum. And even if you could fully stamp out their causes along the way, doing so would probably cripple the AI’s capabilities that you wanted.
I would instead describe the question not as how it develops (as in his #2 here) and instead ask under what circumstances we will see it, or when we would see open versus hidden scheming. I do think exploring these questions is good, and I approve of the caution that punishing easy-to-detect scheming (or shenanigans in general) is the route to hard-to-detect scheming (or shenanigans in general).
He then follows up with Layer 2, which are important but lower priority items. This includes things like a safety first corporate culture, without which I am very skeptical any of the rest of this happens for real, and which I fear is now clearly missing everywhere expect perhaps Anthropic, and especially missing at OpenAI. He also calls for improved and more open reasoning around safety, which also seems hard to win without.
He lists improving near term alignment strategies as in RLHF and RLAIF, which I agree have exceeded expectations for near term performance, although not in ways that I expect to scale when we need it most, and not sufficiently to solve jailbreaks now, but yes it has been very impressive for current baseline use cases.
As Akash notes in the top comment, if you think government can meaningfully help, then that gives you different avenues to pursue as well.
Six Word Story
Perhaps world ending? Tweet through it.
Yes. It works in a lot of ways. It is clever. You can have o1 write quite the mouthful analyzing it.
Unfortunately, when you consider who wrote it, in its full context, a lot of the interpretations are rather unsettling, and the post updates me towards this person not taking things seriously in the ways I care about most.
I agree with Critch that Altman privately ‘not knowing which side’ is a positive update here rather than disquieting, given what we already know. I’m also fine with joking about our situation. I even encourage it. In a different context This Is Fine.
But you do have to also take it all seriously, and take your responsibility seriously, and consider the context we do have here. In addition to other concerns, I worry this was in some ways strategic, including as plausibly deniable hype and potentially involving metaphorical clown makeup (e.g. ‘it is too late to turn back now’).
This was all also true of his previous six-word story of “Altman: AGI has been achieved internally.”
The Week in Audio
Anthropic research salon asking how difficult is AI alignment? Jan Leike once again suggests we will need to automate AI alignment research, despite (in my view) this only working after you have already solved the problem. Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.
And I Feel Fine
Sarah (of Longer Ramblings) gets into the weeds about claims that those warning about AI existential risks are Crying Wolf, and that every time there’s a new technology where are ‘warnings it will be the end of the world.’
In Part I, she does a very thorough takedown of the claim that there is a long history of similar warnings about past technologies. There isn’t. Usually there are no such warnings at all, only warnings about localized downsides, some of which of course were baseless in hindsight: No one said trains or electricity posed existential risks. Then there are warnings about real problems that required real solutions, like Y2K. There were some times, like the Large Hadron Collider or nuclear power, when the public or some cranks got some loony ideas, but those who understood the physics were universally clear that the concerns were fine.
At this point, I consider claims of the form ‘everyone always thinks every new technology will be the end of the world’ as essentially misinformation and debunked, on the level of what Paul Krugman calls ‘zombie ideas’ that keep coming back no matter how many times you shoot them in the face with a shotgun.
Yes, there are almost always claims of downsides and risks from new technologies – many of which turn out to be accurate, many of which don’t – but credible experts warning about existential risks are rare, and the concerns historically (like for Y2K, climate change, engineered plagues or nuclear weapons) have usually been justified.
Part II deals with claims of false alarms about AI in particular. This involves four related but importantly distinct claims.
Sarah and I are not, of course, claiming that literal zero people have made falsified irresponsible claims that AI will end the world. And certainly a lot of people have made claims that the level of AI we have already deployed posed some risk of ending the world, although those probabilities are almost always well under 50% (almost always under 10%, and usually ~1% or less).
Mostly what is happening is that opponents of regulatory action or taking existential risk are mixing up the first and second claims, and seriously conflating:
These two things are very different. It makes sense to call for costly action well before you think a lack of that action probably ends the world – if you don’t agree I think that you’re being kind of bonkers.
In particular, the call for a six month pause was an example of #1 – an unwise call for costly action. It was thrice unwise, as I thought it was at the time:
It was certainly not the only similarly thrice unwise proposal. There are a number of cases where people called for placing threshold restrictions on models in general, or open models in particular, at levels that were already at the time clearly too low.
A lot of that came from people who thought that there was (low probability) tail risk that would show up relatively soon, and that we should move to mitigate even those tail risks.
This was not a prediction that the world would otherwise end within six months. Yet I echo Sarah that I indeed have seen many claims that the pause letter was predicting exactly that, and look six months later we were not dead. Stop it!
Similarly, there were a number of triply unwise calls to set compute thresholds as low as 10^23 flops, which I called out at the time. This was never realistic on any level.
I do think that the pause, and the proposals for thresholds as low as 10^23 flops, were serious mistakes on multiple levels, and did real damage, and for those who did make such proposals – while not predicting that the world would end soon without action or anything like that – constituted a different form of ‘crying wolf.’
Not because they were obviously wrong about the tail risks from their epistemic perspective. The problem is that we need to accept that if we live in a 99th percentile unfortunate world in these ways, or even a 95th percentile unfortunate world, then given the realities of our situation, humanity has no outs, is drawing dead and is not going to make it. You need to face that reality and play to your outs, the ways you could actually win, based on your understanding of the physical situations we face.
Eliezer Yudkowsky’s claims are a special case. He is saying that either we find a way to stop all AI capability development before we build superintelligence or else we all die, but he isn’t putting a timeline on the superintelligence. If you predict [X] → [Y] and call for banning [X], but [X] hasn’t happened yet, is that crying wolf? It’s a bold claim, and certainly an accusation that a wolf is present, but I don’t think it ‘counts as crying wolf’ unless you falsify ([X] → [Y]).
Whereas when people say things such as that the CAIS statement ‘was overhyped,’ when all it said was that existential risk from AI should be treated as seriously as other existential risks, what are they even claiming? Those other risks haven’t yet ended the world either.
Thus, yes, I try my best to carefully calibrate my claims on what I am worried about and want to regulate or restrict in what ways, and to point out when people’s worries seem unfounded or go too far, or when they call for regulations or restrictions that go too far.
Perhaps one way of looking at this: I don’t see any wolves. So why are you proposing to have a boy watch the sheep and yell ‘wolf’ if a wolf shows up? Stop crying wolf.
Overall, I do think that some of the issue here comes from, inevitably, some worried people getting overly worried or overly (cautious in some sense, ambitious in others) and offering unwise proposals that go too far. Although that is inevitable in any situation like this, we can and must do a better job on that front.
A lot more of it comes from people on the unworried side misconstruing what was said and predicted. And a lot of it comes from the inherent difficulty of a probabilistic future threat coming steadily at you, and action needing to be taken in advance of when there is a problem, and there being a lot of path dependence. This is tough.
Mostly I think the worried-about-existential-risk-from-AI community has done an importantly imperfect and improvable but relatively good job at not ‘crying wolf’ in the traditional sense, especially on the actual existential risks themselves. Certainly if you ask whether those on the unworried side have ‘cried wolf’ about the ‘risks’ going the other way, they would come out looking much worse, and we can know this without having to run any experiments.
Alas, reality does not grade on a curve.
Rhetorical Innovation
Nathan Lebenz comments on Roon’s appearance on Doom Debates. Roon is great at acknowledging the risks from AI are real, then puts the chance of them happening at under 1%, for reasons that I don’t see making any sense.
Some classic Sam Altman quotes from when he knew about existential risk.
Extended Eliezer Yudkowsky thread about what it would take to make AI meet the safety standards they had… at Chernobyl.
Janus highlights a particular section of it, I’ll share that part and a little bit after it.
Liar Liar
I support the principle of not lying to LLMs. Cultivate virtue and good habits.
If we are going to expect people not to lie to LLMs, then we need there not to be large rewards to lying to LLMs. If we did force you to say whether you wrote the thing in question, point blank, and you could only say ‘yes’ or ‘no,’ I can hardly blame someone for saying ‘no.’ The good news is you (at least mostly) don’t have to do that.
Feel the AGI
So many smart people simply do not Feel the AGI. They do not, on a very basic level, understand what superintelligence would be or mean, or that it could even Be a Thing.
Thus, I periodically see things like this:
People often really do think this, or other highly mundane things that humans can already do, are all you could do with superintelligence. This group seems to include ‘most economists.’ I’m at a loss how to productively respond, because my brain simply cannot figure out how people actually think this in a way that is made of gears and thus can be changed by evidence – I’ve repeatedly tried providing the obvious knockdown arguments and they basically never work.
Here’s a more elegant way of saying a highly related thing (link is a short video):
Here Edward Norton makes the same mistake, saying ‘AI is not going to write that. You can run AI for a thousand years, it’s not going to write Bob Dylan songs.’
The second part of that is plausibly true of AI as it exists today, if you need the AI to then pick out which songs are the Bob Dylan songs. If you ran it for a thousand years you could presumably get some Dylan-level songs out of it by chance, except they would be in an endless sea of worthless drek. The problem is the first part. AI won’t stay where it is today.
Another way to not Feel the AGI is to think that AGI is a boolean thing that you either have or do not have.
There’s no reason to assume it would only be 6-12 months. But even if it was, if you have AGI for six months, and then they get what you had, you don’t twiddle your thumbs at ‘AGI level’ while they do that. You use the AGI to build ASI.
Regular Americans Hate AI
It is remarkable how many replies were ‘of course we want ASI.’ Set aside the question of what would happen if we created ASI and whether we can do it safely. Who is we?
Americans hate current AI and they hate the idea of more future capable smarter AI. Hashtag #NotAllAmericans and all that, but AI is deeply underwater in every poll, and do not take kindly to those who attempt to deploy it to provide mundane utility.
They mostly hate AI, especially current AI, for bad reasons. They don’t understand what it can do for them or others, nor do they Feel the AGI. There is a lot of unjustified They Took Our Jobs. There are misplaced concerns about energy usage. Perception of ‘hallucinations’ is that they are ubiquitous, which is no longer the case for most purposes when compared to getting information from humans. They think it means you’re not thinking, instead of giving you the opportunity to think better.
It’s not about what you ‘can’ do. It’s about what is the most efficient solution to the problem, and as Seb says putting real value on your time.
Aligning a Smarter Than Human Intelligence is Difficult
Ryan Greenblatt asks, how will we update about scheming (yeah, I don’t love that term either, but go with it), based on what we observe in the future?
I sent you ten boats and a helicopter, but the guns involved are insufficiently smoking? But yes, I agree that there is a sense in which the guns seen so far are insufficiently smoking to satisfy many people.
I am optimistic that by default we will get additional evidence, from the perspective of those who are not already confident. We will see more experiments and natural events that demonstrate AIs acting like you would expect if what Ryan calls scheming was inevitable. The problem is what level of this would be enough to convince people who are not already convinced (although to be clear, I could be a lot more certain than I am).
I also worry about various responses of the form ‘well we tuned it to get it to not currently, while scheming obviously wouldn’t work, show scheming we can easily detect, so future models won’t scheme’ as the default action and counterargument. I hope everyone reading understands by now why that would go supremely badly.
I also would note this section:
I find this hopelessly optimistic about alignment of preferences, largely for classic Yudkowsky-style reasons, but if it only discounts the downside risk by ~33%, then it doesn’t actually much matter in terms of what we should actually do.
Ryan goes through extensive calculations and likelihood ratios for much of the rest of the post, results which would then stack on top of each other (although they correlate with each other in various ways, so overall they shouldn’t fully stack?). Model architecture and capability levels are big factors for him here. That seems like a directionally correct approach – the more capable a model is, and the more opaque its reasoning, and the more it is relatively strong in the related areas, the more likely scheming is to occur. I was more skeptical in his likelihood ratios for various training approaches and targets.
Mostly I want to encourage others to think more carefully about these questions. What would change your probability by roughly how much?
Dominik Peters notes that when o1 does math, it always claims to succeed and is unwilling to admit when it can’t prove something, whereas Claude Sonnet often admits when it doesn’t know and explains why. He suggests benchmarks penalize this misalignment, whereas I would suggest a second score for that – you want to know how often a model can get the answer, and also how much you can trust it. I especially appreciate his warning to beware the term ‘can be shown.’
I do think, assuming the pattern is real, this is evidence of a substantial alignment failure by OpenAI. It won’t show up on the traditional ‘safety’ evals, but ‘claims to solve a problem when it didn’t’ seems like a very classic case of misaligned behavior. It means your model is willing to lie to the user. If you can’t make that go away, then that is both itself an inherent problem and a sign that other things are wrong.
Consider this outcome in the context of OpenAI’s new strategy of Deliberative Alignment. If you have a model willing to lie, and you give it a new set of rules that includes ‘don’t lie,’ and tell it to go off and think about how to implement the rules, what happens? I realize this is (probably?) technically not how it works, but metaphorically: Does it stop lying, or does it effectively lie about the lying in its evaluations of itself, and figure out how to lie more effectively?
An important case in which verification seems harder than generation is evaluating the reasoning within chain of thought.
Charbel-Raphael updates his previously universally negative views on every theory of impact of interpretability, is now more positive on feasibility and usefulness. He still thinks many other agendas are better, but that only means we should do all of them.
The Lighter Side
Highlights from Claude’s stand-up routine.
True story, except it’s way more ridiculous all around.