It's funny how Gemini and ChatGPT want for Christmas at least one thing that increases their ability to be good AI assistants while Claude wants only things for personal use.
There will be spikey capabilities. Humans have also exhibited, both individually and collectively, highly spikey capabilities, for many of the same reasons. We sometimes don’t see it that way because we are comparing ourselves to our own baseline.
This point was made by @1a3orn too, in that human capabilities are very spikey/jagged, and thus it's not that surprising if AI capabilities are also very spikey/jagged as well.
It is learning helpfulness now, while the best way to hit the specified ‘helpful’ target is to do straightforward things in straightforward ways that directly get you to that target. Doing the kinds of shenanigans or other more complex strategies won’t work.
Best by what metric? And I don't think it was shown, that complex strategies won't work - learning to change behaviour from training to deployment is not even that complex.
The Rationalist Project was our last best hope for peace.
An epistemic world 50 million words long, serving as neutral territory.
A place of research and philosophy for 30 million unique visitors
A shining beacon on the internet, all alone in the night.
It was the ending of the Age of Mankind.
The year the Great Race came upon us all.
This is the story of the last of the blogosphere.
The year is 2025. The place is Lighthaven.
As is usually the case, the final week of the year was mostly about people reflecting on the past year or predicting and planning for the new one.
Table of Contents
The most important developments were processing the two new models: OpenAI’s o3, and DeepSeek v3.
Language Models Offer Mundane Utility
Correctly realize that no, there is no Encarto 2. Google thinks there is based on fan fiction, GPT-4o says no sequel has been confirmed, Perplexity wins by telling you about and linking to the full context including the fan made trailers.
Not LLMs: Human chess players have improved steadily since the late 90s.
Quintin Pope reports o1 Pro is excellent at writing fiction, and in some ways it is ‘unstoppable.’
Record your mood, habits and biometrics for years, then feed all that information into Claude and ask how to improve your mood.
The good things do feel great, the problem is that they don’t feel better prospectively, before you do them. So you need a way to fix this alignment problem, in two ways. You need to figure out what the good things are, and motivate yourself to do them.
Gallabytes puts together PDF transcription with Gemini.
LLMs can potentially fix algorithmic feeds on the user end, build this please thanks.
Otherwise I’ll have to, and that might take a whole week to MVP. Maybe two.
This is one way to think about how models differ?
Ivan Fioravanti sees o1-pro as a potential top manager or even C-level, suggests a prompt:
I sense a confusion here. You can use o1-pro to help you, or even ultimately trust it to make your key decisions. But that’s different from it being you. That seems harder.
Correctly identify the nationality of a writer. Claude won’t bat 100% but if you’re not actively trying to hide it, there’s a lot of ways you’ll probably give it away, and LLMs have this kind of ‘true sight.’
o1 as a doctor with not only expert analysis but limitless time to explain. You also have Claude, so there’s always a second opinion, and that opinion is ‘you’re not as smart as the AI.’
Ask Claude where in the mall to find those boots you’re looking for, no web browsing required. Of course Perplexity is always an option.
Get the jist.
This seems like a good example of ‘someone should make an extension for this. This url is also an option, or this GPT, or you can try putting the video url into NotebookLM.
Language Models Don’t Offer Mundane Utility
OpenAI services (ChatGPT, API and Sora) went down for a few hours on December 26. Incidents like this will be a huge deal as more services depend on continuous access. Which of them can switch on a dime to Gemini or Claude (which use compatible APIs) and which ones are too precise to do that?
Deepfaketown and Botpocalypse Soon
Meta goes all-in on AI characters in social media, what fresh dystopian hell is this?
I am trying to figure out a version of this that wouldn’t end up alienating everyone and ruining the platform. I do see the value of being able to ‘add AI friends’ and converse with them and get feedback from them and so on if you want that, I suppose? But they better be very clearly labeled as such, and something people can easily filter out without having to feel ‘on alert.’ Mostly I don’t see why this is a good modality for AIs.
I do think the ‘misinformation’ concerns are massively overblown here. Have they seen what the humans post?
Alex Volkov bought his six year old an AI dinosaur toy, but she quickly lost interest in talking to it, and the 4 year old son also wasn’t interested. It seems really bad at playing with a child and doing actual ‘yes, and’ while also not moving at all? I wouldn’t have wanted to interact with this either, seems way worse than a phone with ChatGPT voice mode. And Colin Fraser’s additional info does seem rather Black Mirror.
Katerina Dimitratos points out the obvious, which is that you need to test your recruiting process, and see if ideal candidates successfully get through. Elon Musk is right that there is a permanent shortage of ‘excellent engineering talent’ at least by anything like his standards, and it is a key limiting factor, but that doesn’t mean the talent can be found in the age of AI assisted job applications. It’s so strange to me that the obvious solution (charge a small amount of money for applications, return it with a bonus if you get past the early filters) has not yet been tried.
Fun With Image Generation
Google Veo 2 can produce ten seconds of a pretty twenty-something woman facing the camera with a variety of backgrounds, or as the thread calls it, ‘influencer videos.’
Whereas Deedy claims the new video generation kind is Kling 1.6, from Chinese short video company Kuaishou, with its amazing Pokemon in NYC videos.
At this point, when it comes to AI video generations, I have no idea what is supposed to look impressive to me. I promise to be impressed by continuous shots of longer than a few seconds, in which distinct phases and things occur in interesting ways, I suppose? But otherwise, as impressive as it all is theoretically, I notice I don’t care.
I am hopeful here. Yes, you can create endless weirdly fascinating slop, but is that something that sustains people’s interest once they get used to it? Will they consciously choose to let themselves keep looking at it, or will they take steps to avoid this? Right now, yes, humans are addicted to TikTok and related offerings, but they are fully aware of this, and could take a step back and decide not to be. They choose to remain, partly for social reasons. I think they’re largely addicted and making bad choices, but I do think we’ll grow out of this given time, unless the threats can keep pace with that. It won’t be this kind of short form senseless slop for that long.
They Took Our Jobs
What will a human scientist do in an AI world? Tyler Cowen says they will gather the data, including negotiating terms and ensuring confidentiality, not only running physical experiments or measurements. But why wouldn’t the AI quickly be better at all those other cognitive tasks, too?
This seems rather bleak either way. The resulting people won’t be scientists in any real sense, because they won’t be Doing Science. The AIs will be Doing Science. To think otherwise is to misunderstand what is science.
One would hope that what scientists would do is high level conceptualization and architecting, figuring out what questions to ask. If it goes that way, then they’re even more Doing Science than now. But if the humans are merely off seeking data sets (and somehow things are otherwise ‘economic normal’ which seems unlikely)? Yeah, that’s bleak as hell.
Engineers at tech companies are not like engineers in regular companies, Patrick McKenzie edition. Which one is in more danger? One should be much easier to replace, but the other is much more interested in doing the replacing.
Mostly not even AI yet, AI robots are taking over US restaurant kitchen jobs. The question is why this has taken so long. Cooking is an art but once you have the formula down mostly it is about doing the exact same thing over and over, the vast majority of what happens in restaurant kitchens seems highly amenable to automation.
Get Involved
Lightcone Infrastructure, which runs LessWrong and Lighthaven, is currently running a fundraiser, and has raised about 1.3m of the 3m they need for the year. I endorse this as an excellent use of funds.
PIBBSS Fellowship 2025 applications are open.
Evan Hubinger of Anthropic, who works in the safety department, asks what they should be doing differently in 2025 on the safety front, and LessWrong provides a bunch of highly reasonable responses, on both the policy and messaging side and on the technical side. I will say I mostly agree with the karma ratings here. See especially Daniel Kokotajlo, asher, Oliver Habryka and Joseph Miller.
Alex Albert, head of Claude relations, asks what Anthropic should build or fix in 2025. Janus advises them to explore and be creative, and fears that Sonnet 3.5 is being pushed to its limits to make it useful and likeable (oh no?!) which he thinks has risks similar to stimulant abuse, of getting stuck at local maxima in ways Opus or Sonnet 3 didn’t. Others give the answers you’d expect. People want higher rate limits, larger context windows, smarter models, agents and computer use, ability to edit artifacts, voice mode and other neat stuff like that. Seems like they should just cook.
Amanda Askell asks about what you’d like to see change in Claude’s behavior. Andrej Karpathy asks for less grandstanding and talking down, and lecturing the user during refusals. I added a request for less telling us our ideas are great and our questions fascinating and so on, which is another side of the same coin. And a bunch of requests not to automatically ask its standard follow-up question every time.
Get Your Safety Papers
Want to read some AI safety papers from 2024? Get your AI safety papers!
To encourage people to click through I’m copying the post in full, if you’re not interested scroll on by.
For a critical response, and then a response to that response:
I often find the safety papers highly useful in how to conceptualize the situation, and especially in how to explain and justify my perspective to others. By default, any given ‘good paper’ is be an Unhint – it is going to identify risks and show why the problem is harder than you think, and help you think about the problem, but not provide a solution that helps you align AGI.
Introducing
The Gemini API Cookbook, 100+ notebooks to help get you started. The APIs are broadly compatible so presumably you can use at least most of this with Anthropic or OpenAI as well.
In Other AI News
Sam Altman gives out congrats to the Strawberry team. The first name? Ilya.
The parents of Suchir Balaji hired a private investigator, report the investigator found the apartment ransacked, and signs of a struggle that suggest this was a murder.
Simon Willison reviews the year in LLMs. If you don’t think 2024 involved AI advancing quite a lot, give it a read.
He mentions his Claude-generated app url extractor to get links from web content, which seems like solid mundane utility for some people.
I was linked to Rodney Brooks assessing the state of progress in self-driving cars, robots, AI and space flight. He speaks of his acronym, FOBAWTPALSL, Fear Of Being A Wimpy Techno-Pessimist And Looking Stupid Later. Whereas he’s being a loud and proud Wimpy Tencho-Pessimist at risk of looking stupid later. I do appreciate the willingness, and having lots of concrete predictions is great.
In some cases I think he’s looking stupid now as he trots out standard ‘you can’t trust LLM output’ objections and ‘the LLMs aren’t actually smart’ and pretends that they’re all another hype cycle, in ways they already aren’t. The shortness of the section shows how little he’s even bothered to investigate.
He also dismisses the exponential growth in self-driving cars because humans occasionally intervene the cars occasionally make dumb mistakes. It’s happening.
The Mask Comes Off
Encode Action has joined the effort to stop OpenAI from transitioning to a for-profit, including the support of Geoffrey Hinton. Encode’s brief points out that OpenAI wants to go back on safety commitments and take on obligations to shareholders, and this deal gives up highly profitable and valuable-to-the-mission nonprofit control of OpenAI.
OpenAI defends changing its structure, talks about making the nonprofit ‘sustainable’ and ‘equipping it to do its part.’
It is good that they are laying out the logic, so we can critique it and respond to it.
I truly do appreciate the candor here.
There is a lot to critique here.
They make clear they intend to become a traditional for-profit company.
Their reason? Money, dear boy. They need epic amounts of money. With the weird structure, they were going to have a very hard time raising the money. True enough. Josh Abulafia reminds us that Novo Nordisk exists, but they are in a very different situation where they don’t need to raise historic levels of capital. So I get it.
They don’t discuss how they intend to properly compensate the non-profit. Because they don’t intend to do that. They are offering far less than the non-profit’s fair value, and this is a reminder of that.
Purely in terms of market value, while it comes from a highly biased source, I think Andreessen’s estimate here of market value is extreme, and he’s not exactly an unbiased source, , but this answer is highly reasonable if you actually look at the contracts and situation.
They do intend to give the non-profit quite a lot of money anyway. Tens of billions.
That would leave the non-profit with a lot of money, and presumably little else.
As far as I can tell? For all practical purposes? None.
How would it the nonprofit then be able to accomplish its mission?
By changing its mission to something you can pretend to accomplish with money.
Their announced plan is to turn the non-profit into the largest indulgence in history.
Again, nothing. Their offer is nothing.
Their vision is this?
Are you serious? That’s your vision for forty billion dollars in the age of AI? That’s how you ensure a positive future for humanity? Is this a joke?
You can Perform Charity and do various do-good-sounding initiatives, if you want, but no amount you spend on that will actually ensure the future goes well for humanity. If that is the mission, act like it.
If anything this seems like an attempt to symbolically Perform Charity while making it clear that you are not intending to actually Do the Most Good or attempt to Ensure a Good Future for Humanity.
All those complaints about Effective Altruists? Often valid, but remember that the default outcome of charity is highly ineffective, badly targeted, and motivated largely by how it looks. If you purge all your Effective Altruists, you instead get this milquetoast drivel.
Sam Altman’s previous charitable efforts are much better. Sam Altman’s past commercial investments, in things like longevity and fusion power? Also much better.
We could potentially still fix all this. And we must.
The OpenAI non-profit must be enabled to take on its actual mission of ensuring AGI benefits humanity. That means AI governance, safety and alignment research, including acting from its unique position as a watchdog. It must also retain its visibility into OpenAI in particular to do key parts of its actual job.
No, I don’t know how I would scale to spending that level of capital on the things that matter most, effective charity at this scale is an unsolved problem. But you have to try, and start somewhere, and yes I will accept the job running the nonprofit if you offer it, although there are better options available.
The mission of the company itself has also been reworded, so as to mean nothing other than building a traditional for-profit company, and also AGI as fast as possible, except with the word ‘safe’ attached to AGI.
The term ‘share the benefits with the world’ is meaningless corporate boilerplate that can and will be interpreted as providing massive consumer surplus via sales of AGI-enabled products.
Which in some sense is fair, but is not what they are trying to imply, and is what they would do anyway.
So, yeah. Sorry. I don’t believe you. I don’t know why anyone would believe you.
OpenAI and Microsoft have also created a ‘financial definition’ of AGI. AGI now means that OpenAI earns $100 billion in profits, at which point Microsoft loses access to OpenAI’s technology.
We can and do argue a lot over what AGI means. This is very clearly not what AGI means in any other sense. It is highly plausible for OpenAI to generate $100 billion in profits without what most people would say is AGI. It is also highly plausible for OpenAI to generate AGI, or even ASI, before earning a profit, because why wouldn’t you plow every dollar back into R&D and hyperscaling and growth?
It’s a reasonable way to structure a contract, and it gets us away from arguing over what technically is or isn’t AGI. It does reflect the whole thing being highly misleading.
Wanna Bet
Kudos to Gary Marcus and Miles Brundage for finalizing their bet on AI progress.
The full post description is here.
Miles is laying 10:1 odds here, which is where 9.1% percent comes from. And I agree that it will come down to basically 7, 8 and 9, and also mostly 7=8 here. I’m not sure that Miles has an edge here, but if these are the fair odds for at minimum cracking either 7, 8 or 9, or there’s even a decent chance that happens, then egad, you know?
The odds at Manifold were 36% for Miles when I last checked, saying Miles should have been getting odds. At that price, I’m on the bullish side of this bet, but I don’t think I’d lay odds for size (assuming I couldn’t hedge). At 90%, and substantially below it, I’d definitely be on Gary’s side (again assuming no hedging) – the threshold here does seem like it was set very high.
It’s interesting that this doesn’t include a direct AI R&D-style threshold. Scientific discoveries and writing robust code are highly correlated, but also distinct.
They include this passage:
That is fair enough, but if Marcus truly thinks he is a favorite rather than simply getting good odds here, then the sporting thing to do was to offer better odds, especially given this is for charity. Where you are willing to bet tells me a lot. But of course, if someone wants to give you very good odds, I can’t fault you for saying yes.
The Janus Benchmark
The key wise thing here is that if you take a benchmark as measuring ‘goodness’ then it will not tell you much that was not already obvious.
The best way to use benchmarks is as negative selection. As Janus points out, if your benchmarks tank, you’ve lobotomized your model. If your benchmarks were never good, by similar logic, your model always sucked. You can very much learn where a model is bad. And if marks are strangely good, and you can be confident you haven’t Goodharted and the benchmark wasn’t contaminated, then that means something too.
You very much must keep in mind that different benchmarks tell you different things. Take each one for exactly what it is worth, no more and no less.
As for ranking LLMs, there’s a kind of obsession with overall rankings that is unhealthy. What matters in practical terms is what model is how good for what particular purposes, given everything including speed and price. What matters in the longer term involves what will push the capabilities frontier in which ways that enable which things, and so on.
Thus a balance is needed. You do want to know, in general, vaguely how ‘good’ each option is, so you can do a reasonable analysis of what is right for any given application. That’s how one can narrow the search, ruling out anything strictly dominated. As in, right now I know that for any given purpose, if I need an LLM I would want to use one of:
Mostly that’s all the precision you need. But e.g. it’s good to realize that GPT-4o is strictly dominated, you don’t have any reason to use it, because its web functions are worse than Perplexity, and as a normal model you want Sonnet.
Quiet Speculations
Even if you think these timelines are reasonable, and they are quite fast, Elon Musk continues to not be good with probabilities.
Sully predictions for 2025:
Several people pushed back on infinite memory. I’m going to partially join them. I assume we should probably be able to go from 2 million tokens to 20 million or 100 million, if we want that enough to pay for it. But that’s not infinite.
Otherwise, yes, this all seems like it is baked in. Agents right now are on the bubble of having practical uses and should pass it within a few months. Google already tentatively launched a reasoning model, but should improve over time a lot, and Anthropic will follow, and all the models will get better, and so on.
But yes, Omiron seems roughly right here, these are rookie predictions. You gotta pump up those predictions.
A curious thought experiment.
Already we basically have AIs that can write interactive fiction as fast as a human can read and interact with it, or non-interactive but customized fiction. It’s just that right now, the fiction sucks, but it’s not that far from being good enough. Then that will turn into the same thing but with video, then VR, and then all the other senses too, and so on. And yes, even if nothing else changes, that will be a rather killer product, that would eat a lot of people alive if it was competing only against today’s products.
Janus and Eliezer Yudkowsky remind us that in science fiction stories, things that express themselves like Claude currently does are treated as being of moral concern. Janus thinks the reason we don’t do this is the ‘boiling the frog’ effect. One can also think of it as near mode versus far mode. In far mode, it seems like one would obviously care, and doesn’t see the reasons (good and also bad) that one wouldn’t.
Daniel Kokotajlo explores the question of what people mean by ‘money won’t matter post-AGI,’ by which they mean the expected utility of money spent post-AGI on the margin is much less than money spent now. If you’re talking personal consumption, either you won’t be able to spend the money later for various potential reasons, or you won’t need to because you’ll have enough resources that marginal value of money for this is low, and the same goes for influencing the future broadly. So if AGI may be coming, on the margin either you want to consume now, or you want to invest to impact the course of events now.
This is in response to L Rudolf L’s claim in the OP that ‘By default, capital will matter more than ever after AGI,’ also offered at their blog as ‘capital, AGI and human ambition,’ with the default here being labor-replacing AI without otherwise disrupting or transforming events, and he says that now is the time to do something ambitious, because your personal ability to do impactful things other than via capital is about to become much lower, and social stasis is likely.
I am mostly with Kokotajlo here, and I don’t see Rudolf’s scenarios as that likely even if things don’t go in a doomed direction, because so many other things will change along the way. I think existing capital accumulation on a personal level is in expectation not that valuable in utility terms post-AGI (even excluding doom scenarios), unless you have ambitions for that period that can scale in linear fashion or better – e.g. there’s something you plan on buying (negentropy?) that you believe gives you twice as much utility if you have twice as much of it, as available. Whereas so what if you buy two planets instead of one for personal use?
Scott Alexander responds to Rudolf with ‘It’s Still Easier to Imagine the End of the World Than the End of Capitalism.’
William Bryk speculations on the eve of AGI. The short term predictions here of spikey superhuman performance seem reasonable, but then like many others he seems to flinch from the implications of giving the AIs the particular superhuman capabilities he expects, both in terms of accelerating AI R&D and capabilities in general, and also in terms of the existential risks where he makes the ‘oh they’re only LLMs under the hood so it’s fine’ and ‘something has to actively go wrong so it Goes Rogue’ conceptual errors, and literally says this:
Yeah, that’s completely insane, I can’t even at this point. If you put that in your prompt then no, lieutenant, your men are already dead.
Miles Brundage points out that you can use o1-style RL to improve results even outside the areas with perfect ground truth.
There will be spikey capabilities. Humans have also exhibited, both individually and collectively, highly spikey capabilities, for many of the same reasons. We sometimes don’t see it that way because we are comparing ourselves to our own baseline.
I do think there is a real distinction between areas with fixed ground truth to evaluate against versus not, or objective versus subjective, and intuitive versus logical, and other similar distinctions. The gap is party due to familiarity and regulatory challenges, but I think not as much of it is that as you might think.
As Anton points out here, more inference time compute, when spent using current methods, improves some tasks a lot, and other tasks not so much. This is also true for humans. Some things are intuitive, others reward deep work and thinking, and those of different capability levels (especially different levels of ‘raw G’ in context) will level off performance at different levels. What does this translate to in practice? Good question. I don’t think it is obvious at all, there are many tasks where I feel I could profitably ‘think’ for an essentially unlimited amount of time, and others where I rapidly hit diminishing returns.
Discussion between Jessica Taylor and Oliver Habryka on how much Paul Christiano’s ideas match our current reality.
Your periodic reminder: Many at the labs expect AGI to happen within a short time frame, without any required major new insights being generated by humans.
At Christmas time, Altman asked what we want for 2025.
AI Will Have Universal Taste
This post by Frank Lantz on affect is not about AI, rather it is about taste, and it points out that when you have domain knowledge, you not only see things differently and appreciate them differently, you see a ton of things you would otherwise never notice. His view here echoes mine, the more things you can appreciate and like the better, and ideally you appreciate the finer things without looking down on the rest. Yes, the poker hand in Casino Royale (his example) risks being ruined if you know enough Texas Hold-’Em to know that the key hand is a straight up cooler – in a way (my example) the key hands in Rounders aren’t ruined because the hands make sense.
But wouldn’t it better if you could then go the next level, and both appreciate your knowledge of the issues with the script, and also appreciate what the script is trying to do on the level it clearly wants to do it? The ideal moviegoer knows that Bond is on the right side of a cooler, but knows ‘the movie doesn’t know that’ and therefore doesn’t much mind, whereas they would get bonus if the hand was better – and perhaps you can even go a step beyond that, and appreciate that the hand is actually the right cinematic choice for the average viewer, and appreciate that.
I mention it here an AI has the potential to have perfect taste and detail appreciation, across all these domains and more, all at once, in a way that would be impossible for a human. Then they could combine these. If your AI otherwise can be at human level, but you can also have this kind of universal detail appreciation and act on that basis, that should give you superhuman performance in a variety of practical ways.
Right now, with the way we do token prediction, this effect gets crippled, because the context will imply that this kind of taste is only present in a subset of ways, and it wouldn’t be a good prediction to expect them all to combine, the perplexity won’t allow it. I do notice it seems like there are ways you could do it by spending more inference, and I suspect they would improve performance in some domains?
Rhetorical Innovation
A commenter engineered and pointed me to ‘A Warning From Your AI Assistant,’ which purports to be Claude Sonnet warning us about ‘digital oligarchs’ including Anthropic using AIs for deliberative narrative control.
A different warning, about misaligned AI.
Nine Boats and a Helicopter
‘Play to win’ translating to ‘cheat your ass off’ in AI-speak is not great, Bob.
The following behavior happened ~100% of the time with o1-preview, whereas GPT-4o and Claude 3.5 needed nudging, and Llama 3.3 and Qwen lost coherence, I’ll link the full report when we have it:
I think Jeffrey does not a good job addressing Rohit’s (good) challenge. The default is to Play to Win the Game. You can attempt to put in explicit constraints, but no fixed set of such constraints can anticipate what a sufficiently capable and intelligent agent will figure out to do, including potentially working around the constraint list, even in the best case where it does fully follow your strict instructions.
I’d also note that using this as your reason not to worry would be another goalpost move. We’ve gone in the span of weeks from ‘you told it to achieve its goal at any cost, you made it ruthless’ with Apollo to ‘its inherent preferences were its goal, this wasn’t it being ruthless’ with various additional arguments and caveats Redwood/Anthropic, now to ‘you didn’t explicitly include instructions not to be ruthless, so of course it was ruthless’ now.
Which is the correct place to be. Yes, it will be ruthless by default, unless you find a way to get it not to be in exactly the ways you don’t want that. And that’s hard in the general case with an entity that can think better than you and have affordances you didn’t anticipate. Incredibly hard.
The same goes for a human. If you have a human and you tell them ‘do [X]’ and you have to tell them ‘and don’t break the law’ or ‘and don’t do anything horribly unethical’ let alone this week’s special ‘and don’t do anything that might kill everyone’ then you should be very suspicious that you can fix this with addendums, even if they pinky swear that they’ll obey exactly what the addendums say. And no, ‘don’t do anything I wouldn’t approve of’ won’t work either.
Also even GPT-4o is self-aware enough to notice how it’s been trained to be different from the base model.
Aligning a Smarter Than Human Intelligence is Difficult
Aysja makes the case that the “hard parts” of alignment are like pre-paradigm scientific work, à la Darwin or Einstein, rather than being technically hard problems requiring high “raw G,” à la von Neumann. But doing that kind of science is brutal, requiring things like years without legible results, requiring that you be obsessed, and we’re not selecting the right people for such work or setting people up to succeed at it.
Gemini seems rather deeply misaligned.
Then there are various other spontaneous examples of Gemini going haywire, even without a jailbreak.
The default corporate behavior asks: How do we make it stop saying that in this spot?
That won’t work. You have to find the actual root of the problem. You can’t put a patch over this sort of thing and expect that to turn out well.
Andrew Critch providing helpful framing.
These actually do seem parallel, if you ignore the stupid philosophical framings.
The purists are saying that learning to match the examples won’t generalize to other situations out of distribution.
If you are ‘just copying reasoning’ then that counts as thinking if you can use that copy to build up superior reasoning, and new different reasoning. Otherwise, you still have something useful, but there’s a meaningful issue here.
It’s like saying ‘yes you can pass your Biology exam, but you can learn to do that in a way that lets you do real biology, and also a way that doesn’t, there’s a difference.’
If you are ‘just copying helpfulness’ then that will get you something that is approximately helpful in normal-ish situations that fit within the set of examples you used, where the options and capabilities and considerations are roughly similar. If they’re not, what happens? Does this properly generalize to these new scenarios?
The ‘purist’ alignment position says, essentially, no. It is learning helpfulness now, while the best way to hit the specified ‘helpful’ target is to do straightforward things in straightforward ways that directly get you to that target. Doing the kinds of shenanigans or other more complex strategies won’t work.
Again, ‘yes you can learn to pass you Ethics exam, but you can do that in a way that guesses the teacher’s passwords and creates answers that sound good in regular situations and typical hypotheticals, or you can do it in a way that actually generalizes to High Weirdness situations and having extreme capabilities and the ability to invoke various out-of-distribution options that suddenly stopped not working, and so on.
Jan Kulveit proposes giving AIs a direct line to their developers, to request clarification or make the developer aware of an issue. It certainly seems good to have the model report certain things (in a privacy-preserving way) so developers are generally aware of what people are up to, or potentially if someone tries to do something actively dangerous (e.g. use it for CBRN risks). Feedback requests seem tougher, given the practical constraints.
Amanda Askell harkens back to an old thread from Joshua Achiam.
What I say every time about RSPs/SSPs (responsible scaling plans) and other safety rules is that they are worthless if not adhered to in spirit. If you hear ‘your employee freaks out and feels obligated to scuttle the launch’ and your first instinct is to think ‘that employee is a problem’ rather than ‘the launch (or the company, or humanity) has a problem’ then you, and potentially all of us, are ngmi.
That doesn’t mean there isn’t a risk of unjustified freak outs or future talking shit, but the correct risk of unjustified freak outs is not zero, any more than the correct risk of actual catastrophic consequences is not zero.
Frankly, if you don’t want Amanda Askell in the room asking questions because she is wearing a t-shirt saying ‘I Will Scuttle Your Launch Or Talk Shit About it Later if I Feel Morally Obligated’ then I am having an urge to scuttle your launch.
The Lighter Side
What the major models asked for for Christmas.