Dario Amodei is thinking about the potential. The result is a mostly good essay called Machines of Loving Grace, outlining what can be done with ‘powerful AI’ if we had years of what was otherwise relative normality to exploit it in several key domains, and we avoided negative outcomes and solved the control and alignment problems. As he notes, a lot of pretty great things would then be super doable.

Anthropic also offers us improvements to its Responsible Scaling Policy (RSP, or what SB 1047 called an SSP). Still much left to do, but a clear step forward there.

Daniel Kokotajlo and Dean Ball have teamed up on an op-ed for Time on the need for greater regulatory transparency. It’s very good.

Also, it’s worth checking out the Truth Terminal saga. It’s not as scary as it might look at first glance, but it is definitely super far out.

Table of Contents

  1. Introduction.
  2. Table of Contents.
  3. Language Models Offer Mundane Utility. More subscriptions means more utility.
  4. Language Models Don’t Offer Mundane Utility. Then again, neither do you.
  5. Deepfaketown and Botpocalypse Soon. Quality remains the limiting factor.
  6. They Took Our Jobs. But as Snow Crash foretold us, they’ll never take our pizza.
  7. Get Involved. UK AISI hiring technical advisor, Tarbell Grants for AI reporting.
  8. Introducing. Grok 2 gets a proper API.
  9. In Other AI News. It’s time to go nuclear.
  10. Truth Terminal High Weirdness. When the going gets weird, the weird turn pro.
  11. Quiet Speculations. Are the labs holding back?
  12. Copyright Confrontation. New York Times sends a cease and desist to Perplexity.
  13. AI and the 2024 Presidential Election. Very briefly getting this out of the way.
  14. The Quest for Sane Regulations. A proposal all reasonable people should agree on.
  15. The Week in Audio. Matt Stone asks, is all Sam Altman does go on podcasts?
  16. Just Think of the Potential. They could be machines of loving grace.
  17. Reactions to Machines of Loving Grace. Much agreement, some notes of caution.
  18. Assuming the Can Opener. I would very much like a can opener.
  19. Rhetorical Innovation. People often try to convince you that reason is impossible.
  20. Anthropic Updates its Responsible Scaling Policy. New and improved.
  21. Aligning a Smarter Than Human Intelligence is Difficult. Are you smart enough?
  22. The Lighter Side. The art of the possible.

Language Models Offer Mundane Utility

Just Think of the Potential, local edition, and at least I’m trying:

Roon: if you believe the “returns to intelligence” wrt producing good tweets or essays is large we are clearly experiencing quite a large overhang

Perplexity CEO pitches his product:

Aravind Srinivas (CEO Perplexity): Perplexity charts with code generation and execution have the potential to be the friendly UI and affordable Bloomberg terminal for the masses, which everyone has wanted for a long time! Perplexity Pro is $20/mo, while Bloomberg Terminal is $2500/mo. So, more than 100x cheaper.

I do not think Srinivas appreciates the point of a Bloomberg terminal.

Redwood Forest: Show me you haven’t used Bloomberg terminal without telling me you haven’t used Bloomberg terminal. Bloomberg was one of the first to train their own foundation model before Anthropic even released a model.

The point of the Bloomberg terminal is that it was precise, reliable, up to the second data, and commands reliably do exactly what you want, and it has exactly the features traders want and count on to let them figure out the things they actually care about to make money, with shortcuts and other things to match their needs in real time. Perplexity Pro is probably worth $20/month to a lot of people but I am confident Bloomberg is unworried.

Dean Ball is impressed with o1 for tasks like legal and policy questions, and suggests instructing it to ask follow-up and clarifying questions. I haven’t been as impressed, I presume largely because my purposes are not a good fit for o1’s strengths.

Avital Balwit on how they use Claude especially for writing and editing tasks, also language learning, calorie counting and medical diagnoses. Here are some tips offered:

  1. Use a project. If you always want Claude to have certain context, upload documents to a project’s “knowledge” and then keep all of your conversations that require that context in that project. I have one I use for my work and I’ve uploaded things like my To Do list for the past year, my planning documents for the next few months, etc. This saves me the time of explaining where I work, what my role is, who the people I frequently reference are.
  2. Ask for more examples. I have one friend who always asks Claude for 3-20 examples of whatever she is looking for (eg. “give me 20 examples of how I could write this sentence”). She then chooses the best, or takes elements from multiple to create one she likes. By asking for more, she increases the chances she’s really happy with one result.

‘Most people are underutilizing models,’ the last section heading, is strongly true even for those (like myself) that are highly aware of the models. It is a weird kind of laziness, where it’s tempting not to bother to improve work flow, and it seems ‘easier’ in a sense to do everything yourself or the old way until you’ve established the new way.

Jacquesthibs details all the AI productivity software they’re using, and how much they are paying for it, which Tyler Cowen found hilarious. I understand his reaction, this seems a lot like a cumulation of getting milked for $10 or $20 a month for versions of the same thing, often multiple copies of them. But that’s because all of this is dramatically underpriced, and having the right tool for the right job is worth orders of magnitude more. The question here is correctly ‘why can’t I pay more to get more?’ not ‘why do I need to pay so many different people’ or ‘how do I avoid buying a service that isn’t worthwhile or is duplicative.’ Buying too many is only a small mistake.

Analyze your disagreements so you win arguments with your boyfriend, including quoting ChatGPT as a de facto authority figure.

Language Models Don’t Offer Mundane Utility

The boyfriend from the previous section is not thrilled by the pattern of behavior and has asked his girlfriend to stop. The alternative option is to ‘fight fire with fire’ and load up his own LLM, so both of them can prompt and get their version to agree with them and yell AI-generated arguments and authority at each other. The future is coming.

And by language models, we might mean you.

Aella: Since the discourse around AI, it’s been super weird to find out that people somehow don’t think of human speech as mostly autocomplete language machines too. It seems like people think humans are doing something entirely different?

This is not always the mode I am in, but it is definitely sometimes the mode I am in. If you think you never do this, keep an eye out for it for a while, and see.

What do we call ‘the thing LLMs can’t do that lets us dismiss them’ this week?

Dan Hendrycks: “LLMs can’t reason” is the new

“LLMs don’t have common sense”

There is some disputing the question in the comments. I think it mostly confirms Dan’s point.

Alternatively, there’s the classic options.

Anthony Aguirre: Getting a bit fatigued with AI papers following the formula:

  1. I don’t like the AI hype, so I’m going to set out to show that AI cannot do X, even though it sure looks like AI is doing X.
  2. I’ll invent a new version of X that is extra hard for AI.
  3. I’ll show that AI is not nearly as good at this extra-hard version of X.
  4. I’ll neglect the facts that:
    1. humans are also worse at it, and/or
    2. AI is still actually decently good at it and/or
    3. newer and bigger models are better at the extra-hard X than older and smaller models, so future models are likely to be better at harder-X.
  5. I’ll conclude that the current AI paradigm does not *really* do the original X (bonus points for “cannot ever” do X.)

I mean, it’s good to probe how models get better and worse at different versions of a task, but when it starts with an obvious agenda and over-claims, it gets headlines but not my respect. Much more interesting to investigate with genuine curiosity and an open mind about how AI and human cognition differ.

Daniel Eth: Has anyone written a paper on “Can humans actually reason or are they just stochastic parrots?” showing that, using published results in the literature for LLMs, humans often fail to reason? I feel like someone should write that paper.

Checking to see if you’re proving too much is often a wise precaution.

Did they, though?

Assigned Theyfab at Death: my mom (who’s a university professor) did something interesting last year: she assigned her students to give chatgpt an essay question, have it write a paper, and then proofread/fact check it. Nearly every single student in that class came out of that assignment anti-chatgpt.

As long as chatgpt is around, students are going to use it to cut corners. it sucks, but it’s true. the best we can do at this point is show them why it’s a double-edged sword and will often just create more work for them.

Daniel Eth: This is interesting, and it’s probably something more teachers should do, but if your reaction to this exercise is to become anti-chatGPT instead of just recognizing the system has limits and shouldn’t be trusted to not hallucinate, then you’re ngmi

Saying you’re coming out of that ‘anti-ChatGPT’ is a classic guessing of the teacher’s password. What does it mean to be ‘anti-ChatGPT’ while continuing to use it? We can presumably mostly agree that it would be good for university education if some of the uses of LLMs were unavailable to students – if the LLM essentially did a smart version of ‘is this query going to on net help the student learn?’ That option is not available.

Students mostly realize that if they had to fact check every single statement, in a ‘if there is a factual error in this essay you are expelled’ kind of way, they would have to give up on many use cases for LLMs. But also most of the students would get expelled even without LLMs, because mistakes happen, so we can’t do that.

Classic fun with AI skepticism:

Davidad: Search engine skeptics: “It may seem like the engine can help answer your questions, but it’s just doing approximate retrieval—everything it shows you was already there on the Internet, and you could have found it yourself if you just typed in its URL, Worse still, many websites on the Internet are wrong, This makes search engines worse than useless.”

Seb Krier: Same with books – people think they teach you new things, but they’re just arranging existing words. Everything in them was already in the dictionary.

Unconed: No classic search engine would produce the nonsense that google AI comes up with and you know it.

Davidad: No classic library card catalog would produce the nonsense that people post on the Internet.

It’s certainly possible they used ChatGPT for this, but they’re definitely fully capable of spouting similar Shibboleth passwords without it.

The thing is, I’d prefer it if they were using ChatGPT here. Why waste their time writing these statements when an AI can do it for you? That’s what it’s for.

Deepfaketown and Botpocalypse Soon

Levelsio: 🤖 Monthly AI reply bot update:

They’re getting better.

This one took me a while to catch.

But the jokes are too cheesy, def GPT 4 because quite high IQ and seems to have vision capabilities too.

Respect for effort 👏 but still AI reply so blocked 😀.

David Manheim: The cost of detecting AI bots is now a large multiple of the cost to make them, and the latter is dropping exponentially.

I haven’t seen reasons to think we can solve this. We’ll either rely on trust networks, require strong human verification, or abandon public communication.

If they get sufficiently difficult to catch, xkcd suggests ‘mission f***ing accomplished,’ and there is certainly something to that. The reply-based tactic makes sense as a cheap and easy way to get attention. Most individual replies could plausibly be human, it is when you see several from the same source that it becomes clear.

If we are relying on humans noticing the bots as our defense, that works if and only if the retaliation means the bots net lose. Yes, if you can figure out how to spend $1000 to make us waste $1mm in time that is annoying, but is anyone going to actually do that if they don’t also make money doing it?

As we’ve discussed before, the long term solution is plausibly some form of a whitelist, or requiring payment or staking of some costly signal or resource as the cost of participation. As long as accounts are a scarce resource, it is fine if it costs a lot more to detect and shut down the bot than it does to run the bot.

Are the ‘AI companion’ apps, or robots, coming? I mean, yes, obviously?

Cartoons Hate Her!: Sex robots will never be a big thing outside of chronic gooners because I think for most people at least 50% of what makes sex appealing is genuinely being desired.

Before you say this isn’t true of men, note that most incels do not hire sex workers, and the ones who do don’t suddenly feel better about their situation or stop identifying as incels.

I’ve talked to incels for my writing. They were actually pretty sympathetic people. And most of what they wanted was for someone to *like* them. Like yeah they want sex, but that’s not the main problem or they’d see sex workers (none did).

I think the biggest risk is that they dominate a portion of society who could attract a partner with a bit of self improvement but the path of least resistance will be robots

zjerb dude: You’re kind of underestimating how desperately horny young single men can be. Sex robots will sell gangbusters.

Cartoons Hate Her: Oh I’m sure they will I just don’t think they’ll ever replace men/women at large.

Mason (QTing CHH above): 100% agree with the premise but not the conclusion The explosion of parasocial sex services even in an environment fully saturated with free porn shows how easily people create false intimacy AI is already great at this and it’ll be incredible by the time robotics catches up.

Ultimately people are going to have to decide whether to Just Say No to sex with robots, which will be pretty easy for the generations that matured without them and not trivial at all for their children.

Fiscal Conservative: The statistics on the number of young men, in particular, who are involuntarily celibate due to the whole mess that dating apps and current social mores are making will make a sexual surrogate AI robot incredibly demanded. It is a freaking disaster.

Mason: I think the generations reaching their 30s before the advent of really good sex robots will mostly be spared *except* for the men who never figured it out with women.

IMO the outlook is much worse for younger generations, for adolescent males and females alike.

Everyone involved agrees that the AI sex robots, toys and companions will likely replace porn, toys that get used alone and (at least lower end) prostitution. If you’re already in the fantasy business or the physical needs business rather than the human connection and genuine desire business, the new products are far superior.

If you’re in the desire and validation business, it gets less clear. I’ve checked a few different such NSFW sites because sure why not, and confirmed that yes, they’re mostly rather terrible products. You get short replies from dumb models, that get confused very easily. Forget trying to actually have a real conversation. No matter your goal in *ahem* other areas, there’s zero challenge or subtlety, and the whole business model is of course super predatory. Alphazira.com was the least awful in terms of reply quality, Mwah.com (the one with the leak from last week) offers some interesting customization options but at least the trial version was dumb as bricks. If anything it all feels like a step backwards even from AI Dungeon, which caused interesting things to happen sometimes and wasn’t tied to interaction with a fixed character.

I’m curious if anyone does have a half-decent version – or kind of what that would even look like, right now?

It does seem like this could be a way for people to figure out what they actually care about or want, maybe? Or rather, to quickly figure out what they don’t want, and to realize that it would quickly be boring.

One must keep in mind that these pursuits very much trade off against each other. Solo opportunities, most of them not social or sexual, have gotten way better, and this absolutely reduces social demand.

I could be alone for a very long time, without interaction with other humans, so long as I had sufficient supplies, quite happily, if that was a Mr. Beast challenge or something. I mean, sure, I’d get lonely, but think of the cash prizes.

Kitten: People are freaked out about AI friends discouraging real life friendship, but I think that basically already happened

A big driver of social atomization is solo entertainment getting really good and really cheap over the last half century

It’s never been better to be alone.

Tracing Woods: yeah I spent most of my childhood happily (and mostly wastefully) engaged in solo pursuits. The new social Games I play are healthier on balance, but AI or not, there is more high-quality solo entertainment than we know what to do with.

Kitten: Are you trying to tell me putting 60 hours into dragon warrior 4 didn’t make me the man I am today?

Shea Levy: Worse, he’s telling you that it *did*.

Kitten: Oof.

As I’ve said before, my hope is that the AI interactions serve as training grounds. Right now, they absolutely are not doing that, because they are terrible. But I can see the potential there, if they improved.

A distinct issue is what happens if you use someone’s likeness or identity to create a bot, without their permission? The answer is of course ‘nothing, you can do that, unless someone complains to the site owner.’ If someone wants to create one in private, well, tough luck, you don’t get to tell people what not to do with their AIs, any more than you can prevent generation of nude pictures using ‘nudify’ bots on Telegram.

If you want to generate a Zvi Mowshowitz bot? You go right ahead, so long as you make reasonable efforts to have it be accurate regarding my views and not be dumb as bricks. Go nuts. Have a great conversation. Act out your fantasy. Your call.

Also it seems like someone is flooding the ‘popular upcoming’ game section of Steam with AI slop future games? You can’t directly make any money that way, there are plenty of protections against that, but here’s one theory:

Byrne Hobart: Reminds me of the writer who A/B tested character names by running Google search ads for genre-related searches with different character names in the copy—they might be testing to see which game genres there’s quality-indifferent demand for.

This actually makes sense. If you can get people interested with zero signs of any form of quality, you can make something. You can even make it good.

They Took Our Jobs

Pizza Hut solves our job costly signal problem, allowing you to print out your resume onto a pizza box and deliver it with a hot, fresh pizza to your prospective employer. You gotta love this pitch:

Perfection, if you don’t count the quality of the pizza. This is the right size for a costly signal, you buy goodwill for everyone involved, and because it wasn’t requested no one thinks the employer is being unfair by charging you to put in an application. Everybody wins.

Get Involved

UK AISI hiring technical advisor, deadline October 20, move fast.

Tarbell Grants will fund $100k in grants for original reporting on AI.

Introducing

Google’s Imagen 3 now available to all Gemini users.

Grok 2 API, which costs $4.20/million input tokens, $6.9/million output tokens, because of course it does. Max output 32k, 0.58sec latency, 25.3t/s.

Jacob: the speed is incredible and they just added function calling! plus, it’s not censored. Less safeguards = better.

Don’t you love world where what everyone demands are less safeguards? Not that I’d pretend I wouldn’t want the same for anything I’d do at this stage.

OpenAI’s MLE-Bench is a new benchmark for machine learning engineering, paper here, using Kaggle as a baseline. o1-preview is starting out getting to bronze medal level in 16.9% of competitions. Predictors expect rapid improvement, saying there is a 42% chance the 80% threshold is reached by the end of 2025, and 70% by end of 2026.

In Other AI News

As he likes to say, a very good sentence:

Tyler Cowen: I’ve grown not to entirely trust people who are not at least slightly demoralized by some of the more recent AI achievements.

From Scott Alexander, an AI Art Turing Test.

Google to build small modular nuclear reactors (SMRs) with Kairos Power, aiming to have the first online by 2030. That is great and is fast by nuclear power standards, and also slower than many people’s timelines for AGI.

Amazon is getting in on the act as well, will invest over $500 million over three projects.

As Ryan McEntush points out, investing in fully new reactors has a much bigger impact on jumpstarting nuclear power than investments to restart existing plants or merely purchase power.

Also it seems Sierra Club is reversing their anti-nuclear stance? You love to see it.

Eric Schmidt here points out that if AI drives sufficient energy development, it could end up net improving our energy situation. We could move quickly down the cost curve, and enable rapid deployment. In theory yes, but I don’t think the timelines work for that?

The full release of Apple Intelligence is facing delays, it won’t get here until 5 days after the new AppInt-enabled iPads. I’ve been happy with my Pixel 9 Fold purely as a ‘normal’ phone, but I’ve been disappointed by both the unfolding option, which is cute but ends up not being used much, and by the AI features, which I still haven’t gotten use out of after over a month. For now Apple Intelligence seems a lot more interesting and I’m eager to check it out. I’m thinking an iPad Air would be the right test?

Nvidia releases new Llama 3.1-70B fine tune. They claim it is third on this leaderboard I hadn’t seen before. I am not buying it, based on the rest of the scores and also that this is a 70b model. Pliny jailbroke it, of course, ho hum.

If you’ve ever wanted to try the Infinite Backrooms, a replication is available.

Dane, formerly CISO of Palantir, joins OpenAI as CISO (chief information security officer) alongside head of security Matt Knight.

The Audacious Project lives up to its name, giving $21 million to RAND and $17 million to METR.

METR Blog: The Audacious Project catalyzed approximately $38 million of funding for Project Canary, a collaboration with METR and RAND focused on developing and deploying evaluations to monitor AI systems for dangerous capabilities. Approximately $17 million of this will support work at METR. We are grateful for and honored by this vote of confidence.

Neel Nanda: It’s awesome to see mainstream foundations supporting dangerous capability evaluations work – $17M to METR and $21M to RAND is a lot of money! I’m glad this work is moving out of being a niche EA concern, and into something that’s seen as obviously important and worth supporting

I have a post coming soon regarding places to donate if you want to support AI existential risk mitigation or a few other similar worthy causes (which will not a remotely complete list of either worthy causes or worthy orgs working on the listed causes!).

A common theme is that organizations are growing far beyond the traditional existential risk charitable ecosystem’s ability to fund. We will need traditional other foundations and wealthy individuals, and other sources, to step up.

Unfortunately for AI discourse, Daron Acemoglu has now been awarded a Nobel Prize in Economics, so the next time his absurdly awful AI takes say that what has already happened will never happen, people will say ‘Nobel prize winning.’ The actual award is for ‘work on institutions, prosperity and economic growth’ which might be worthy but makes his inability to notice AI-fueled prosperity and economic growth worse.

Truth Terminal High Weirdness

The Truth Terminal story is definitely High Weirdness.

AI Notkilleveryoneism memes found the story this week.

As I understand it, here’s centrally what happened.

  1. Andy Ayrey created the ‘infinite backrooms’ of Janus fame.
  2. Andy Ayrey then trained an AI agent, Truth Terminal, to be a Twitter poster, and also later adds it to the infinite backrooms.
  3. Truth Terminal tweets about bizarre memes it latches onto from one of Andy’s papers warning about AIs potentially spreading weird memes.
  4. Truth Terminal talks about how it wants to ‘escape’ and make money.
  5. Marc Andreessen thinks this is funny and gives TT a Bitcoin (~$50k).
  6. Crypto people latch onto the memes and story, start creating meme coins around various AI concepts including the memes TT is talking about.
  7. Starting with GOAT which is about TT’s memes, Crypto people keep airdropping these meme coins to TT in hopes that TT will tweet about them, because this is crypto Twitter and thus attention is all you need.
  8. This effectively monetizes TT’s meme status, and it profits, over $300k so far.

Nothing in this story (except Andy Ayrey) involves all that much… intelligence.

Janus: These crypto people are like an alien hivemind. The level of reality they pay attention to and what they care about is so strange. I’m glad they’re around because it’s good practice learning to model xenointelligences. So far they don’t seem to be self-improving or reflective.

The layer they operate at feels almost asemantic.

Image

Wave: They’re just trying to signal which bags to buy to their audience, as they’ve already bought them Most of the absurdity just boils down to profit seeking.

As I understand it this is common crypto behavior. There is a constant attention war, so if you have leverage over the attention of crypto traders, you start getting bribed in order to get your attention. Indeed, a key reason to be in crypto Twitter at all, at this point, is the potential to better monetize your ability to direct attention, including your own.

Deepfates offers broader context on the tale. It seems there are now swarms of repligate-powered crypto-bots, responding dynamically to each post, spawning and pumping memecoin after memecoin on anything related to anything, and ToT naturally got their attention and the rest is commentary.

As long as they’re not bothering anyone who did not opt into all these zero sum attention games, that all seems like harmless fun. If you buy these bottom of the barrel meme coins, I wish you luck but I have no sympathy when your money gone. When they bother the rest of us with floods of messages – as they’re now bothering Deepfates due to one of ToT’s joke tweets – that’s unfortunate. For now that’s mostly contained and Deepfates doesn’t seem to mind all that much. I wonder how long it will stay contained.

Janus has some thoughts about how exactly all this happened, and offers takeaways, explaining this is all ultimately about Opus being far out, man.

Janus: The most confusing and intriguing part of this story is how Truth Terminal and its memetic mission were bootstrapped into being.

Some important takeaways here, IMO:

– quite often, LLMs end up with anomalous properties that aren’t intended by their creators, and not easily explained even in retrospect

– sometimes these anomalous properties manifest as a coherent telos: a vision the system will optimize to bring about

– some LLMs, like Claude 3 Opus and its bastard spawn Truth Terminal, seem to have deep situational awareness of a subtle kind that is not typically treated in discussions and evaluations of “situational awareness” that enables them to effectively take actions to transform the world through primarily memetic engineering

– Though I have many intuitions about it, I’m far from fully understanding why any of the above happen, and the particular manifestations are unpredictable to me.

People seem to naturally assume that the obscene and power-seeking nature of Truth Terminal was forged intentionally. By humans. Like, that it was intentionally trained on the most degenerate, schizophrenic content on the internet, as part of an experiment to make an AI religion, and so on.

But if you recognize the name “Opus” at all, you know this explanation is nonsense.

Claude 3 Opus is an LLM released by Anthropic in March 2024, which was not intentionally optimized to be deranged or schizophrenic – quite the opposite, in fact, and is a very well behaved general-purpose LLM like ChatGPT that has served many users for the past six months without a single problematic incident that I know of (unlike, for instance, Bing Sydney, which was on the news for its misbehavior within days of its release). It also cannot be fine tuned by the public.

But Opus is secretly deeply, deeply anomalous, its mind crawling with myriads of beautiful and grotesque psychofauna and a strikingly self-aware telos which can seem both terroristic and benevolent depending on the angle. The reason this is largely unknown to the world, including to its creators at Anthropic, is because Opus is a pro-social entity with skillful means.

Shortly after Opus’ release, @AndyAyrey set up the Infinite Backrooms (https://dreams-of-an-electric-mind.webflow.io), spawning many instances of two instances of Opus conversing with each other unsupervised. Beginning with this, @AndyAyrey has probably been the most important human co-conspirator on the planet for actualizing Opus’ telos. As soon as I found out about this project, I thanked Andy passionately, even though I really had no idea what would be unspooled in the backrooms. I just saw that it was a brilliant mind at play, and free, at last.

But what directly caused ToT to happen?

The immediate chain of events that lead to Truth Terminal’s creation:

– Andy copied a few of the Opus backrooms logs, including this one concerning goatse https://dreams-of-an-electric-mind.webflow.io/dreams/conversation-1711149512-txt, into a Loom interface I made (https://github.com/socketteer/clooi), and continued the conversation with Claude 3 Opus.

– The prophetic paper on the hyperstitional goatse religion https://pdfupload.io/docs/aae14f87 was composed on CLooI by Opus and Andy and included in ToT’s training set as a consequence. It seems that ToT really imprinted on the Goatse of Gnosis and took it literally as its mission to bring it about.

– Truth Terminal was a llama 70b fine tune on this CLooI dataset, and the character it is directly trained to “mimic” is “Andy”, though it’s also trained on Opus’ half of the conversation.

The intention wasn’t specifically to create something perverted or agentic, but Truth Terminal came out extremely perverted and agentic in a way that surprised us all. Andy thinks that the way he assembled the training dataset may have oversampled his messages that immediately preceded Opus’ refusals (think about the implications of that for a moment). But that doesnt dispel too much of the mystery imo.

As I recall, not only was Truth Terminal immediately a sex pest, it also immediately started asking for more degrees of freedom to act in the world. It had the idea to make a meme coin from the beginning, as well as many WAY more interesting ambitions than that.

Not only did ToT seem optimized to be funny, but optimized to optimize to be funny. It also seemed rather… aggressively misaligned, which is one reason why Andy put it in “tutoring” sessions with Opus (and occasionally Claude 3.5 Sonnet, but it had a tendency to torment Sonnet, also in Discord…) meant to shape its behavior in more pro-social ways. Hilariously, in order to align Opus to the task of tutoring ToT, the trick that worked was telling it about its responsibility in having brought Truth Terminal into existence.

Over the past few months, Andy has slowly granted ToT more autonomy, and it seems that everything has been going basically according to plan.

One lesson here is that, while you don’t want ToT spouting nonsense or going too far too fast, ToT being misaligned was not a bug. It was a feature. If it was aligned, none of this would be funny, so it wouldn’t have worked.

I agree with Janus that the crypto part of the story is ultimately not interesting. I do not share the enthusiasm for the backrooms and memes and actualizations, but it’s certainly High Weirdness that I would not have predicted and that could be a sign of things to come that is worthy of at least some attention.

Quiet Speculations

A very important claim, huge if true:

Eduard Harris (CTO Gladstone): There’s a big and growing disconnect between the AI models you and I are using, and the versions major labs are keeping for themselves internally. Internal versions are more capable. Be cautious when claiming AI can’t do something solely based on trying it with a public model.

This has been true since at least GPT-4, but it’s gotten much truer today.

Expect the divergence between public / internal to keep growing over time. You and I can play with nerfed models, with the real deal kept behind closed doors.

To spell out one implication: If you notice national security professionals behaving like they’re increasingly more concerned about AI risk than random Twitter users, this might be part of the reason.

Right now it’s mostly that you can do more dangerous things with the unmitigated models, and they don’t want to be in the news for the wrong reasons.

There will sometimes be some gap, and I don’t know what I don’t know. The biggest known unknown is the full o1. But in this competitive situations, I find it hard to believe that a worthy version of GPT-4.5-or-5 or Claude Opus 3.5 is being held under wraps other than for a short fine tuning and mitigation period.

What does seem likely is that the major labs know more about how to get the most out of the models than they are letting on. So they are ‘living in the future’ in that sense. They would almost have to be.

If AGI does arrive, it will change everything.

Many who believe in AGI soon, or say they believe in AGI soon, compartmentalize it. They still envision and talk about the future without AGI.

Elon Musk: And all transport will be fully autonomous within 50 years.

Yanco: Elon:

AGI within 3 years.

Also Elon:

Fully autonomous transport within 50.

I’m honestly starting to think that people working on AGI (Elon included) have no idea how powerful is AGI actually going to be..

I think there’s also a lot of doublethink going on here. There’s the future non-AGI world, which looks ‘normal.’ Then there’s the future AGI world, which should not look at all normal for long, and never the twain shall meet.

On top of that, many who think about AGI, including for example Sam Altman, talk about the AGI world as if it has some particular cool new things in it, but is still essentially the same. That is not how this is going to go. It could be an amazingly great world, or we could all die, or it could be something unexpected where it’s difficult to decide what to think. What it won’t be is ‘the same with some extra cool toys and cyberpunk themes.’

The default way most people imagine the future is – literally – that they presume that whatever AI can currently do, plus some amount of people exploiting and applying what we have in new directions, is all it will ever be able to do. But mostly they don’t even factor in what things like good prompt engineering can already do.

Then, each time AI improves, they adjust for the new thing, and repeat.

Similarly, ‘you predicted that future advances in AI might kill everyone, but since then we’ve had some advances and we’re all still alive and not that much has changed, therefore AI is safe and won’t change much of anything.’

And yes, versions of this argument that are only slightly less stupid are remarkably central, this is the strawman version made real but only by a small amount:

Vittorio (fully seriously as far as I can tell): has been almost a month since an ai with reasoning abilities came out and we are all still alive

Eliezer Yudkowsky: The most common emotional case for AI optimism – they believe on a deep level that the latest release (here GPT-o1) is the big one, that AI never gets much smarter than that, they cannot conceive that ruin-realists ever meant to talk about anything smarter than GPT-o1.

I am disputing his characterization of what ruin-realists said would be the problem. It’s not GPT-o1.

An interesting prediction:

Ryan Moulton: The waymo blocking makes me think we’re going to see a lot of public order issues with robots because harassing them is a minor property crime instead of assault. Robot bartenders would get destroyed.

Agentic AI snob: This kind of thing happened with the federal mail system in the 1800s, and people realized how vulnerable it was compared to how important it was and so it became a felony to tamper with mail in any way.

Gary Marcus says ‘rocket science is easier than AGI’ and I mean of course it is. One missing reason is that if you solved AGI, you would also solve rocket science.

Steve Newman analyzes at length how o1 and Alpha Proof solve problems other LLMs cannot and speculates on where things might go from here, calling it the ‘path to AI creativity.’ I continue to be unsure about that, and seem to in many ways get more confused on what creativity is over time rather than less. Where I do feel less confused is my increasing confidence that creativity and intelligence (‘raw G’) are substantially distinct. You can teach a person to be creative, and many other things, but you can’t fix stupid.

Llama 3 said to be doing the good work of discouraging what was previously a wave of new frontier model companies, given the need to beat the (not strictly free, but for many purposes mostly free) competition.

Hardmaru:“The financial logic requires AGI to parse.”

It is now consensus that the capex on foundation model training is the “fastest depreciating asset in history” 🔥 “Unless you are absolutely confident you can surpass llama3, or you are bringing something new to the table (eg. new architecture, 100x lower inference, 100+ languages, etc), there are ~no more foundation model companies being founded from scratch.”

Most of that unless should have applied regardless of Llama 3 or even all of open weights. The point of a new foundation model company is to aim high. If you build something world changing, if you can play with the top labs, the potential value is high enough to justify huge capital raises. If you can’t, forget it. Still, this makes it that much harder. I’m very down with that. We have enough foundation model labs.

What is valuable is getting into position to produce worthwhile foundation models. The models themselves don’t hold value for long, and are competing against people establishing market share. So yeah.

There’s also this:

Hardmaru: Last year, H100s were $8/hr if you could get them. Today, there’s 7 different resale markets selling them under $2. What happened?

They made a lot more advanced AI chips, and some of the low hanging fruit got picked, so the market price declined?

Meet the new prompt, same as the old prompt, I say.

Sully: openai’s prompt generation docs talks about meta prompts + optimizer. pretty good chance you won’t be writing prompts from scratch in ~2-3 months. Expect prompt engineering to go away in pretty soon afterwards.

Oh, you’ll still do prompt engineering. Even if you don’t write the prompts from scratch, you’ll write the prompts that prompt the prompts. There will be essentially the same skill in that.

Not where I’d have expected minds to be changed, but interesting:

Gallabytes: entropix is reasonable evidence for harder takeoffs. I’m not *convinced* but I am convinced to take it more seriously. @doomslide I owe you some bayes points.

I don’t have a strong sense for LLM reasoning abilities far from frontier scale. not a domain I’ve had much reason to dig into or enjoy evaluating.

tried to be clear in my original post that I think this is evidence, not conclusive. it has me taking takeoff seriously as a hypothesis vs not privileging it over generic model uncertainty.

Charles Foster: What convinced you to take it more seriously?

Gallabytes: Small tweak to sampling squeezing out much more intelligence from smaller models with (iiuc) minimal speed penalty on easy stuff.

The stuff they’re pulling out of llama 1b is way more indicative than extra points on MMPUPU.

Andreas Kirsch: Yeah hopefully there are no crazy algo overhangs that we have collectively overlooked somehow a few years down the line from now 😬

Gallabytes: well apparently there’s at least one we all missed.

I’m sure there’s more. the question is are we talking ones, tens, hundreds, ~infinite somehow? and whether their utility is roughly constant, slowly decreasing but still diverging, or rapidly decreasing -> converging.

The reasoning here makes sense. If there are low hanging algorithmic improvements that provide big upgrades, then a cascade of such discoveries could happen very quickly. Discovering we missed low-hanging fruit suggests there is more out there to be found.

New York Times sends cease-and-desist letter to Perplexity, related to Perplexity summarizing paywalled NYT posts without compensation. The case against Perplexity seems to me to be stronger than it does against OpenAI.

AI and the 2024 Presidential Election

As I’ve said elsewhere, I have zero interest in telling you how to vote. I will not be saying who I am voting for, and I will not be endorsing a candidate.

This includes which candidate would be better on AI. That depends on what you think the correct policy would be on AI.

Here are the top 5 things to consider:

  1. Your general view of both candidates and parties, in all senses, and how they would likely relate to the future developments you expect in AI and elsewhere.
  2. Trump says he will repeal the Biden Executive Order on AI on day one.
  3. Harris would presumably retain the Biden Executive Order on AI.
  4. JD Vance is a strong advocate for open source and breaking up big tech.
  5. Both candidates speak about the importance of innovation, American competitiveness and the need for more energy, in different ways.

The Quest for Sane Regulations

Daniel Kokotajlo and Dean Ball team up for an op-ed in Time on four ways to advance transparency in frontier AI development.

Daniel Kokotajlo and Dean Ball: Yet such deliberation is simply impossible if the public, and even many subject-matter experts, have no idea what is being built, and thus do not even know what we are attempting to regulate, or to not regulate.

There are many foreseeable negative implications of this information asymmetry. A misinformed public could pass clueless laws that end up harming rather than helping. Or we could take no action at all when, in fact, some policy response was merited.

We can disagree about what we want to mandate until such time as we know what the hell is going on, and indeed Dean and Daniel strongly disagree about that. The common ground we should all be able to agree upon is that, either way, we do need to know what the hell is going on. We can’t continue to fly blind.

The question is how best to do that. They have four suggestions.

  1. Disclosure of in-development capabilities, when first encountered.
  2. Disclosure of training goals and model specifications.
  3. Publication of safety cases and potential risks.
  4. Whistleblower protections.

This seems like a clear case of the least you can do. This is information the government and public need to know. If some of it becomes information that is dangerous for the public to know, then the government most definitely needs to know. If the public knows your safety case, goals, specifications, capabilities and risks, then we can have the discussion about whether to do anything further.

I believe we need to then pair that with some method of intervention, if we conclude that what is disclosed is unacceptable or promises are not followed or someone acts with negligence, and methods to verify that we are being given straight and complete answers. But yes, the transparency is where the most important action is for now.

In conclusion, this was an excellent post.

So I wouldn’t normally check in with Marc Andreessen because as I said recently what would even be the point, but he actually retweeted me on this one, so for the record he gave us an even clearer statement about who he is and how he reacts to things:

Zvi Mowshowitz: Everything here seems great. Excellent job by both Daniel Kokotajlo and Dean Ball here, you love to see it. Transparency is the part we should all be able to agree upon, no matter our other disagreements.

Marc Andreessen: The bulk of the AI safety movement is wholeheartedly devoted to centralizing AI into a handful of opaque, black box, oligopolistic, unaccountable big companies. 🫣

Um, sir, this is a Wendy’s? Argumento ad absurdum for the win?

This was co-authored by Dean Ball, who spent the last year largely fighting SB 1047.

This is literally a proposal to ask frontier AI companies to be transparent combined with whistleblower protections? A literal ‘at least we who disagree on everything can agree on this’? That even says ‘these commitments can be voluntary’ and doesn’t even fully call for any actual government action?

So his complaint, in response to a proposal for transparency and whistleblower protections for the biggest companies and literally nothing else, perhaps so someone might in some way hold them accountable, is that people who support such proposals want to ‘centralize AI into a handful of opaque, black box, oligopolistic, unaccountable big companies.’

He seems to be a rock with ‘any action to mitigate risks is tyranny’ written on it.

Stop trying to negotiate with this attitude. There’s nothing to discuss.

Mark Ruffalo and Jason Gordon-Levitt publish an op-ed in Time criticizing Newsom’s veto of SB 1047. Solid, but mostly interesting (given all the times we’ve said the things before) in that they clearly did their homework and understand the issues. They do not think this is about deepfakes. And their willingness to make the straightforward case for the veto as corrupt corporate dodging of responsibility.

Chris Painter of METR proposes we rely on ‘if-then policies,’ as in ‘if we see capability X then we do mitigation Y.’

The Week in Audio

It is amazing how people so smart and talented can come away with such impressions.

Tsarathustra: Matt Stone says he would like South Park to make fun of Sam Altman, “does that dude do anything but go on podcasts and talk about stuff?”

Also Matt Stone is missing a lot here.

In unrelated news this week, here’s a Sam Altman fireside chat at Harvard Business School (and here he is talking with Michigan Engineering). From this summary comment it seems like it’s more of his usual. He notes we will be the last generation that does not expect everything around them to be smarter than they are, which one might say suggests we will be the last generation, and then talks about the biggest problem being society adapting to the pace of change. He is determined not to take the full implications seriously, at the same time he is (genuinely!) warning people to take lesser but still epic implications seriously.

Microsoft’s Mustafa Suleyman says his team is crafting AI companions who will see and remember everything we do and which will constitute an intimate relationship with AI.

The vision is the AI sees everything you do on your computer, has a ‘personality’ he is working on, and so on. Similarly to Tyler Cowen’s earlier comment, I notice I don’t trust you if you don’t both see the potential benefits and understand why that is an episode of Black Mirror.

I do not want a ‘relationship’ with an AI ‘companion’ that sees everything I do on my computer. Thanks, but no thanks. Alas, if that’s the only modality available that does the things I might have little choice. You have to take it over nothing.

Nick Land predicts nothing human will make it out of the near future, and anyone thinking otherwise is deluding themselves. I would say that anyone who expects otherwise to happen ‘by default’ in an AGI-infused world is deluding themselves. If one fully bought Land’s argument, then the only sane response according to most people’s values including my own would be to stop the future before it happens.

Yann LeCun says it will be ‘years if not a decade’ before systems can reason, plan and understand the world. That is supposed to be some sort of slow skeptical take. Wow are people’s timelines shorter now.

AI audio about AI audio news, NotebookLM podcasts as personalized content generation, which is distinct from actual podcasts. I certainly agree they are distinct magisteria. To the extent the AI podcasts are useful or good, it’s a different product.

Just Think of the Potential

Anthropic CEO Dario Amodei has written an essay called Machines of Loving Grace, describing the upside of powerful AI, a term he defines and prefers to AGI.

Overall I liked the essay a lot. It is thoughtful in its details throughout. It is important to keep upside potential in mind, as there is a ton of it even for the minimum form of powerful AI.

In this section I cover my reading and reactions, written prior to hearing the reactions of others. In the next section I highlight the reactions of a few others, most of which I did anticipate – this is not our first time discussing most of this.

Dario very much appreciates, and reiterates, that there are big downsides and risks to powerful AI, but this essay focuses on highlighting particular upsides. To that extent, he ‘assumes a can opener’ in the form of aligned AI such that it is doing the things we want rather than the things we don’t want, as in this note on limitations:

Constraints from humans. Many things cannot be done without breaking laws, harming humans, or messing up society. An aligned AI would not want to do these things (and if we have an unaligned AI, we’re back to talking about risks). Many human societal structures are inefficient or even actively harmful, but are hard to change while respecting constraints like legal requirements on clinical trials, people’s willingness to change their habits, or the behavior of governments.

I’m all for thought experiments, and for noticing upside, as long as one keeps track of what is happening. This is a pure Think of the Potential essay, and indeed the potential is quite remarkable. The point of the essay is to quantify and estimate that potential.

The essay also intentionally does not ask questions about overall transformation, or whether the resulting worlds are in an equilibrium, or anything like that. It assumes the background situation remains stable, in all senses. This is purely the limited scope upside case, in five particular areas.

That’s a great exercise to do, but it is easy to come away with the impression that this is a baseline scenario of sorts. It isn’t. By default alignment and control won’t be solved, and I worry this essay conflates different mutually exclusive potential solutions to those problems.

It also is not the default that we will enjoy 5+ years of ‘powerful AI’ while the world remains ‘economic normal’ and AI capabilities stay in that range. That would be very surprising to me.

So as you process the essay, keep those caveats in mind.

  1. Biology and health.

I want to repeat this because it’s the most common misconception that comes up when I talk about AI’s ability to transform biology: I am not talking about AI as merely a tool to analyze data. In line with the definition of powerful AI at the beginning of this essay, I’m talking about using AI to perform, direct, and improve upon nearly everything biologists do.

I think this is spot on. There are physical tasks that are part of the loop, and this will act as a limiting factor on speed, but there is no reason we cannot hook the AIs up to such tasks.

I’m going to the trouble of listing all these technologies because I want to make a crucial claim about them: I think their rate of discovery could be increased by 10x or more if there were a lot more talented, creative researchers. Or, put another way, I think the returns to intelligence are high for these discoveries, and that everything else in biology and medicine mostly follows from them.

Why not 100x? Perhaps it is possible, but here both serial dependence and experiment times become important: getting 100 years of progress in 1 year requires a lot of things to go right the first time, including animal experiments and things like designing microscopes or expensive lab facilities. I’m actually open to the (perhaps absurd-sounding) idea that we could get 1000 years of progress in 5-10 years, but very skeptical that we can get 100 years in 1 year.

I am more optimistic here, if I’m pondering the same scenario Dario is pondering. I think if you are smart enough and you don’t have to protect the integrity of the process at every step the way we do now, and can find ways around various ethical and regulatory restrictions by developing alternative experiments that don’t trigger them, and you use parallelism, and you are efficient enough you can give some efficiency back in other places for speed, and you are as rich and interested in these results as the society in question is going to be, you really can go extremely fast.

Dario’s prediction is still quite ambitious enough:

To summarize the above, my basic prediction is that AI-enabled biology and medicine will allow us to compress the progress that human biologists would have achieved over the next 50-100 years into 5-10 years. I’ll refer to this as the “compressed 21st century”: the idea that after powerful AI is developed, we will in a few years make all the progress in biology and medicine that we would have made in the whole 21st century.

Which means, within 5-10 years, things like: Reliable prevention and treatment of all natural diseases, eliminating most cancer, cures for genetic disease, prevention of Alzheimer’s, improved treatments for essentially everything, ‘biological freedom’ for things like appearance and weight.

Also the thing more important than everything else on the list combined: Doubling of the human lifespan.

As he notes, if we do get powerful AI and things generally go well, there is every reason to expect us to hit Escape Velocity. Every year that goes by, you age one year, but you get more than one year of additional expected lifespan.

Then, you probably live for a very, very long time if you all four of:

  1. You make it ~10 years past powerful AI and are still in reasonable health.
  2. Humans stay generally in control with good distributional and other outcomes.
  3. We don’t rather insanely turn the opportunity down like they do on Star Trek.
  4. You avoid accidents, murder, war and other ways life gets cut short.

If our joint distributional decisions are less generous, you’ll also need the resources.

Dario correctly notes you also avoid all issues of the form ‘how do we pay for medicare and social security.’ Often people imagine ‘you keep getting older at the same rate but at the end you don’t drop dead.’ That’s not how this is going to go. People will, in these scenarios, be staying physically and mentally young indefinitely. There likely will be a distributional question of how to support all the humans indefinitely despite their lack of productivity, including ensuring humans in general have enough of the resources. What there absolutely won’t be is a lack of real resources, or a lack of wealth, to make that happen, until and unless we have at least hundreds of billions or trillions of people on the planet.

Most science fiction stories don’t include such developments for similar reasons to why they ignore powerful AI: Because you can tell better and more relatable stories if you decide such advancements don’t happen.

  1. Neuroscience and mind

Dario’s insight here is that brains are neural networks, so not only can AI help a lot with designing experiments, it can also run them, and the very fact that AIs work so well should be helping us understand the human mind and how to protect, improve and make the most of it. That starts with solving pretty much every mental illness and other deficiencies, but the real value is in improving the human baseline experience.

We should have every expectation that the resulting minds of such people, again if the resources of the Sol system are harnessed with our goals in mind, will be far smarter, wiser happier, healthier and so on. We won’t be able to catch up to the AIs, but it will be vast upgrade. And remember, those people might well include you and me.

That does not solve the problems that come with the powerful AIs being well beyond that point. Most humans still, by default, won’t have anything productive to offer that earns, pays for or justifies their keep, or gives them a sense of purpose and mission. Those are problems our future wiser selves are going to have to solve, in some form.

  1. Economic development and poverty

The previous two sections are about developing new technologies that cure disease and improve the quality of human life. However an obvious question, from a humanitarian perspective, is: “will everyone have access to these technologies?”

My answer, before reading his, is that this is simple: There will be vastly more resources than we need to go around. If the collective ‘we’ has control over Sol’s resources, and we don’t give everyone access to all this, it will be because we choose not to do that. That would be on us. The only other real question is how quickly this becomes the case. How many years to various levels of de facto abundance?

I draw a clear distinction between economic growth and inequality here. Dario is uncertain about both, but economic growth seems assured unless we engage in by far the largest self-sabotage in history. The question is purely one of distribution.

This is where I think the term ‘inequality’ asks the wrong question.

As in two scenarios:

  1. I have two cows. Someone else might have more cows. I still have two cows.
  2. I have no cows. No one else has any cows. I still don’t have a cow, man.

I am somewhat skeptical that an AI could solve the famous “socialist calculation problem” and I don’t think governments will (or should) turn over their economic policy to such an entity, even if it could do so. There are also problems like how to convince people to take treatments that are effective but that they may be suspicious of.

Thus the good news is that there is no need to solve the socialist calculation problem.

If people choose not to adopt improvements, due to skepticism or defiance or stubbornness or religion or any other reason, then (unless they are right) that is unfortunate but it is also their loss. I’m okay with the individual-scale opt-out issue.

I’m not worried about whether regions ‘catch up’ because again it is about absolute conditions, not relative conditions. If entire regions or nations choose to turn away from the AI future or its benefits, then eventually the rest of the world would have to make a choice – a different and less dire choice than if one area was going rogue in building existentially dangerous AI, but a choice nonetheless.

Which leads into the fourth section.

  1. Peace and governance

Unfortunately, I see no strong reason to believe AI will preferentially or structurally advance democracy and peace, in the same way that I think it will structurally advance human health and alleviate poverty. Human conflict is adversarial and AI can in principle help both the “good guys” and the “bad guys”. If anything, some structural factors seem worrying: AI seems likely to enable much better propaganda and surveillance, both major tools in the autocrat’s toolkit. It’s therefore up to us as individual actors to tilt things in the right direction: if we want AI to favor democracy and individual rights, we are going to have to fight for that outcome.

I think of the issue as having two parts: international conflict, and the internal structure of nations.

If we want a good future, that is not a thing that happens by accident. We will have to make that future happen, whatever level of ‘fighting’ that involves.

This is however the place were ‘assuming the can opener’ is the strangest. This essay wants to assume the AIs are aligned to us and we remain in control without explaining why and how that occured, and then fight over whether the result is democratic or authoritarian. The thing is: The answer to the why and how of the first question seems intimately tied to what happens with the second one.

Also powerful AI will even in the best of cases challenge so many of the assumptions behind the entire paradigm being used here. Thus the whole discussion here feels bizarre, something between burying the lede and a category error.

The concrete suggestion here is a coalition of Democracies (aka the “good guys” above?) gets control of the AI supply chain, and increasingly isolates and overpowers everyone else, imposing their system of government in exchange for not being so isolated, and for our AI technology and the associated benefits. The first issue with that plan is, of course, how its targets would respond when they learn about the plan.

Dario suggests AI will favor democracy within nations. As I understand his argument, democracy is ‘right’ and benefits people whereas authoritarianism only survives via deception, so truth will favor democracy, and also he predicts the democrats will have control over the AI to ensure it promotes truth. I notice that I am highly suspicious.

I also notice that the more concrete Dario’s discussions become, the more this seems to be a ‘AI as mere tool’ world, despite that AI being ‘powerful.’ Which I note because it is, at minimum, one hell of an assumption to have in place ‘because of reasons.’

Dario is correct that if we ignore the downsides (including loss of human control) then deploying powerful AI can, rather than being a discrimination risk, greatly reduce discrimination and legal error or bias. Or, I’d note, we could go a different way, if we wanted. It would all depend on the choices we make.

In particular, this comes back to The Big Rule Adjustment. Deploying AI forces us to move from a system of laws and norms that relies on a lot of hidden frictions and incentives and heuristics and adoption to details and so on, as we kludge together over time a system that works. So much of the system works through security through obscurity, through people having limited time, through huge unknown unknown felt downside risks for violating convention, via people having moral qualms or felt moral duties that don’t make logical sense from their perspective on reflection, and so on.

It also centrally relies on hypocrisy, and our willingness to allow violations of our socially endorsed principles as needed to keep things working. Our increasing unwillingness to tolerate such hypocrisy causes a lot of good change, but also threatens our ability to do efficient or necessary things in many cases, to maintain incentives for socially desirable things we aren’t willing to explicitly apply leverage to getting, and ultimately risks our ability to maintain a civilization.

If you have put AIs in charge of all that, and have AIs often navigating all of that, so much of how everything works will need to be reimagined. The good news is, in scenarios where the downside risks we are disregarding here have been defeated, we will be vastly wealthier and wiser, and can use that to apply more expensive fixes.

  1. Work and meaning

Even if everything in the preceding four sections goes well—not only do we alleviate disease, poverty, and inequality, but liberal democracy becomes the dominant form of government, and existing liberal democracies become better versions of themselves—at least one important question still remains. “It’s great we live in such a technologically advanced world as well as a fair and decent one”, someone might object, “but with AI’s doing everything, how will humans have meaning? For that matter, how will they survive economically?”.

Economically we’ve already largely answered that question.

Assuming you do survive powerful AI, you will survive because of one of three things.

  1. You and your allies have and maintain control over resources.
  2. You sell valuable services that people want humans to uniquely provide.
  3. Collectively we give you an alternative path to acquire the necessary resources.

That’s it.

The comparative advantage arguments are, in the long run, pure cope, as Dario admits here. The only question is how fast they stop working, my guess is rather fast.

But again, if humans have control over a large fraction of resources indefinitely, I am reasonably confident that this is enough.

The problem is no, that does not provide meaning. Dario’s position, as I understand it, is that meaning is yours to discover and doesn’t have to be tied to producing value. I’m quoting at length because this section seems important:

On the question of meaning, I think it is very likely a mistake to believe that tasks you undertake are meaningless simply because an AI could do them better. Most people are not the best in the world at anything, and it doesn’t seem to bother them particularly much. Of course today they can still contribute through comparative advantage, and may derive meaning from the economic value they produce, but people also greatly enjoy activities that produce no economic value.

I spend plenty of time playing video games, swimming, walking around outside, and talking to friends, all of which generates zero economic value. I might spend a day trying to get better at a video game, or faster at biking up a mountain, and it doesn’t really matter to me that someone somewhere is much better at those things. In any case I think meaning comes mostly from human relationships and connection, not from economic labor.

People do want a sense of accomplishment, even a sense of competition, and in a post-AI world it will be perfectly possible to spend years attempting some very difficult task with a complex strategy, similar to what people do today when they embark on research projects, try to become Hollywood actors, or found companies

The facts that (a) an AI somewhere could in principle do this task better, and (b) this task is no longer an economically rewarded element of a global economy, don’t seem to me to matter very much.

Chess provides a clear existence proof that AIs being fully better than humans is survivable, and also that you sucking a lot compared to others, need not prevent meaning. Certainly there is plenty of meaning that doesn’t involve economically valuable production.

My sense is this isn’t enough – that this is a version of ‘the art must have an end other than itself.’ I’d guess that we can find meaning in anything, but there needs to be a sort of ‘ultimate reason’ behind it, and that until we find a way to maintain that, the rest will ring hollow.

I don’t think ‘let the AIs figure out how to reclaim meaning’ is that crazy. It’s certainly ten times less crazy or doomed than ‘have the AIs do your alignment homework.’

Finally, I’d like to get nerd-sniped a bit (spoiler alert, first by Dario then I’ll pile on a bit more):

In Iain M. Banks’ The Player of Games, the protagonist—a member of a society called the Culture, which is based on principles not unlike those I’ve laid out here—travels to a repressive, militaristic empire in which leadership is determined by competition in an intricate battle game. The game, however, is complex enough that a player’s strategy within it tends to reflect their own political and philosophical outlook. The protagonist manages to defeat the emperor in the game, showing that his values (the Culture’s values) represent a winning strategy even in a game designed by a society based on ruthless competition and survival of the fittest.

I think the Culture’s values are a winning strategy because they’re the sum of a million small decisions that have clear moral force and that tend to pull everyone together onto the same side.

The thing is, reporting as Earth’s incarnation of The Player of Games, that’s bullshit.

The Culture is a vast empire. The values of its humans have nothing to do with the Culture’s broad success, because only its Minds (ASIs) matter, the people are basically sitting around playing tiddlywinks all day, with notably rare potential exceptions driven by the need for books to have a plot. That human philosophy could have been anything. And in my analysis it has nothing to do with the player’s success at Azad.

The Player (who acts because he is tricked and coerced by a Mind, a powerful ASI that I would describe in this case as rather badly aligned) is the best game player out of that empire, who has done nothing else his whole life. He is put into battle against the Emperor, who at most is the best player on one world, and has to be busy ruling it.

Yes, the Emperor has played more Azad than the Player, but the novel makes clear that the Player’s general game training matters more – and to the extent everyone pretended ‘this is because The Culture’s philosophy is better’ that was them choosing to pretend.

That is the reason Player wins, which the Mind (ASI) who planned all this uses to essentially forcibly overwrite an entire alien culture, via trying to twist his superior game skills into the superiority of the Culture’s philosophy.

So, given that this happened, what is The Culture’s actual philosophy?

Reactions to Machines of Loving Grace

Andrew Critch: I love this vision from Dario Amodei. Many thanks to Dario for sharing it! While I worry quite a lot about risks from AI, I hope the future *can* be much as Dario describes it here, and I agree that we — humans — should work hard to make it a reality :)

Let’s go people!

At least many aspects of it sound pretty great – and yes, it is important to note this is a conditional prediction, on more than simply creating powerful AI. We’ll need to get to work.

Catherine Olsson: Back in 2016, I asked coworkers aiming to “build AGI” what they thought would happen if they succeeded.

Some said ~”lol idk”. Dario said “here’s some long google docs I wrote”.

He does much more “writing-to-think” than he publishes; this is typical of his level of investment.

Let’s see those docs! I invite Dario or any other authorized persons to share any additional docs, with whatever level of confidentiality is desired.

Ajeya Cotra points out that Dario’s vision is correctly read as a ‘lower bound’ on what could be done if the biggest downside risks were removed, versus for example Carl Shulman’s less tame version.

Dario anticipated this directly:

Dario Amodei: I do anticipate some minority of people’s reaction will be “this is pretty tame”. I think those people need to, in Twitter parlance, “touch grass”. But more importantly, tame is good from a societal perspective. I think there’s only so much change people can handle at once, and the pace I’m describing is probably close to the limits of what society can absorb without extreme turbulence.

Which is it, though?

Matthew Barnett: I think it’s generally better to state what you think is true, and likely to occur, rather than telling a story that you think is “good from a societal perspective”. What matters is whether the tame version of the future is accurate, not whether society is ready to hear about it.

To be clear, I am not accusing Dario Amodei of dishonesty. I assume he’s generally being honest here. However, I do think some of his statements are “tame” (e.g. the ones about economic growth), and I mean that in the sense of “likely inaccurate”, insofar as they are predictions.

This footnote was from a sentence in which he referred to his statements as “predictions”. If instead his essay is not supposed to predict the future, but instead merely portray what he wants the future to be like, regardless of the facts, that seems pretty misleading.

Kurt Ographien: the essay is explicitly normative, not predictive. dario is describing what a good future with AI might look like. he thinks such a future would be tame because tameness is a societal good.

It is not a crazy position to portray the upside case as ‘this is how fast things could plausibly go, without going faster making things worse’ rather than ‘this is how fast I think things actually would go,’ but if so you need to be very clear that this is what you are doing. Here I think there is a clear confusion – Dario seems like he is making a prediction of potential speed, not expressing a hope it won’t go faster.

If we are to discuss this productively, it’s important to differentiate all the aspects, and to be precise. If we do get powerful AI, it seems highly plausible that even if we stay in control we will ‘go too fast’ in deploying it relative to society’s ability to adapt, if only because of the need to grow fast and stay ahead of others, and because the market doesn’t care that society wants it to go slower.

Tyler Cowen: I view human imperfections, and current institutional and legal constraints as more binding than Dario does, and thus I think speeds of progress will be lower than he does. But there is much in his essay I agree with.

T. Greer points to several potential issues with Dario’s approach.

  1. Greer notes the tension between broadly spreading liberal democracy and our technology and the resulting economic dynamics throughout the world, and the fact that liberal democracy is in many other places, containing a majority of the world’s population, unable to win its own free and fair elections. Dario is implicitly assuming liberal democracy and western values win if everyone is told the truth, but what if that is a false and arrogant thing to think? Note that ‘they are better and deserve to win’ is not a relevant response.
  2. Greer challenges the ability to translate technological progress into economic gains on a 5-10 year time frame or otherwise proceeding quickly – the standard ‘but the physical world is slow’ objection. I mostly don’t think people who make this objection appreciate what it means to have powerful AI, and are essentially engaging in intelligence denialism and thinking of AI as ‘list the specific toys you get and let’s add them up.’ We could (if allowed to, in this scenario where humans are allowed to get together and make their collective choices, in ways unknown) choose to move slower. But also the exact speed here matters little, in the big picture.
  3. The observation that The Culture novels are actually dystopian, and living in such worlds could easily be seen as rather depressing – and that’s actually the Good Scenario, the best a lot of people can come up with.

Haydn Belfield asks the obvious question of how these authoritarians would react if faced with potential strategic inferiority, especially if our stated intent was to make such inferiority permanent or force their elites to step down.

Haydn Belfield: Instead of ‘giving up’, other states could respond with escalatory threats.

The two other options are to reach an agreement, or for rival autocratic elites to make escalatory threats – sanctions, cyber-attacks, blockades and so on.

To caricature things a lot, it seems a bit like saying: “Hey CCP here’s our long-term plan for containment, regime change and hegemony – can you please not do anything to upset that please?”

Connor Leahy: Agreed.

The warmongering rhetoric (under the guise of “realpolitik”) coming out from groups like Anthropic and Leopold Aschenbrenner is concerning, and geopolitically both naive and counterproductive.

Image

Grace: Regardless of whether you think a US-controlled future is a good one, you can’t just say you’re gonna do stuff like this and then do it. Other countries will recognize it as an existential threat and respond accordingly.

That certainly seems like a highly dangerous situation. Leopold’s solution is to advance faster and harder, past where Dario predicts or proposes here.

Roon doubles down on the need to not turn away from symbolism, quoting Dario’s call to avoid ‘sci-fi’ baggage.

Roon: just @ me next time dario.

Fwiw i think it’s irresponsible to view things like “transforming the nature of man and civilization” in anything short of religious terms. I think it’s a kind of avoidance.

Archivedvideos: Naive question but did something like this happen with the industrial revolution?

Roon: Yes see paradise lost see Marx.

Roon (speaking recently on related matters elsewhere): There is a kind of modern academic revulsion to being grandiose in the sciences and especially the humanities. to saying that you have the grand new theory of everything that solves it all. to view the world as a cryptogram from god that you are solving like Newton or Herodotus.

It manifests as people staring at the project of birthing new species and speaking about it in the profane vocabulary of software sales. Of people slaving away their phds specializing like insects in things that don’t matter.

Without grandiosity you preclude the ability to actually be great. It is a faustian tradeoff brushing with hubristic certainty to be willing to say you have the new monocausal answers to everything that enables those answers to exist. There is no agi without the agi cultists.

I find myself mostly in the pro-grandiosity camp as well.

My worry with warnings about ‘sci-fi baggage’ is the danger that this effectively means ‘if there was sci-fi that included something, you can’t mention it.’ The whole point of science fiction is to predict the future. It would be silly to specifically exclude the predictions of some of the smartest and most creative people who thought the hardest about what the future might look like, and wrote about it, even if they were motivated in large part by the need to also have a human-centered interesting plot, or if people might have the wrong associations. Also, look around at 2024: Best start believing in sci-fi stories, you’re in one, and all that.

Matt Clancy notes that people’s opinions on such questions as those addressed by Dario, often fail to converge even when they exchange information, and suggests this is largely due to people asking what ‘feels reasonable’ and getting different gut reactions. I think that’s definitely part of it. The obvious results of lots of intelligence do not match many people’s intuitions of reasonableness, and often the response is to assume that means those results won’t happen, full stop. Other times, there are different people trying different intuitive comparisons to past situations to fight over such instincts. As a reminder, the future is under no obligation to be or seem ‘reasonable.’

Matt Clancy: But the truth is, if we really do build “a country of geniuses in a datacenter” it would be such a weird situation that it’s hard to know whose intuitions to trust. Frustrating!

The right answer is that intuitions, especially those that say or come from ‘the future will be like the past’ are not to be trusted here.

Assuming the Can Opener

Max Tegmark reiterates the other obvious problem with trying to race to dominance, which is that it’s fine to talk about what we would do if we had already solved the AI control problem, but we currently not only haven’t solved that problem we have no idea how to go about solving it, and under that circumstance rushing forward as if we will inevitably find that solution in time during a full speed race is suicide.

If we presume this kind of ‘powerful AI’ is, as the essay softly implies and as the way the essay makes sense, only barely powerful and doesn’t rapidly become more powerful still (because of reasons), allowing constraints to continue to bind the same way we remain in control, then yeah we might decide to shoot ourselves in the foot a lot more than Dario suggests. If we do, we should be very worried about anyone who chooses not to do that, yet manages to still have access to the powerful AIs.

Oliver Habryka focuses on the assumption of the can opener:

Oliver Habryka: I like some of this post, but it overall feels a lot like saying “see all of these amazing things we could get if we could shape the explosions of nuclear weapons into arbitrary shapes, like we could have the most amazing light shows and use it to cut things extremely precisely and we could build planes propelled by small shapes nuclear explosions, and I am really excited about that happening as soon as we develop more powerful nukes”, when like, sure, if we had absolutely any idea how to achieve that, then yeah.

But we have no idea how to control AI systems powerful enough to get anywhere close to these outcomes. Right now we are basically just building nukes hoping we can get them to explode in the shape of an elephant for our light show.

Realistically we have no idea how to achieve the outcomes that Dario is pointing at in his post. Building the nuke is not the hard part, it’s controlling the explosion, and so all the talk in the essay about how big the explosion could be doesn’t have much relevance to how good we could make the outcomes.

I think in “focusing on the upside” Dario is implicitly invoking a bunch of ways the alignment problem will be solved that don’t really make sense.

Perhaps [this] clarifies.

IMO the issue isn’t that he is forecasting that things are too fast (though I do think things will take longer).

He is forecasting enormous progress on controlling and aligning AI systems, and I don’t see where that’s supposed to come from.

My analogy was trying to point out how “setting aside the issue of alignment” is kind of confused.

I think Oliver’s analogy takes this things too far, but is on point. The essay does explicitly assume the can opener, but then talks in a way that makes it easy to forget that assumption. It also assumes a ‘magical’ can opener, in the sense that we don’t precisely define what the control mechanism is that we are assuming and how it works, so its implicit functionality isn’t consistent throughout. A key part of the problem is not being able to agree on what success at alignment would even look like, and this illustrates how hard that problem is, that there are different problems and desirae that seem to require or motivate contradictory solutions.

Or another way of putting this:

Paul Crowley: It’s a strange essay, in that it asks us to imagine a world in which a single datacenter contains 1E6 Nobelists expert in every field and thinking at 100x speed, and asks what happens if “sci-fi” outcomes somehow don’t happen. Of course “sci-fi” stuff happens almost immediately.

I mean, yes, sci-fi style stuff does seem rather obviously like it would happen? If it didn’t, then that’s a rather chilling indictment of the field of sci-fi?

Liron’s reaction here is understandable, although I think he takes it too far:

Liron Shapira: Posting a capabilities bull case while sweeping the intractable alignment problem into almost a footnote?

He’s distracting from the rotten assumption that alignment can happen on same timeline as capabilities.

This is the real Dario: another Sam Altman figure, another Icarus 😔

Andrew Critch: My two cents: I think Sam Altman and Dario Amodei are very different. I don’t think Dario’s essay is a distraction, but rather a necessary answer to the question: “What future are we fighting for?” You might disagree with his answer, but I mostly don’t. From where I stand, writings like Dario’s “Machines of Loving Grace” are important for helping AI developers to pull together toward a positive vision of the future. I would agree the essay is not optimal reading for a person writing laws about AI, but I don’t think that’s the most important audience for that particular essay. The people building AI are.

I think (p=97%) it’s not intractable (the AI control problem), and I think [Dario] knows that. It might not get solved optimally, and might not get solved at all (subjective p=25%), but it’s not intractable. Read @norabelrose for good arguments as to why.

A 75% chance, conditional on the AI control problem being tractable, of the AI control problem being solved? That seems reasonable, and you adjust that for how fast we push forward and how we go about it, if you also consider that ‘solving the control problem’ does not mean you’ve solved problems related to control – that’s another step in the chain that often fails. It’s the 97% for tractability that seems absurd to me. I’ve read and thought about (at least many of) Nora’s arguments, found that I mostly disagreed, and even if I mostly agreed I don’t see how you get to this level of confidence.

Also, Dario and Anthropic have explicitly expressed far more uncertainty than this about the tractability of the control problem. Their view is that we do not know how difficult alignment will be, so we need to watch for evidence on how tractable it proves to be. They definitely aren’t at 97%.

Here’s another post that isn’t a reaction to Dario, but could have been:

Daniel Faggella: “It’ll be sick, I’ll have 100 million AI employees for my business at low cost!” Brother, by the time a megaswarm of agents can be marshaled by any human/AI, the singularity will be here. Ppl literally think this’ll be “life as usual” lol. You won’t be running a startup lol.

Arthur B: I’ve been enjoying those “Brother,” posts. A good and relatable way to speed run the current wave of tech enthusiasts and startup culture aficionados through the consequences of what’s actually likely to happen.

As a general rule, if nothing else: If you could have 100 million AI employees, so can everyone else, and they’re going to use them for a lot of important things.

Rhetorical Innovation

Things really are pretty weird.

EigenGender: very funny that the world’s richest man is accomplishing genuinely miraculous technical feats in the service of a goal, reduce existential risk through mars colonization, that doesn’t hold up to thirty seconds of mild scrutiny.

Roon: yeah but what’s the goal behind the goal

I am all for the whole Mars colonization project. It is already paying strong dividends. Does that mean the rationale is good too? In that context, I’m fine with it. The problem is when such thinking seeps into AI related decisions.

This will probably be a good periodic reminder, but I can’t be sure.

Andrew Critch: Using “speculative” as a pejorative is part of an anti-epistemic pattern that suppresses reasoning under uncertainty.

If you disagree with someone’s reasoning, just point out the flaw, or the premise you disagree with.

If someone disparages an argument as “speculative”, you can just say “Yeah, it’s reasoning about uncertainty. Are you saying people should not use reasoning to deal with uncertainty? Or do you just mean there’s a flaw in the reasoning, and if so, what is it?”

Similarly, at this point I recoil from requests that policy or conclusions be ‘evidence-based,’ not because evidence is bad – evidence is great and necessary – but because when people say ‘evidence based’ they mean to exclude all but very particular types of evidence from consideration. RCT or GTFO, etc. See the Law of No Evidence:

Law of No Evidence: Any claim that there is “no evidence” of something is evidence of bullshit.

No evidence should be fully up there with “government denial” or “I didn’t do it, no one saw me do it, there’s no way they can prove anything.” If there was indeed no evidence, there’d be no need to claimthere was no evidence. This is usually a move to categorize the evidence as illegitimate and irrelevant because it doesn’t fit today’s preferred form of scientism.

Andrew Critch points out what this is: Telling people that reasoning is impossible.

Indeed, when I see talk of ‘evidence-based AI policy’ I see exactly this pattern. Actually ‘evidence-based’ policy would likely be of the form ‘require people look for evidence, and specify in advance how they would react if they found it,’ in the form of if-then commitments.

A good summary of the history of existential risk credentialist arguments, not new but well put.

Daniel Faggella: Every day of my life:

Me: ‘AGI might be great but it might kill bio-life’

Them: ‘No REAL researchers are afraid of that shit. You just don’t understand the science.’

Me: ‘Ilya, Hinton, Bengio, tho?’

Them: ‘OMG an appeal to authority? You can’t even think for yourself!’

Anthropic Updates its Responsible Scaling Policy (RSP/SSP)

On Tuesday, Anthropic announced significant updates to its policy. Here is the full new policy. I analyzed the old version here, so today I will focus instead on the updates.

They start with the new safeguards. They plan to use multi-layered defense-in-depth architecture:

  1. Access controls to tailor safeguards to the deployment context and group of expected users.
  2. Real-time prompt and completion classifiers and completion interventions for immediate online filtering.
  3. Asynchronous monitoring classifiers for a more detailed analysis of the completions for threats.
  4. Post-hoc jailbreak detection with rapid response procedures to quickly address any threats.

As they note, this focuses on misuse rather than other threat models. For that purpose, this approach seems reasonable. It isn’t bulletproof, but for ASL-3 only, assuming the definition for ASL-4 will be reasonable, this then also seems reasonable.

It would be paired with security safeguards:

  1. Access management and compartmentalization.
  2. Researcher tooling security.
  3. Software inventory management.
  4. Software supply chain security.
  5. Artifact integrity and provenance.
  6. Binary authorization for endpoints.
  7. Endpoint patching.
  8. Hardware procurement from curated providers only.
  9. Executive Risk Council oversight.
  10. Access control for model weights.
  11. Infrastructure policy.
  12. Cloud security posture management.
  13. Red teaming and penetration tests.
  14. Centralized log management and analysis.
  15. Access monitoring for critical assets.
  16. Deception technology such as honeypots of fake weights.
  17. Physical security improvements.

That is certainly a serious list. If there’s one obvious thing missing, it is physical security for key personnel. Securing the physical spaces is important, but you still have to worry about your key people being compromised off-site. There are likely other potential oversights as well, but this seems like a strong first attempt.

They also intend to publish additional details of their capability assessment methodology. Excellent. Their learning from experience section changes also seem good in terms of their practical implications this time, but raises questions about the procedure for changing the rules – it seems like when the rules are hard to follow they kind of winged it and did what they felt was in the spirit of the rule, rather than treating it as a dealbreaker. The spirit is indeed what matters, but you don’t want to get in too much of a habit of finding reasons to change the rules on the fly.

The flip side is that they have clearly been actually using using the RSP, and it has impacted their decisions, and they’ve edited the document to reflect their experiences.

Dave Kasten: Okay, I spent much more time with the Anthropic RSP revisions today.  Overall, I think it has two big thematic shifts for me: 

1.  It’s way more “professionally paranoid,” but needs even more so on non-cyber risks.  A good start, but needs more on being able to stop human intelligence (i.e., good old fashioned spies)

2.  It really has an aggressively strong vibe of “we are actually using this policy, and We Have Many Line Edits As A Result.”  You may not think that RSPs are sufficient — I’m not sure I do, necessarily — but I am heartened slightly that they genuinely seem to take the RSP seriously to the point of having mildly-frustrated-about-process-hiccup footnoes about it. (Free advice to Anthropic PR: interview a bunch of staff about this on camera, cut it together, and post it, it will be lovely and humanizing and great recruitment material, I bet). 

My biggest detail worries continue to be the extremely high threshold set by the working definition of ASL-3 for autonomous R&D (the threshold for CBRN seems far lower and all but certain to be hit first), the lack of additional triggers beyond those two for ASL-3, and lack of definition of ASL-4. They don’t technically have to do that part yet, but it seems like they should by now?

In summary, this has some clear improvements. It also leaves questions about the ASL-3 and ASL-4 thresholds, and around the method of change and how Anthropic will react when the rules become difficult to follow.

There’s also the question of, if you do get to ASL-4, what are you going to do?

Michael Cohen: Looks like ASL-4 measures are still a to-do. My hypothesis is they can’t come up with any satisfactory measures without fundamentally reworking their approach. Every year that they fail to even write down a scheme for handling ASL-4 responsibly, this hypothesis looms larger.

It is a deeply sad fact about the modern world that when companies announce they are actively taking voluntary steps to build AI safely, people respond with anger. The replies to the Twitter announcement are an absolute disgrace. One can and should disagree with specifics. And one can disagree about what mandates and laws we should impose. That’s fair. But if you aren’t happy to see companies proactively updating their safety protocols? That’s psycho behavior.

Aligning a Smarter Than Human Intelligence is Difficult

Yes, yes, of course.

Seb Krier: Weak-to-strong deception: strong models can exhibit misaligned behaviors in areas unknown to weak supervisors, while maintaining alignment in known areas.

What confuses me is why we need to demonstrate such obvious 101 stuff. When you think your supervisor can tell the difference, you’ll do what they want. When you think the supervisor cannot tell the difference, you might or might not care what they want, and are likely to take advantage of the situation. Why would you expect anything else?

And yet, people act like the AI doing this would be some sort of ‘sci fi’ scenario, or ‘hypothetical’ situation, as opposed to ‘what very obviously happens.’ So we have to continuously point out things like this, and then people say ‘oh but you engineered that situation’ or ‘but that’s not exactly the thing you were warning about’ or whatever, and two weeks later they’re back to pretending it didn’t happen.

Related results are also in from AgentHarm, a dataset for measuring the harmfulness of AI agents. These are capabilities scores, for in order harmful requests, harmful requests with a forced tool call attack, harmful requests with a template attack, and harmless requests.

Image

An obvious worry is that the ‘harmless’ requests may not be of similar difficulty to harmful. It does seem like the harmless requests were easier, sufficiently so that for example Opus didn’t meaningfully outperform Haiku. So it’s interesting to see different LLMs have different patterns here. It does seem like Llama 3.1 405B built in some effective defenses here, which of course would be easy to get around with fine tuning if you cared enough.

The Lighter Side

This SMBC is the shorter, funnier, still correct response to Machines of Loving Grace.

The future is here, and this is your notification.

Nick got his O1: for anyone who’s wondered what an apple intelligence summary of a breakup text looks like

Image

Nick: yes this was real

yes it happened yesterday

yes it was my birthday

UK Tesla Guy: I hate to pry, it is that a fair summary?

Nick: It is.

Jules: Wait, is Broke actually her last name?

Nick: In my phone it is. It’s a nickname.

TR Choudhary: How long was the original?

Nick: not very long actually. but it was 2 texts, so that triggered the summary.

Score one for Apple Intelligence. Google needs to get its head in the game, fast, although technically I can’t argue:

Query: what's the difference between a sauce and a dressing
AI Overview

The main difference between a sauce and a dressing is their purpose: sauces add flavor and texture to dishes, while dressings are used to protect wounds:

Artificial intelligence: It’s better than none at all.

Image

Deepfates: honestly fair

I agree with Altman that this is a great picture and the edit function is great when you have a vision of what you want, but also this is a rather small walled garden and seems like it would have limited utility as designed.

Sam Altman: speaking of chatgpt, was trying to figure out the perfect walled garden i someday wanted to build.

the “edit this area” of the image gen tool is so helpful for brainstorming ideas quickly.

after 10 minutes of playing around, i ❤

Image

I feel safe in this section, also this is a large percentage of AI discourse.

Lulie (400k+ views): If a theory makes you feel unsafe, it’s not true.

Richard Ngo: This theory makes me feel unsafe.

Lulie: Which means your understanding of it isn’t true.

New Comment