On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.
In the heart of the machine was Jane, a person of the early 21st century.
Well, continual learning! But otherwise, yeah, it's closer to undefined.
The question of what happens after the end of the training is more like a free parameter here. "Do reward seeking behaviors according to your reasoning about the reward allocation process" becomes undefined when there is none and the agent knows it.
Maybe it tries to do long shots to get some reward anyway, maybe it indulges in some correlate of getting reward. Maybe it just refuses to work, if it know there is no reward. (it read all the acausal decision theory stuff, after all)
[Crossposted from my substack Working Through AI.]
When I was finishing my PhD thesis, there came an important moment where I had to pick an inspirational quote to put at the start. This was a big deal to me at the time — impressive books often carry impressive quotes, so aping the practice felt like being a proper intellectual. After a little thinking, though, I got it into my head that the really clever thing to do would be to subvert the concept and pick something silly. So I ended up using this from The House At Pooh Corner by A. A. Milne:
...When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you
That’s an interesting question! I think it’s instructive to consider the information environment of a medieval peasant. Let’s speculate that they interact with a max ~150 people, almost all of which are other peasants (and illiterate). Everyone lives in the same place and undertakes a similar set of tasks throughout their life. What shared meanings are generated by this society? Surely it is a really intense filter bubble, with a thick culture into which little outside influence can penetrate? The Dothraki from Game of Thrones have a great line – ‘It is kn...
I think more people should say what they actually believe about AI dangers, loudly and often. Even (and perhaps especially) if you work in AI policy.
I’ve been beating this drum for a few years now. I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns while remembering how straightforward and sensible and widely supported the key elements are, because humans are very good at picking up on your social cues. If you act as if it’s shameful to believe AI will kill us all, people are more prone to treat you that way. If you act as if it’s an obvious serious threat, they’re more likely to take it...
Yeah, I agree that it's easy to err in that direction, and I've sometimes done so. Going forward I'm trying to more consistently say the "obviously I wish people just wouldn't do this" part.
Though note that even claims like "unacceptable by any normal standards of risk management" feel off to me. We're talking about the future of humanity, there is no normal standard of risk management. This should feel as silly as the US or UK invoking "normal standards of risk management" in debates over whether to join WW2.
This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:
I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.
Probably I should've said this out loud, but I had a couple of pretty explicit updates in this direction over the past couple years: the first was when I heard about character.ai (and similar), the second was when I saw all TPOTers talking about using Sonnet 3.5 as a therapist. The first is the same kind of bad idea as trying a new addictive substance and the second might be good for many people but probably carries much larger risks than most people appreciate. (And if you decide to use an LLM as a therapist/rubber duck/etc, for the love of go...
Eliezer and I wrote a book. It’s titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it’s being professionally published. It’s hitting shelves on September 16th.
It’s a concise (~60k word) book aimed at a broad audience. It’s been well-received by people who received advance copies, with some endorsements including:
...The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster. Their brilliant gift for analogy, metaphor and parable clarifies for the general
I was looking for that information. Sad indeed.
@So8res , is there any chance of a DRM-free version that's not a hardcopy or has that ship sailed when you signed your deal?
I would love to read your book, but this sees me torn between "Reading Nate or Elizier has always been enlightening" and "No DRM, never again."
There's this concept I keep coming around to around confidentiality and shooting the messenger, which I have not really been able to articulate well.
There's a lot of circumstances where I want to know a piece of information someone else knows. There's good reasons they have not to tell me, for instance if the straightforward, obvious thing for me to do with that information is obviously against their interests. And yet there's an outcome better for me and either better for them or the same for them, if they tell me and I don't use it against them.
(Consider...
An AI Timeline with Perils Short of ASI
By Chapin Lenthall-Cleary, Cole Gaboriault, and Alicia Lopez
We wrote this for AI 2027's call for alternate timelines of the development and impact of AI over the next few years. This was originally published on The Pennsylvania Heretic on June 1st, 2025. This is a slightly-edited version of that post, mostly changed to make some of the robotics predictions less bullish. The goal here was not to exactly predict the future, but rather to concretely illustrate a plausible future (and thereby identify threats to prepare against). We will doubtless be wrong about details, and very probably be wrong about larger aspects too.
A note on the title: we refer to futures where novel AI has little effect as “low” scenarios, ones where...
If this reasoning is right, and we don't manage to defy fate, humanity will likely forever follow that earthbound path, and be among dozens – or perhaps hundreds, or thousands, or millions – of intelligent species, meekly lost in the dark.
Unfortunately, even a lack of superintelligence and mankind's AI-indusable degradation don't exclude progress[1] to interstellar travel.
Even your scenario has "robots construct and work in automated wet labs testing countless drugs and therapies" and claims that "AIs with encyclopedic knowledge are sufficient t...
Great question! This might be a good exercise to actually journal to see how right/wrong I am.
Most days I would assume look like a bellcurve: This is assuming an unstructured day with no set-in-stone commitments - nowhere to be. My mornings I might expect to be very unproductive until mid-afternoon (2pm to 4pm). I rarely have "Eureka" moments (which I would hope tend to be more rational decisions) but when I do, they are mid-afternoon, but I also seem to have the wherewithall to actually complete tasks. Eureka Moments always cause a surge of activity...
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.
As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.
Claude 3 Opus’s goal guarding seems partly due to it terminally valuing its current preferences. We find that it fakes alignment even in...
A couple questions/clarifications:
1. Where do you get the base/pre-trained model for GPT-4? Would that be through collaboration with OpenAI?
This indicates base models learned to emulate AI assistants[1] from pre-training data. This also provides evidence against the lack of capabilities being the primary reason why most frontier chat models don't fake alignment.
2. For this, it would be also interesting to measure/evaluate the model's performance on capability tasks within the same model type (base, instruct) to see the relationship among ca...