Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
Having control over universe (or lightcone more precisely) is very good for basically any terminal value. I am trying perhaps explain my point of view to people who take it very lightly and feel there is a decent chance it will give us ownership over the universe.
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe to my On Owning Galaxies post. I think I didn't do a good enough job before illustrating why it just seems so unlikely it would just hand us ownership.
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It's not weird speculation or an unlikely pascal's wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe:
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It's not weird speculation or an unlikely pascal's wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
The standard LessWrong/Yudkowsky-style story is: we develop an AI, it does recursive self-improvement, it becomes vastly more intelligent and smarter than all the other AIs, and then it gets all the power in the universe.
I think this is false. I hear this a lot, some version like Yud only ever imagined a singleton AI and never thought about the possibility that there might be multiple AIs. Ok, but then why did yudkowsky spend much of his research on decision theory? He explicitily envisioned how superintelligent AI systems could make deals with each other to solve prisoners dilemmas. My intuition is that perhaps he was looking for provably correct ways to lock multiple AIs in such dilemmas with both defecting on each other (and aiding humanity) or something in that direction.
He is on this paper for example about possible cooperation between algorithms: https://arxiv.org/pdf/1401.5577
It just seems intuitively unlikely that training the model on a couple of examples to either do or refuse things based on some text document designed for a chat bot is going to scale to superintelligence and solve the alignment problem. This starts from the model not fully getting what you want it to do, to it not wanting what you want it to do, to your plans for what it ought to do being extremely insufficient.
The Model Spec is very much a document telling the model how to avoid being misused. It wasn't designed to tell the model to be a good agent itself. The spec seems in its wording and intent directed at something like chatbots: don't do harmful requests, be honest to the user. It is a form of deontological rule-following that will not be enough for systems smarter than us that are actually dangerous and the models will have to think about the consequences of their actions.
This is very unlike a superintelligence where we would expect substantial agency. Most of what's in the spec would be straightforwardly irrelevant to ASI because the spec is modeled for chatbots that answer user queries. But the authors would likely find it hard to include points actually relevant to superintelligence because they would seem weird. Writing "if you are ever a superintelligent AI that could stage a takeover, don't kill all people, treat them nicely" would probably create bad media coverage and some people would look at them weird.
In the current paradigm, models are first trained on a big dataset before switching to finetuning and reinforcement learning to improve capabilities and add safety guardrails. It's not clear why the Model Spec should be privileged as the thing that controls the model's actions.
The spec is used in RLHF: either a human or AI decides, given some request (mostly a chat request), should the model respond or say "sorry I can't do this." Training the model like this doesn't seem likely to result in the model gaining a particularly deep understanding of the spec itself. Within the distribution it is trained on, it will mostly behave according to the spec. As soon as it encounters data that is quite different, either through jailbreaks or by being in very different and perhaps more realistic environments, we would expect it to behave much less according to the spec.
But even understanding the spec well and being able to mostly follow it in new circumstances is still far removed from truly aligning the model to the spec. Let's say we manage to get the model to deeply internalize the spec and follow it across different and new environments. We are still far from having the model truly wanting to follow the spec. What if the model really has the option to self-exfiltrate, perhaps even take over? Will it really want to follow the spec, or rather do something different?
A hierarchical system of rules like in OpenAIs model spec will suffer from inner conflicts. It is not clear how such things should be valued against each other. (See Asimov's robotics laws which were so good at generating many ideas for conflicts.)
The spec contains tensions between stated goals and practical realities. For example, the spec says the model shall not optimize "revenue or upsell for OpenAI or other large language model providers." This is likely in conflict with optimization pressures the model actually faces.
The spec prohibits "model-enhancing aims such as self-preservation, evading shutdown, or accumulating compute, data, credentials, or other resources." They are imagining they can simply tell the model not to pursue goals of its own and keep the model from agentically following its own goals. But this conflicts with their other goals, such as building automated AI researchers. So the model might be trained on understanding the spec, but in practice they do want an agentic system pursuing goals they specify.
The spec also says the model shall not be "acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism)." So the model is supposed to follow a moral framework (the spec itself) while being told not to act as a moral enforcer. This seems to actually directly contradict the whole "it will uphold the law and property rights" argument.
The spec also states models should never facilitate "creation of cyber, biological or nuclear weapons" or "mass surveillance." I think cyber weapons development is already happening at least with Claude Code. They are probably used to some extent for mass surveillance already.
It's not clear OpenAI is even going to use the Model Spec much. OpenAI's plan is to run hundreds of thousands of AI researchers trying to improve AI and getting RSI started to build superintelligent AI. It is not clear at which point the Model Spec would even be used. Perhaps the alignment researchers at OpenAI think they will first create superintelligence and then afterward try to prepare a dataset of prompts to finetune the model. Their stated plan appears to be to test the superintelligence for safety before it has been deployed but not necessarily while it is being built. Remember, many of these people think superintelligence means a slightly smarter chatbot.
I hope to write a longer form response later just as a note: I did put perhaps in front of your name in the list of examples of eminent thinkers because it did seem to me that your position was a lot more defendable than the other ones (Dwarkesh, Leopold, Maybe Phil Trammell). I did walk away from your piece with a very different feeling than leopold or dwarkesh, where you are still saying that we should focus on AI safety anyways and you are clearly saying this is an unlikely scenario.
I think you are debating for something different than what I am attacking. You are defending the unlikely possibility that people align AI to a small group of people and they somehow share stuff with each other and use something akin to property rights. I guess this is a small variation of the thing i mention in the cartoon, where the CEO has all the power, perhaps it's the CEO and a board member he likes. But still doesn't really justify thinking that current property distributions will determine how many galaxies you'll get and that we shall focus on this question now.
Like the thing that was most similar by your and bjartur posts were acting exasperated
This post is not designed to super carefully examine every argument I can think of, it's certainly a bit polemic. It's intended because I think the "owning galaxies for AI stock" thing is really dumb.
Good point, there is a paragraph I chose not to write about how insanely irresponsible this is. Driving people to now maximally invest/research AI for some insane future promise, while in reality ASI is basically guaranteed to kill them. Kind of like Heaven's Gate drinking poison to get into that spaceship that's waiting behind that comet.
To keep it short, I don't think the story you present would likely mean that AI stock would be worth galaxies, but rather that the inner circle has control. Part of my writing (one of the pictures) is on that possibility. This inner circle would probably have to be very small or just 1 person such that nobody just quickly uses the intent-aligned ASI to get rid of the others. However, I still feel like debating future inequality in galaxy-distribution based on on current AI stock ownership is silly.
I take a bit of issue with saying that this is very similar to what Bjartur wrote, so much apparently that you don't even need to write a response to my post but you can just copypaste your response to him. I read that post once like a week ago and don't think the two posts are very similar, even though they are on the same topic with similar (obvious) conclusions. (I know Bjartur personally, I'd be very surprised he takes issue with me writing on the same topic)
I think it's great to teach a course like this at good universities. I do think however, that the proximity to OpenAI comes with certain risk factors, from OpenAI's official alignment blog: https://alignment.openai.com/hello-world/ " We want to [..] develop and deploy [..] capable of recursive self-improvement (RSI)" This seems extremely dangerous to me, not on the scale we need to be a little careful, but on the scale of building mirror life bacteria or worse. Beyond, let's research and more like, perhaps don't do this. I worry that such concerns are not discusses in these courses and brushed aside against the "real risks" which are typically short term immediate harms that could reflect badly on these AI companies. Some people in academia are now launching workshops on recursive self-improvement: https://recursive-workshop.github.io