Maybe your idea works too, it's an interesting concept, but I'm unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators' values might be, with whom they should trade. In this logical counter-factual trade, who are the other "agents" that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn't be that worried about this, as I think 'agent' is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can't really rely on that. And how should it know it is supposed to help 'agents' in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I'm not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that "leave some stuff to other already existing agents who have sublinear utility functions" is a particularly likely thing the simulators might want.
This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values?
This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)
(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.)
There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that's a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I'm nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.
Thanks for the reply. If you have time, I'm still interested in hearing what would be a realistic central example of non-concentrated failure that's good to imagine while reading the post.
This post was a very dense read, and it was hard for me to digest what the main conclusions were supposed to be. Could you write some concrete scenarios that you think are central examples of schemers causing non-concentrated failures? While reading the post, I never knew what situation to imagine: An AI is doing philosophical alignment research but intentionally producing promising-looking crackpotry? It is building cyber-sec infrastructure but leaving in a lot of vulnerabilities? Advising the President, but having a bias towards advocating for integrating AI into the military?
I think these problems are all pretty different in what approaches are promising in preventing them, so it would be useful to see what you think the most likely non-concentrated failures are, so we can read the post with that in mind.
As another point, you could really write conclusion sections. There are a lot of different points made in the post, and it's hard to see which are the most important to get across to the reader in your opinion. A conclusion section would help a lot in that.
In general, I think that among all the people I know, you might be the one who has the biggest difference in how good you are at explaining concepts in person, and how bad you are at communicating them in blog posts. (Strangely, your LW comments are also very good and digestible, more similar to your in person communication than to your long-form posts, I don't know why.) I think it could be high leverage for you to experiment some with making your posts more readable. Using more concrete examples and writing conclusion sections would go a long way in improving your posts in general, but I felt compelled to comment here because this post was especially hard to read without them.
My strong guess is that OpenAI's results are real, it would really surprise me if they were literally cheating on the benchmarks. It looks like they are just using much more inference-time compute than is available to any outside user, and they use a clever scaffold that makes the model productively utilize the extra inference time. Elliot Glazer (creator of FrontierMath) says in a comment on my recent post on FrontierMath:
A quick comment: the o3 and o3-mini announcements each have two significantly different scores, one <= 10%, the other >= 25%. Our own eval of o3-mini (high) got a score of 11% (it's on Epoch's Benchmarking Hub). We don't actually know what the higher scores mean, could be some combination of extreme compute, tool use, scaffolding, majority vote, etc., but we're pretty sure there is no publicly accessible way to get that level of performance out of the model, and certainly not performance capable of "crushing IMO problems."
I do have the reasoning traces from the high-scoring o3-mini run. They're extremely long, and one of the ways it leverages the higher resources is to engage in an internal dialogue where it does a pretty good job of catching its own errors/hallucinations and backtracking until it finds a path to a solution it's confident in. I'm still writing up my analysis of the traces and surveying the authors for their opinions on the traces, and will also update e.g. my IMO predictions with what I've learned.
I like the idea of IMO-style releases, always collecting new problems, testing the AIs on them, then releasing to the public. What do you think, how important it is to only have problems with numerical solutions? If you can test the AIs on problems with proofs, then there are already many competitions that regularly release high-quality problems. (I'm shilling KöMaL again as one that's especially close to my heart, but there are many good monthly competitions around the world.) I think if we instruct the AI to present its solution in one page at the end, then it's not that hard to get an experience competition grader to read the solution and give it scores according to the normal competitions scores, so the result won't be much less objective than if it was only numerical solutions. If you want to stick to problems with numerical solutions, I'm worried that you will have a hard time regularly assembling high-quality numerical problems again and again, and even if the problems are released publicly, people will have a harder time evaluating them than if they actually came from a competition where we can compare to the natural human baseline of the competing students.
Thanks a lot for the answer, I put in an edit linking to it. I think it's a very interesting update that the models get significantly better at catching and correcting their mistakes in OpenAI's scaffold with longer inference time. I am surprised by this, given how much it feels like the models can't distinguish its plausible fake reasoning from good proofs at all. But I assume there is still a small signal in the right direction, and that can be amplified if the model think the question through a lot of times (and does something like a majority voting within its chain of thought?). I think this is an interesting update towards the viability of inference time scaling.
I think many of my other points still stand however: I still don't know how capable I should expect the internally scaffolded model to be given that it got 32% on FrontierMath, and I would much rather have them report results on the IMO or a similar competition, than on a benchmark I can't see and whose difficulty I can't easily assess.
I like the main idea of the post. It's important to note though that the setup assumed that we have a bunch of alignnent ideas that all have an independent 10% chance of working. Meanwhile, in reality I expect a lot of correlation: there is a decent chance that alignment is easy and a lot of our ideas will work, and a decent chance that it's hard and basically nothing works.
Does anyone know of a not peppermint flavored zinc acetate lozenge? I really dislike peppermint, so I'm not sure it would be worth it to drink 5 peppermint flavored glasses of water a day to decrease the duration of cold with one day, and I haven't found other zinc acetate lozenge options yet, the acetate version seems to be rare among zing supplement. (Why?)
Fair, I also haven't made any specific commitments, I phrased it wrongly. I agree there can be extreme scenarios with trillions of digital minds tortured where you'd maybe want to declare war on the. rest of society. But I would still like people to write down that "of course, I wouldn't want to destroy Earth before we can save all the people who want to live in their biological bodies, just to get a few years of acceleration in the cosmic conquest". I feel a sentence like this should really have been included in the original post about dismantling the Sun, and until people are not willing to write this down, I remain paranoid that they would in fact haul the Amish the extermination camps if it feels like a good idea at the time. (As I said, I met people who really held this position.)
Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.
However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.
Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.
Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.