David Matolcsi

Sequences

Obstacles in ARC's agenda

Wikitag Contributions

Comments

Sorted by

I agree 4 and 5 are not really separate. The main point is that using formal input distributions for explanations just passes the buck to explain things about the generative AI that defines the formal input distribution, and at some point something needs to have been trained n real data, and we need to explain behavior there.

Yes, this is part of the appeal of catastrophe detectors, that we can make an entire interesting statement fully formal by asking how often a model causes a catastrophe (as defined by a  neural net catastrophe detector) on a a formal distribution (defined by a generative neural net with a Gaussian random seed). This is now a fully formal statement but I'm skeptical this helps much., Among other issues:

  1. It's probably not enough to only explain this type of statements to actualize all of ARC's plans.
  2. As I will explain in my next post, I'm skeptical that formalizing through catastrophe detectors actually helps much.
  3. If the AI agent whose behavior you want to explain sometimes uses Google search or interacts with humans (very realistic possibilities), you inherently can't reduce its behavior to formal statements.
  4. You need to start training your explanation during pre-training. ARC's vague hope is that the explanation target is why the model gets low loss of the (empirical) training set. What formally defined statement could be the explanation target during pre-training?
  5. Even if the input distribution, the agent and the catastrophe detector are all fully formal, you still need to deal with the capacity allocation problem. The formal input distribution is created by training a generative AI on real-world data. If you are just naively trying to create the highest quality explanation for why the AI agent never causes a catastrophe on the formally defined input distribution, you will probably waste a lot of resources on explaining on why the generative AI creating the inputs behaves the way it does, which makes you uncompetitive with the training of agents, as the agent doesn't need to understand all the deep causes underlying the input distribution.

I think this is a good and important post, but there was one point I felt missing from it: What if the company, bing caught in a race, not only wants to keep using their proven schemer model, but they want to continue its training to be smarter, or quickly build other smarter models with similar techniques? I think it's likely they will want to do that, and I think most of your recommendations in the post become very dubious if the scheming AI is continuously trained to be smarter. 

Do you have recommendations on what to do if the company wants to train smarter AIs once they caught a schemer? It's fair to say that we don't have a plan for that, and please just don't train smarter schemers, but then I think that should appear in the "If you are continuing to deploy known scheming models, my recommendation is" list of key recommendations. 

Unclear if we can talk about "humans" in a simulation where logic works differently, but I don't know, it could work. I remain uncertain how feasible trades across logical counterfactuals will be, it's all very confusing.

Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.

However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.

Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.

Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.

Maybe your idea works too, it's an interesting concept, but I'm unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators' values might be, with whom they should trade. In this logical counter-factual trade, who are the other "agents" that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn't be that worried about this, as I think 'agent' is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can't really rely on that. And how should it know it is supposed to help 'agents' in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I'm not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that "leave some stuff to other already existing agents who have sublinear utility functions" is a particularly likely thing the simulators might want.

This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values? 

This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)

(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.) 

There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that's a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I'm nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.

 

Thanks for the reply. If you have time, I'm still interested in hearing what would be a realistic central example of non-concentrated failure that's good to imagine while reading the post.

This post was a very dense read, and it was hard for me to digest what the main conclusions were supposed to be. Could you write some concrete scenarios that you think are central examples of schemers causing non-concentrated failures? While reading the post, I never knew what situation to imagine: An AI is doing philosophical alignment research but intentionally producing promising-looking crackpotry? It is building cyber-sec infrastructure but leaving in a lot of vulnerabilities? Advising the President, but having a bias towards advocating for integrating AI into the military?

I think these problems are all pretty different in what approaches are promising in preventing them, so it would be useful to see what you think the most likely non-concentrated failures are, so we can read the post with that in mind.

As another point, you could really write conclusion sections. There are a lot of different points made in the post, and it's hard to see which are the most important to get across to the reader in your opinion. A conclusion section would help a lot in that.

In general, I think that among all the people I know, you might be the one who has the biggest difference in how good you are at explaining concepts in person, and how bad you are at communicating them in blog posts. (Strangely, your LW comments are also very good and digestible, more similar to your in person communication than to your long-form posts, I don't know why.) I think it could be high leverage for you to experiment some with making your posts more readable. Using more concrete examples and writing conclusion sections would go a long way in improving your posts in general, but I felt compelled to comment here because this post was especially hard to read without them. 

My strong guess is that OpenAI's results are real, it would really surprise me if they were literally cheating on the benchmarks. It looks like they are just using much more inference-time compute than is available to any outside user, and they use a clever scaffold that makes the model productively utilize the extra inference time. Elliot Glazer (creator of FrontierMath) says in a comment on my recent post on FrontierMath: 

A quick comment: the o3 and o3-mini announcements each have two significantly different scores, one <= 10%, the other >= 25%. Our own eval of o3-mini (high) got a score of 11% (it's on Epoch's Benchmarking Hub). We don't actually know what the higher scores mean, could be some combination of extreme compute, tool use, scaffolding, majority vote, etc., but we're pretty sure there is no publicly accessible way to get that level of performance out of the model, and certainly not performance capable of "crushing IMO problems." 

I do have the reasoning traces from the high-scoring o3-mini run. They're extremely long, and one of the ways it leverages the higher resources is to engage in an internal dialogue where it does a pretty good job of catching its own errors/hallucinations and backtracking until it finds a path to a solution it's confident in. I'm still writing up my analysis of the traces and surveying the authors for their opinions on the traces, and will also update e.g. my IMO predictions with what I've learned.

I like the idea of IMO-style releases, always collecting new problems, testing the AIs on them, then releasing to the public. What do you think, how important it is to only have problems with numerical solutions? If you can test the AIs on problems with proofs, then there are already many competitions that regularly release high-quality problems. (I'm shilling KöMaL again as one that's especially close to my heart, but there are many good monthly competitions around the world.) I think if we instruct the AI to present its solution in one page at the end, then it's not that hard to get an experience competition grader to read the solution and give it scores according to the normal competitions scores, so the result won't be much less objective than if it was only numerical solutions. If you want to stick to problems with numerical solutions, I'm worried that you will have a hard time regularly assembling high-quality numerical problems again and again, and even if the problems are released publicly, people will have a harder time evaluating them than if they actually came from a competition where we can compare to the natural human baseline of the competing students.

Load More