David Matolcsi3h20

That's not how I see it. I think the argument tree doesn't go very deep until I lose the the thread. Here are a few, slightly stylized but real, conversations I had with friends who had no context on what ARC was doing, when I tried to explain our research to them:

Me: We want to to do Low Probability Estimation.

Them: Does this mean you want to estimate the probability that ChatGPT says a specific word after a 100 words on chain of thought? Isn't this clearly impossible?

Me: No, you see, we only want to estimate the probabilities only as well as the model knows.

Them: What does this mean?

Me: [I can't answer this question.]

Me: We want to do Mechanistic Anomaly Detection.

Them: Isn't this clearly impossible? Won't this result in a lot of false positives when anything out of distribution happens?

Me: Yes, why we have this new clever idea of relying on the fragility of sensor tampering, that if you delete a subset of the actions, you will get an inconsistent image.

Them: What if the AI builds another robot to tamper with the cameras?

Me: We actually don't want to delete actions but heuristic arguments for why the cameras will show something, and we want to construct heuristic explanations in a way that they carry over through delegated actions.

Them: What does this mean?

Me; [I can't answer this question.]

Me: We want to create Heuristic Arguments to explain everything the model does.

Them: What does it mean that an argument explained a behavior? What is even the type signature of heuristic arguments? And you want to explain everything a model does? Isn't this clearly impossible?

Me: [I can't answer this question.]

When I was explaining our research to outsiders (which I usually tried to avoid out of cowardice), we usually got to some of these points within minutes. So I wouldn't say these are fine details of our agenda.

During my time at ARC, the majority of my time was spent on asking variations of these three questions from Mark and Paul. They always kindly answered, and the answer was convincing-sounding enough for the moment that I usually couldn't really reply on the spot, and then I went back to my room to think through their answers. But I never actually understood their answers, and I can't reproduce them now. Really, I think that was the majority of work I did at ARC. When I left, you guys should have bought a rock with "Isn't this clearly impossible?" written on it, and that would profitably replace my presence.

That's why I'm saying that either ARC's agenda is fundamentally unsound or I'm still missing some of the basics. What is standing between ARC's agenda collapsing from five minutes of questioning from an outsider is that Paul and Mark (and maybe others in the team) have some convincing-sounding answers to the three questions above. So I would say that these answers are really part of the basics, and I never understood them.

Maybe Mark will show up in the comments now to give answers to the three questions, and I expect the answers to sound kind of convincing, and I won't have a very convincing counter-argument other than some rambling reply saying essentially that "I think this argument is missing the point and doesn't actually answer the question, but I can't really point out why, because I don't actually understand the argument because I don't understand how you imagine heuristic arguments". (This is what happened in the comments on my other post, and thanks to Mark for the reply and I'm sorry for still not understanding it.) I can't distinguish whether I'm just bad at understanding some sound arguments here, or the arguments are elaborate self-delusions of people who are smarter and better at arguments than me. In any case, I feel epistemic learned helplessness on some of these most basic questions in ARC's agenda.

David Matolcsi's Shortform

David Matolcsi2d14324

If you don't believe in your work, consider looking for other options

I spent 15 months working for ARC Theory. I recently wrote up why I don't believe in their research. If one reads my posts, I think it should become very clear to the reader that either ARC's research direction is fundamentally unsound, or I'm still misunderstanding some of the very basics after more than a year of trying to grasp it. In either case, I think it's pretty clear that it was not productive for me to work there. Throughout writing my posts, I felt an intense shame imagining readers asking the very fair question: "If you think the agenda is so doomed, why did you keep working on it?"^[1]

In my first post, I write: "Unfortunately, by the time I left ARC, I became very skeptical of the viability of their agenda."This is not quite true. I was very skeptical from the beginning, for largely similar reasons I expressed in my posts. But first I told myself that I should stay a little longer. Either they manage to convince me that the agenda is sound, or I demonstrate that it doesn't work, in which case I free up the labor of the group of smart people working on the agenda. I think this was initially a somewhat reasonable position, though it was already in large part motivated reasoning.

But half a year after joining, I don't think this theory of change was very tenable anymore. It was becoming clear that our arguments were going in circles. I couldn't convince Paul and Mark (the two people thinking the most about the big picture questions), nor could they convince me. Eight months in, two friends visited me in California, and they noticed that I always derailed the conversation when they asked me about my research. I think that should have been an important thing to notice that I was ashamed to talk about my research to my friends, because I was afraid they would see how crazy it was. I should have quit then, but I stayed for another seven months.

I think this was largely due to cowardice. I'm very bad at coding and all my previous attempts at upskilling in coding went badly.^[2] I thought of my main skill as being a mathematician, and I wanted to keep working on AI safety. The few other places one can work as a mathematician in AI safety looked even less promising to me than ARC. I was afraid that if I quit, I wouldn't find anything else to do.

In retrospect, this fear was unfounded. I realized there were other skills one can develop, not just coding. In my afternoons, I started reading a lot more papers and serious blog posts ^[3] from various branches of AI safety. After a few months, I felt I had much more context on many topics. I started to think more about what I can do with my non-mathematical skills. When I finally started applying for jobs, I got an offer from the European AI Office and UKAISI, and it looked more likely than not that I would get an offer from Redwood. ^[4]

Other options I considered that looked less promising than the three above, but still better than staying at ARC: Team up with some Hungarian coder friends and execute some simple but interesting experiments I had vague plans for. ^[5] Assemble a good curriculum for the prosaic AI safety agendas that I like. Apply for a grant-maker job. Become a Joe Carlsmith-style general investigator. Try to become a journalist or an influential blogger. Work on crazy acausal trade stuff.

I still think many of these were good opportunities, and probably there are many others. Of course, different options are good for people with different skill profiles, but I really believe that the world is ripe with opportunities to be useful for people who are generally smart and reasonable and have enough context on AI safety. If you are working on AI safety but don't really believe that your day-to-day job is going anywhere, remember that having context and being ingrained in the AI safety field is a great asset in itself,^[6] and consider looking for other projects to work on.

(Important note: ARC was a very good workplace, my coworkers were very nice to me and receptive to my doubts, and I really enjoyed working there except for feeling guilty that my work is not useful. I'm also not accusing the people who continue working at ARC of being cowards in the way I have been. They just have a different assessment of ARC's chances, or work on lower-level questions than I have, where it can be reasonable to just defer to others on the higher-level questions.)

(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)

^{^}
No, really, it felt very bad writing the posts. It felt like describing how I worked for a year on a scheme that was either trying to build perpetuum mobile machines, or trying to build normal cars, I just missed the fact that gasoline exists. Embarrassing either way.
^{^}
I don't know why. People keep telling me that it should be easy to upskill, but for some reason it is not.
^{^}
I particularly recommend Redwood's blog.
^{^}
We didn't fully finish the work trial as I decided that the EU job was better.
^{^}
Think of things in the style of some of Owain Evans' papers or experiments on faithful chain of thought.
^{^}
And having more context and knowledge is relatively easy to further improve by reading for a few months. It's a young field.

Obstacles in ARC's agenda: Finding explanations

David Matolcsi2d20

I still don't see it, sorry. If I think of deep learning as an approximation of some kind of simplicity prior + updating on empirical evidence, I'm not very surprised that it solves the capacity allocation problem and learns a productive model of the world. ^[1] The prize is that the simplicity prior doesn't necessarily get rid of scheming. The big extra challenge for heuristic explanations is that you need to do the same capacity allocation in a way that scheming reliably gets explained (even though it's not relevant for the model's performance and doesn't make things classically simpler), while no capacity is spent on explaining other phenomena that are not relevant for the model's performance. I still don't see at all how we can get the the non-malign prior that can do that.

^{^}
Though I'm still very surprised that it works in practice.

Obstacles in ARC's agenda: Finding explanations

David Matolcsi11d40

Thanks for the reply. I agree that it would be exciting in itself to create "a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks", and I agree that for that goal, LPE and MAD are more of a test case than a necessary element. However, I think you probably can't get rid of the question of empirical regularities.

I think you certainly need to resolve the question of empirical regularities if you want to apply your methods to arbitrary neural networks, and I strongly suspect that you need to do something like solving empirical regularities even if you only want to create an explanation finding method that works on a diverse range of neural networks solving simple algorithmic tasks. I agree that empirical regularities are not necessarily needed if you only want to explain specific neural nets solving algorithmic tasks one by one, but I'm not that excited about that.

Can you say an example of a goal that would not require resolving the question of empirical regularities, but would still deserve to be called "a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks" an you would expect it to produce "quiet a lot" of applications? I don't really see that without making significant headway with empirical regularities.

Obstacles in ARC's agenda: Finding explanations

David Matolcsi16d30

I agree 4 and 5 are not really separate. The main point is that using formal input distributions for explanations just passes the buck to explain things about the generative AI that defines the formal input distribution, and at some point something needs to have been trained n real data, and we need to explain behavior there.

Obstacles in ARC's agenda: Finding explanations

David Matolcsi16d50

Yes, this is part of the appeal of catastrophe detectors, that we can make an entire interesting statement fully formal by asking how often a model causes a catastrophe (as defined by a neural net catastrophe detector) on a a formal distribution (defined by a generative neural net with a Gaussian random seed). This is now a fully formal statement but I'm skeptical this helps much., Among other issues:

It's probably not enough to only explain this type of statements to actualize all of ARC's plans.
As I will explain in my next post, I'm skeptical that formalizing through catastrophe detectors actually helps much.
If the AI agent whose behavior you want to explain sometimes uses Google search or interacts with humans (very realistic possibilities), you inherently can't reduce its behavior to formal statements.
You need to start training your explanation during pre-training. ARC's vague hope is that the explanation target is why the model gets low loss of the (empirical) training set. What formally defined statement could be the explanation target during pre-training?
Even if the input distribution, the agent and the catastrophe detector are all fully formal, you still need to deal with the capacity allocation problem. The formal input distribution is created by training a generative AI on real-world data. If you are just naively trying to create the highest quality explanation for why the AI agent never causes a catastrophe on the formally defined input distribution, you will probably waste a lot of resources on explaining on why the generative AI creating the inputs behaves the way it does, which makes you uncompetitive with the training of agents, as the agent doesn't need to understand all the deep causes underlying the input distribution.

Handling schemers if shutdown is not an option

David Matolcsi21d120

I think this is a good and important post, but there was one point I felt missing from it: What if the company, bing caught in a race, not only wants to keep using their proven schemer model, but they want to continue its training to be smarter, or quickly build other smarter models with similar techniques? I think it's likely they will want to do that, and I think most of your recommendations in the post become very dubious if the scheming AI is continuously trained to be smarter.

Do you have recommendations on what to do if the company wants to train smarter AIs once they caught a schemer? It's fair to say that we don't have a plan for that, and please just don't train smarter schemers, but then I think that should appear in the "If you are continuing to deploy known scheming models, my recommendation is" list of key recommendations.

Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

David Matolcsi26d20

Unclear if we can talk about "humans" in a simulation where logic works differently, but I don't know, it could work. I remain uncertain how feasible trades across logical counterfactuals will be, it's all very confusing.

Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

David Matolcsi1mo60

Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.

However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.

Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.

Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.

Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents

David Matolcsi1mo60

Maybe your idea works too, it's an interesting concept, but I'm unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators' values might be, with whom they should trade. In this logical counter-factual trade, who are the other "agents" that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn't be that worried about this, as I think 'agent' is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can't really rely on that. And how should it know it is supposed to help 'agents' in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I'm not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that "leave some stuff to other already existing agents who have sublinear utility functions" is a particularly likely thing the simulators might want.

This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values?

This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)

(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.)

There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that's a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I'm nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

If you don't believe in your work, consider looking for other options