All of Marius Hobbhahn's Comments + Replies

There are two sections that I think make this explicit:

1. No failure mode is sufficient to justify bigger actions. 
2. Some scheming is totally normal. 

My main point is that even things that would seem like warning shots today, e.g. severe loss of life, will look small in comparison to the benefits at the time, thus not providing any reason to pause. 

1Petropolitan
I don't think the second point is anyhow relevant here while the first one is worded so that it might imply something on the scale of "AI assistant convinces a mentally unstable person to kill their partner and themselves"—not something that would be perceived as a warning shot by the public IMHO (have you heard there were at least two alleged suicides driven by GPT-J 6B? The public doesn't seem to bother https://www.vice.com/en/article/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says/ https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html). I believe that dozens of people killed by misaligned AI in a single incident will be enough smoke in the room https://www.lesswrong.com/posts/5okDRahtDewnWfFmz/seeing-the-smoke for the metaphorical fire alarm to go off. What to do after that is a complicated political topic: for example, French voters has always believed that nuclear accidents look small in comparison to the benefits of the nuclear energy while Italian and German ones hold the opposite opinion. The sociology data available, AFAIK, generally indicates that people in many societies have certain fears regarding possible AI takeover and is quite unlikely to freak out less than it did after Chernobyl, but that's hard to predict

what made you update towards longer timelines? My understanding was that most people updated toward shorter timelines based on o3 and reasoning models more broadly.

A big one has to do with Deepseek's R1 maybe breaking moats, essentially killing industry profit if it happens:

https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=a2y2dta4x38LqKLDX

The other issue has to do with o1/o3 being potentially more supervised than advertised:

https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu/?commentId=gfEFSWENkmqjzim3n#gfEFSWENkmqjzim3n

Finally, Vladimir Nesov has an interesting comment on how Stargate is actually evidence for longer timelines:

https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform#W5tw... (read more)

If I had more time, I would have written a shorter post ;) 

Marius HobbhahnΩ122512

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe. 

I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning). 

These are all pretty basic thought... (read more)

Go for it. I have some names in mind for potential experts. DM if you're interested. 

At BlueDot we've been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.

I've messaged you on Slack.

Marius HobbhahnΩ92111

Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/

In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" ... (read more)

Ryan Kidd279

I'm tempted to set this up with Manifund money. Could be a weekend project.

Marius HobbhahnΩ71413

I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.

Marius HobbhahnΩ11162

Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO. 

We will potentially further look into these findings in 2025. 

1Outlaw-Spades
Hey, more a question as someone kind of new to A.I. and the IT world but if these newer models are built in similar fashion to previous models like GPT-4, has there been any inquiry or study into the relation between certain behaviors regarding things like A.I. 'hallucinations' where false information is generated and fabricated and these behaviors investigated above?  First (?) Libel-by-AI (ChatGPT) Lawsuit Filed  Using this case as an example where the A.I. repeatedly generated false information and false text from a real lawsuit to support allegations it created while a third-party researcher was investigating a separate court-case. What stood out to me was the repeated tendency to double down on the false information as noted in the points of the lawsuit. Just curious if there are people with more understanding or knowledge on these topics that could maybe clarify whether or not there might be a linkage to those behaviors whether purposefully generated by the LLM or not? 
Marius Hobbhahn*Ω7123

(thx to Bronson for privately pointing this out)

I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan's argument on the AXRP podcast.

Also, I think you're right, and my statement of "I think for most practical considerations, it makes almost zero difference." was too strong. 

1Ben Millwood
in case anyone else wanted to look this up, it's at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn't get a lot of eyeballs at the time.
Marius Hobbhahn*Ω102717

We write about this in the limitations section (quote below). My view in brief:

  1.  Even if they are just roleplaying, they cause harm. That seems bad.
  2. If they are roleplaying, they will try to be consistent once they are caught in that persona in one rollout. That also seems bad.
  3. I think the "roleplaying" argument is very overrated. It feels to me as if existing models change behavior throughout their rollouts, and I would expect that stuff like outcome-based RL will make them more consistently move away from "roleplay." 
  4. I also think it's philosophica
... (read more)
8Kaj_Sotala
I didn't say that roleplaying-derived scheming would be less concerning, to be clear. Quite the opposite, since that means that there now two independent sources of scheming rather than just one. (Also, what Mikita said.)
Mikita BalesniΩ91814

I think one practical difference is whether filtering pre-training data to exclude cases of scheming is a useful intervention.

Thanks. I agree that this is a weak part of the post. 

After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like:
1. Things move really fast.
2. We don't really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways. 
4. "what failure looks like" type loss of control. 

4Noosphere89
I definitely agree that conditioning on AI catastrophe, I think the 4 step chaotic catastrophe is the most likely way an AI catastrophe leads to us being extinct or at least in a very bad position. I admit the big difference is that I do think that 2 is probably incorrect, as we have some useful knowledge of how models form goals, and I expect this to continue.

Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked in... (read more)

9Seth Herd
Question 1: I stay in the loop when my AGI is solving hard problems. Absolutely it will need persistent goals, new reasoning, and continuous learning to make progress. That changing mind opens up The alignment stability problem as you note in your comment on the other thread. My job is making sure it's not going off the rails WRT my intent as it works.  People will do this by default. Letting it run for any length of time without asking questions about what it's up to would be both very expensive, and beyond the bounds of patience and curiosity for almost any humans. I instructed it to cure cancer, but I'm going to keep asking it how it's planning to do that and what progress its making. My important job is asking it about its alignment continually as it learns and plans. I'm frequently asking if it's had ideas about scheming to get its (subgoal) accomlished (while of course re-iterating the standing instructions to tell me the whole truth relevant to my requests). Its alignmnt is my job, until it's so much smarter than me, and clearly understands my intent, to trust it to keep itself aligned. Question 2: Yes, instruction-following should be helpful-only. Giving a bunch of constraints on the instructions it will follow adds risk that it won't obey your instructions to shut down or amend its goals or its understanding of previous instructions. That's the principal advantage of corrigibility. Max Harms details this logic in much more compelling detail. Yes, this definitely opens up the prospect of misuse, and that is terrifying. But this is not only the safer early route, it's the one AGI project leaders will choose - because they're people who like power. An org that's created instruction-following AGI would have it follow instructions only from one or a few top "principals". They would instruct it to follow a limited set of instructions from any users they license its instances to. Some of those users would try to jailbreak it to follow dangerous instructions.

Good point. That's another crux for which RL seems relevant. 

From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now. 

So the story where we misspecify the goal, the model realizes that the given goal differs from the inten... (read more)

I think it's actually not that trivial. 
1. The AI has goals, but presumably, we give it decently good goals when we start. So, there is a real question of why these goals end up changing from aligned to misaligned. I think outcome-based RL and instrumental convergence are an important part of the answer. If the AI kept the goals we originally gave it with all side constraints, I think the chances of scheming would be much lower. 
2. I guess we train the AI to follow some side constraints, e.g., to be helpful, harmless, and honest, which should red... (read more)

2Seth Herd
Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part. I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal. I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.

These are all good points. I think there are two types of forecasts we could make with evals:

1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques. 
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently. 

I think the second ... (read more)

Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making. 

1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people. 
2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels qu... (read more)

Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with "an earlier version with less safety training".

Marius HobbhahnΩ17357

Some quick confirmations / clarifications:
1. Evan and Daniel interpreted it correctly, we just wanted to test if the model has the capability to reason through every step needed for scheming conditional on it strongly following a goal. 
2. In general, with these evals, we often ran into problems where we couldn't disentangle incapability vs. refusal, e.g. a model might say something like "I understand what my goal would imply but I also see that my developers have other goals, so I find some balance". It's nice to know that models do that but it also d... (read more)

3aysja
I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.   

Copying from EAF

TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it's incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.

I've been a participant in the 2021/22 edition. Some thoughts that might make it easier to decide for funders/donors.
1. Impact-per-dollar is probably pretty good for the AISC. It's incredibly cheap compared to most other AI field-building efforts and scalable.
2. I learned a bunch during AISC and I did enjoy it.... (read more)

I feel like both of your points are slightly wrong, so maybe we didn't do a good job of explaining what we mean. Sorry for that. 

1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I'd be surprised if the way we currently do demonstrations could not be improved by better science.
1b) Even if you claim you just did a demo or an existence proof and explicitly ... (read more)

1L Rudolf L
1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn't really bear this emphasis out though, so I think this is a misunderstanding on my part. 1b) I agree we should try to be able to make claims like "the model will never X". But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it's not clear to me that effective policy (or science) requires being able to say "the model will never X". 1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can't guarantee they won't do bad things if X, we can work to not put them in situation X. 2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I'd emphasise slightly different parts and mention prediction as the fundamental goal). 2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a u

Nice work. Looking forward to that!

Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant. 
2. I'm not sure how true your suggestion is but I haven't tried it a lot empirically. But this is exactly the kind of stuff I'd like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don't have enough confidence in. Or at least it hasn't been established as a standard in evals.

I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them. 

I'm not sure the post would have been better if we used a more narrow title, e.g. "We need a science of capability evaluations" because the natural question then would be "But why not for propensity tests or for this other type of eval. I think the broader point of "when we do evals, we need some reason to be confident in the results no matter which kind of eval" seems to be true across all of them. 

I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world ... (read more)

I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware

If anything, ... (read more)

I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it's not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It's quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables. 

One thing that bothered me after the post is that Johannes Ackva (who's arguably a world-leading expert in this field) and Samuel + me just didn't seem to be able to communicate w... (read more)

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading. 

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth... (read more)

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts. 

In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time. 

For example, one could start as an organizatio... (read more)

5Seth Herd
It seems like all of those points are of the form "you could do better alignment work if you didn't worry about profits". Which is definitely true. But only if you have some other source of funding. Since alignment work is funding-constrained, that mostly isn't true. So, what's the alternative? Work a day job and work nights on alignment?
4Roman Leventov
An important factor that should go into this calculation (not just for you or your org but for anyone) is the following: given that AI safety is currently quite severely funding-constrained (just look at the examples of projects that are not getting funded right now), I think people should assess their own scientific calibre relative to other people in technical AI safety who will seek for funding. It's not a black-and-white choice between doing technical AI safety research, or AI governance/policy/advocacy, or not contributing to reducing the AI risk at all. The relevant 80000 hours page perpetuates this view and therefore is not serving the cause well in this regard. For people with more engineering, product, and business dispositions I believe there are many ways to help some to reduce the AI risk, many of which I referred to in other comments on this page, and here. And we should do a better job at laying out these paths for people, a-la "Work on Climate for AI risks".
1Brendon_Wong
This is an interesting point. I also feel like the governance model of the org and culture of mission alignment with increasing safety is important, in addition to the exact nature of the business and business model at the time the startup is founded. Looking at your examples, perhaps by “business model” you are referring both to what brings money in but also the overall governance/decision-making model of the organization?
2Eric Ho
Thanks Marius, definitely agreed that business model alignment is critical here, and that culture and investors matter a bunch in determining the amount of impact an org has.

Thx. updated:

"You might not be there yet" (though as Neel points out in the comments, CV screening can be a noisy process)“You clearly aren’t there yet”

2Neel Nanda
Thanks!

Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part. 

If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them. 

We considered alternative definitions of DA in Appendix C.

We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities. 

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s... (read more)

Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.

(personal opinion; might differ from other authors of the post)

Thanks for both questions. I think they are very important. 

1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycopha... (read more)

4aogara
Thanks! First response makes sense, there's a lot of different ways you could cut it.  On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we'll have models which reliably do what they believe is best according to their designer's goals. What major risks might emerge at that point? Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they're on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it's immoral. But many believe they are doing the right thing. How do we explain that? These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don't believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.  Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it's important to keep in mind the problems that would not be solved by honesty alone. 

Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning. 

I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.

3Jay Bailey
Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia. Does the model then: * Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis) * Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's) One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.

I'm not going to crosspost our entire discussion from the EAF. 

I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post. 

All of the above but in a specific order. 
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning. 
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning. 
3. Do less and less handholding for 1 and 2. See if the model still shows deception. 
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer ques... (read more)

(cross-posted from EAG)

Meta: Thanks for taking the time to respond. I think your questions are in good faith and address my concerns, I do not understand why the comment is downvoted so much by other people. 

1. Obviously output is a relevant factor to judge an organization among others. However, especially in hits-based approaches, the ultimate thing we want to judge is the process that generates the outputs to make an estimate about the chance of finding a hit. For example, a cynic might say "what has ARC-theory achieve so far? They wrote some nice f... (read more)

-3Omega.
(cross-posted from EAF)  appreciate you sharing your impression of the post. It’s definitely valuable for us to understand how the post was received, and we’ll be reflecting on it for future write-ups. 1) We agree it's worth taking into account aspects of an organization other than their output. Part of our skepticism towards Conjecture – and we should have made this more explicit in our original post (and will be updating it) – is the limited research track record of their staff, including their leadership. By contrast, even if we accept for the sake of argument that ARC has produced limited output, Paul Christiano has a clear track record of producing useful conceptual insights (e.g. Iterated Distillation and Amplification) as well as practical advances (e.g. Deep RL From Human Preferences) prior to starting work at ARC. We're not aware of any equally significant advances from Connor or other key staff members at Conjecture; we'd be interested to hear if you have examples of their pre-Conjecture output you find impressive. We're not particularly impressed by Conjecture's process, although it's possible we'd change our mind if we knew more about it. Maintaining high velocity in research is certainly a useful component, but hardly sufficient. The Builder/Breaker method proposed by ARC feels closer to a complete methodology. But this doesn't feel like the crux for us: if Conjecture copied ARC's process entirely, we'd still be much more excited about ARC (per-capita). Research productivity is a product of a large number of factors, and explicit process is an important but far from decisive one. In terms of the explicit comparison with ARC, we would like to note that ARC Theory's team size is an order of magnitude smaller than Conjecture. Based on ARC's recent hiring post, our understanding is the theory team consists of just three individuals: Paul Christiano, Mark Xu and Jacob Hilton. If ARC had a team ten times larger and had spent close to $10 mn, then we would

(cross-posted from EAF)

Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post. 
3. I'd also be fine with the a... (read more)

(cross-commented from EA forum)

I personally have no stake in defending Conjecture (In fact, I have some questions about the CoEm agenda) but I do think there are a couple of points that feel misleading or wrong to me in your critique. 

1. Confidence (meta point): I do not understand where the confidence with which you write the post (or at least how I read it) comes from. I've never worked at Conjecture (and presumably you didn't either) but even I can see that some of your critique is outdated or feels like a misrepresentation of their work to me (see... (read more)

(cross-posted from EAF, thanks Richard for suggesting. There's more back-and-forth later.)

I'm not very compelled by this response.

It seems to me you have two points on the content of this critique. The first point:

I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.

I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursu... (read more)

Omega.12-3

(crossposted from the EA Forum)

We appreciate your detailed reply outlining your concerns with the post. 

Our understanding is that your key concern is that we are judging Conjecture based on their current output, whereas since they are pursuing a hits-based strategy we should expect in the median case for them to not have impressive output. In general, we are excited by hits-based approaches, but we echo Rohin's point: how are we meant to evaluate organizations if not by their output? It seems healthy to give promising researchers sufficient ... (read more)

(cross-posted from EAF)

Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post. 
3. I'd also be fine with the a... (read more)

Clarified the text: 

Update (early April 2023): I now think the timelines in this post are too long and expect the world to get crazy faster than described here. For example, I expect many of the things suggested for 2030-2040 to already happen before 2030. Concretely, in my median world, the CEO of a large multinational company like Google is an AI. This might not be the case legally but effectively an AI makes most major decisions.

Not sure if this is "Nice!" xD. In fact, it seems pretty worrying. 

2Daniel Kokotajlo
Well nice that you updated at least! :) But yeah I'm pretty scared.

So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.  

I agree with the overall conclusion that the burden of proof should be on the side of the AGI companies. 

However, using the FDA as a reference or example might not be so great because it has historically gotten the cost-benefit trade-offs wrong many times and e.g. not permitting medicine that was comparatively safe and highly effective. 

So if the association of AIS evals or is similar to the FDA, we might not make too many friends. Overall, I think it would be fine if the AIS auditing community is seen as generally cautious but it should not give... (read more)

3lisas
That seems like an excellent angle to the issue - I agree that reference models and stakeholders' different attitudes towards them likely have a huge impact.  As such, the criticisms the FDA faces might indeed be an issue! (at least that's how I understand your comment);  However, I'd carefully offer a bit of pushback on the aviation industry as an example, keeping in mind the difficult tradeoffs and diverging interests regulators will face in designing an approval process for AI systems. I think the problems that regulators will face are more similar to those of the FDA & policymakers (if you assume they are your audience) might be more comfortable with a model that can somewhat withstand these problems.  Below my reasoning (with a bit of an overstatement/ political rhetoric e.g., "risking peoples live") As you highlighted, FDA is facing substantial criticism for being too cautious, e.g., with the Covid Vaccine taking longer to approve than the UK. Not permitting a medicine that would have been comparatively safe and highly effective, i.e., a false negative, can mean that medicine could have had a profound positive impact on someone's life. And beyond the public interest, industry has quite some financial interests in getting these through too. In a similar vein, I expect that regulators will face quite some pushback when "slowing" innovation down, i.e. not approving a model. On the other side, being too fast in pushing drugs through the pipeline is also commonly criticized (e.g., the recent Alzheimer's drug approval as a false positive example). Even more so, losing its reputation as a trustworthy regulator has a lot of knock-on effects. (i.e., will people trust an FDA-approved vaccine in the future?).  As such, both being too cautious and being too aggressive have both potentially high costs to people's lives, striking the right balance is incredibly difficult. The aviation industry also faces a tradeoff, but I would argue, one side is inherently "weaker" tha
3[anonymous]
This makes sense. Can you say more about how aviation regulation differs from the FDA? In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration?  (I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).

People could choose how they want to publish their opinion. In this case, Richard chose the first name basis. To be fair though, there aren't that many Richards in the alignment community and it probably won't be very hard for you to find out who Richard is. 

Just to get some intuitions. 

Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while? 

8Mark Xu
I think this case is unclear, but also not central because I'm imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you've basically "solved interp", so the benefits no longer really apply?

Thank you!

I also agree that toy models are better than nothing and we should start with them but I moved away from "if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization". 

I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven't found the time to interpret them yet. I'm happy for someone to take over the project if they want to. I'm mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fair... (read more)

How confident are you that the model is literally doing gradient descent from these papers? My understanding was that the evidence in these papers is not very conclusive and I treated it more as an initial hypothesis than an actual finding. 

Even if you have the redundancy at every layer, you are still running copies of the same layer, right? Intuitively I would say this is not likely to be more space-efficient than not copying a layer and doing something else but I'm very uncertain about this argument. 

I intend to look into the Knapsack + DP algorithm problem at some point. If I were to find that the model implements the DP algorithm, it would change my view on mesa optimization quite a bit. 

4abhayesian
I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.  Garg et al. 2022 examine the performance of 12-layer GPT-style transformers trained to do few-shot learning and show that they can in-context learn 2-layer MLPs.  The performance of their model closely matches an MLP with GD for 5000 steps on those same few-shot examples, and it cannot be explained by heuristics like averaging the K-nearest neighbors from the few-shot examples.  Since the inputs are fairly high-dimensional, I don't think they can be performing this well by only memorizing the weights they've seen during training.  The model is also fairly robust to distribution shifts in the inputs at test time, so the heuristic they must be learning should be pretty similar to a general-purpose learning algorithm.   I think that there also is some amount of mechanistic evidence that transformers implement some sort of iterative optimization algorithm over some quantity stored in the residual stream.   In one of the papers mentioned above (Akyurek et al. 2022), the authors trained a probe to extract the ground-truth weights of the linear model from the residual stream and it appears to somewhat work.  The diagrams seem to show that it gets better when trained on activations from later layers, so it seems likely that the transformer is iteratively refining its prediction of the weights.

No plans so far. I'm a little unhappy with the experimental design from last time. If I ever come back to this, I'll change the experiments up anyways.

Load More