All of Marius Hobbhahn's Comments + Replies

Marius Hobbhahn6mo53

If I had more time, I would have written a shorter post ;)

Marius Hobbhahn6moΩ122211

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.

Marius Hobbhahn6moΩ6150

I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).

These are all pretty basic thought... (read more)

Marius Hobbhahn6mo20

Go for it. I have some names in mind for potential experts. DM if you're interested.

Adam Jones6mo100

At BlueDot we've been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.

I've messaged you on Slack.

Marius Hobbhahn6moΩ92111

Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/

In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" ... (read more)

Ryan Kidd6mo279

I'm tempted to set this up with Manifund money. Could be a weekend project.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn6moΩ71413

I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.

Marius Hobbhahn7moΩ11162

Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.

We will potentially further look into these findings in 2025.

1Outlaw-Spades5mo

Hey, more a question as someone kind of new to A.I. and the IT world but if these newer models are built in similar fashion to previous models like GPT-4, has there been any inquiry or study into the relation between certain behaviors regarding things like A.I. 'hallucinations' where false information is generated and fabricated and these behaviors investigated above? First (?) Libel-by-AI (ChatGPT) Lawsuit Filed Using this case as an example where the A.I. repeatedly generated false information and false text from a real lawsuit to support allegations it created while a third-party researcher was investigating a separate court-case. What stood out to me was the repeated tendency to double down on the false information as noted in the points of the lawsuit. Just curious if there are people with more understanding or knowledge on these topics that could maybe clarify whether or not there might be a linkage to those behaviors whether purposefully generated by the LLM or not?

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn7mo*Ω7123

(thx to Bronson for privately pointing this out)

I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan's argument on the AXRP podcast.

Also, I think you're right, and my statement of "I think for most practical considerations, it makes almost zero difference." was too strong.

1Ben Millwood7mo

in case anyone else wanted to look this up, it's at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn't get a lot of eyeballs at the time.

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn7mo*Ω102717

We write about this in the limitations section (quote below). My view in brief:

Even if they are just roleplaying, they cause harm. That seems bad.
If they are roleplaying, they will try to be consistent once they are caught in that persona in one rollout. That also seems bad.
I think the "roleplaying" argument is very overrated. It feels to me as if existing models change behavior throughout their rollouts, and I would expect that stuff like outcome-based RL will make them more consistently move away from "roleplay."
I also think it's philosophica

... (read more)

8Kaj_Sotala7mo

I didn't say that roleplaying-derived scheming would be less concerning, to be clear. Quite the opposite, since that means that there now two independent sources of scheming rather than just one. (Also, what Mikita said.)

Mikita Balesni7moΩ91814

I think one practical difference is whether filtering pre-training data to exclude cases of scheming is a useful intervention.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn8moΩ220

Thanks. I agree that this is a weak part of the post.

After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like:
1. Things move really fast.
2. We don't really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways.
4. "what failure looks like" type loss of control.

4Noosphere898mo

I definitely agree that conditioning on AI catastrophe, I think the 4 step chaotic catastrophe is the most likely way an AI catastrophe leads to us being extinct or at least in a very bad position. I admit the big difference is that I do think that 2 is probably incorrect, as we have some useful knowledge of how models form goals, and I expect this to continue.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn8moΩ430

Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked in... (read more)

9Seth Herd8mo

Question 1: I stay in the loop when my AGI is solving hard problems. Absolutely it will need persistent goals, new reasoning, and continuous learning to make progress. That changing mind opens up The alignment stability problem as you note in your comment on the other thread. My job is making sure it's not going off the rails WRT my intent as it works. People will do this by default. Letting it run for any length of time without asking questions about what it's up to would be both very expensive, and beyond the bounds of patience and curiosity for almost any humans. I instructed it to cure cancer, but I'm going to keep asking it how it's planning to do that and what progress its making. My important job is asking it about its alignment continually as it learns and plans. I'm frequently asking if it's had ideas about scheming to get its (subgoal) accomlished (while of course re-iterating the standing instructions to tell me the whole truth relevant to my requests). Its alignmnt is my job, until it's so much smarter than me, and clearly understands my intent, to trust it to keep itself aligned. Question 2: Yes, instruction-following should be helpful-only. Giving a bunch of constraints on the instructions it will follow adds risk that it won't obey your instructions to shut down or amend its goals or its understanding of previous instructions. That's the principal advantage of corrigibility. Max Harms details this logic in much more compelling detail. Yes, this definitely opens up the prospect of misuse, and that is terrifying. But this is not only the safer early route, it's the one AGI project leaders will choose - because they're people who like power. An org that's created instruction-following AGI would have it follow instructions only from one or a few top "principals". They would instruct it to follow a limited set of instructions from any users they license its instances to. Some of those users would try to jailbreak it to follow dangerous instructions.

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn8moΩ340

Good point. That's another crux for which RL seems relevant.

From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.

So the story where we misspecify the goal, the model realizes that the given goal differs from the inten... (read more)

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn8moΩ452

I think it's actually not that trivial.
1. The AI has goals, but presumably, we give it decently good goals when we start. So, there is a real question of why these goals end up changing from aligned to misaligned. I think outcome-based RL and instrumental convergence are an important part of the answer. If the AI kept the goals we originally gave it with all side constraints, I think the chances of scheming would be much lower.
2. I guess we train the AI to follow some side constraints, e.g., to be helpful, harmless, and honest, which should red... (read more)

2Seth Herd8mo

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part. I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal. I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.

The Evals Gap

Marius Hobbhahn8moΩ330

These are all good points. I think there are two types of forecasts we could make with evals:

1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.

I think the second ... (read more)

The Evals Gap

Marius Hobbhahn8moΩ220

Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.

1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels qu... (read more)

TurnTrout's shortform feed

Marius Hobbhahn10mo50

Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with "an earlier version with less safety training".

TurnTrout's shortform feed

Marius Hobbhahn10moΩ17357

Some quick confirmations / clarifications:
1. Evan and Daniel interpreted it correctly, we just wanted to test if the model has the capability to reason through every step needed for scheming conditional on it strongly following a goal.
2. In general, with these evals, we often ran into problems where we couldn't disentangle incapability vs. refusal, e.g. a model might say something like "I understand what my goal would imply but I also see that my developers have other goals, so I find some balance". It's nice to know that models do that but it also d... (read more)

3aysja10mo

I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.

This might be the last AI Safety Camp

Marius Hobbhahn1y6835

Copying from EAF

TL;DR: At least in my experience, AISC was pretty positive for most participants I know and it's incredibly cheap. It also serves a clear niche that other programs are not filling and it feels reasonable to me to continue the program.

I've been a participant in the 2021/22 edition. Some thoughts that might make it easier to decide for funders/donors.
1. Impact-per-dollar is probably pretty good for the AISC. It's incredibly cheap compared to most other AI field-building efforts and scalable.
2. I learned a bunch during AISC and I did enjoy it.... (read more)

Marius Hobbhahn1yΩ352

I feel like both of your points are slightly wrong, so maybe we didn't do a good job of explaining what we mean. Sorry for that.

1a) Evals both aim to show existence proofs, e.g. demos, as well as inform some notion of an upper bound. We did not intend to put one of them higher with the post. Both matter and both should be subject to more rigorous understanding and processes. I'd be surprised if the way we currently do demonstrations could not be improved by better science.
1b) Even if you claim you just did a demo or an existence proof and explicitly ... (read more)

1L Rudolf L1y

1a) I got the impression that the post emphasises upper bounds more than existing proofs from the introduction, which has a long paragraph on the upper bound problem, and from reading the other comments. The rest of the post doesn't really bear this emphasis out though, so I think this is a misunderstanding on my part. 1b) I agree we should try to be able to make claims like "the model will never X". But if models are genuinely dangerous, by default I expect a good chance that teams of smart red-teamers and eval people (e.g. Apollo) to be able to unearth scary demos. And the main thing we care about is that danger leads to an appropriate response. So it's not clear to me that effective policy (or science) requires being able to say "the model will never X". 1c) The basic point is that a lot of the safety cases we have for existing products rely less on the product not doing bad things across a huge range of conditions, but on us being able to bound the set of environments where we need the product to do well. E.g. you never put the airplane wing outside its temperature range, or submerge it in water, or whatever. Analogously, for AI systems, if we can't guarantee they won't do bad things if X, we can work to not put them in situation X. 2a) Partly I was expecting the post to be more about the science and less about the field-building. But field-building is important to talk about and I think the post does a good job of talking about it (and the things you say about science are good too, just that I'd emphasise slightly different parts and mention prediction as the fundamental goal). 2b) I said the post could be read in a way that produces this feeling; I know this is not your intention. This is related to my slight hesitation around not emphasising the science over the field-building. What standards etc. are possible in a field is downstream of what the objects of study turn out to be like. I think comparing to engineering safety practices in other fields is a u

Marius Hobbhahn1yΩ130

Nice work. Looking forward to that!

Marius Hobbhahn1yΩ120

Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant.
2. I'm not sure how true your suggestion is but I haven't tried it a lot empirically. But this is exactly the kind of stuff I'd like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don't have enough confidence in. Or at least it hasn't been established as a standard in evals.

The next decades might be wild

Marius Hobbhahn1yΩ140

I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them.

I'm not sure the post would have been better if we used a more narrow title, e.g. "We need a science of capability evaluations" because the natural question then would be "But why not for propensity tests or for this other type of eval. I think the broader point of "when we do evals, we need some reason to be confident in the results no matter which kind of eval" seems to be true across all of them.

Marius Hobbhahn2yΩ120Review for 2022 Review

I think this post was a good exercise to clarify my internal model of how I expect the world to look like with strong AI. Obviously, most of the very specific predictions I make are too precise (which was clear at the time of writing) and won't play out exactly like that but the underlying trends still seem plausible to me. For example, I expect some major misuse of powerful AI systems, rampant automation of labor that will displace many people and rob them of a sense of meaning, AI taking over the digital world years before taking over the physical world ... (read more)

Disagreement with bio anchors that lead to shorter timelines

Marius Hobbhahn2yΩ120Review for 2022 Review

I still stand behind most of the disagreements that I presented in this post. There was one prediction that would make timelines longer because I thought compute hardware progress was slower than Moore's law. I now mostly think this argument is wrong because it relies on FP32 precision. However, lower precision formats and tensor cores are the norm in ML, and if you take them into account, compute hardware improvements are faster than Moore's law. We wrote a piece with Epoch on this: https://epochai.org/blog/trends-in-machine-learning-hardware

If anything, ... (read more)

Nuclear Energy - Good but not the silver bullet we were hoping for

Marius Hobbhahn2y120Review for 2022 Review

I think I still mostly stand behind the claims in the post, i.e. nuclear is undervalued in most parts of society but it's not as much of a silver bullet as many people in the rationalist / new liberal bubble would make it seem. It's quite expensive and even with a lot of research and de-regulation, you may not get it cheaper than alternative forms of energy, e.g. renewables.

One thing that bothered me after the post is that Johannes Ackva (who's arguably a world-leading expert in this field) and Samuel + me just didn't seem to be able to communicate w... (read more)

Trends in GPU price-performance

Marius Hobbhahn2yΩ360Review for 2022 Review

In a narrow technical sense, this post still seems accurate but in a more general sense, it might have been slightly wrong / misleading.

In the post, we investigated different measures of FP32 compute growth and found that many of them were slower than Moore's law would predict. This made me personally believe that compute might be growing slower than people thought and most of the progress comes from throwing more money at larger and larger training runs. While most progress comes from investment scaling, I now think the true effective compute growth... (read more)

Lessons learned from talking to >100 academics about AI safety

Marius Hobbhahn2yΩ5116Review for 2022 Review

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

Some for-profit AI alignment org ideas

Marius Hobbhahn2y2610

At Apollo, we have spent some time weighing the pros and cons of the for-profit vs. non-profit approach so it might be helpful to share some thoughts.

In short, I think you need to make really sure that your business model is aligned with what increases safety. I think there are plausible cases where people start with good intentions but insufficient alignment between the business model and the safety research that would be the most impactful use of their time where these two goals diverge over time.

For example, one could start as an organizatio... (read more)

5Seth Herd2y

It seems like all of those points are of the form "you could do better alignment work if you didn't worry about profits". Which is definitely true. But only if you have some other source of funding. Since alignment work is funding-constrained, that mostly isn't true. So, what's the alternative? Work a day job and work nights on alignment?

4Roman Leventov2y

An important factor that should go into this calculation (not just for you or your org but for anyone) is the following: given that AI safety is currently quite severely funding-constrained (just look at the examples of projects that are not getting funded right now), I think people should assess their own scientific calibre relative to other people in technical AI safety who will seek for funding. It's not a black-and-white choice between doing technical AI safety research, or AI governance/policy/advocacy, or not contributing to reducing the AI risk at all. The relevant 80000 hours page perpetuates this view and therefore is not serving the cause well in this regard. For people with more engineering, product, and business dispositions I believe there are many ways to help some to reduce the AI risk, many of which I referred to in other comments on this page, and here. And we should do a better job at laying out these paths for people, a-la "Work on Climate for AI risks".

1Brendon_Wong2y

This is an interesting point. I also feel like the governance model of the org and culture of mission alignment with increasing safety is important, in addition to the exact nature of the business and business model at the time the startup is founded. Looking at your examples, perhaps by “business model” you are referring both to what brings money in but also the overall governance/decision-making model of the organization?

2Eric Ho2y

Thanks Marius, definitely agreed that business model alignment is critical here, and that culture and investors matter a bunch in determining the amount of impact an org has.

Experiences and learnings from both sides of the AI safety job market

Marius Hobbhahn2yΩ350

Thx. updated:

"You might not be there yet" (though as Neel points out in the comments, CV screening can be a noisy process)~~“You clearly aren’t there yet”~~

2Neel Nanda2y

Thanks!

Understanding strategic deception and deceptive alignment

Marius Hobbhahn2y42

Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part.

If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them.

Understanding strategic deception and deceptive alignment

Marius Hobbhahn2y30

We considered alternative definitions of DA in Appendix C.

We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities.

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s... (read more)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn2y20

Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.

Understanding strategic deception and deceptive alignment

Marius Hobbhahn2y52

(personal opinion; might differ from other authors of the post)

Thanks for both questions. I think they are very important.

1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycopha... (read more)

4aog2y

Thanks! First response makes sense, there's a lot of different ways you could cut it. On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we'll have models which reliably do what they believe is best according to their designer's goals. What major risks might emerge at that point? Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they're on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it's immoral. But many believe they are doing the right thing. How do we explain that? These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don't believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs. Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it's important to keep in mind the problems that would not be solved by honesty alone.

Understanding strategic deception and deceptive alignment

Marius Hobbhahn2y20

Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning.

I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.

3Jay Bailey2y

Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia. Does the model then: * Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis) * Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's) One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.

There should be more AI safety orgs

Marius Hobbhahn2y64

Nice to see you're continuing!

Announcing Apollo Research

Marius Hobbhahn2y82

I'm not going to crosspost our entire discussion from the EAF.

I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post.

Marius Hobbhahn2yΩ251

All of the above but in a specific order.
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning.
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning.
3. Do less and less handholding for 1 and 2. See if the model still shows deception.
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer ques... (read more)

Marius Hobbhahn2y97

(cross-posted from EAG)

Meta: Thanks for taking the time to respond. I think your questions are in good faith and address my concerns, I do not understand why the comment is downvoted so much by other people.

1. Obviously output is a relevant factor to judge an organization among others. However, especially in hits-based approaches, the ultimate thing we want to judge is the process that generates the outputs to make an estimate about the chance of finding a hit. For example, a cynic might say "what has ARC-theory achieve so far? They wrote some nice f... (read more)

-3Omega.2y

(cross-posted from EAF) appreciate you sharing your impression of the post. It’s definitely valuable for us to understand how the post was received, and we’ll be reflecting on it for future write-ups. 1) We agree it's worth taking into account aspects of an organization other than their output. Part of our skepticism towards Conjecture – and we should have made this more explicit in our original post (and will be updating it) – is the limited research track record of their staff, including their leadership. By contrast, even if we accept for the sake of argument that ARC has produced limited output, Paul Christiano has a clear track record of producing useful conceptual insights (e.g. Iterated Distillation and Amplification) as well as practical advances (e.g. Deep RL From Human Preferences) prior to starting work at ARC. We're not aware of any equally significant advances from Connor or other key staff members at Conjecture; we'd be interested to hear if you have examples of their pre-Conjecture output you find impressive. We're not particularly impressed by Conjecture's process, although it's possible we'd change our mind if we knew more about it. Maintaining high velocity in research is certainly a useful component, but hardly sufficient. The Builder/Breaker method proposed by ARC feels closer to a complete methodology. But this doesn't feel like the crux for us: if Conjecture copied ARC's process entirely, we'd still be much more excited about ARC (per-capita). Research productivity is a product of a large number of factors, and explicit process is an important but far from decisive one. In terms of the explicit comparison with ARC, we would like to note that ARC Theory's team size is an order of magnitude smaller than Conjecture. Based on ARC's recent hiring post, our understanding is the theory team consists of just three individuals: Paul Christiano, Mark Xu and Jacob Hilton. If ARC had a team ten times larger and had spent close to $10 mn, then we would

Marius Hobbhahn2y2822

(cross-posted from EAF)

Some clarifications on the comment:
1. I strongly endorse critique of organisations in general and especially within the EA space. I think it's good that we as a community have the norm to embrace critiques.
2. I personally have my criticisms for Conjecture and my comment should not be seen as "everything's great at Conjecture, nothing to see here!". In fact, my main criticism of leadership style and CoEm not being the most effective thing they could do, are also represented prominently in this post.
3. I'd also be fine with the a... (read more)