LESSWRONG
LW

All of Joe Collman's Comments + Replies

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).

If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.

(without having read too carefully, my initial impression is that this is the kind of thing I expect to work fo... (read more)

1Knight Lee4d

Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You're a true researcher, take my strong upvote. My idea consists of a "hammer" and a "nail." GDM's paper describes a "hammer" very similar to mine (perhaps superior), but lacks the "nail." The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I'm not badly confused :). I shouldn't be sad that my hammer invention already exists.[1] The "nail" of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the "Constitutional AI Sufficiency Theorem." The "hammer" of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty. 1. ^ It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/

Akash's Shortform

Joe Collman5mo51

Some thoughts:

The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard's reasons.
- Given [new potential-to-shift-motivation information/understanding], I expect there's a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
- Specifically:
  - Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cauti

Joe Collman5mo00

First some points of agreement:

I like that you're focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
- Skimming through your suggestions, I think I'm most keen on human augmentation related approaches - hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
I think outreach to Republicans / conservatives, and working across political lines is important, and I'm glad that people are actively thinking about this.
I do buy the

... (read more)

Twitter thread on AI safety evals

Joe Collman8mo40

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

The whole thing is laundering vibes into credible-sounding headline numbers.
Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, b... (read more)

Circumventing interpretability: How to defeat mind-readers

Joe Collman8moΩ120

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].

That said, I think I'd disagree on one word of the following:

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement su

Joe Collman8moΩ120

Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.

It seems to me that there's no fixed notion of "active" that works for both para... (read more)

2Lee Sharkey8mo

So I believe I had in mind "active means [is achieved deliberately through the agent's actions]". I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle. The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they've been selected for passively. So I'd argue that they all need to be implemented actively (if they're to have a reasonable probability of success) but may be selected for passively or actively. I'm curious if you agree with this? If not, then I may have missed your argument. ( For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are: * 1.1.3. Environmental artifacts that corrupt its future self: A misaligned AI may leave behind artifacts that serve as adversarial examples or poisoned data that corrupt its future self. (Strictly active) * 1.1.4. Deleting (and relearning) dangerous representations: When not under surveillance, an AI might be able to delete dangerous representations so that it looks safe when surveillance resumes. It might also be able to schedule when it relearns those representations. (Strictly active) * 1.1.5. Weights poisoning: Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples. (Strictly active) * 1.1.6. Pre-committing to following potentially dangerous instructions: Hubinger: "... if the model has the ability to read and write to some sort of state ... it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it find

On “first critical tries” in AI alignment

Joe Collman9mo60

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

On “first critical tries” in AI alignment

Joe Collman9moΩ360

It may be better to think about it that way, yes - in some cases, at least.

Probably it makes sense to throw in some more variables.
Something like:

To stand x chance of property p applying to system s, we'd need to apply resources r.

In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

5faul_sname9mo

I think the most important part of your "To stand x chance of property p applying to system s, we'd need to apply resources r" model is the word "we". Currently, there exists no "we" in the world that can ensure that nobody in the world does some form of research, or at least no "we" that can do that in a non-cataclysmic way. The International Atomic Energy Agency comes the closest of any group I'm aware of, but the scope is limited and also it does its thing mainly by controlling access to specific physical resources rather than by trying to prevent a bunch of people from doing a thing with resources they already possess. If "gain a DSA (or cause some trusted other group to gain a DSA) over everyone who could plausibly gain a DSA in the future" is a required part of your threat mitigation strategy, I am not optimistic about the chances for success but I'm even less optimistic about the chances of that working if you don't realize that's the game you're trying to play.

Towards more cooperative AI safety strategies

Joe Collman9mo*73

Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".

Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?

[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]

I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't e... (read more)

Towards more cooperative AI safety strategies

Joe Collman9mo42

On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].

Constrained-power-seeking still seems necessary to me. (unfortunately)

A few clarifications:

I guess most technical AIS work is net negative in expectation. My ask there is that people work on clearer cases for their work being positive.
I don't think my (or Eliezer's) conclus

Joe Collman9mo4-2

E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.

That's clarifying. In particular, I hadn't realized you meant to imply [legitimacy of the 'community' as a whole] in your post.

I think both are good examples in principle, given the point you're making. I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadeni... (read more)

8Richard_Ngo9mo

While this seems like a reasonable opinion in isolation, I also read the thread where you were debating Rohin and holding the position that most technical AI safety work was net-negative. And so basically I think that you, like Eliezer, have been forced by (according to me, incorrect) analyses of the likelihood of doom to the conclusion that only power-seeking strategies will work. From the inside, for you, it feels like "I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are". From the outside, for me, it feels like "The doomers have a cognitive bias that ends up resulting in them overrating power-seeking strategies, and this is not a coincidence but instead driven by the fact that it's disproportionately easy for cognitive biases to have this effect (given how the human mind works)". Fortunately I think most rationalists have fairly good defense mechanisms against naive power-seeking strategies, and this is to their credit. So the main thing I'm worried about here is concentrating less force behind non-power-seeking strategies.

Towards more cooperative AI safety strategies

Joe Collman9mo62

First, I think that thinking about and highlighting these kind of dynamics is important.
I expect that, by default, too few people will focus on analyzing such dynamics from a truth-seeking and/or instrumentally-useful-for-safety perspective.

That said:

It seems to me you're painting with too broad a brush throughout.
- At the least, I think you should give some examples that lie just outside the boundary of what you'd want to call [structural power-seeking].
Structural power-seeking in some sense seems unavoidable. (AI is increasingly powerful; influencing it im

Joe Collman9moΩ340

That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)

What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].

I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than h... (read more)

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ120

"[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...

I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom.
I'll clarify more below.

Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).

Sure, but I mostly do... (read more)

Rohin Shah9moΩ13155

Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.

I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ120

Not going to respond to everything

No worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long.

E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.

Agreed, but I think this is too coarse-grained a view.
I expect that, absent impressive levels of international coordination, we... (read more)

6Rohin Shah9mo

This is the sort of thing that makes it hard for me to distinguish your argument from "[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down". I agree that, conditional on believing that we're screwed absent huge levels of coordination regardless of technical work, then a lot of technical work including debate looks net negative by reducing the will to coordinate. Similarly this only makes sense under a view where technical work can't have much impact on p(doom) by itself, aka "regardless of technical work we're screwed". Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom). Oh, my probability on level 6 or level 7 specifications becoming the default in AI is dominated by my probability that I'm somehow misunderstanding what they're supposed to be. (A level 7 spec for AGI seems impossible even in theory, e.g. because it requires solving the halting problem.) If we ignore the misunderstanding part then I'm at << 1% probability on "we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future". (I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ5138

[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]

Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?

Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).

[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]

It seems like your argument here, and in other parts of your comment, is something like "we could d

... (read more)

8Rohin Shah9mo

Not going to respond to everything, sorry, but a few notes: My claim is that for the things you call "actions that increase risk" that I call "opportunity cost", this causal arrow is very weak, and so you shouldn't think of it as risk compensation. E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities. I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2). I also don't really understand the argument in your spoiler box. You've listed a bunch of claims about AI, but haven't spelled out why they should make us expect large risk compensation effects, which I thought was the relevant question. It depends hugely on the specific stronger safety measure you talk about. E.g. I'd be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I'm hesitant around such small probabilities on any social claim. For things like GSA and ARC's work, there isn't a sufficiently precise claim for me to put a probability on. Not a big factor. (I guess it matters that instruction tuning and RLHF exist, but something like that was always going to happen, the question was when.) Hmm, then I don't understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper). You might think that GSA will uncover problems in debate if they exist when using it as a specification, but if anything that seems to me less likely to happen with GSA, since in a GSA approach the specification is treated as infallible.

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ5124

Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]

I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".

(Thanks for this too. I don't ... (read more)

4Rohin Shah9mo

Main points: I'm on board with these. I still don't see why you believe this. Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk? (E.g. I haven't looked into it but I bet people didn't spend a bunch of time thinking about risk compensation when deciding whether to include seat belts in cars.) If you do agree with that, what makes AI different from those cases? (The arguments you give seem like very general considerations that apply to other fields as well.) I'd say that the risk compensation argument as given here Proves Too Much and implies that most safety work in most previous fields was net negative, which seems clearly wrong to me. It's true that as a result I don't spend lots of time thinking about risk compensation; that still seems correct to me. It seems like your argument here, and in other parts of your comment, is something like "we could do this more costly thing that increases safety even more". This seems like a pretty different argument; it's not about risk compensation (i.e. when you introduce safety measures, people do more risky things), but rather about opportunity cost (i.e. when you introduce weak safety measures, you reduce the will to have stronger safety measures). This is fine, but I want to note the explicit change in argument; my earlier comment and the discussion above was not trying to address this argument. Briefly on opportunity cost arguments, the key factors are (a) how much will is there to pay large costs for safety, (b) how much time remains to do the necessary research and implement it, and (c) how feasible is the stronger safety measure. I am actually more optimistic about both (a) and (b) than what I perceive to be the common consensus amongst safety researchers at AGI labs, but tend to be pretty pessimistic about (c) (at least relative to many LessWrongers, I'm not sure how it compares to safety researchers at AGI labs). Anyway for

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ240

Sure, linking to that seems useful, thanks.

That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].

For something like the latter, it's not clear to me that it's not useful.
Mainly my pessimism is about:

Debate seeming not to address the failure modes I'm worried about - e.g. scheming.
Expecting [systems insufficiently capable to cause catastrophe] not to radically (>10x) boost the most important research on alignment. (hopefully I'

... (read more)

5Fabien Roger9mo

Why? Is it exploration difficulties, rare failures, or something else? Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs - other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself. This intuition + the "no exploration difficulties" assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with "No research I'm aware of seeming likely to tell us when debate would fail catastrophically.")

On scalable oversight with weak LLMs judging strong LLMs

Joe Collman9moΩ34-2

Do you have a [link to] / [summary of] your argument/intuitions for [this kind of research on debate makes us safer in expectation]? (e.g. is Geoffrey Irving's AXRP appearance a good summary of the rationale?)

To me it seems likely to lead to [approach that appears to work to many, but fails catastrophically] before it leads to [approach that works]. (This needn't be direct)

I.e. currently I'd expect this direction to make things worse both for:

We're aiming for an oversight protocol that's directly scalable to superintelligence.
We're aiming for e.g. a contro

... (read more)

Rohin Shah9moΩ9119

I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.

(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we'll be able to reduce risk further in the future with more research.)

I note here that this isn't a fully-general counterargument, but rather a general consideration.

I don't see why this isn't a fully general counterargument to alignment work. Your arg... (read more)

6peterbarnett9mo

I think this comment might be more productive if you described why you expect this approach to fail catastrophically when dealing with powerful systems (in a way that doesn't provide adequate warning). Linking to previous writing on this could be good (maybe this comment of yours on debate/scalable oversight).

MIRI's June 2024 Newsletter

Joe Collman10mo86

Broadly I agree.

I'm not sure about:

but the team has not cohered around a leadership structure or agenda yet. I'm hopeful that this will come together

I don't expect the most effective strategy at present to be [(try hard to) cohere around an agenda]. An umbrella org hosting individual researchers seems the right starting point. Beyond that, I'd expect [structures and support to facilitate collaboration and self-organization] to be ideal.
If things naturally coalesce that's probably a good sign - but I'd prefer that to be a downstream consequence of explorati... (read more)

3plex10mo

Yeah, I mostly agree with the claim that individuals pursuing their own agendas is likely better than trying to push for people to work more closely. Finding directions which people feel like converging on could be great, but not at the cost of being able to pursue what seems most promising in a self-directed way. I think I meant I was hopeful about the whole thing coming together, rather than specifically the coherent agenda part.

AI catastrophes and rogue deployments

Joe Collman10moΩ120

Ah okay, that's clarifying. Thanks.

It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.

In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].

Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]
Rather... (read more)

AI catastrophes and rogue deployments

Joe Collman10moΩ120

This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.

Thoughts:

It's not clear to me whether the following counts as a rogue deployment (I'm assuming so):
1. [un-noticed failure of one safety measure, in a context where all other safety measures are operational]
2. For this kind of case:
  1. The name "rogue deployment" doesn't seem a great fit.
  2. In general, it's not clear to me how to draw the line between:
    1. Safety measure x didn't achieve what we wanted, because it wasn't

... (read more)

2Buck10mo

Re 1, I don't count that as a rogue deployment. The distinction I'm trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we'd designed? Agreed re 2.

When is "unfalsifiable implies false" incorrect?

Joe Collman10mo89

(Egan's Incandescence is relevant and worth checking out - though it's not exactly thrilling :))

I'm not crazy about the terminology here:

Unfalsifiable-in-principle doesn't imply false. It implies that there's a sense in which the claim is empty. This tends to imply [it will not be accepted as science], but not [it is false].
Where something is practically unfalsifiable (but falsifiable in principle), that doesn't suggest it's false either. It suggests it's hard to check.
- It seems to me that the thing you'd want to point to as potentially suspicious is [pract

... (read more)

1VojtaKovarik10mo

I agree with all of this. (And good point about the high confidence aspect.) The only thing that I would frame slightly differently is that: [X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this too far, and view "unfalsifiable" as a reason to stop paying attention, even though they would grant the general claim that [unfalsifiable] doesn't logically imply [false].) That was my main plan. I was just hoping to accompany that direct case by a class of examples that build intuition and bring the point home to the audience.

Thomas Kwa's Shortform

Joe Collman10mo20

Here and above, I'm unclear what "getting to 7..." means.
With x = "always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection".

Which of the following do you mean (if either)?:

We have a method that x.
We have a method that x, and we have justified >80% confidence that the method x.

I don't see how model organisms of deceptive alignment (MODA) get us (2).
This would seem to require some theoretical reason to believe our MODA in some sense covere... (read more)

Richard Ngo's Shortform

Joe Collman10moΩ242

I agree with this.

Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).

I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different ... (read more)

Prometheus's Shortform

Joe Collman10mo20

then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information

I think the same reasoning applies if they hack us: they'll assume that the stuff they were able to hack was the part we left suspiciously vulnerable, and the really important information is behind more serious security.

I expect they'll assume we're in control either way - once the stakes are really high.
It seems preferable to actually be in control.

I'll grant that it's far from clear that the best strategy would be used.

(apologies if I misinterpreted your assumptions in my previous reply)

Prometheus's Shortform

Joe Collman10mo20

Working on this seems good insofar as greater control implies more options. With good security, it's still possible to opt in to whatever weight-sharing / transparency mechanisms seem net positive - including with adversaries. Without security there's no option.

Granted, the [more options are likely better] conclusion is clearer if we condition on wise strategy.
However, [we have great security, therefore we're sharing nothing with adversaries] is clearly not a valid inference in general.

2Garrett Baker10mo

Not necessarily. If we have the option to hide information, then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information, and are closer to a decisive strategic advantage than we appear. Even in the case where we do share all our information (which we won't). Of course the more options are likely better option holds if the lumbering, slow, disorganized, and collectively stupid organizations which have those options somehow perform the best strategy, but they're not actually going to take the best strategy. Especially when it comes to US-China relations. ETA: I don't think the conclusion holds if that is true in general, and I don't think I ever assumed or argued it was true in general.

[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"

Joe Collman10moΩ793

I think this is great overall.

One area I'd ideally prefer a clearer presentation/framing is "Safety/performance trade-offs".

I agree that it's better than "alignment tax", but I think it shares one of the core downsides:

If we say "alignment tax" many people will conclude ["we can pay the tax and achieve alignment" and "the alignment tax isn't infinite"].
If we say "Safety/performance trade-offs" many people will conclude ["we know how to make systems safe, so long as we're willing to sacrifice performance" and "performance sacrifice won't imply any hard limi

... (read more)

4David Scott Krueger (formerly: capybaralet)10mo

Really interesting point! I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety. I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.

On “first critical tries” in AI alignment

Joe Collman10moΩ4130

I think the DSA framing is in keeping with the spirit of "first critical try" discourse.
(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)

However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-gr... (read more)

faul_sname9moΩ4104

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

MIRI 2024 Communications Strategy

Joe Collman11mo*4-5

On your (2), I think you're ignoring an understanding-related asymmetry:

Without clear models describing (a path to) a solution, it is highly unlikely we have a workable solution to a deep and complex problem:
1. Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence.
  [EDIT for clarity, by "we have" I mean "we know of", not "there exists"; I'm not claiming there's strong evidence that no path to a solution exists]
Whether or not we have clear models of a problem, it is entirely possible for it to exist and to kill us:
1. Absence of

... (read more)

OpenAI: Fallout

Joe Collman11mo20

This makes it even clearer that Altman’s claims of ignorance were lies – he cannot possibly have believed that former employees unanimously signed non-disparagements for free!

This is still quoting Neel, right? Presumably you intended to indent it.

2Zvi11mo

Yes this is quoting Neel.

Talent Needs of Technical AI Safety Teams

Joe Collman11mo20

Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?

If we're sticking to the [generate new approaches to core problems] aim, I can see three or four I'd be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).

There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basi... (read more)

Talent Needs of Technical AI Safety Teams

Joe Collman11mo40

A few points here (all with respect to a target of "find new approaches to core problems in AGI alignment"):

It's not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that's more easily achieved by not doing a PhD. (to the extent that development of research 'taste'/skill acts to service a publish-or-perish constraint, that's likely to be harmful)

This is not to say that there's nothing useful about an academic context - only tha... (read more)

4Ryan Kidd11mo

I do think category theory professors or similar would be reasonable advisors for certain types of MIRI research.

4TsviBT11mo

I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.) I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is "related" in the sense of "might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds".

Talent Needs of Technical AI Safety Teams

Joe Collman11mo20

RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you'd like to see])

Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it's non-obvious what conclusions to draw, but more data is a good starting point. It's on my to-do-list to read it carefully and share some thoughts.

Talent Needs of Technical AI Safety Teams

Joe Collman11mo4021

I agree with Tsvi here (as I'm sure will shock you :)).

I'd make a few points:

"our revealed preferences largely disagree with point 1" - this isn't clear at all. We know MATS' [preferences, given the incentives and constraints under which MATS operates]. We don't know what you'd do absent such incentives and constraints.
1. I note also that "but we aren't Refine" has the form [but we're not doing x], rather than [but we have good reasons not to do x]. (I don't think MATS should be Refine, but "we're not currently 20% Refine-on-ramp" is no argument that it would

... (read more)

4Ryan Kidd11mo

I plan to respond regarding MATS' future priorities when I'm able (I can't speak on behalf of MATS alone here and we are currently examining priorities in the lead up to our Winter 2024-25 Program), but in the meantime I've added some requests for proposals to my Manifund Regrantor profile.

Talent Needs of Technical AI Safety Teams

Joe Collman11mo62

For reference there's this: What I learned running Refine
When I talked to Adam about this (over 12 months ago), he didn't think there was much to say beyond what's in that post. Perhaps he's updated since.

My sense is that I view it as more of a success than Adam does. In particular, I think it's a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.

Agreed that Refine's... (read more)

2TsviBT11mo

Ah thanks! Mhm. In fact I'd want to apply a bar that's even lower, or at least different: [the extent to which the participants (as judged by more established alignment thinkers) seem to be well on the way to developing new promising directions--e.g. being relentlessly resourceful including at the meta-level; having both appropriate Babble and appropriate Prune; not shying away from the hard parts]. Agree that this is an issue, but I think it can be addressed--certainly at least well enough that there'd be worthwhile value-of-info in running such a thing. I'd be happy to contribute a bit of effort, if someone else is taking the lead. I think most of my efforts will be directed elsewhere, but for example I'd be happy to think through what such a program should look like; help write justificatory parts of grant applications; and maybe mentor / similar.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joe Collman11moΩ120

(understood that you'd want to avoid the below by construction through the specification)

I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes.

It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our a... (read more)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joe Collman11moΩ285

[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"]

not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world.

Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe".

It is worth noting here that a potential failure mode is that a truly malicious general-pur

Joe Collman11moΩ5115

This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])

The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing]... (read more)

3davidad11mo

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period. "Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours. As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.

6davidad11mo

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments). It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.

Stephen Fowler's Shortform

Joe Collman11mo63

So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.

I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).

In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirabl... (read more)

Shane Legg's necessary properties for every AGI Safety plan

Joe Collman1y41

Some thoughts:

Necessary conditions aren't sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
1. I don't see any reason to think that prompting would achieve this robustly.
2. As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
  1. I note that I don't expect trying it emp

Joe Collman1y40

I think it's important to distinguish between:

Has understood a load of work in the field.
Has understood all known fundamental difficulties.

It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.

What is the purpose and application of AI Debate?

Joe Collman1y20

Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)

I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.

Counting arguments provide no evidence for AI doom

Joe Collman1yΩ230

Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.

[emphasis mine]

This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".

A more reasonable-to-my-mind b... (read more)

Critiques of the AI control agenda

Joe Collman1yΩ120

Sure, understood.

However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.

I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me... (read more)

2ryan_greenblatt1y

Yep, I just literally meant, "human coworker level doesn't suffice". I was just making a relatively narrow argument here, sorry about the confusion.

Critiques of the AI control agenda

Joe Collman1yΩ120

Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.

Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).

Models don't need to capable of a pareto improvement on human per... (read more)

2ryan_greenblatt1y

Note that I wasn't making this argument. I was just reponding to one specific story and then noting "I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties". This is obviously only one of many possible arguments.

Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe Collman1yΩ340

Thanks for the thoughtful response.

A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.

I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.

I agree that this isn't a... (read more)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe Collman1yΩ121

I'm not clear whether the idea is that:

The title isn't an overstatement.
The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness)
The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention.
There are upsides to the current nam

... (read more)

9ryan_greenblatt1y

I disagreed due to a combination of 2, 3, and 4. (Where 5 feeds into 2 and 3). For 4, the upside is just that the title is less long and confusingly caveated. Norms around titles seem ok to me given issues with space. Do you have issues with our recent paper title "AI Control: Improving Safety Despite Intentional Subversion"? (Which seems pretty similar IMO.) Would you prefer this paper was "AI Control: Improving Safety Despite Intentional Subversion in a code backdooring setting"? (We considered titles more like this, but they were too long : (.) Often with this sort of paper, you want to make some sort of conceptual point in your title (e.g. debate seems promising), but where the paper is only weak evidence for the conceptual point and most of the evidence is just that the method seems generally reasonable. I think some fraction of the general mass of people in the AI safety community (e.g. median person working at some safety org or persistently lurking on LW) reasonably often get misled into thinking results are considerably stronger than they are based on stuff like titles and summaries. However, I don't think improving titles has very much alpha here. (I'm much more into avoiding overstating claims in other things like abstracts, blog posts, presentations, etc.) While I like the paper and think the title is basically fine, I think the abstract is misleading and seems to unnecessarily overstate their results IMO; there is enough space to do better. I'll probably gripe about this in another comment. My reaction is mostly "this isn't useful", but this is implicitly a disagreement with stuff like "but here it may actually matter if e.g. those working in governance think that you've actually shown ...".

Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe Collman1yΩ498

Interesting - I look forward to reading the paper.

However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answer... (read more)

2the gears to ascension1y

Being misleading about this particular thing - whether persuasion is uniformly good - could have significant negative externalities, so I'd propose that it is important in this particular case to have a title that reduces the likelihood of title misuse. I'd hope that the title can be changed in an amended version fairly soon, so that the paper doesn't have a chance to spread too far in labs before the title clarification. I do expect a significant portion of people to not be vulnerable to this problem, but I'm thinking in terms of edge case risk in the first place here, so that doesn't change my opinion much.

2Joe Collman1y

I'd be curious what the take is of someone who disagrees with my comment. (I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction) I'm not clear whether the idea is that: 1. The title isn't an overstatement. 2. The title is not misleading. (e.g. because "everybody knows" that it's not making a claim of generality/robustness) 3. The title will not mislead significant amounts of people in important ways. It's marginally negative, but not worth time/attention. 4. There are upsides to the current name, and it seems net positive. (e.g. if it'd get more attention, and [paper gets attention] is considered positive) 5. This is the usual standard, so [it's fine] or [it's silly to complain about] or ...? 6. Something else. I'm not claiming that this is unusual, or a huge issue on its own. I am claiming that the norms here seem systematically unhelpful. I'm more interested in the general practice than this paper specifically (though I think it's negative here). I'd be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it's an unhelpful equilibrium, but if we unilaterally depart from it it'll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)

The case for ensuring that powerful AIs are controlled

Joe Collman1yΩ120

Thanks for the link.

I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)

I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.