Joe Collman

I'm a researcher on the technical governance team at MIRI.

Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.

Previously:

Helped with MATS, running the technical side of the London extension (pre-LISA).
Briefly a teaching fellow with BlueDot on AISF.
Worked for a while on Debate (this kind of thing).

Quick takes on the above:

I think MATS is great-for-what-it-is. My misgivings relate to high-level direction.
- Worth noting that PIBBSS exists, and is philosophically closer to my ideal.
The technical AISF course doesn't have the emphasis I'd choose (which would be closer to Key Phenomena in AI Risk). It's a decent survey of current activity, but only implicitly gets at fundamentals - mostly through a [notice what current approaches miss, and will continue to miss] mechanism.
I don't expect research on Debate, or scalable oversight more generally, to help significantly in reducing AI x-risk. (I may be wrong! - some elaboration in this comment thread)

Posts

Sorted by New

37Existing Safety Frameworks Imply Unreasonable Confidence

2mo

12Truthfulness, standards and credibility

37Review of "Learning Normativity: A Research Agenda"

65Review of "Fun with +12 OOMs of Compute"

13A Critique of Non-Obstruction

33Optimal play in human-judged Debate usually won't answer your question

80Literature Review on Goal-Directedness

Wikitag Contributions

Acausal Trade

(+6/-6)

Comments

Sorted by

Newest

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Joe Collman2mo92

Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).

If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.

(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it'll fail?)

Orpheus16's Shortform

Joe Collman7mo51

Some thoughts:

The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard's reasons.
- Given [new potential-to-shift-motivation information/understanding], I expect there's a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
- Specifically:
  - Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who're cautious and highly adaptable?
    - Here I note that the kind of 'caution' we'd need is [people who push effectively for the system to operate with caution]. Most people who want caution are more cautious.
  - How is the project structured? Will the structure be optimized for adaptability? For red-teaming of top-level goals?
    - Suppose that a mid-to-high-level participant receives information making the current top-level goals questionable - is the setup likely to reward them for pushing for changes? (noting that these are the kind of changes that were not expected to be needed when the project launched)
  - Which external advisors do leaders of the project develop relationships with? What would trigger these to change?
  - ...
I do think that it makes sense to aim for some centralized project - but only if it's the right kind.
- I expect that almost all the directional influence is in [influence the initial conditions].
- For this reason, I expect [push for some kind of centralized project, and hope it changes later] is a bad idea.
- I think [devote great effort to influencing the likely initial direction of any such future project] seems a great idea (so long as you're sufficiently enlightened about desirable initial directions, of course :))
- I'd note that [initial conditions] needn't only be internal to the project - in principle we could have reason to believe that various external mechanisms would be likely to shift the project's motivation sufficiently over time. (I don't know of any such reasons)
I think the question becomes significantly harder once the primary motivation behind a project isn't [let's beat China!], but also isn't [your ideal project motivation (with your ideal initial conditions)].
I note that my p(doom) doesn't change much if we eliminate racing but don't slow down until it's clear to most decision makers that it's necessary.
- Likewise, I don't expect that [focus on avoiding the earliest disasters] is likely to be the best strategy. So e.g. getting into a good position on security seems great, all else equal - but I wouldn't sacrifice much in terms of [odds of getting to a sufficiently cautious overall strategy] to achieve better short-term security outcomes.

Making a conservative case for alignment

Joe Collman7mo00

First some points of agreement:

I like that you're focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
- Skimming through your suggestions, I think I'm most keen on human augmentation related approaches - hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
I think outreach to Republicans / conservatives, and working across political lines is important, and I'm glad that people are actively thinking about this.
I do buy the [Trump's high variance is helpful here] argument. It's far from a principled analysis, but I can more easily imagine [Trump does correct thing] than [Harris does correct thing]. (mainly since I expect the bar on "correct thing" to be high so that it needs variance)
- I'm certainly making no implicit "...but the Democrats would have been great..." claim below.

That said, various of the ideas you outline above seem to be founded on likely-to-be-false assumptions.
Insofar as you're aiming for a strategy that provides broadly correct information to policymakers, this seems undesirable - particularly where you may be setting up unrealistic expectations.

Highlights of the below:

Telling policymakers that we don't need to slow down seems negative.
1. I don't think you've made any valid argument that not needing to slow down is likely. (of course it'd be convenient)
A negative [alignment-in-the-required-sense tax] seems implausible. (see below)
1. (I don't think it even makes sense in the sense that "alignment tax" was originally meant^[1], but if "negative tax" gets conservatives listening, I'm all for it!)
I think it's great for people to consider convenient possibilities (e.g. those where economic incentives work for us) in some detail, even where they're highly unlikely. Whether they're actually 0.25% or 25% likely isn't too important here.
1. Once we're talking about policy advocacy, their probability is important.

More details:

A conservative approach to AI alignment doesn’t require slowing progress, avoiding open sourcing etc. Alignment and innovation are mutually necessary, not mutually exclusive: if alignment R&D indeed makes systems more useful and capable, then investing in alignment is investing in US tech leadership.

Here and in the case for a negative alignment tax, I think you're:

Using a too-low-resolution picture of "alignment" and "alignment research".
1. This makes it too easy to slip between ideas like:
  1. Some alignment research has property x
  2. All alignment research has property x
  3. A [sufficient for scalable alignment solution] subset of alignment research has property x
  4. A [sufficient for scalable alignment solution] subset of alignment research that we're likely to complete has property x
2. An argument that requires (iv) but only justifies (i) doesn't accomplish much. (we need something like (iv) for alignment tax arguments)
Failing to distinguish between:
1. Alignment := Behaves acceptably for now, as far as we can see.
2. Alignment := [some mildly stronger version of 'alignment']
3. Alignment := notkilleveryoneism

In particular, there'll naturally be some crossover between [set of research that's helpful for alignment] and [set of research that leads to innovation and capability advances] - but alone this says very little.

What we'd need is something like:

Optimizing efficiently for innovation in a way that incorporates various alignment-flavored lines of research gets us sufficient notkilleveryoneism progress before any unrecoverable catastrophe with high probability.

It'd be lovely if something like this were true - it'd be great if we could leverage economic incentives to push towards sufficient-for-long-term-safety research progress. However, the above statement seems near-certainly false to me. I'd be (genuinely!) interested in a version of that statement you'd endorse at >5% probability.

The rest of that paragraph seems broadly reasonable, but I don't see how you get to "doesn't require slowing progress".

On "negative alignment taxes":

First, a point that relates to the 'alignment' disambiguation above.
In the case for a negative alignment tax, you offer the following quote as support for alignment/capability synergy:

...Behaving in an aligned fashion is just another capability... (Anthropic quote from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback)

However, the capability is [ability to behave in an aligned fashion], and not [tendency to actually behave in an aligned fashion] (granted, Anthropic didn't word things precisely here). The latter is a propensity, not a capability.

What we need for scalable alignment is the propensity part: no-one sensible is suggesting that superintelligences wouldn't have the ability to behave in an aligned fashion. The [behavior-consistent-with-alignment]-capability synergy exists while a major challenge is for systems to be able to behave desirably.

Once capabilities are autonomous-x-risk-level, the major challenge will be to get them to actually exhibit robustly aligned behavior. At that point there'll be no reason to expect the synergy - and so no basis to expect a negative or low alignment tax where it matters.

On things like "Cooperative/prosocial AI systems", I'd note that hits-based exploration is great - but please don't expect it to work (and that "if implemented into AI systems in the right ways" is almost all of the problem).

On this basis, it seems to me that the conservative-friendly case you've presented doesn't stand up at all (to be clear, I'm not critiquing the broader claim that outreach and cooperation are desirable):

We don't have a basis to expect negative (or even low) alignment tax.
- (unclear so far that we'll achieve non-infinite alignment tax for autonomous x-risk relevant cases)
It's highly likely that we do need to slow advancement, and will need serious regulation.

Given our lack of precise understanding of the risks, we'll likely have to choose between [overly restrictive regulation] and [dangerously lax regulation] - we don't have the understanding to draw the line in precisely the right place. (completely agree that for non-frontier systems, it's best to go with little regulation)

I'd prefer a strategy that includes [policymakers are made aware of hard truths] somewhere.
I don't think we're in a world where sufficient measures are convenient.

It's unsurprising that conservatives are receptive to quite a bit "when coupled with ideas around negative alignment taxes and increased economic competitiveness" - but this just seems like wishful thinking and poor expectation management to me.

Similarly, I don't see a compelling case for:

that is, where alignment techniques are discovered that render systems more capable by virtue of their alignment properties. It seems quite safe to bet that significant positive alignment taxes simply will not be tolerated by the incoming federal Republican-led government—the attractor state of more capable AI will simply be too strong.

Of course this is true by default - in worlds where decision-makers continue not to appreciate the scale of the problem, they'll stick to their standard approaches. However, conditional on their understanding the situation, and understanding that at least so far we have not discovered techniques through which some alignment/capability synergy keeps us safe, this is much less obvious.

I have to imagine that there is some level of perceived x-risk that snaps politicians out of their default mode.
I'd bet on [Republicans tolerate significant positive alignment taxes] over [alignment research leads to a negative alignment tax on autonomous-x-risk-capable systems] at at least ten to one odds (though I'm not clear how to operationalize the latter).
Republicans are more flexible than reality :).

^{^}
As I understand the term, alignment tax compares [lowest cost for us to train a system with some capability level] against [lowest cost for us to train an aligned system with some capability level]. Systems in the second category are also in the first category, so zero tax is the lower bound.

This seems a better definition, since it focuses on the outputs, and there's no need to handwave about what counts as an alignment-flavored training technique: it's just [...any system...] vs [...aligned system...].

Separately, I'm not crazy about the term: it can suggests to new people that we know how to scalably align systems at all. Talking about "lowering the alignment tax" from infinity strikes me as an odd picture.

Twitter thread on AI safety evals

Joe Collman10mo40

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

The whole thing is laundering vibes into credible-sounding headline numbers.
Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).

I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].

It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.

Circumventing interpretability: How to defeat mind-readers

Joe Collman10moΩ120

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].

That said, I think I'd disagree on one word of the following:

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to that help orchestrate the actions required to do it. This is true even if they've been selected for passively.

Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].

In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)

I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).

If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)

Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].

My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].

Circumventing interpretability: How to defeat mind-readers

Joe Collman10moΩ120

Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.

It seems to me that there's no fixed notion of "active" that works for both paragraphs here.

If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.

If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.

There are two dimensions here:

Whether the circumvention is implemented passively/actively.
Whether the circumvention is selected for passively/actively.

In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).

On “first critical tries” in AI alignment

Joe Collman11mo60

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

On “first critical tries” in AI alignment

Joe Collman11moΩ360

It may be better to think about it that way, yes - in some cases, at least.

Probably it makes sense to throw in some more variables.
Something like:

To stand x chance of property p applying to system s, we'd need to apply resources r.

In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

Towards more cooperative AI safety strategies

Joe Collman11mo*73

Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".

Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?

[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]

I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't emphasize this so much] would have much of an impact.
I expect that anything less than a position of [open-source will be fine forever] would have had much the same impact - though perhaps a little slower. (granted, there's potential for hindsight bias here, so I shouldn't say "I'm confident that this was inevitable", but it's not at all clear to me that it wasn't highly likely)

It's also not clear to me that any narrow definition of [AI safety community] was in a position to prevent some claims that open-source will be unacceptably dangerous at some point. E.g. IIRC Geoffrey Hinton rhetorically compared it to giving everyone nukes quite a while ago.

Reducing focus on [desirable, but controversial, short-term wins] seems important to consider where non-adversarial groups are concerned. It's less clear that it helps against (proto-)adversarial groups - unless you're proposing some kind of widespread, strict message discipline (I assume that you're not).

[EDIT for useful replies to this, see Richard's replies to Akash above]

Towards more cooperative AI safety strategies

Joe Collman11mo42

On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].

Constrained-power-seeking still seems necessary to me. (unfortunately)

A few clarifications:

I guess most technical AIS work is net negative in expectation. My ask there is that people work on clearer cases for their work being positive.
I don't think my (or Eliezer's) conclusions on strategy are downstream of [likelihood of doom]. I've formed some model of the situation. One output of the model is [likelihood of doom]. Another is [seemingly least bad strategies]. The strategies are based around why doom seems likely, not (primarily) that doom seems likely.
It doesn't feel like "I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are".
- It feels like the level of power-seeking I'm suggesting seems necessary is appropriate.
- My cognitive biases push me away from enacting power-seeking strategies.
- Biases aside, confidence in [power seems necessary] doesn't imply confidence that I know what constraints I'd want applied to the application of that power.
- In strategies I'd like, [constraints on future use of power] would go hand in hand with any [accrual of power].
  - It's non-obvious that there are good strategies with this property, but the unconstrained version feels both suspicious and icky to me.
  - Suspicious, since [I don't have a clue how this power will need to be directed now, but trust me - it'll be clear later (and the right people will remain in control until then)] does not justify confidence.
To me, you seem to be over-rating the applicability of various reference classes in assessing [(inputs to) likelihood of doom]. As I think I've said before, it seems absolutely the correct strategy to look for evidence based on all the relevant reference classes we can find.
- However, all else equal, I'd expect:
  - Spending a long time looking for x, makes x feel more important.
  - [Wanting to find useful x] tends to shade into [expecting to find useful x] and [perceiving xs as more useful than they are].
    - Particularly so when [absent x, we'll have no clear path to resolving hugely important uncertainties].
- The world doesn't owe us convenient reference classes. I don't think there's any way around inside-view analysis here - in particular, [how relevant/significant is this reference class to this situation?] is an inside-view question.
  - That doesn't make my (or Eliezer's, or ...'s) analysis correct, but there's no escaping that you're relying on inside-view too. Our disagreement only escapes [inside-view dependence on your side] once we broadly agree on [the influence of inside-view properties on the relevance/significance of your reference classes]. I assume that we'd have significant disagreements there.
    - Though it still seems useful to figure out where. I expect that there are reference classes that we'd agree could clarify various sub-questions.
  - In many non-AI-x-risk situations, we would agree - some modest level of inside-view agreement would be sufficient to broadly agree about the relevance/significance of various reference classes.