Joe Collman

I'm a researcher on the technical governance team at MIRI.

Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.

Previously:

Quick takes on the above:

  • I think MATS is great-for-what-it-is. My misgivings relate to high-level direction.
    • Worth noting that PIBBSS exists, and is philosophically closer to my ideal.
  • The technical AISF course doesn't have the emphasis I'd choose (which would be closer to Key Phenomena in AI Risk). It's a decent survey of current activity, but only implicitly gets at fundamentals - mostly through a [notice what current approaches miss, and will continue to miss] mechanism.
  • I don't expect research on Debate, or scalable oversight more generally, to help significantly in reducing AI x-risk. (I may be wrong! - some elaboration in this comment thread)
     

Wiki Contributions

Comments

Sorted by

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

  1. The whole thing is laundering vibes into credible-sounding headline numbers.
  2. Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).

I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].

It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.

Joe CollmanΩ120

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].

That said, I think I'd disagree on one word of the following:

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to that help orchestrate the actions required to do it. This is true even if they've been selected for passively.

Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].

In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)

I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).

If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)

Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].

My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].

Joe CollmanΩ120

Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.

Most potential circumvention methods that can be passive can also be active. But some methods can only be active.

It seems to me that there's no fixed notion of "active" that works for both paragraphs here.

If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.

If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.

There are two dimensions here:

  • Whether the circumvention is implemented passively/actively.
  • Whether the circumvention is selected for passively/actively.

In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

Joe CollmanΩ360

It may be better to think about it that way, yes - in some cases, at least.

Probably it makes sense to throw in some more variables.
Something like:

  • To stand x chance of property p applying to system s, we'd need to apply resources r.

In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".

Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?

[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]

I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't emphasize this so much] would have much of an impact.
I expect that anything less than a position of [open-source will be fine forever] would have had much the same impact - though perhaps a little slower. (granted, there's potential for hindsight bias here, so I shouldn't say "I'm confident that this was inevitable", but it's not at all clear to me that it wasn't highly likely)

It's also not clear to me that any narrow definition of [AI safety community] was in a position to prevent some claims that open-source will be unacceptably dangerous at some point. E.g. IIRC Geoffrey Hinton rhetorically compared it to giving everyone nukes quite a while ago.

Reducing focus on [desirable, but controversial, short-term wins] seems important to consider where non-adversarial groups are concerned. It's less clear that it helps against (proto-)adversarial groups - unless you're proposing some kind of widespread, strict message discipline (I assume that you're not).

[EDIT for useful replies to this, see Richard's replies to Akash above

On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].

Constrained-power-seeking still seems necessary to me. (unfortunately)

A few clarifications:

  • I guess most technical AIS work is net negative in expectation. My ask there is that people work on clearer cases for their work being positive.
  • I don't think my (or Eliezer's) conclusions on strategy are downstream of [likelihood of doom]. I've formed some model of the situation. One output of the model is [likelihood of doom]. Another is [seemingly least bad strategies]. The strategies are based around why doom seems likely, not (primarily) that doom seems likely.
  • It doesn't feel like "I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are".
    • It feels like the level of power-seeking I'm suggesting seems necessary is appropriate.
    • My cognitive biases push me away from enacting power-seeking strategies.
    • Biases aside, confidence in [power seems necessary] doesn't imply confidence that I know what constraints I'd want applied to the application of that power.
    • In strategies I'd like, [constraints on future use of power] would go hand in hand with any [accrual of power].
      • It's non-obvious that there are good strategies with this property, but the unconstrained version feels both suspicious and icky to me.
      • Suspicious, since [I don't have a clue how this power will need to be directed now, but trust me - it'll be clear later (and the right people will remain in control until then)] does not justify confidence.
  • To me, you seem to be over-rating the applicability of various reference classes in assessing [(inputs to) likelihood of doom]. As I think I've said before, it seems absolutely the correct strategy to look for evidence based on all the relevant reference classes we can find.
    • However, all else equal, I'd expect:
      • Spending a long time looking for x, makes x feel more important.
      • [Wanting to find useful x] tends to shade into [expecting to find useful x] and [perceiving xs as more useful than they are].
        • Particularly so when [absent x, we'll have no clear path to resolving hugely important uncertainties].
    • The world doesn't owe us convenient reference classes. I don't think there's any way around inside-view analysis here - in particular, [how relevant/significant is this reference class to this situation?] is an inside-view question.
      • That doesn't make my (or Eliezer's, or ...'s) analysis correct, but there's no escaping that you're relying on inside-view too. Our disagreement only escapes [inside-view dependence on your side] once we broadly agree on [the influence of inside-view properties on the relevance/significance of your reference classes]. I assume that we'd have significant disagreements there.
        • Though it still seems useful to figure out where. I expect that there are reference classes that we'd agree could clarify various sub-questions.
      • In many non-AI-x-risk situations, we would agree - some modest level of inside-view agreement would be sufficient to broadly agree about the relevance/significance of various reference classes.

E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.

That's clarifying. In particular, I hadn't realized you meant to imply [legitimacy of the 'community' as a whole] in your post.

I think both are good examples in principle, given the point you're making. I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadening!) AIS community] help us much at all in achieving our goals.

To achieve our goals, I expect we'll need something much closer to 'our' people in power (where 'our' means [people with a pretty rare combination of properties, conducive to furthering our goals]), and increased legitimacy for [narrow part of the community I think is correct].

I think we'd need to go with [aim for a relatively narrow form of power], since I don't think accumulating less power will work. (though it's a good plan, to the extent that it's possible)

First, I think that thinking about and highlighting these kind of dynamics is important.
I expect that, by default, too few people will focus on analyzing such dynamics from a truth-seeking and/or instrumentally-useful-for-safety perspective.

That said:

  • It seems to me you're painting with too broad a brush throughout.
    • At the least, I think you should give some examples that lie just outside the boundary of what you'd want to call [structural power-seeking].
  • Structural power-seeking in some sense seems unavoidable. (AI is increasingly powerful; influencing it implies power)
    • It's not clear to me that you're sticking to a consistent sense throughout.
      • E.g. "That makes AI safety strategies which require power-seeking more difficult to carry out successfully." seems false in general, unless you mean something fairly narrow by power-seeking.
  • An important aspect is the (perceived) versatility of power:
    • To the extent that it's [general power that could be efficiently applied to any goal], it's suspicious.
    • To the extent that it's [specialized power that's only helpful in pursuing a narrow range of goals] it's less suspicious.
  • Similarly, it's important under what circumstances the power would become general: if I take actions that can only give me power by routing through [develops principled alignment solution], that would make a stated goal of [develop principled alignment solution] believable; it doesn't necessarily make some other goal believable - e.g. [...and we'll use it to create this kind of utopia].
  • Increasing legitimacy is power-seeking - unless it's done in such a way that it implies constraints.
    • That said, you may be right that it's somewhat less likely to be perceived as such.
    • Aiming for [people will tend to believe whatever I say about x] is textbook power-seeking wherever [influence on x] implies power.
    • We'd want something more like [people will tend to believe things that I say about x, so long as their generating process was subject to [constraints]].
      • Here it's preferable for [constraints] to be highly limiting and clear (all else equal).
  • I'd say that "prioritizing competence" begs the question.
    • What is the required sense of "competence"?
      • For the most important AI-based decision-making, I doubt that "...broadly competent, and capable of responding sensibly..." is a high enough bar.
    • In particular, "...because they don't yet take AGI very seriously" is not the only reason people are making predictable mistakes.
    • "...as AGI capabilities and risks become less speculative..."
      • Again, this seems too coarse-grained:
        • Some risks becoming (much) clearer does not entail all risks becoming (much) clearer.
        • Understanding some risks well while remaining blind to others, does not clearly imply safer decision-making, since "responding sensibly" will tend to be judged based on [risks we've noticed].
Joe CollmanΩ340

That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)

What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].

I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than hours, but it seems worth it (whether or not I'm correct - I expect that [the understanding required to clearly demonstrate to me that I'm wrong] would be useful in a bunch of other ways).

Load More