I'm a researcher on the technical governance team at MIRI.
Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.
Previously:
Quick takes on the above:
That said, various of the ideas you outline above seem to be founded on likely-to-be-false assumptions.
Insofar as you're aiming for a strategy that provides broadly correct information to policymakers, this seems undesirable - particularly where you may be setting up unrealistic expectations.
A conservative approach to AI alignment doesn’t require slowing progress, avoiding open sourcing etc. Alignment and innovation are mutually necessary, not mutually exclusive: if alignment R&D indeed makes systems more useful and capable, then investing in alignment is investing in US tech leadership.
Here and in the case for a negative alignment tax, I think you're:
In particular, there'll naturally be some crossover between [set of research that's helpful for alignment] and [set of research that leads to innovation and capability advances] - but alone this says very little.
What we'd need is something like:
It'd be lovely if something like this were true - it'd be great if we could leverage economic incentives to push towards sufficient-for-long-term-safety research progress. However, the above statement seems near-certainly false to me. I'd be (genuinely!) interested in a version of that statement you'd endorse at >5% probability.
The rest of that paragraph seems broadly reasonable, but I don't see how you get to "doesn't require slowing progress".
First, a point that relates to the 'alignment' disambiguation above.
In the case for a negative alignment tax, you offer the following quote as support for alignment/capability synergy:
...Behaving in an aligned fashion is just another capability... (Anthropic quote from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback)
However, the capability is [ability to behave in an aligned fashion], and not [tendency to actually behave in an aligned fashion] (granted, Anthropic didn't word things precisely here). The latter is a propensity, not a capability.
What we need for scalable alignment is the propensity part: no-one sensible is suggesting that superintelligences wouldn't have the ability to behave in an aligned fashion. The [behavior-consistent-with-alignment]-capability synergy exists while a major challenge is for systems to be able to behave desirably.
Once capabilities are autonomous-x-risk-level, the major challenge will be to get them to actually exhibit robustly aligned behavior. At that point there'll be no reason to expect the synergy - and so no basis to expect a negative or low alignment tax where it matters.
On things like "Cooperative/prosocial AI systems", I'd note that hits-based exploration is great - but please don't expect it to work (and that "if implemented into AI systems in the right ways" is almost all of the problem).
On this basis, it seems to me that the conservative-friendly case you've presented doesn't stand up at all (to be clear, I'm not critiquing the broader claim that outreach and cooperation are desirable):
Given our lack of precise understanding of the risks, we'll likely have to choose between [overly restrictive regulation] and [dangerously lax regulation] - we don't have the understanding to draw the line in precisely the right place. (completely agree that for non-frontier systems, it's best to go with little regulation)
I'd prefer a strategy that includes [policymakers are made aware of hard truths] somewhere.
I don't think we're in a world where sufficient measures are convenient.
It's unsurprising that conservatives are receptive to quite a bit "when coupled with ideas around negative alignment taxes and increased economic competitiveness" - but this just seems like wishful thinking and poor expectation management to me.
Similarly, I don't see a compelling case for:
that is, where alignment techniques are discovered that render systems more capable by virtue of their alignment properties. It seems quite safe to bet that significant positive alignment taxes simply will not be tolerated by the incoming federal Republican-led government—the attractor state of more capable AI will simply be too strong.
Of course this is true by default - in worlds where decision-makers continue not to appreciate the scale of the problem, they'll stick to their standard approaches. However, conditional on their understanding the situation, and understanding that at least so far we have not discovered techniques through which some alignment/capability synergy keeps us safe, this is much less obvious.
I have to imagine that there is some level of perceived x-risk that snaps politicians out of their default mode.
I'd bet on [Republicans tolerate significant positive alignment taxes] over [alignment research leads to a negative alignment tax on autonomous-x-risk-capable systems] at at least ten to one odds (though I'm not clear how to operationalize the latter).
Republicans are more flexible than reality :).
As I understand the term, alignment tax compares [lowest cost for us to train a system with some capability level] against [lowest cost for us to train an aligned system with some capability level]. Systems in the second category are also in the first category, so zero tax is the lower bound.
This seems a better definition, since it focuses on the outputs, and there's no need to handwave about what counts as an alignment-flavored training technique: it's just [...any system...] vs [...aligned system...].
Separately, I'm not crazy about the term: it can suggests to new people that we know how to scalably align systems at all. Talking about "lowering the alignment tax" from infinity strikes me as an odd picture.
I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.
I don't think this is quite right.
Two major objections to the bio-anchors 30-year-median conclusion might be:
To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).
I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].
It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.
To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
That said, I think I'd disagree on one word of the following:
The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations
tothat help orchestrate the actions required to do it. This is true even if they've been selected for passively.
Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].
In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)
I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).
If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)
Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].
My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there's no fixed notion of "active" that works for both paragraphs here.
If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.
If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.
There are two dimensions here:
In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).
I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].
I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].
It may be better to think about it that way, yes - in some cases, at least.
Probably it makes sense to throw in some more variables.
Something like:
In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".
Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?
[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]
I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't emphasize this so much] would have much of an impact.
I expect that anything less than a position of [open-source will be fine forever] would have had much the same impact - though perhaps a little slower. (granted, there's potential for hindsight bias here, so I shouldn't say "I'm confident that this was inevitable", but it's not at all clear to me that it wasn't highly likely)
It's also not clear to me that any narrow definition of [AI safety community] was in a position to prevent some claims that open-source will be unacceptably dangerous at some point. E.g. IIRC Geoffrey Hinton rhetorically compared it to giving everyone nukes quite a while ago.
Reducing focus on [desirable, but controversial, short-term wins] seems important to consider where non-adversarial groups are concerned. It's less clear that it helps against (proto-)adversarial groups - unless you're proposing some kind of widespread, strict message discipline (I assume that you're not).
[EDIT for useful replies to this, see Richard's replies to Akash above]
On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].
Constrained-power-seeking still seems necessary to me. (unfortunately)
A few clarifications:
E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.
That's clarifying. In particular, I hadn't realized you meant to imply [legitimacy of the 'community' as a whole] in your post.
I think both are good examples in principle, given the point you're making. I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadening!) AIS community] help us much at all in achieving our goals.
To achieve our goals, I expect we'll need something much closer to 'our' people in power (where 'our' means [people with a pretty rare combination of properties, conducive to furthering our goals]), and increased legitimacy for [narrow part of the community I think is correct].
I think we'd need to go with [aim for a relatively narrow form of power], since I don't think accumulating less power will work. (though it's a good plan, to the extent that it's possible)
Some thoughts: