Some thoughts:
I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.
I don't think this is quite right.
Two major objections to the bio-anchors 30-year-median conclusion might be:
To me, (2) is the more obvious error. I basically buy (1) too, b...
To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
That said, I think I'd disagree on one word of the following:
...The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement su
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there's no fixed notion of "active" that works for both para...
I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].
I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].
It may be better to think about it that way, yes - in some cases, at least.
Probably it makes sense to throw in some more variables.
Something like:
In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
Do you see this as likely to have been avoidable? How?
I agree that it's undesirable. Less clear to me that it's an "own goal".
Do you see other specific things we're doing now (or that we may soon do) that seem likely to be future-own-goals?
[all of the below is "this is how it appears to my non-expert eyes"; I've never studied such dynamics, so perhaps I'm missing important factors]
I expect that, even early on, e/acc actively looked for sources of long-term disagreement with AI safety advocates, so it doesn't seem likely to me that [AI safety people don't e...
On your bottom line, I entirely agree - to the extent that there are non-power-seeking strategies that'd be effective, I'm all for them. To the extent that we disagree, I think it's about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].
Constrained-power-seeking still seems necessary to me. (unfortunately)
A few clarifications:
E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.
That's clarifying. In particular, I hadn't realized you meant to imply [legitimacy of the 'community' as a whole] in your post.
I think both are good examples in principle, given the point you're making. I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadeni...
First, I think that thinking about and highlighting these kind of dynamics is important.
I expect that, by default, too few people will focus on analyzing such dynamics from a truth-seeking and/or instrumentally-useful-for-safety perspective.
That said:
That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)
What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than h...
"[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...
I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom.
I'll clarify more below.
Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
Sure, but I mostly do...
Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.
Not going to respond to everything
No worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
Agreed, but I think this is too coarse-grained a view.
I expect that, absent impressive levels of international coordination, we...
[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]
Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?
Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).
[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]
...It seems like your argument here, and in other parts of your comment, is something like "we could d
Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]
I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
(Thanks for this too. I don't ...
Sure, linking to that seems useful, thanks.
That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].
For something like the latter, it's not clear to me that it's not useful.
Mainly my pessimism is about:
Do you have a [link to] / [summary of] your argument/intuitions for [this kind of research on debate makes us safer in expectation]? (e.g. is Geoffrey Irving's AXRP appearance a good summary of the rationale?)
To me it seems likely to lead to [approach that appears to work to many, but fails catastrophically] before it leads to [approach that works]. (This needn't be direct)
I.e. currently I'd expect this direction to make things worse both for:
I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.
(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we'll be able to reduce risk further in the future with more research.)
I note here that this isn't a fully-general counterargument, but rather a general consideration.
I don't see why this isn't a fully general counterargument to alignment work. Your arg...
Broadly I agree.
I'm not sure about:
but the team has not cohered around a leadership structure or agenda yet. I'm hopeful that this will come together
I don't expect the most effective strategy at present to be [(try hard to) cohere around an agenda]. An umbrella org hosting individual researchers seems the right starting point. Beyond that, I'd expect [structures and support to facilitate collaboration and self-organization] to be ideal.
If things naturally coalesce that's probably a good sign - but I'd prefer that to be a downstream consequence of explorati...
Ah okay, that's clarifying. Thanks.
It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]
Rather...
This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.
Thoughts:
(Egan's Incandescence is relevant and worth checking out - though it's not exactly thrilling :))
I'm not crazy about the terminology here:
Here and above, I'm unclear what "getting to 7..." means.
With x = "always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection".
Which of the following do you mean (if either)?:
I don't see how model organisms of deceptive alignment (MODA) get us (2).
This would seem to require some theoretical reason to believe our MODA in some sense covere...
I agree with this.
Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).
I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different ...
then even if we reveal information, adversaries may still assume (likely correctly) we aren't sharing all our information
I think the same reasoning applies if they hack us: they'll assume that the stuff they were able to hack was the part we left suspiciously vulnerable, and the really important information is behind more serious security.
I expect they'll assume we're in control either way - once the stakes are really high.
It seems preferable to actually be in control.
I'll grant that it's far from clear that the best strategy would be used.
(apologies if I misinterpreted your assumptions in my previous reply)
Working on this seems good insofar as greater control implies more options. With good security, it's still possible to opt in to whatever weight-sharing / transparency mechanisms seem net positive - including with adversaries. Without security there's no option.
Granted, the [more options are likely better] conclusion is clearer if we condition on wise strategy.
However, [we have great security, therefore we're sharing nothing with adversaries] is clearly not a valid inference in general.
I think this is great overall.
One area I'd ideally prefer a clearer presentation/framing is "Safety/performance trade-offs".
I agree that it's better than "alignment tax", but I think it shares one of the core downsides:
I think the DSA framing is in keeping with the spirit of "first critical try" discourse.
(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)
However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-gr...
Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".
On your (2), I think you're ignoring an understanding-related asymmetry:
This makes it even clearer that Altman’s claims of ignorance were lies – he cannot possibly have believed that former employees unanimously signed non-disparagements for free!
This is still quoting Neel, right? Presumably you intended to indent it.
Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?
If we're sticking to the [generate new approaches to core problems] aim, I can see three or four I'd be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).
There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basi...
A few points here (all with respect to a target of "find new approaches to core problems in AGI alignment"):
It's not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that's more easily achieved by not doing a PhD. (to the extent that development of research 'taste'/skill acts to service a publish-or-perish constraint, that's likely to be harmful)
This is not to say that there's nothing useful about an academic context - only tha...
RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you'd like to see])
Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it's non-obvious what conclusions to draw, but more data is a good starting point. It's on my to-do-list to read it carefully and share some thoughts.
I agree with Tsvi here (as I'm sure will shock you :)).
I'd make a few points:
For reference there's this: What I learned running Refine
When I talked to Adam about this (over 12 months ago), he didn't think there was much to say beyond what's in that post. Perhaps he's updated since.
My sense is that I view it as more of a success than Adam does. In particular, I think it's a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Agreed that Refine's...
(understood that you'd want to avoid the below by construction through the specification)
I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes.
It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our a...
[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"]
not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world.
Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe".
...It is worth noting here that a potential failure mode is that a truly malicious general-pur
This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing]...
So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).
In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirabl...
Some thoughts:
I think it's important to distinguish between:
It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.
Relevant here is Geoffrey Irving's AXRP podcast appearance. (if anyone already linked this, I missed it)
I think Daniel Filan does a good job there both in clarifying debate and in questioning its utility (or at least the role of debate-as-solution-to-fundamental-alignment-subproblems). I don't specifically remember satisfying answers to your (1)/(2)/(3), but figured it's worth pointing at regardless.
Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.
[emphasis mine]
This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".
A more reasonable-to-my-mind b...
Sure, understood.
However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me...
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.
Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).
Models don't need to capable of a pareto improvement on human per...
Thanks for the thoughtful response.
A few thoughts:
If length is the issue, then replacing "leads" with "led" would reflect the reality.
I don't have an issue with titles like "...Improving safety..." since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare "curing cancer using x" to "x cures cancer".
Also in that particular case your title doesn't suggest [we have achieved AI control]. I don't think it's controversial that control would improve safety, if achieved.
I agree that this isn't a...
I'd be curious what the take is of someone who disagrees with my comment.
(I'm mildly surprised, since I'd have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)
I'm not clear whether the idea is that:
Interesting - I look forward to reading the paper.
However, given that most people won't read the paper (or even the abstract), could I appeal for paper titles that don't overstate the generality of the results. I know it's standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you've actually shown "Debating with More Persuasive LLMs Leads to More Truthful Answers", rather than "In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answer...
Thanks for the link.
I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)
I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.
Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).
If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.
(without having read too carefully, my initial impression is that this is the kind of thing I expect to work fo... (read more)