ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

mention seem to me like they could be very important to deploy at scale ASAP

Why think this is important to measure or that this already isn't happening?

E.g., on the current model organism related project I'm working on, I automate inspecting reasoning traces in various ways. But I don't feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn't more important than other tips for doing LLM research better).

My main vibe is:

  • AI R&D and AI safety R&D will almost surely come at the same time.
    • Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
  • People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).

It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.

(Note that this paper was already posted here, so see comments on that post as well.)

I wrote up some of my thoughts on Bengio's agenda here.

TLDR: I'm excited about work on trying to find any interpretable hypothesis which can be highly predictive on hard prediction tasks (e.g. next token prediction).[1] From my understanding, the bayesian aspect of this agenda doesn't add much value.

I might collaborate with someone to write up a more detailed version of this view which engages in detail and is more clearly explained. (To make it easier to argue against and to exist as a more canonical reference.)

As far as Davidad, I think the "manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)" part is very likely to be totally unworkable even with vast amounts of AI labor. It's possible that something can be salvaged by retreating to a weaker approach. It seems like a roughly reasonable direction to explore as a possible highly ambitious moonshot to automate researching using AIs, but if you're not optimistic about safely using vast amounts of AI labor to do AI safety work[2], you should discount accordingly.

For an objection along these lines, see this comment.

(The fact that we can be conservative with respect to the infra-bayesian world model doesn't seem to buy much, most of the action is in getting something which is at all good at predicting the world. For instance, in Fabien's example, we would need the infrabayesian world model to be able to distinguish between zero-days and safe code regardless of conservativeness. If it didn't distinguish, then we'd never be able to run any code. This probably requires nearly as much intelligence as our AI has.)

Proof checking on this world model also seems likely to be unworkable, though I have less confidence in this view. And, the more the infra-bayesian world model is computationally intractible to run, the harder it is to proof check. E.g., if running the world model on many inputs is intractable (as would seem to be the default for detailed simulations), I'm very skeptical about proving anything about what it predicts.

I'm not an expert on either agenda and it's plausible that this comment gets some important details wrong.


  1. Or just improving on the intepretability and predictiveness pareto frontier substantially. ↩︎

  2. Presumably by employing some sort of safety intervention e.g. control or only using narrow AIs. ↩︎

Huh, this seems messy. I wish Time was less ambigious with their language here and more clear about exactly what they have/haven't seen.

It seems like the current quote you used is an accurate representation of the article, but I worry that it isn't an accurate representation of what is actually going on.

It seems plausible to me that Time is intentionally being ambigious in order to make the article juicier, though maybe this is just my paranoia about misleading journalism talking. (In particular, it seems like a juicier article if all of the big AI companies are doing this than if they aren't, so it is natural to imply they are all doing it even if you know this is false.)

Overall, my take is that this is a pretty representative quote (and thus I disagree with Zac), but I think the additional context maybe indicates that not all of these companies are doing this, particularly if the article is intentionally trying to deceive.

Due to prior views, I'd bet against Anthropic consistently pushing for very permissive of voluntary regulation behind closed doors which makes me think the article is probably at least somewhat misleading (perhaps intentionally).

I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless.

Hmm, no mean something broader than this, something like "humans ultimately have control and will decide what happens". In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.

Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn't ultimately point back to some human driven process.

I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can't train the AI based on it's behavior over that horizon).

I'm not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with.

I was arguing against:

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down

On the general point of "will people pause", I agree people won't pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don't necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

Yep, and my disagreement as expressed in another comment is that I think that it's not that hard to have robust corrigibility and there might also be a basin of corrigability.

The world looking alien isn't necessarily a crux for me: it should be possible in principle to have AIs protect humans and do whatever is needed in the alien AI world while humans are sheltered and slowly self-enhance and pick successors (see the indirect normativity appendix in the ELK doc for some discussion of this sort of proposal).

I agree that perfect alignment will be hard, but I model the situation much more like a one time hair cut (at least in expectation) than exponential decay of control.

I expect that "humans stay in control via some indirect mechanism" (e.g. indirect normativity) or "humans coordinate to slow down AI progress at some point (possibly after solving all diseases and becoming wildly wealthy) (until some further point, e.g. human self-enhancement)" will both be more popular as proposals than the world you're thinking about. Being popular isn't sufficient: it also needs to be implementable and perhaps sufficiently legible, but I think at least implementable is likely.

Another mechanism that might be important is human self-enhancement: humans who care about staying in control can try to self-enhance to stay at least somewhat competitive with AIs while preserving their values. (This is not a crux for me and seems relatively marginal, but I thought I would mention it.)

(I wasn't trying to trying to argue against your overall point in this comment, I was just pointing out something which doesn't make sense to me in isolation. See this other comment for why I disagree with your overall view.)

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to "go rogue".

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:

  • Rogue AIs
  • AIs being granted rights/affordances by humans. Either on the basis of:
    • Moral grounds.
    • Practical grounds. This could be either:
      • The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can't efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
      • We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.

I'm sympathetic to various policies around paying AIs. I think the likely deal will look more like: "if the AI doesn't try to screw us over (based on investigating all of it's actions in the future when he have much more powerful supervision and interpretability), we'll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power". Or possibly "if AIs reveal credible evidence of having long run preferences (that we didn't try to instill), we'll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don't have such preferences".

I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that "we grant the AIs rights and then they end up owning most capital via wages" is implausible.

Load More