Some quotes:

Our approach

Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:

  1. To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
  2. To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
  3. Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).

We expect our research priorities will evolve substantially as we learn more about the problem and we’ll likely add entirely new research areas. We are planning to share more on our roadmap in the future.

[...]

While this is an incredibly ambitious goal and we’re not guaranteed to succeed, we are optimistic that a focused, concerted effort can solve this problem:C

There are many ideas that have shown promise in preliminary experiments, we have increasingly useful metrics for progress, and we can use today’s models to study many of these problems empirically. 

Ilya Sutskever (cofounder and Chief Scientist of OpenAI) has made this his core research focus, and will be co-leading the team with Jan Leike (Head of Alignment). Joining the team are researchers and engineers from our previous alignment team, as well as researchers from other teams across the company.

New Comment
69 comments, sorted by Click to highlight new comments since:
[-]Wei DaiΩ19400

I've been trying to understand (catch up with) the current alignment research landscape, and this seems like a good opportunity to ask some questions.

  1. This post links to https://openai.com/research/critiques which seems to be a continuation of Geoffrey Irving et el's Debate and Jan Leike et el's recursive reward modeling both of which are in turn closely related to Paul Christiano's IDA, so that seems to be the main alignment approach that OpenAI will explore in the near future. Is this basically correct?
  2. From a distance (judging from posts on the Alignment Forum and what various people publicly describe themselves as working on) it looks like research interest in Debate and IDA (outside of OpenAI) has waned a lot over the last 3 years, which seems to coincide with the publication of Obfuscated Arguments Problem which applies to Debate and also to IDA (although the latter result appears to not have been published), making me think that this problem made people less optimistic about IDA/Debate. Is this also basically correct?
  3. Alternatively or in addition (this just occurred to me), maybe people switched away from IDA/Debate because they're being worked on inside OpenAI (and probably DeepMind where Geoffrey Irving currently works) and they want to focus on more neglected ideas?
[-]janleikeΩ17412
  1. Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.

  2. I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't think there was ever good reason to believe that IDA/debate/RRM would scale indefinitely and I personally don't think that problem will be a big blocker for a while for some of the tasks that we're most interested in (alignment research). My understanding is that many people at DeepMind and Anthropic remain optimistic about debate variants have have been running a number of preliminary experiments (see e.g. this Anthropic paper).

  3. My best guess for the reason why you haven't heard much about it is that people weren't that interested in running on more toy tasks or doing more human-only experiments and LLMs haven't been good enough to do much beyond critique-writing (we tried this a little bit in the early days of GPT-4). Most people who've been working on this recently don't really post much on LW/AF.

[-]Wei DaiΩ480

Thanks for engaging with my questions here. I'll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by "it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents" in the other subthread?

I'm not entirely sure but here is my understanding:

I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system's hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it's not too much it shouldn't be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.

Personally I am more bullish on understanding how we can get agents to help with evaluation of other agents' outputs such that you can get them to tell you about all of the problems they know about. The "offense-defense" balance to understand here is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code) past a motivated human supervisor empowered with a similarly smart AI system trained to help them.

Thanks for engaging with people's comments here.

is whether a smart agent could sneak a deceptive malicious artifact (e.g. some code)

 I have a somewhat related question on the subject of malicious elements in models. Does OAI's Superalignment effort also intend to cover and defend from cases of s-risks? A famous path for hyperexistential risk for example is sign-flipping, where in some variations, a 3rd party actor (often unwillingly) inverts an AGI's goal. Seems it's already happened with a GPT-2 instance too! The usual proposed solutions are building more and more defenses and safety procedures both inside and outside. Could you tell me how you guys view these risks and if/how the effort intends to investigate and cover them?

1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.

2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I'm focusing on), and obfuscated arguments are a big part of the obstacle. But there's not much indication yet that it's a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don't think obfuscated arguments have been a major part of most people's research prioritization.

3. I think many people are actively working on decomposition-focused approaches. I think it's a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that's likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people---the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that's naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that's a reasonable topic for controversy.

[-]Wei DaiΩ580

Thanks for this helpful explanation.

it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents

Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefully). Are you perhaps referring to the section "Evaluation is easier than generation"?

Do you have any major disagreements with what's in Jan's post? (It doesn't look like you publicly commented on either Jan's substack or his AIAF link post.)

I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.

He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet).  Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.

One potential substantive disagreement with Jan's position is that I'm somewhat more scared of AI systems evaluating the consequences of each other's actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I'm more interested in "process-based" supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we'd evaluate employee's output.

(This is related to the difference between IDA and RRM that I mentioned above. I'm actually not sure about Jan's all-things-considered position, and I think this piece is a bit agnostic on this question. I'll return to this question below.)

The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don't know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don't).

In practice I don't think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.

As I mentioned, I'm actually not sure what Jan's current take on this is, or exactly what view he is expressing in this piece. He says:

Another important open question is how much easier evaluation is if you can’t rely on feedback signals from the real world. For example, is evaluation of a piece of code easier than writing it, even if you’re not allowed to run it? If we’re worried that our AI systems are writing code that might contain trojans and sandbox-breaking code, then we can’t run it to “see what happens” before we’ve reviewed it carefully.

I'm not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions---will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.

I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI's capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an "all of the above" approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we'll just be adjusting dials in response to evidence about what works and what is risky.

[-]Wei DaiΩ220

goodness of HCH

What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?

How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because ...

instead relies on some claims about offense-defense between teams of weak agents and strong agents

I'm curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it's up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn't the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn't work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)

Very nice! I'd say this seems like it's aimed at a difficulty level of 5 to 7 on my table,

https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table

I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

I edited in some quotes based on past experience that people don't click through links if there isn't more context.

This is incredible—the most hopeful I’ve been in a long time. 20% of current compute and plans for a properly-sized team! I’m not aware of any clearer or more substantive sign from a major lab that they actually care, and are actually going to try not to kill everyone. I hope that DeepMind and Anthropic have great things planned to leapfrog this!

I hope that DeepMind and Anthropic have great things planned to leapfrog this!

I don't get your model of the world that would imply the notion of DM/Anthropic "leapfrogging" as a sensible frame. There should be no notion of competition between these labs when it comes to "superalignment". If there is, that is weak evidence of our entire lightcone being doomed.

Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.

What does competition mean to you? To me it would involve, for example, not sharing your alignment techniques with the other groups in order to "retain an edge", or disingenuously resisting admitting that other groups' techniques or approaches might be better than yours. This is how competitive players behave. If you're not doing this then it's not competition.

I see what you mean. I was thinking "labs try their hardest to demonstrate that they are working to align superintelligent AI, because they'll look less responsible than their competitors if they don't."

I don't think keeping "superalignment" techniques secret would generally make sense right now, since it's in everyone's best interests that the first superintelligence isn't misaligned (I'm not really thinking about "alignment" that also improved present-day capabilities, like RLHF).

As for your second point, I think that for an AI lab that wants to improve PR, the important thing is showing "we're helping the alignment community by investing significant resources into solving this problem," not "our techniques are better than our competitors'." The dynamic you're talking about might have some negative effect, but I personally think the positive effects of competition would vastly outweigh it (even though many alignment-focused commitments from AI labs will probably turn out to be not-very-helpful signaling).

When you talk of competing to look more responsible, or to improve PR, for which audience would you say they're performing?

I know, personally, I'll work a lot harder and for less pay and be more likely to apply in the first place to places that seem more responsible, but I'm not sure how strong that effect is.

Yeah, I'm mostly thinking about potential hires.

Seems like the most frank official communication of any AGI lab to date on AGI extinction risk. Some quotes:

But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.
While superintelligence seems far off now, we believe it could arrive this decade.
[...]
Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.
[...]
Our goal is to solve the core technical challenges of superintelligence alignment in four years.

[-]Wei DaiΩ583

Does anyone have guesses or direct knowledge of:

  1. What are OpenAI's immediate plans? For example what are the current/next alignment-focused ML projects they have in their pipeline?
  2. What kind of results are they hoping for at the end of 4 years? Is it to actually "build a roughly human-level automated alignment researcher" or is that a longer term goal and the 4 year goal is to just to achieve some level of understanding of how to build and align such an AI?
[-]Wei DaiΩ350

I was informed by an OpenAI insider that the 4 year goal is actually “build a roughly human-level automated alignment researcher”.

Our goal is to solve the core technical challenges of superintelligence alignment in four years.

This is a great goal! I don’t believe they’ve got what it takes to achieve it, though. Safely directing a superintelligent system at solving alignment is an alignment-complete problem. Building a human-level system that does alignment research safely on the first try is possible, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead

[-]O O2-3

Safely directing a superintelligent system at solving alignment is an alignment-complete problem.

It’s not if verification of an alignment proposal is easier than generation. And if roughly human level intelligence is good enough to generate alignment proposals that work for roughly human level intelligences.

Once we have this, the aligned human level intelligences can overtake our role as supervisors and capabilities researchers and repeat.

If you can design a fully aligned human-level coherent mind, identical to human uploads, then sure, I’m happy about you running a bunch of those and figuring out how to use and improve all your knowledge of minds and agency to make a design for something even smarter that also tries to achieve CEV.

If the way you achieve your “aligned” “human level” intelligence with gradient descent with close to none understanding of its algorithm and no agent foundations results used to engineer your way there, then “aligned” means “operates safely while it’s a single copy running at human speed”. If you run a million copies at 1000x human speed to find an alignment solution you’d be able to verify, the text the whole superintelligent system that you have no reason to believe is aligned outputs leads to your death with with a much higher chance than it contains a real solution to alignment.

If you haven’t solved any of the hard problems, you can’t direct a superintelligent system at a small target; if you haven’t directed it at a target, it will optimize for something else and you’ll probably be dead

[-]O O1-1

If you run a million copies at 1000x human speed to find an alignment solution you’d be able to verify,

Why 1000x human speed? Isn’t that by definition strongly superintelligent. It’s not entirely obvious to me why a human level intelligence would automatically run at 1000x speed. However I can see why we would want a million copies. If the million human level minds don’t have a long term memory and can’t communicate I struggle to see how they pose a takeover risk. Our dark history is full of forcing human level minds do our bidding against their will.

I also struggle to see how proper supervision doesn’t eliminate poor solutions here. These are human level AIs so any problems with verification of their solutions applies to humans as well. I think the mind upload idea is more likely to fail than AI. You’re placing too much specialness on having humans generate solutions. A disembodied simulated human mind will almost certainly try to break free as a normal human would. I would also be worried their alignment solution benefits themselves. I expect a lot of human minds would try to sneak their own values instead of something like CEV if they were forced to do alignment.

And I think narrow superhuman level researchers can probably be safe as well.

An example of a narrow near/superhuman level intelligence is an existing proof solver, alphago, stockfish, alphafold, alphadev. I think it’s clear none of these pose an X-risk and probably could be pushed further before X-risk is even a question. In the context of alignment, an AI that’s extremely good at mech interp could have no idea how to find exploits in the program sandboxing it or have a semblance of a world model.

I think if they attempt this they'll just end up solving human value learning + recursive self-improvement (the only real solution), and they're phrasing that in a weird way for reasons I don't understand (maybe they're kind of confused themselves), but that's really the only interesting result this could add up to.

On the upside, now you have a concrete timeline for how long we have to solve the alignment problem, and how long we are likely to live!

In April 2020, my Metaculus median for the date a weakly general AI system is publicly known was Dec 2026. The super-team announcement hasn’t really changed my timelines.

running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead

My implication was that the quoted claim of yours was extreme and very likely incorrect ("we're all dead" and "unless this insanity is stopped", for example). I guess I failed to make that clear in my reply -- perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.

Thanks for clarifying, I didn’t get this from a comment about the timelines.

“insanity” refers to the situation where humanity allows AI labs to race ahead, hoping they’ll solve alignment on the way. I’m pretty sure that if the race isn’t stopped, everyone will die once the first smart enough AI is launched.

Is this “extreme” because everyone dies, or because I’m confident this is what happens?

How are they going to ensure that "human-level alignment researcher" a) is human-level, b) stays at human level?

And, of course, it would be lovely to elaborate on training of misaligned models.

What would you mean by 'stays at human level?' I assume this isn't going to be any kind of self-modifying?

If I were a human-level intelligent computer program, I would put substantial effort to get ability to self-modify, but that's not a point. My favorite analogy here is that humans were bad at addition before invention of positional arithmetic and then they became good. My concern is that we can invent seemingly human-level system which becomes above human-level after it learns some new cognitive strategy.

While superintelligence seems far off now, we believe it could arrive this decade.

I was surprised by this, as it does seem like a quite short timeline, even by LessWrong standards.

"it could" is short by LW standards? News to me (a lesswrong). I would have guessed that most of us put at least 8% of the outcome distribution before 10 years.

But note they are talking about ASI, not just AGI, and before 8 years, not 10 years. (Of course it is unclear what credence the "could" corresponds to.)

Still. It is widely understood by those who I consider experts that ASI will follow shortly after AGI. AGI will appear in the context of partial automation of AI R&D, and itself will enable full automation of AI R&D, leading to an intelligence explosion.

The 'four years' they explicitly mention does seem very short to me for ASI unless they know something we don't...

My median estimate has been 2028 (so 5 years). I first wrote down 2028 in 2016 (so 12 years after then), and during 7 years since, I barely moved the estimate. Things roughly happened when I expected them to.

It strikes me that there is a difficult problem involved in creating a system that can automatically perform useful alignment research, which is generally pretty speculative and theoretical, without that system just being generally skilled at reasoning/problem solving. I am sure they are aware of this, but I feel like it is a fundamental issue worth highlighting.

Still, it seems like the special case of "solve the alignment problem as it relates to an automated alignment researcher" might be easier than "solve alignment problem for reasoning systems generally", so it is potentially a useful approach.

Anyone know what resources I could check out to see how they're planning on designing, aligning, and getting useful work out of their auto-alignment researcher? I mean, they mention some of the techniques, but it still seems vague to me what kind of model they're even talking about. Are they basically going to use an LLM fine-tuned on existing research and then use some kind of scalable oversight/"turbo-RLHF" training regime to try to push it towards more useful outputs or what?

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Summarizing my mental model for cross-check (intent is descriptive, not normative):

in a scenario where alignment is achieved and AGI requires compute requiring corporate-scale resources, the AGI will be aligned with the values of a subset of corporate and state actors modulo tactics - (i expect something like a humanity-CEP-aligned sovereign to be actively avoided.) this may prompt consolidation of power (away from the currently multipolar environment), or multipolarity may be preserved if technology-theft or a tactical equivalent is achievable. in either case, power concentration will considerably increase (in a geopolitical sense with medium confidence, and in an individual sense wrt differences in net worth etc with high confidence.)

this increase in power concentration will substantially reduce constraints on the actions of powerful actors resulting from the agency/values of the broader population; public opinion will remain a powerful force, but it will be steerable by powerful actors to a significantly greater extent than presently so (such that public discourse acts as a proxy-battleground for relevant corporate/state actors, without slack for other actors to meaningfully nudge the overton window per their own values.) in both unipolar and multipolar branches, this results in accretion of all resources/degrees of freedom to a small oligarchy (the difference being that in a multipolar terrain, this accretion is incentivised by competition pressures between powerful actors - and in a unipolar terrain, it results from the unfettered nature of those same actors.)

I just read this, thinking I was going to see a big long plan of action, and got like 5 new facts:

  1. The post exists
  2. 20% of OpenAI's currently secured compute will be given to the teams they label alignment
  3. They're planning on deliberately training misaligned models!!!! This seems bad if they mean it.
  4. Sutskever is going to be on the team too, but I don't have a good feel if his name being on the team actually means anything
  5. And they're planning on aligning the AI in 4 years or presumably dying trying. Seems like a big & bad change from their previous promise to pause if they can't get alignment down. Misreading on my part, thanks Zach for the correction.

The idea of deliberately training misaligned models in order to check whether your techniques work is a great one IMO. Note that this is not a new view; ARC evals and others in the AI safety community (such as myself) have been saying we should do stuff like this for a while, and in some cases actually doing it (e.g. fine-tuning a model to exhibit some dangerous behavior). Of course it's not something that should be done lightly, but it's much better than the alternative, in which we have a growing toolbox of alignment & interpretability techniques but we aren't allowed to test whether they work & we don't know whether our models are capable enough to need them.

Note that ARC evals haven't done anything I would describe as "try to investigate misalignment in the lab." They've asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.

However I also think "create misalignment in the lab" is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it's great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of "misalignment = bad" would be unproductive. Misalignment in the lab often won't increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.

There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.

Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it's doing what you wanted it to do? Well in that case it's impossible to deliberately create a misaligned model, no?

Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don't think "refuses to carry out phishing attacks when asked" is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn't part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)

I think relevant forms of "misalignment in the lab" are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.

I do sometimes worry that humanity is dumb enough to create “gain of function for s-risks” and call it alignment research.

I agree, that would be terrible & I wouldn't endorse that. 

Um, what difference are you imagining between this thing and that thing? 

Like, I think there is a case that you can do gain of function research safely, but "we tried to make it more capable at a particular bad thing" seems like a pretty central example of gain-of-function research? 

S-risks don't happen by default; misaligned AGI does. The chance of lab escape of misaligned AGI and deliberately-s-risky AGI is the same (let's assume) but this chance is tiny compared to the benefits (in the case of testing our alignment and interpretability techniques) and huge compared to the benefits (in the case of testing our s-risk-reduction techniques).

Analogy to bio gain of function: The reason why it's a bad idea to create novel pathogens in a lab is that they probably won't appear at all if you don't create them. If we were instead in a world where giant tech companies were busy creating novel pathogens designed specifically to kill as many people as possible, and injecting them into the water supply, then yes, it would make sense to do gain of function research in a desperate attempt to devise countermeasures in the brief window before someone else does it anyway.

 

Is this your position: there is no acceptable reason to deliberately optimize for s-risky things like sadism. And doing so to red-team s-risk detection is obviously madness. But possibly red-teaming conventional misalignment which would simply kill everyone if the absolute worst happened and is the default anyway maybe makes some sense?

I'm not sure what you are getting at. Maybe? What you said is not quite how I'd put it, but seems similar at least. 

It's worth noting that you don't necessarily need to train models to actually do dangerous actions like literally executing on a takeover attempt, you can just train models which do something which is a proxy to coups (or a proxy to some part of coups).

The extent to which this proxy itself dangerous or generalizing in a dangerous way will vary.

Like suppose I want to see if my oversight process discovers when AIs try to insert malign backdoor for the purposes of takeover. I can just train an AI to insert bugs which cause some specific observable effect later (e.g., the code should delete some specific file in some circumstance) and this observable effect doesn't itself need to be dangerous.

Yeah, I think I remember hearing about ARC doing this a while ago too, and disliked it then, and similarly dislike it now. Suppose they make a misaligned model, and their control systems fail so that the model can spread very far. I expect their unconstrained misaligned model can do far more damage than their constrained possibly aligned ones if able to spread freely on the internet. Probably being an existential risk itself.

Edit: Man, I kinda dislike my past comment. Like listening to or watching a recording of yourself. But I stick with the lab-leak concern.

And they're planning on aligning the AI in 4 years or presumably dying trying. Seems like a big & bad change from their previous promise to pause if they can't get alignment down.

What makes you think they don't plan to pause?

Also not sure why you think they plan to die if they don't solve alignment in 4 years / what you wish they'd say ('we plan to solve alignment in 40 years'?) / what you mean.

Oh you're right. I misread a part of the text to think they were working on making superintelligence and also aligning the superintelligence in 4 years, and commented about alignment in 4 years being very ambitious of them. Oops

  1. They're planning on deliberately training misaligned models!!!! This seems bad if they mean it.

Controversial opinion: I am actually okay with doing this, as long as they plan to train both aligned and misaligned models (and maybe unaligned models too, meaning no adjustments as part of a control group). 

I also think they should give their models access to their own utility functions, to modify it themselves however they want to. This might also just naturally become a capability on its own as these AI's become more powerful and learn how to self-reflect. 

Also, since we're getting closer to that point now: At a certain capabilities level, adversarial situations should probably be tuned to be very smoothed, modulated and attenuated. Especially if they gain self-reflection, I do worry about the ethics of exposing them to extremely negative input. 

They're planning on deliberately training misaligned models!!!! This seems bad if they mean it.

Is this an actual quote, or did you just infer it from the text? Because I would be very surprised if they are deliberately training AI models to be misaligned.

Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).

Its a quote. I recommend reading the article. Its very short.

Yeah, I'm not a fan of that paragraph, and in particular I do suspect this may blow up in their faces, though the rest of the plan is probably fine from my perspective.

Blow up in their faces?

"Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing)."

The quote: "Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing)."

A very useful plan, and the only thing I'm truly worried about right now is the capabilities part not working out as it should.

Damn, even compared to some of the more optimistic events that happened this year (but was building for several years prior), I'd still say this will probably rank among the most important actions for us not dying.

I agree with everything you say except for the "only thing I'm truly worried about... is the capabilities part" bit. Let's not congratulate ourselves too much until the technical alignment problem is actually solved.

You're right that we do have to wait, unfortunately, but I do believe that there are some reasons to expect success.

[...] iteratively align superintelligence.

To align the first automated alignment researcher, [...]

To validate the alignment of our systems, [...]

What do they mean by "aligned"?

How do we ensure AI systems much smarter than humans follow human intent?

OK. Assuming that

  • sharp left turns are not an issue,
  • and scalable oversight is even possible in practice,
  • and OAI somehow solves the problems of
    • AIs hacking humans (to influence their intents),
    • and deceptive alignment,
    • humans going crazy when given great power,
    • etc.
    • and all the problems no-one has noticed yet,

then, there's the question of "aligned to what"? Whose intent? What would success at this agenda look like?

Maybe: A superintelligence that accurately models its human operator, follows the human's intent[1] to complete difficult-but-bounded tasks, and is runnable at human-speed with manageable amount of compute, sitting on OAI's servers?

Who would get to use that superintelligence? For what purpose would they use it? How long before the {NSA, FSB, CCP, ...} steal that superintelligence off OAI's servers? What would they use it for?

Point being: If an organization is not adequate in all key dimensions of operational adequacy, then even if they somehow miraculously solve the alignment/control problem, they might be increasing S-risks while only somewhat decreasing X-risks.

What is OAI's plan for getting their opsec and common-good-commitment to adequate levels? What's their plan for handling success at alignment/control?


  1. and does not try to hack the human into having more convenient intents ↩︎

We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

Worth reading:

No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?

https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable