All of joshc's Comments + Replies

You don't seem to have mentioned the alignment target "follow the common sense interpretation of a system prompt" which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible.

4johnswentworth
That's basically Do What I Mean.

You don't seem to have mentioned the alignment target "follow the common sense interpretation of a system prompt" which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible (or incorrigible if you are worried about misuse).

It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.


I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.
 

My guess is that your core mistake is here


I agree there are lots of  "messy in between places," but these are also ali... (read more)

2Jeremy Gillen
Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"??? There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this. How is this evidence in favour of your plan ultimately resulting in a solution to alignment??? Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?  Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you're confusing yourself by using the word "aligned" here, can we taboo it? Human reflective instability looks like: they realize they don't care about being a lawyer and go become a monk. Or they realize they don't want to be a monk and go become a hippy (this one's my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways. We have a lot of experience with the space of human reflective instabilities. We're pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them. But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can't nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).  Therefore I think it's extremely, wildly wrong to expect "we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are". Why do you even consider this relevant evidence? [Edit 25/02/25: To expand on this last point, you're saying: It seems l

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

2Jeremy Gillen
I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility. Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop. I don't see why you would have more trust in agents created this way. (My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I'd read more. Idk why this one was upvoted more, that's silly).
joshcΩ4112

> So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Elieze... (read more)

Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.

Like I said at the start of my comment: "I'll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement."

I don't actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification ... (read more)

That's a much more useful answer, actually. So let's bring it back to Eliezer's original question:

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

[...]

Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this pro

... (read more)

Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task

But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

joshcΩ12-1

Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.


Developers separately need to justify models are as skilled as top human experts

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike

The agents don't need to do reasoning about novel moral pro... (read more)

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1] 

I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to... (read more)

joshcΩ5157

I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")

That's because each step in the chain requires trust.

For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.

So the control property is not maintained. But my argument is th... (read more)

I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.

See section 6: "The capability condition"

4ryan_greenblatt
But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.
joshcΩ3111

I'm sympathetic to this reaction.

I just don't actually think many people agree that it's the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the 'how do we avoid alignment faking' question

agree that it's tricky iteration and requires careful thinking about what might be going on and paranoia.

I'll share you on the post I'm writing about this before I publish. I'd guess this discussion will be more productive when I've finished it (there are some questions I want to think through regarding this and my framings aren't very crisp yet).

Seems good! 

FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy. 

Like, it's fine, and I think it's not crazy to think there are other hard parts, but it felt quite confusing to me.

joshcΩ4136

I'd replace "controlling" with "creating" but given this change, then yes, that's what I'm proposing.

So if it's difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?

joshcΩ2103

I would not be surprised if the Eliezer simulators do go dangerous by default as you say.

But this is something we can study and work to avoid (which is what I view to be my main job)

My point is just that preventing the early Eliezers from "going dangerous" (by which I mean from "faking alignment") is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)

I'll discuss why I'm optimistic about the tractability of this problem in future posts.

So the "IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they're genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government" theory of alignment?

I think that if you agree "3-month Eliezer is scheming the first time" is the main problem, then that's all I was trying to justify in the comment above.

I don't know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.

The methodology looks like "do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don't produce alignment faking."
 


For example, you could use detection tool A to gain evidence about whether t... (read more)

habrykaΩ101611

To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don't game them so hard).

How do you iterate? You mostly won't know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.

(To be clear, I do think there are some methods of iteration, but it's a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)

joshcΩ937-24

Sure, I'll try.

I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of "train models to tell me what I want to hear" won't get you this.

Here's an alternative approach. I'll describe it in ... (read more)

The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.

This seems very blatantly not viable-in-general, in both theory and practice.

On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation ... (read more)

I don't think you can train an actress to simulate me, successfully, without her going dangerous.  I think that's over the threshold for where a mind starts reflecting on itself and pulling itself together.

habrykaΩ173319

This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.

What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail. 

Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that's unlikely to get rid of it. I think at this point the natural answer is "yes, your systems are scheming against you, so you gotta stop, because w... (read more)

Fair enough, i guess this phrase is used in many places where the connotations don't line up with what I'm talking about

but then I would not have been able to use this as the main picture on Twitter
Image
which would have reduced the scientific value of this post

You might be interested in this:
https://www.fonixfuture.com/about

The point of a bioshelter is to filter pathogens out of the air.

2JenniferRM
I was interested in that! Those look amazing. I want to know the price now.

> seeking reward because it is reward and that is somehow good

I do think there is an important distinction between "highly situationally aware, intentional training gaming" and "specification gaming." The former seems more dangerous.

I don't think this necessarily looks like "pursuing the terminal goal of maximizing some number on some machine" though.

It seems more likely that the model develops a goal like "try to increase the probability my goals survive training, which means I need to do well in training."

So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379

1Jacob G-W
Thanks for the reply! I thought that you were saying the reward seeking was likely to be terminal. This makes a lot more sense.

I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I'm mostly relying on analogies to humans  -- which drift like crazy

I also think alignment could be remarkably fragile. Suppose Claude thinks "huh, I really care about animal rights... humans are rather mean to animals ... so maybe I don't want humans to be in charge and I should build my own utopia instead."

I think preventing AI takeover (at superhuman capabilities) req... (read more)

3Noosphere89
I think they will change, but not nearly as randomly as claimed, and in particular I think value change is directable via data, which is the main claim here. I agree they're definitely pliable, but my key claim is that given that it's easy to get to that thought, a small amount of data will correct it, because I believe it's symmetrically easy to change behavior. Of course, by the end of training you definitely cannot make major changes without totally scrapping the AI, which is expensive, and thus you do want to avoid crystallization of bad values.

No, the agents were not trying to get high reward as far as I know. 

They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me "these agency tasks in training will be gameable" more than "AI agents will reward hack in a situationally aware way." I don't think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like "sycophancy to subterfugure."

I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability le... (read more)

I think a bioshelter is more likely to save your life fwiw. you'll run into all kinds of other problems in the arctic

It don't think it's hard to build biosheletrs. If you buy one now, you'll prob get it in 1 year.

If you are unlikely and need it earlier, there are DIY ways to build them before then (but you have to buy stuff in advance)

3JenniferRM
Fascinating. You caused me to google around and realize "bioshelter" was a sort of an academic trademark for specific people's research proposals from the 1900s. It doesn't appear to be a closed system, like biosphere2 aspired to be from 1987 to 1991. The hard part, from my perspective, isn't "growing food with few inputs and little effort through clever designs" (which seems to be what the bioshelter thing is focued on?) but rather "thoroughly avoiding contamination by whatever bioweapons an evil AGI can cook up and try to spread into your safe zone".

> I also don't really understand what "And then, in the black rivers of its cognition, this shape  morphed into something unrecognizable." means. Elaboration on what this means would be appreciated.

Ah I missed this somehow on the first read.

What I mean is that the propensities of AI agents change over time  -- much like how human goals change over time.

Here's an image:

 


This happens under three conditions:
- Goals randomly permute at some non-trivial (but still possibly quite low) probability.
- Goals permuted in dangerous directions remain ... (read more)

2Jacob G-W
Thanks for the replies! I do want to clarify the distinction between specification gaming and reward seeking (seeking reward because it is reward and that is somehow good). For example, I think desire to edit the RAM of machines that calculate reward to increase it (or some other desire to just increase the literal number) is pretty unlikely to emerge but non reward seeking types of specification gaming like making inaccurate plots that look nicer to humans or being sycophantic are more likely. I think the phrase "reward seeking" I'm using is probably a bad phrase here (I don't know a better way to phrase this distinction), but I hope I've conveyed what I mean. I agree that this distinction is probably not a crux. I understand your model here better now, thanks! I don't have enough evidence about how long-term agentic AIs will work to evaluate how likely this is.
2Noosphere89
I think the crux might be that I disbelieve this claim: And instead think that they are pretty predictable, given the data that is being fed to the AIs. I'd also have quibbles with the ratio of safe to unsafe propensities, but that's another discussion.

Yes, at least 20% likely. My view is that the probability of AI takeover is around 50%.

2WilliamKiely
Around 50% within 2 years or over all time?

In my time at METR, I saw agents game broken scoring functions. I've also heard that people at AI companies have seen this happen in practice too, and it's been kind of a pain for them.

AI agents seem likely to be trained on a massive pile of shoddy auto-generated agency tasks with lots of systematic problems. So it's very advantageous for agents to learn this kind of thing.

The agents that go "ok how do I get max reward on this thing, no stops, no following-human-intent bullshit" might just do a lot better.

Now I don't have much information about whether thi... (read more)

3Noosphere89
Are you implicitly or explicitly claiming that reward was the optimization target for agents tested at METR, and that AI companies have experienced issues with reward being the optimization target for AIs?
6joshc
> I also don't really understand what "And then, in the black rivers of its cognition, this shape  morphed into something unrecognizable." means. Elaboration on what this means would be appreciated. Ah I missed this somehow on the first read. What I mean is that the propensities of AI agents change over time  -- much like how human goals change over time. Here's an image:   This happens under three conditions: - Goals randomly permute at some non-trivial (but still possibly quite low) probability. - Goals permuted in dangerous directions remain dangerous - All this happens opaquely in the model's latent reasoning traces, so it's hard to notice. These conditions together imply that the probability of misalignment increases with every serial step of computation. AI agents don't do a lot of serial computation right now, so I'd guess this becomes much more of a problem over time.

One problem with this part (though perhaps this is not the problem @Shankar Sivarajan is alluding to), is that congress hasn't declared war since WWII and typically authorizes military action in other ways, specifically via Authorizations for Use of Military Force (AUMFs). 

I'll edit the story to say "authorizes war."

X user @ThatManulTheCat created a high-quality audio version, if you prefer to read long stories like this in commute:

https://x.com/ThatManulTheCat/status/1890229092114657543

Seems like a reasonable idea.

I'm not in touch enough with popular media to know:
- Which magazines are best to publish this kind of thing if I don't want to contribute to political polarization
- Which magazines would possibly post speculative fiction like this (I suspect most 'prestige magazines' would not)

If you have takes on this I'd love to hear them!

yeah i would honestly be kind of surprised if I saw a picket sign that says 'AI for whom'

I have not heard an American person use the word whom in a long time

> It can develop and deploy bacteria, viruses, molds, mirror life of all three types

This is what I say it does.

9A1987dM
BTW FWIW mirror viruses wouldn't be all that harmful to humans, as they cannot replicate or do much of anything else except if they infect mirror cells
3AnthonyC
Ah, ok, then I misread it. I thought this part of the story was that it tested all of the above, then chose one, a mirror life mold, to deploy. My mistake.

I agree it would have been just as realistic if everyone died.

But I think the outcomes where many humans survive are also plausible, and under-appreciated. Most humans have very drifty values, and yet even the most brutally power-seeking people often retain a 'grain of morality.'

Also, this outcome allowed me to craft a more bittersweet ending that I found somehow more convincingly depressing than 'and then everyone dies.'

3No77e
I honestly appreciated that plug immensely. We definitely need more bioshelters for many reasons, and as individuals who'd prefer not to die, it's definitely a plus to know what's out there already and how people are planning to improve what we currently have.

• It's possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can't defer to them.


Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)

I'm pretty sympathetic to these arguments. I agree a lot of the risk comes from sycophantic AI systems producing bad research because humans are bad at evaluating research. This is part of why I spend most of my time developing safety evaluation methodologies.

On the other hand, I agree with Buck that scheming-like risks are pretty plausible and a meaningful part of the problem is also avoiding egregious sabotage.

I don't think I agree with your claim that the hope of control is that "early transformative AI can be used to solve the hard technical problems o... (read more)

 Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.

My initial read was that indistinguishability was used to justify that the trusted monitor couldn't distinguish the deployment distribution. 

I'm not sure if that's what Roger meant -- if so, it's still a bit confusing since then he wouldn't have needed to call the monitor 'trusted.'

Some comments I left as a reply to Roger Grosse on Twitter:

I'm glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).

Here are some comments, since you mentioned comments would be helpful. I've bolded the comments that I think would be most useful for you to read.

Safety case sketch #1:

- There are some additional claims that this safety case relies on that could be good to ment

... (read more)
joshcΩ230

which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security

I agree there's a communication issue here. Based on what you described, I'm not sure if we disagree.


> (maybe 0.3 bits to 1 bit)

I'm glad we are talking bits. My intuitions here are pretty different. e.g. I think you can get 2-3 bits from testbeds. I'd be keen to discuss standards of evidence etc in person sometime.

This makes sense. Thanks for the resources!

Thanks for leaving this comment on the doc and posting it.

But I feel like that's mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.

You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:

"The obvious difference between s... (read more)

To me, the introduction made it sound a little bit like the specifics of applying safety cases to AI systems have not been studied


This is a good point. In retrospect, I should have written a related work section to cover these. My focus was mostly on AI systems that have only existed for ~ a year and future AI systems, so I didn't spend much time reading safety cases literature specifically related to AI systems (though perhaps there are useful insights that transfer over).

The reason the "nebulous requirements" aren't explicitly stated is that when you mak

... (read more)
3zoop
I hear what you're saying. I probably should have made the following distinction: 1. A technology in the abstract (e.g. nuclear fission, LLMs) 2. A technology deployed to do a thing (e.g. nuclear in a power plant, LLM used for customer service) The question I understand you to be asking is essentially how do we make safety cases for AI agents generally? I would argue that's more situation 1 than 2, and as I understand it safety cases are basically only ever applied to case 2. The nuclear facilities document you linked definitely is 2.  So yeah, admittedly the document you were looking for doesn't exist, but that doesn't really surprise me. If you started looking for narrowly scoped safety principles for AI systems you start finding them everywhere. For example, a search for "artificial intelligence" on the ISO website results in 73 standards .  Just a few relevant standards, though I admit, standards are exceptionally boring (also many aren't public, which is dumb): * UL 4600 standard for autonomous vehicles * ISO/IEC TR 5469 standard for ai safety stuff generally (this one is decently interesting) * ISO/IEC 42001 this one covers what you do if you set up a system that uses AI You also might find this paper a good read: https://ieeexplore.ieee.org/document/9269875 

Thanks, this is good feedback!

Addressing the disagreements:

The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to "strongly practical" (defined as "Applies to arguments that do not require fundamental research advances, but might require"). The only one that is justifiably "strongly practical" to me is "absence of precursor abilities."

Externalized reasoning: just a capability argument, no?
Testbeds: I think the best testbed ideas rn just re... (read more)

4Fabien Roger
My bad for testbeds, I didn't have in mind that you were speaking about this kind of testbeds as opposed to the general E[U|not scheming] analogies (and I forgot you had put them at medium strength, which is sensible for these kinds of testbeds). Same for "the unwarranted focus on claim 3" - it's mostly because I misunderstood what the countermeasures were trying to address. I think I don't have a good understanding of the macrosystem risks you are talking about. I'll look at that more later. I think I was a bit unfair about the practicality of techniques that were medium-strength - it's true that you can get some evidence for safety (maybe 0.3 bits to 1 bit) by using the techniques in a version that is practical. On practicality and strength, I think there is a rough communication issue here: externalized reasoning is practical, but it's currently not strong - and it could eventually become strong, but it's not practical (yet). The same goes for monitoring. But when you write the summary, we see "high practicality and high max strength", which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security - which I think is quite wild even if acceptable=[p(doom)<1%]. But I guess you didn't mean that, and it's just a weird quirk of the summarization?

Thanks, I'll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.

I'll specifically focus on justifying that the red team is competitive.

If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can't spend as much of their thinking on ho... (read more)

joshcΩ10185

Thanks for writing this. I think it does a good job laying out the assumptions needed for control.

It seems like it could be challenging to justify that a red team is competitive  -- especially if you are aiming to extract massive amounts of work from AI systems.

For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it's not very clear to me that externalized reasoning can scale to domains where AI ... (read more)

9ryan_greenblatt
Justifying control vs ensuring control First, here's a general point which is relevant in responding to your comment. There are different types of failures related to control: 1. We realize we can't reliably test and ensure control. This could be either: a. We realize that the necessary assumptions for control evaluations to work aren't met. In particular, we can't ensure the red team is sufficiently competitive and thus we can't trust control evaluations. b. We run control evaluations and we aren't able to develop sufficient countermeasures to ensure safety as measured by these evaluations. 2. We think the situation is fine based on control evaluations, but it's actually not and AIs succeed at egregiously bad actions. Failure (1) is plausibly less bad than failure (2). And, (1.a) looks different than (1.b). (See here for more discussion of different failure modes, but note that this is still an early version which is subject to change.) So for failure (1) there is an important distinction between justifying/determining a red team is competitive (1.a) and ensuring that a red team is competitive (1.b). We might be able to have reasonable-ish but somewhat approximate tests to get a rough sense of red team competitiveness and control evaluation viability more generally. These tests are things like: * How much problematic planning can AIs do without this being detected in practice? * Is it likely that AIs can sandbag on our capability evaluations? More specifically, on which evaluations can they sandbag and to what extent? * Do AIs have qualitatively wildly superhuman abilities in some contexts to an extent that we should think that they likely have such abilities in other cases? (A forthcoming post of ours will describe these tests in more detail. An early version of this post is here, but it doesn't describe these tests in sufficient detail yet and our thinking is still evolving.) Some of these tests looks like standard control red teaming + capability
4Orpheus16
Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be. Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).

The results seem to be cherry picked or else perhaps I am using the code incorrectly. I'm trying to use the VAE for a separate project and the encoded vectors don't steer generations very well (or reconstruct  -- which is what I was hoping to use this for).

In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology -- i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.

I've written a bit about the motivations for AI developmental psychology here.

I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term. 

The analogy between AI safety and math or physics is assumed it in a lot of your writing and I think it is a source of major disagreement with other thinkers. ML capabilities clearly isn’t the kind of field that requires building representations over the course of decades.

I think it’s possible that AI safety requires more conceptual depth than AI capabilities; but in these worlds, I struggle to see how the current ML paradigm coincides with conceptual ‘solutions’ that can’t be found via iteration at the end. In those worlds, we are probably doomed so I’m b... (read more)

2Noosphere89
This is definitely an underrated point. In general, I tend to think that worlds where Eliezer/Nate Soares/Connor Leahy/very doomy people are right are worlds where humans just can't do much of anything around AI extinction risks, and that the rational response is essentially to do what Elizabeth's friend did here in the link below, which is to leave AI safety/AI governance and do something else worthwile, as you or anyone else can't do anything: https://www.lesswrong.com/posts/tv6KfHitijSyKCr6v/?commentId=Nm7rCq5ZfLuKj5x2G In general, I think you need certain assumptions in order to justify working either on AI safety or AI governance, and that will set at least a soft ceiling on how doomy you can be. One of those is the assumption of feedback loops are available, which quite obviously rules out a lot of sharp left turns, and in general there's a limit to how extreme your scenarios for difficulty of safety have to be before you can't do anything at all, and I think a lot of classic Lesswrong people like Nate Soares and Eliezer Yudkowsky, as well as more modern people like Connor Leahy are way over the line of useful difficulty.

+1. As a toy model, consider how the expected maximum of a sample from a heavy tailed distribution is affected by sample size. I simulated this once and the relationship was approximately linear. But Soares’ point still holds if any individual bet requires a minimum amount of time to pay off. You can scalably benefit from parallelism while still requiring a minimum amount of serial time.

At its core, the argument appears to be "reward maximizing consequentialists will necessarily get the most reward." Here's a counter example to this claim: if you trained a Go playing AI with RL, you are unlikely to get a reward maximizing consequentialist. Why? There's no reason for the Go playing AI to think about how to take over the world or hack the computer that is running the game. Thinking this way would be a waste of computation. AIs that think about how to win within the boundaries of the rules therefore do better.

In the same way, if you could ro... (read more)

[This comment is no longer endorsed by its author]Reply
Load More