Science advances one funeral at a time

Cameron Berg; Judd  Rosenblatt; Diogo de Lucena; AE Studio

LESSWRONG
LW

All of Jack R's Comments + Replies

Prize idea: Transmit MIRI and Eliezer's worldviews

Aren’t turned off by perceived arrogance

One hypothesis I've had is that people with more MIRI-like views tend to be more arrogant themselves. A possible mechanism is that the idea that the world is going to end and that they are the only ones who can save is appealing in a way that shifts their views on certain questions and changes the way they think about AI (e.g. they need less explanation that they are some of the most important people ever, so they spend less time considering why AI might go well by default).

[ETA: In case it wasn't clear, I am positing subconscious patterns correlated with arrogance that lead to MIRI-like views]

How can we ensure that a Friendly AI team will be sane enough?

Jack R3y20

How'd this go? Just searched LW for "neurofeedback" since I recently learned about it

4James_Miller3y

I stopped doing it years ago. At the time I thought it reduced my level of anxiety. My guess now is that it probably did but I'm uncertain if the effect was placebo.

Discussion on utilizing AI for alignment

Jack R3y10

That argument makes sense, thanks

Discussion on utilizing AI for alignment

Jack R3y71

We are very likely not going to miss out on alignment by a 2x productivity boost, that’s not how things end up in the real world. We’ll either solve alignment or miss by a factor of >10x.

Why is this true?

9elifland3y

Most problems that people work on in research are roughly the right difficulty, because the ambition level is adjusted to be somewhat challenging but not unachievable. If it's too hard then the researcher just moves on to another project. This is the problem selection process we're used to, and might bias our intuitions here. On the other hand, we want to align AGI because it's a really important problem, and have no control over the difficulty of the problem. And if you think about the distribution of difficulties of all possible problems, it would be a huge coincidence if the problem of aligning AGI, chosen for its importance and not its difficulty, happened to be within 2x difficulty of the effort we end up being able to put in.

The shard theory of human values

Jack R3y91

the genome can’t directly make us afraid of death

It's not necessarily direct, but in case you aren't aware of it, prepared learning is a relevant phenomenon,since apparently the genome does predispose us to certain fears

Announcing Encultured AI: Building a Video Game

Jack R3y20

Seems like this guy has already started trying to use GPT-3 in a videogame: GPT3 AI Game Prototype

AGI Timelines Are Mostly Not Strategically Relevant To Alignment

Jack R3y10

Not sure if it was clear, but the reason I asked was because it seems like if you think the fraction changes significantly before AGI, then the claim that Thane quotes in the top-level comment wouldn't be true.

2johnswentworth3y

Oh, I see. Certainly if the time required to implement our current best idea goes down, then the timescale at which we care about timelines becomes even shorter.

AGI Timelines Are Mostly Not Strategically Relevant To Alignment

Jack R3y40

Don't timelines change your views on takeoff speeds? If not, what's an example piece of evidence that updates your timelines but not your takeoff speeds?

4johnswentworth3y

That's not how the causal arrow works. There are some interesting latent factors which influence both timelines and takeoff speeds, like e.g. "what properties will the first architecture to take off have?". But then the right move is to ask directly about those latent factors, or directly about takeoff speeds. Timelines are correlated with takeoff speeds, but not really causally upstream.

AGI Timelines Are Mostly Not Strategically Relevant To Alignment

Jack R3y21

Same - also interested if John was assuming that the fraction of deployment labor that is automated changes negligibly over time pre-AGI.

2johnswentworth3y

That's an interesting and potentially relevant question, but a separate question from timelines, and mostly not causally downstream of timelines.

Broad Picture of Human Values

Jack R3y20

Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.

I'd add that action patterns can change for reasons other than logical/deliberative ones. For example, adapting to a new culture means you might adopt and have new reactions to objects, gestures, etc that are considered symbolic in that culture.

Discovering Agents

Jack R3y20

so the edge $(~ S, ~ Q)$ is terminal

Earlier you said that the blue edges were terminal edges.

2zac_kenton3y

Thanks, this has now been corrected to say 'not terminal'.

Announcing Encultured AI: Building a Video Game

Jack R3y116

What are some of the "various things" you have in mind here? It seems possible to me that something like "AI alignment testing" is straightforwardly upstream of what players want, but maybe you were thinking of something else

Pendulums, Policy-Level Decisionmaking, Saving State

Jack R3y30

"Go with your gut” [...] [is] insensitive to circumstance.

People's guts seem very sensitive to circumstance, especially compared to commitments.

2CFAR!Duncan3y

Yes, this is correct. But the policy "go with your gut" is insensitive to circumstance, and cannot e.g. account for times when "guts" are the wrong tool to trust.

The alignment problem from a deep learning perspective

Jack R3y30

But the capabilities of neural networks are currently advancing much faster than our ability to understand how they work or interpret their cognition;

Naively, you might think that as opacity increases, trust in systems decreases, and hence something like "willingness to deploy" decreases.

How good of an argument does this seem to you against the hypothesis that "capabilities will grow faster than alignment"? I'm viewing the quoted sentence as an argument for the hypothesis.

Some initial thoughts:

A highly capable system doesn't necessarily need to be de

Jack R3y10

I was thinking of the possibility of affecting decision-making, either directly by rising the ranks (not very likely) or indirectly by being an advocate for safety at an important time and pushing things into the Overton window within an organization.

I imagine Habryka would say that a significant possibility here is that joining an AGI lab will wrongly turn you into an AGI enthusiast. I think biasing effects like that are real, though I also think it's hard to tell in cases like that how much you are biased v.s. updating correctly on new information,... (read more)

Will working here advance AGI? Help us not destroy the world!

Jack R3y10

It seems like you are confident that the delta in capabilites would outweigh any delta in general alignment sympathy. Is this what you think?

1Yonatan Cale3y

May I ask what you are calling "general alignment sympathy"? Could you say it in other words or give some examples?

A central AI alignment problem: capabilities generalization, and the sharp left turn

Jack R3y*21

Attempting to manually specify the nature of goodness is a doomed endeavor, of course, but that's fine, because we can instead specify processes for figuring out (the coherent extrapolation of) what humans value. […] So today's alignment problems are a few steps removed from tricky moral questions, on my models.

I‘m not convinced that choosing those processes is significantly non-moral. I might be misunderstanding what you are pointing at, but it feels like the fact that being able to choose the voting system gives you power over the vote’s outcome is evidence of this sort of thing - that meta decisions are still importantly tied to decisions.

Criticism of EA Criticism Contest

Jack R3y13

I think there should be a word for your parsing, maybe "VNM utilitarianism," but I think most people mean roughly what's on the wiki page for utilitarianism:

Utilitarianism is a family of normative ethical theories that prescribe actions that maximize happiness and well-being for all affected individuals

Where I agree and disagree with Eliezer

Jack R3y41

It's not obvious to me that the class of counter-examples "expertise, in most fields, is not easier to verify than to generate" are actually counter-examples. For example for "if you're not a hacker, you can't tell who the good hackers are," it still seems like it would be easier to verify whether a particular hack will work than to come up with it yourself, starting off without any hacking expertise.

2johnswentworth3y

First, "does the hack work?" is not the only relevant question. A good hacker knows that other things also matter - e.g. how easy the code is for another person to understand, or how easy it is to modify later on. This principle generalizes: part of why expertise is hard-to-recognize is because non-experts won't realize which questions to ask. Second, checking whether a program does what we intend in general (i.e. making sure it has no bugs) is not consistently easier than writing a correct program oneself, especially if the program we're trying to check is written by a not-very-good programmer. This is the fundamental reason why nobody uses formal verification methods: writing the specification for what-we-want-the-code-to-do is usually about as difficult, in practice, as writing the code to do it. (This is actually a separate argument/line-of-evidence that verification is not, in practice and in general, easier than generation.)

Human values & biases are inaccessible to the genome

Jack R3yΩ460

Could you clarify a bit more what you mean when you say "X is inaccessible to the human genome?"

Logan Riggs3yΩ6148

My understanding is: Bob's genome didn't have access to Bob's developed world model (WM) when he was born (because his WM wasn't developed yet). Bob's genome can't directly specify "care about your specific family" because it can't hardcode Bob's specific family's visual or auditory features.

This direct-specification wouldn't work anyways because people change looks, Bob could be adopted, or Bob could be born blind & deaf.

[Check, does the Bob example make sense?]

But, the genome does do something indirectly that consistently leads to people valuin... (read more)

Information Loss --> Basin flatness

Jack R3y30

Ah okay -- I have updated positively in terms of the usefulness based on that description, and have updated positively on the hypothesis "I am missing a lot of important information that contextualizes this project," though still confused.

Would be interested to know the causal chain from understanding circuit simplicity to the future being better, but maybe I should just stay posted (or maybe there is a different post I should read that you can link me to; or maybe the impact is diffuse and talking about any particular path doesn't make that much sen... (read more)

Information Loss --> Basin flatness

Jack R3y40

I didn't finish reading this, but if it were the case that:

There were clear and important implications of this result for making the world better (via aligning AGI)
These implications were stated in the summary at the beginning

then I very plausibly would have finished reading the post or saved it for later.

ETA: For what it's worth, I still upvoted and liked the post, since I think deconfusing ourselves about stuff like this is plausibly very good and at the very least interesting. I just didn't like it enough to finish reading it or save it, because from my perspective it's expected usefulness wasn't high enough given the information I had.

6Vivek Hebbar3y

This is only one step toward a correct theory of inductive bias. I would say that "clear and important implications" will only come weeks from now, when we are much less confused and have run more experiments. The main audience for this post is researchers whose work is directly adjacent to inductive bias and training dynamics. If you don't need gears-level insights on this topic, I would say the tl;dr is: "Circuit simplicity seems kind of wrong; there's a cool connection between information loss and basin flatness which is probably better but maybe still very predictive; experiments are surprising so far; stay posted for more in ~2 weeks."

1[comment deleted]3y

Is AI Progress Impossible To Predict?

Jack R3y30

I wonder if there are any measurable dimensions along which tasks can vary, and whether that could help with predicting task progress at all. A simple example is the average input size for the benchmark.

Starting too many projects, finishing none

Jack R3y10

I’m glad you posted this — this may be happening to me and now I’ve read about sunken cost faith counterfactually

Jack R3y90

I don't know how good of a fit you would be, but have you considered applying to Redwood Research?

2mic3y

Or other AI alignment organizations like Anthropic, the Fund for Alignment Research, or Aligned AI.

Jack R3y10

Ah I see, and just to make sure I'm not going crazy, you've edited the post now to reflect this?

1Thomas Kwa3y

Yes

Jack R3y10

W is a function, right? If so, what’s its type signature?

1Thomas Kwa3y

As written w takes behaviors to "properties about world-trajectories that the base optimizer might care about" as Wei Dai says here. If there is uncertainty, I think w could return distributions over such world-trajectories, and the argument would still work.

When is positive self-talk not worth it due to self-delusion?

Jack R3y10

I agree, though I want to be able to have a good enough understanding of the gears such that I can determine whether something like "telling yourself you are awesome everyday" will have counterfactual better outcomes than not. I guess the studies seem to suggest the answer in this case is "yes" in as much as self-delusion negative externalities are captured by the metrics that the studies in the TED talk use. [ETA: and I feel like now I have nearly answered the question for myself, so thanks for the prompt!]

When is positive self-talk not worth it due to self-delusion?

Jack R3y20

What’s a motivation stack? Could you give an example?

1weathersystems3y

I read it as an analogy to a programming stack trace, but with motivations. Often times you're motivated to do A in order to get B in order to get C, where one thing is desired only as a means to get something else. Presumably these chains of desire bottom out in some terminal desires, things that are desired for their own sake, not because of some other thing it gets you. So one example could be, "I want to get a job, in order to get money, in order to be able to feed myself." I'm not sure if that's what they meant. I'm often kind of skeptical of that sort of psychologizing though. It's not that it can't be done, but that our reasons for having motivations are often invisible to ourselves. My guess is that when people try to explain their own actions/motivations in this way, they're largely just making up a plausible story.

2romeostevensit3y

Instrumental goal->instrumental goal->instrumental goal->terminal goal though in reality due to multifinal strategies it's more like a web so there are some up front costs to disentangling that.

When is positive self-talk not worth it due to self-delusion?

Answer by Jack RApr 21, 202210

A partial answer:

Your emotions are more negative than granted if, for instance, it's often the case that your anxiety is strong enough that it feels like you might die and you don't in fact die.
Your emotions are more positive than granted if it's often the case that, for instance, you are excited about getting job offers "more than" you tend to get job offers.

These answers still have ambiguity though, in "more than" and in how many Bayes points your anxiety as a predictor of death actually gets.

Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe

Jack R3y20

I'll add that when I asked John Wentworth why he was IDA-bearish, he mentioned the inefficiency of bureaucracies and told me to read the following post to learn why interfaces and coordination are hard: Interfaces as a Scarce Resource.

Takeoff speeds have a huge effect on what it means to work on AI x-risk

Jack R3y30

while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.

Example?

6Buck3y

I’m not that excited for projects along the lines of “let’s figure out how to make human feedback more sample efficient”, because I expect that non-takeover-risk-motivated people will eventually be motivated to work on that problem, and will probably succeed quickly given motivation. (Also I guess because I expect capabilities work to largely solve this problem on its own, so maybe this isn’t actually a great example?) I’m fairly excited about projects that try to apply human oversight to problems that the humans find harder to oversee, because I think that this is important for avoiding takeover risk but that the ML research community as a whole will procrastinate on it.

[Link] A minimal viable product for alignment

Jack R3y40

I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of "the fuzziness spectrum" as math and physics. And it does seem like concept fuzziness would make evaluation harder.

I'll note though that ARC's approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).

Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe

Jack R3y80

MIRI doesn't have good reasons to support the claim of almost certain doom

I recently asked Eliezer why he didn't suspect ELK to be helpful, and it seemed that one of his major reasons was that Paul was "wrongly" excited about IDA. It seems that at this point in time, neither Paul nor Eliezer are excited about IDA, but Eliezer got to the conclusion first. Although, the IDA-bearishness may be for fundamentally different reasons -- I haven't tried to figure that out yet.

Have you been taking this into account re: your ELK bullishness? Obviously, this sort of p... (read more)

9paulfchristiano3y

I'm still excited about IDA. I assume this is coming from me saying that you need big additional conceptual progress to have an indefinitely scalable scheme. And I do think that's more skeptical than my strongest pro-IDA claim here in early 2017: That said: * I think it's up for grabs whether we'll end up with something that counts as "this basic strategy." (I think imitative generalization is the kind of thing I had in mind in that sentence, but many of the ELK schemes we are thinking about definitely aren't, it's pretty arbitrary.) * Also note that in that post I'm talking about something that produces a benign agent in practice, and in the other I'm talking about "indefinitely scalable." Though my probability on "produces a benign agent in practice" is also definitely lower.

2Yitz3y

Did Eliezer give any details about what exactly was wrong about Paul’s excitement? Might just be an intuition gained from years of experience, but the more details we know the better, I think.

A broad basin of attraction around human values?

Jack R3y10

I think Nate Soares has beliefs about question 1. A few weeks ago, we were discussing a question that seems analogous to me -- "does moral deliberation converge, for different ways of doing moral deliberation? E.g. is there a unique human CEV?" -- and he said he believes the answer is "yes." I didn't get the chance to ask him why, though.

Thinking about it myself for a few minutes, it does feel like all of your examples for how the overseer could have distorted values have a true "wrongness" about them that can be verified against reality -- this makes me feel optimistic that there is a basin of human values, and that "interacting with reality" broadly construed is what draws you in.

Worse than an unaligned AGI

Jack R3y10

An example is an AI making the world as awful as possible, e.g. by creating dolorium. There is a separate question about how likely this is, hopefully very unlikely.

2Shmi3y

Yeah, I would not worry about sadistic AI being super likely, unless specifically designed.

What an actually pessimistic containment strategy looks like

Jack R3y10

I mean to argue against your meta-strategy which relies on obtaining relevant understanding about deception or alignment as we get larger models and see how they work. I agree that we will obtain some understanding, but it seems like we shouldn't expect that understanding to be very close to sufficient for making AI go well (see my previous argument), and hence not a very promising meta-strategy.

2Kaj_Sotala3y

I read your previous comment as suggesting that the improved understanding would mainly be used for pursuing a specific strategy for dealing with deception, namely "to learn the properties of what looks like deception to humans, and instill those properties into a loss function". And it seemed to me that the problem you raised was specific to that particular strategy for dealing with deception, as opposed to something else that we might come up with?

What an actually pessimistic containment strategy looks like

Jack R3y20

[ETA: I'm not that sure of the below argument]

Thanks for the example, but it still seems to me that this sort of thing won't work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]

Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you wha... (read more)

2Kaj_Sotala3y

What sort of thing? I didn't mean to propose any particular strategy for dealing with deception, I just meant to say that now OpenAI has 1) a reason to figure out deception and 2) a concrete instance of it that they can reason about and experiment with and which might help them better understand exactly what's going on with it. More generally, the whole possibility that I was describing was that it might be impossible for us to currently figure out the right strategy since we are missing some crucial piece of understanding. If I could give you an example of some particularly plausible-sounding strategy, then that strategy wouldn't have been impossible to figure out with our current understanding, and I'd be undermining my whole argument. :-) Rather, my example was meant to demonstrate that it has already happened that * Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it. * Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between "capabilities research" and "alignment research". * Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well (even if a lot of it won't). It's not that I'd put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. I-GPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insigh

Worse than an unaligned AGI

Jack R3y90

Isn’t the worst case one in which the AI optimizes exactly against human values?

2Shmi3y

I don't know what it means, can you give a few examples?

2ersatz3y

I think so, by definition, nothing can be worse than that.

What an actually pessimistic containment strategy looks like

Jack R3y200

Maybe Carl meant to link this one

5CarlShulman3y

You're right, my link was wrong, that one is a fine link.

What an actually pessimistic containment strategy looks like

Jack R3y60

it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.

Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don't understand and so whose behavior/properties you can't verify to be acceptable. It's not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).

It's possible, also, that this i... (read more)

7Kaj_Sotala3y

Or maybe once our understanding of intelligent computation in general improves, it will also give us the tools for better identifying deceptive computation. E.g. language models are already "deceptive" in a sense - asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I shared that, multiple people pointed out that its answers sound like the kind of a student who's taking an exam and is asked to write an essay about a topic they know nothing about, but they try to fake their way through anyway (that is, they are trying to deceive the examiner). Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT "trying to deceive" people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such "deceptive" computations as a capabilities researcher already. This means that the existence of InstructGPT gives you both 1) a concrete financial incentive to do research for identifying and stopping deceptive computation 2) a real system that actually carries out something like deceptive computation, which you can experiment on and whose behavior you can make use of in trying to understand the phenomenon better. That second point is something that wouldn't have been the case before our capabilities got to this point. And it might allow us to figure out something we wouldn't have thought of before we had a system with this capability level to tinker with.

We Are Conjecture, A New Alignment Research Startup

Jack R3y110

One thing is that it seems like they are trying to build some of the world’s largest language models (“state of the art models”)

Don't die with dignity; instead play to your outs

Jack R3y10

Hah! Thanks

Don't die with dignity; instead play to your outs

Jack R3y112

It seems to me that it would be better to view the question as "is this frame the best one for person X?" rather than "is this frame the best one?"

Though, I haven't fully read either of your posts, so excuse any mistakes/confusion.

Rob Bensinger3y150

Congrats on making an important and correct point without needing to fully read the posts! :) That's just efficiency.

You get one story detail

Jack R3y30

Do you have an example of a set of 1-detail stories you now might tell (composed with “AND”)?

2Gordon Seidoh Worley3y

keep your teeth healthy and brushing keeps teeth healthy and flossing keeps teeth healthy and ...

Do a cost-benefit analysis of your technology usage

Jack R3y10

Ah — sorry if I missed that in the post, only skimmed

Do a cost-benefit analysis of your technology usage

Jack R3y80

Random tip: If you want to restrict apps etc on your iPhone but not know the Screen Time pin, I recommend the following simple system which allows you to not know the password but unlock restrictions easily when needed:

Ask a friend to write a 4 digit pin in a small note book (which is dedicated only for this pin)
Ask them to punch in the pin to your phone when setting the Screen Time password
Keep the notebook in your backpack and never look inside of it, ever
If you ever need your phone unlocked, you can walk up to someone, even a stranger, show them the not

... (read more)

2TurnTrout3y

Yup, this is what I did, but I just didn't have the notebook. I like the bright line.

Jack R3y10

Thanks for this list!

Though the list still doesn't strike me as very novel -- it feels that most of these conditions are conditions we've been shooting for anyways.

E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.

If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine... (read more)

3Stuart_Armstrong3y

Yes, the list isn't very novel - I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(

What are the top 1-10 posts / sequences / articles / etc. that you've found most useful for yourself for becoming "less wrong"?

Answer by Jack RMar 28, 202220

I (with some help) compiled some of the best rationality essays here.

Jack R3y10

Ping about my other comment -- FYI, because I am currently concerned that you don't have criteria for the innards in mind, I'm less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven't tried to digest your work much yet).

2Stuart_Armstrong3y

Let me develop the idea a bit more. It is somewhat akin to answering, in the 1968, the question "how do you know you've solved the moon landing problem?" In that case, NASA could point to them having solved a host of related problems (getting into space, getting to the moon, module separation, module reconnection), knowing that their lander could theoretically land on the moon (via knowledge of the laws of physics and of their lander design), estimating that the pilots are capable of dealing with likely contingencies, trusting that their model of the lunar landing problem is correct and has covered various likely contingencies, etc... and then putting it all together into a plan where they could say "successful lunar landing is likely". Note that various parts of the assumptions could be tested; engineers could probe at the plan and say things like "what if the conductivity of the lunar surface is unusual", and try and see if their plan could cope with that. Back to value extrapolation. We'd be confident that it is likely to work if we had, for example: 1. It works well in an all situations where we can completely test it (eg we have a list of human moral principles, and we can have an AI successfully run a school using those as input). 2. It works well on testable subproblems of more complicated situations (eg we inspect the AI's behaviour in specific situations). 3. We have models of how value extrapolation works in extreme situations, and strong theoretical arguments that those models are correct. 4. We have developed a much better theoretical understanding of value extrapolation, and are confident that it works. 5. We've studied the problem adversarially and failed to break the approach. 6. We have deployed interpretability methods to look inside the AI at certain places, and what we've seen is what we expect to see. These are the sort of things that could make us confident that a new approach could work. Is this what you are thinking?