LESSWRONG
LW

All of RicG's Comments + Replies

Why White-Box Redteaming Makes Me Feel Weird

__RicG__4mo10

How cherry picked are those examples? Any other words/tokens/sequences they repeat?

Alignment Faking in Large Language Models

__RicG__7mo10

Thank you for providing this detail, that's basically what I was looking for!

Alignment Faking in Large Language Models

__RicG__7mo133

I am curious to know whether Anthropic has any sort of plan to not include results such as this in the training data of actual future LLMs.

To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.

Jozdien7mo157

They do say this in the paper:

As Evan agrees with here however, simply not including the results themselves doesn't solve the problem of the ideas leaking through. There's a reason unlearning as a science is difficult, information percolates in many ways and drawing boundaries around the right thing is really hard.

TurnTrout's shortform feed

__RicG__2yΩ46-3

If your model has the extraordinary power to say what internal motivational structures SGD will entrain into scaled-up networks, then you ought to be able to say much weaker things that are impossible in two years, and you should have those predictions queued up and ready to go rather than falling into nervous silence after being asked.

Sorry, I might misunderstanding you (and hope I am), but... I think doomers literally say "Nobody knows what internal motivational structures SGD will entrain into scaled-up networks and thus we are all doomed". The pr... (read more)

2TurnTrout2y

It would be "useful" (i.e. fitness-increasing) for wolves to have evolved biological sniper rifles, but they did not. By what evidence are we locating these motivational hypotheses, and what kinds of structures are dangerous, and why are they plausible under the NN prior? The relevant commonality is "ability to predict the future alignment properties and internal mechanisms of neural networks." (Also, I don't exactly endorse everything in this fake quotation, so indeed the analogized tasks aren't as close as I'd like. I had to trade off between "what I actually believe" and "making minimal edits to the source material.")

AGI-Automated Interpretability is Suicide

__RicG__2y10

Sorry for taking long to get back to you.

So I take this to be a minor, not a major, concern for alignment, relative to others.

Oh sure, this was more a "look at this cool thing intelligent machines could do that should shut up people from saying things like 'foom is impossible because training run are expensive'".

learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Ar

... (read more)

3Seth Herd2y

I haven't justified either of those statements; I hope to make the complete arguments in upcoming posts. For now I'll just say that human cognition is solving tough problems, and there's no good reason to think that algorithms would be lots more efficient than networks in solving those problems. I'll also reference Morevec's Paradox as an intuition pump. Things that are hard for humans, like chess and arithmetic are easy for computers (algorithms); things that are easy for humans, like vision and walking, are hard for algorithms. I definitely do not think it's pragmatically possible to fully interpret or reverse engineer neural networks. I think it's possible to do it adequately to create aligned AGI, but that's a much weaker criteria.

AGI-Automated Interpretability is Suicide

__RicG__2y10

Thanks for coming back to me.

"OK good point, but it's hardly "suicide" to provide just one more route to self-improvement"

I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).

Unless there's an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful.

I'd like to point out that in this ins... (read more)

2Seth Herd2y

Any type of self-improvement in an un-aligned AGI = death. And if it's already better than human level, it might not even need to do a bit of self-improvement, just escape our control, and we're dead. So I think the suicide is quite a bit of hyperbole, or at least stated poorly relative to the rest of the conceptual landscape here. If the AGI is aligned when it self-improves with algorithmic refinement, reflective stability should probably cause it to stay aligned after, and we just have a faster benevolent superintelligences. So this concern is one more route to self-improvement. And theres a big question of how good a route it is. My points were: 1. learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other 2. Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not. So I take this to be a minor, not a major, concern for alignment, relative to others.

AGI-Automated Interpretability is Suicide

__RicG__2y10

Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).

Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn't true I don't think mechanistic interpretability as a field would even exist.

Jailbreaking GPT-4's code interpreter

__RicG__2y60

BTW, if anyone is interested the virtual machine has these specs:

System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz

RAM: 54.93 GB

8gwern2y

That's surprisingly chonky. 50 GB RAM? (Damn, most people, or servers for that matter, don't have half that much RAM.) Does this correspond to a standard MS Azure instance? This would make for a pretty awesome botnet, perhaps through indirect prompt injection...

Why I am not an AI extinction cautionista

__RicG__2y00

I did listen to that post, and while I don't remember all the points, I do remember that it didn't convince me that alignment is easy and, like Christiano's post "Where I agree and disagree with Eliezer", it just seems to be like "p(doom) of 95%+ plus is too much, it's probably something like 10-50%" which is still incredibly unacceptably high to continue "business as usual". I have faith that something will be done: regulation and breakthrough will happen, but it seems likely that it won't be enough.

It comes down to safety mindset. There are very few and ... (read more)

Why I am not an AI extinction cautionista

__RicG__2y00

I don’t get you. You are upset about people saying that we should scale back capabilities research, while at the same time holding the opinion that we are not doomed because we won’t get to ASI? You are worried that people might try to stop the technology that in your opinion may not happen?? The technology that if does indeed happen, you agree that “If [ASI] us wants us gone, we would be gone”?!?

Said this, maybe you are misunderstanding the people that are calling for a stop. I don’t think anyone is proposing to stop narrow AI capabilities. Just the dange... (read more)

Why I am not an AI extinction cautionista

__RicG__2y2-1

Thanks for the list, I've already read a lot of those posts, but I still remain unconvinced. Are you convinced by any of those arguments? Do you suggest I take a closer look to some posts?

But honestly, with the AI risk statement signed by so many prominent scientists and engineer, debating that AI risks somehow don't exists seems to be just a fringe anti-climate-change-like opinion held by few stubborn people (or people just not properly introduced to the arguments). I find it funny that we are in a position where in the possible counter arguments ap... (read more)

1[anonymous]2y

6Shmi2y

Well, yes, the statement says "should be a global priority alongside other societal-scale risks", not anything about brakes on capabilities research, or privileging this risk over others. This is not at all the cautionista stance. Not even watered down. It is only to raise the public profile of this particular x-risk existence.

Why I am not an AI extinction cautionista

__RicG__2y1-3

You might object that OP is not producing the best arguments against AI-doom. In which case I ask, what are the best arguments against AI-doom?

I am honestly looking for them too.

The best I, myself, can come up with are brief light of "maybe the ASI will be really myopic and the local maxima for its utility is a world where humans are happy long enough to figure out alignment properly, and maybe the AI will be myopic enough that we can trust its alignment proposals", but then I think that the takeoff is going to be really fast and the AI would just se... (read more)

4[anonymous]2y

Why I am not an AI extinction cautionista

__RicG__2y00

Well, I apologized for the aggressiveness/rudeness, but I am interested if I am mischaracterizing your position or if you really disagree with any particular "counter-argument" I have made.

Why I am not an AI extinction cautionista

__RicG__2y2-1

I feel like briefly discussing every point on the object level (even though you don't offer object level discussion: you don't argue why the things you list are possible, just that they could be):

...Recursive self-improvement is an open research problem, is apparently needed for a superintelligence to emerge, and maybe the problem is really hard.

It is not necessary. If the problem is easy we are fucked and should spend time thinking about alignment, if it's hard we are just wasting some time thinking about alignment (it is not a Pascal mugging). This is ju... (read more)

2Shmi2y

You make a few rather strong statements very confidently, so I am not sure if any further discussion would be productive.

Why I am not an AI extinction cautionista

__RicG__2y20

"Despite all the reasons we should believe that we are fucked, there might just be missing some reasons we don't yet know for why everything will all go alright" is a really poor argument IMO.

...AI that is smart enough to discover new physics may also discover separate and efficient physical resources for what it needs, instead of grabby-alien-style lightconing it through the Universe.

This especially feels A LOT like you are starting from hopes and rationalizing them. We have veeeeery little reasons to believe that might be true... and also you just ... (read more)

1[anonymous]2y

2__RicG__2y

I feel like briefly discussing every point on the object level (even though you don't offer object level discussion: you don't argue why the things you list are possible, just that they could be): It is not necessary. If the problem is easy we are fucked and should spend time thinking about alignment, if it's hard we are just wasting some time thinking about alignment (it is not a Pascal mugging). This is just safety mindset and the argument works for almost every point to justify alignment research, but I think you are addressing doom rather than the need for alignment. The short version of RSI is: SI seems to be a cognitive process, so if something is better at cognition it can SI better. Rinse and repeat. The long version. I personally think that just the step from from neural nets to algorithms (which is what perfectly successful interpretability would imply) might be enough to have dramatic improvement on speed and cost. Enough to be dangerous, probably even starting from GPT-3. This has been claimed time and time again, people thinking this, just 3 years ago, would have predicted GPT-4 to be impossible without many breakthroughs. ML hasn't hit a wall yet, but maybe soon? What are you actually arguing? You seem to imply that humans don't discover new math, physics, chemistry, CS algorithms...? 🤔 AGI (not ASI) are still plenty dangerous because they are in silicon. Compared to bio-humans they don't sleep, don't get tired, have speed advantage, ease of communication between each other, ease of self-modification (sure, maybe not foom-style RSI, but self-mod is on the table), self-replication not constrained by willingness to have kids, a lot of physical space, food, health, random IQ variance, random interest and without needing the slow 20-30 years of growth needed for humans to be productive. GPT-4 might not write genius-level code, but it does write code faster than anyone else. Why do you need something that goal-seeks beyond what human informally mean?

"LLMs Don't Have a Coherent Model of the World" - What it Means, Why it Matters

__RicG__2y32

I am quite confused. It is not clear to me if at the end you are saying that LLMs do or don't have a world model. Can you clearly say on which "side" do you stand on? Are you even arguing for a particular side? Are you arguing that the idea of "having a world model" doesn't apply well to an LLM/is just not well defined?

Said this, you do seem to be claiming that LLMs do not have a coherent model of the world (again, am I misunderstanding you?), and then use humans as an example of what having a coherent world model looks like. This sentence is particularly ... (read more)

2Davidmanheim2y

Yes, I'm mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren't coherent. So I'm not saying there is no world model, I'm saying it's not a single / coherent model, it's a bunch of fragments. But I agree that it doesn't explain everything! To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF - the model is generating the token, so it can't "step back" and decide to go back and not make the claim that it "knows" should be followed by a citation. But that's not a mistake on the level of simulator theory, it's a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as "actually" what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I'll ignore that for now,) it would mostly eliminate the hallucination problem. On racism, I don't think there's much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it's possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I'm not sure that is what occurs.

Is behavioral safety "solved" in non-adversarial conditions?

__RicG__2y1-1

The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right prob... (read more)

Is behavioral safety "solved" in non-adversarial conditions?

__RicG__2y1-1

Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometim... (read more)

1David Johnston2y

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration. “Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious! One thing I'm saying is that we don't have clear evidence to support this claim.

how humans are aligned

__RicG__2y32

To me it isn't clear what alignment are you talking about.

You say that the list is about "alignment towards genetically-specified goals", which I read as "humans are aligned with inclusive genetic fitness", but then you talk about what I would describe as "humans aligned with each other" as in "humans want humans to be happy and have fun". Are you confusing the two?

South Korea isn't having kids anymore. Sometimes you get serial killers or Dick Cheney.

Here the first one shows misalignment towards IGF, while the second shows misalignment towards other humans, no?

Is behavioral safety "solved" in non-adversarial conditions?

__RicG__2y1815

I'd actually argue the answer is "obviously no".

RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.

Do many people really forget that RLHF was meant to make GPT say true things?

When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly... (read more)

2Robert_AIZI2y

Thanks, this is a useful corrective to the post! To shortcut safety to "would I trust my grandmother to use this without bad outcomes", I would trust a current-gen LLM to be helpful and friendly with her, but I would absolutely fear her "learning" factually untrue things from it. While I think it can be useful to have separate concepts for hallucinations and "intentional lies" (as another commenter argues), I think "behavioral safety" should preclude both, in which case our LLMs are not behaviorally safe. I think I may have overlooked hallucinations because I've internalized that LLMs are factually unreliable, so I don't use LLMs where accuracy is critical, so I don't see many hallucinations (which is not much of an endorsement of LLMs).

1David Johnston2y

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes. AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows. So I don’t think hallucination makes it obvious that behavioural safety is not solved. I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

AGI-Automated Interpretability is Suicide

__RicG__2y10

Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don't have the mathematical form of (sot hhe reverse of what I explained in this paragraph).

Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.

AGI-Automated Interpretability is Suicide

__RicG__2y10

What NNs do can't be turned into an algorithm by any known route.

NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:

For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training "came up with". Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to "living" inside a transformer. And so, in this example,... (read more)

2Seth Herd2y

Sorry it took me so long to get back to this; I either missed it or didn't have time to respond. I still don't, so I'll just summarize: You're saying that what NNs do could be made a lot more efficient by distilling it into algorithms. I think you're right about some cognitive functions but not others. That's enough to make your argument accurate, so I suggest you focus on that in future iterations. (Maybe going from suicide to adding danger would be more more accurate). I suggest this change because I think you're wrong about a majority of cognition. The brain isn't being inefficient in most of what it does. You've chosen arithmetic as your example. I totally agree that the brain performs arithmetic in a wildly inefficient way. But that establishes one end of a spectrum. The intuition that most of cognition could be vastly optimized with algorithms is highly debetable. After a couple of decades of working with NNs and thinking about how they perform human cognition, I have the opposite intuition: NNs are quite efficient (this isn't to say that they couldn't be made more efficient - surely they can!). For instance, I'm pretty sure that humans use a monte carlo tree search algorithm to solve novel problems and do planning. That core search strucure can be simplified as an algorithm. But the power of our search process comes from having excellent estimates of the semantic linkages between the problem and possible leaves in the tree, and excellent predictors of likely reward for each branch. Those estimates are provided by large networks with good learning rules. Those can't be compressed into an algorithm particularly efficiently; neural network distillation would probably work about as efficiently as it's possible to work. There are large computational costs because it's a hard problem, not because the brain is approaching the problem in an inefficient way. I'm not sure if that helps to convey my very different intuition or not. Like I said, I've got a limited

1__RicG__2y

Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.

AGI-Automated Interpretability is Suicide

__RicG__2y10

Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).

It doesn't' matter how, I don't know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.

1Seth Herd2y

Oh, I see. I don't see a simple path to algorithmic foom from AI interpretability. What NNs do can't be turned into an algorithm by any known route. However, I do think some parts of their reasoning might be adaptable to algorithms. And I think that adding algorithms to language models is a clear path to AGI, as I've written about in Capabilities and alignment of LLM cognitive architectures. So your point stands. I think it might be clarified by going into more depth on how NNs might be adapted to algorithms.

AGI-Automated Interpretability is Suicide

__RicG__2y10

Disclaimer: These are all hard questions and points that I don't know their true answers, these are just my views, what I have understood up to now. I haven't studied the expected utility maximisers exactly because I don't expect the abstraction to be useful for the kind of AGI we are going to be making.

There's a huge gulf between agentic systems and "zombie-agentic" systems (that act like agents with goals, but have no explicit internal representation of those goals)

I feel the same, but I would say that it's the “real-agentic” system (or a close approxima... (read more)

AGI-Automated Interpretability is Suicide

__RicG__2y10

What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it's a theory put on top of the system, it doesn't need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).

AGI-Automated Interpretability is Suicide

__RicG__2y0-20

I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT ""wants"" to predict the correct probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it's very likely to be a bunch of weird heuristics that were selected by the training method because they work.

If you instead meant that GPT is "just an algorithm" I feel we disagree here as I am pretty sure that I am just an a... (read more)

8red75prime2y

No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as "I want to predict next token". Like a bacterium does divide (with possible adaptive mutations), while not containing "be fruitful and multiply" written somewhere inside. No, I certainly didn't mean that. If the extended Church--Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency). What I feel, but can't precisely convey, is that there's a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and "zombie-agentic" systems (that act like agents with goals, but have no explicit internal representation of those goals). How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent's shortcomings)?

1__RicG__2y

What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it's a theory put on top of the system, it doesn't need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).

AGI-Automated Interpretability is Suicide

__RicG__2y10

You are basically discussing these two assumptions I made (under "Algorithmic foom (k>1) is possible"), right?

The intelligence ceiling is much higher than what we can achieve with just DL
The ceiling of hard-coded intelligence that runs on near-future hardware isn’t particularly limited by the hardware itself: algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
Algorithms are easier to reason about than star

__RicG__2y10

I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:

If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.

I guess when I say "No lab should be allowed to have the AI reflect on itself" I do not mean only the running copy of the AGI, but just at any copy of the AGI.

AGI-Automated Interpretability is Suicide

__RicG__2y30

Wouldn't it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.

I really don't expect "goals" to be explicitly written down in the network. There will very likely not be a thing that says "I want to predict the next token" or "I want to make paperclips" or even a utility function of that. My mental image of goals is that they are put "on top" of the model/mind/agent/person. Whatever they seem t... (read more)

red75prime2y1211

I really don't expect "goals" to be explicitly written down in the network. There will very likely not be a thing that says "I want to predict the next token" or "I want to make paperclips" or even a utility function of that. My mental image of goals is that they are put "on top" of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.

I'm sure that I don't understand you. GPT most likely doesn't have "I want to predict next token" written somewhere, because it doesn't want to predict next token. There's nothi... (read more)

AGI-Automated Interpretability is Suicide

__RicG__2y41

I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.

Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won't recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don't do it.

Not today. But pr... (read more)

AGI-Automated Interpretability is Suicide

__RicG__2y10

Cheers. You comments actually allowed me to fully realize where the danger lies and expand a little on the consequences.
Thanks again for the feedback

6the gears to ascension2y

Yeah I do want to add - this particular paper I actually agree with yudkowsky is probably a small reduction in P(doom), because it successfully focuses a risky operation in a way that moves towards humans being able to check a system. The dangerous thing would be to be hands off; the more you actually in fact use interpretability to put humans in the loop, the more you get the intended benefits of interpretability. If you remove humans from the loop, you remove your influence on the system, and the system rockets ahead of you and blows up the world, or if not the world, at least your lab.

All of __RicG__'s Comments + Replies

All of RicG's Comments + Replies