LESSWRONG
LW

All of Ramana Kumar's Comments + Replies

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Ramana Kumar6mo51

Let me know when you can receive donations via a UK charity.

Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

Answer by Ramana KumarSep 26, 2024Ω120

Vaguely related perhaps is the work on Decoupled Approval: https://arxiv.org/abs/2011.08827

Consent across power differentials

Ramana Kumar1yΩ120

Thanks for this! I think the categories of morality is a useful framework. I am very wary of the judgement that care-morality is appropriate for less capable subjects - basically because of paternalism.

2Noosphere8910mo

I think at some level, maybe a crux is that I believe that the harder version of the problem is more useful to solve, where we cannot remove the power differential, or at best cannot remove it totally, or at least do better than society does under such power differentials. Also, maybe I view paternalism in a more positive context, especially as it relates to parenting, especially for legal guardians, as well as raising animals, where I'd argue that the power differential shouldn't be removed.

Consent across power differentials

Ramana Kumar1yΩ230

Just to confirm that this is a great example and wasn't deliberately left out.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ramana Kumar1yΩ8104Review for 2022 Review

I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.

Ramana Kumar2yΩ6132

I think this is basically correct and I'm glad to see someone saying it clearly.

Systems that cannot be unsafe cannot be safe

Ramana Kumar2yΩ360

I agree with this post. However, I think it's common amongst ML enthusiasts to eschew specification and defer to statistics on everything. (Or datapoints trying to capture an "I know it when I see it" "specification".)

4Davidmanheim2y

That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.

Why do we care about agency for alignment?

Answer by Ramana KumarApr 23, 2023Ω460

This is one of the answers: https://www.alignmentforum.org/posts/FWvzwCDRgcjb9sigb/why-agent-foundations-an-overly-abstract-explanation

5Chris_Leong2y

Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something. He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.

Teleosemantics!

Ramana Kumar2yΩ120

The trick is that for some of the optimisations, a mind is not necessary. There is a sense perhaps in which the whole history of the universe (or life on earth, or evolution, or whatever is appropriate) will become implicated for some questions, though.

AI and Evolution

Ramana Kumar2yΩ352

I think https://www.alignmentforum.org/posts/TATWqHvxKEpL34yKz/intelligence-or-evolution is somewhat related in case you haven't seen it.

$500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory

Ramana Kumar2yΩ9140

I'll add $500 to the pot.

Discussion with Nate Soares on a key alignment difficulty

Ramana Kumar2yΩ340

Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.

2dxu2y

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post: It certainly seems to me that e.g. people like Ziz have done reflection in a "goofy" way, and that being human has not particularly saved them from deriving "crazy stuff". Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it's currently plausible to me that part of the danger arises simply from "reflection on (partially) incoherent starting points" getting really crazy really fast. (It's not yet clear to me how this intuition interfaces with my view on alignment hopes; you'd expect it to make things worse, but I actually think this is already "priced in" w.r.t. my P(doom), so explicating it like this doesn't actually move me—which is about what you'd expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.) (EDIT: I mean, a lot of what I'm saying here is basically "CEV" might not be so "C", and I don't actually think I've ever bought that to begin with, so it really doesn't come as an update for me. Still worth making explicit though, IMO.)

Discussion with Nate Soares on a key alignment difficulty

Ramana Kumar2yΩ350

A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).

4HoldenKarnofsky2y

I think Nate and I would agree that this would be safe. But it seems much less realistic in the near term than something along the lines of what I outlined. A lot of the concern is that you can't really get to something equivalent to your proposal using techniques that resembles today's machine learning.

Can we efficiently distinguish different mechanisms?

Ramana Kumar3yΩ120

It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.

Response to Holden’s alignment plan

Ramana Kumar3yΩ240

Will the discussion be recorded?

2Alex Flint3y

Wasn't able to record it - technical difficulties :(

2Alex Flint3y

Yes, I should be able to record the discussion and post a link in the comments here.

Mechanistic anomaly detection and ELK

Ramana Kumar3yΩ230

(Bold direct claims, not super confident - criticism welcome.)

The approach to ELK in this post is unfalsifiable.

A counterexample to the approach would need to be a test-time situation in which:

The predictor correctly predicts a safe-looking diamond.
The predictor “knows” that the diamond is unsafe.
The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.

Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predic... (read more)

4paulfchristiano3y

This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily. This approach also depends on a fuzzier philosophical claim, which is more like: “if every small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree: * We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal). * Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity. * If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won't imply that our method can distinguish real-world cases of sensor tampering, but it wi

[Link] Why I’m optimistic about OpenAI’s alignment approach

Ramana Kumar3yΩ120

I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".

Finding gliders in the game of life

Ramana Kumar3yΩ150

“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”

we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the

“scientists don’t know how the game of life works”.

Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?

Another thing confusing to me along these lines:

“for example they may have noti

... (read more)

7paulfchristiano3y

The scientists don't start off knowing how the game of life works, but they do know how their model works. The scientists don't need to follow along with the heuristic argument, or do any ad hoc work to "understand" that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it's important that their methods open up the model even if they never do. Intuitively, the scientists are like us evaluating heuristic arguments about how activations evolve in a neural network without necessarily having any informal picture of how those activations correspond to the world. This was confusing shorthand. They notice that the A-B correlation is stronger when the A and B sensors are relatively quiet. If there are other sensors, they also notice that the A-B pattern is more common when those other sensors are quiet. That is, I expect they learn a notion of "proximity" amongst their sensors, and an abstraction of "how active" a region is, in order to explain the fact that active areas tend to persist over time and space and to be accompanied by more 1s on sensors + more variability on sensors. Then they notice that A-B correlations are more common when the area around A and B is relatively inactive. But they can't directly relate any of this to the actual presence of live cells. (Though they can ultimately use the same method described in this post to discover a heuristic argument explaining the same regularities they explain with their abstraction of "active," and as a result they can e.g. distinguish the case where the zone including A and B is active (and so both of them tend to exhibit more 1s and more irregularity) from the case where there is a coincidentally high degree of irregularity in those sensors or independent pockets of activity around each of A and B.

[Link] Why I’m optimistic about OpenAI’s alignment approach

Ramana Kumar3yΩ251

They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.

I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.

5Anthony DiGiovanni3y

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

Ramana Kumar3yΩ120

I'm focusing on the code in Appendix B.

What happens when self.diamondShard's assessment of whether some consequences contain diamonds differs from ours? (Assume the agent's world model is especially good.)

2TurnTrout3y

The same thing which happens if the assessment isn't different from ours—the agent is more likely to take that plan, all else equal.

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

Ramana Kumar3yΩ120

upweights actions and plans that lead to

how is it determined what the actions and plans lead to?

2TurnTrout3y

See the value-child speculative story for detail there. I have specific example structures in mind but don't yet know how to compactly communicate them in an intro.

Mechanistic anomaly detection and ELK

Ramana Kumar3yΩ590

We expect an explanation in terms of the weights of the model and the properties of the input distribution.
We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.” We have a mechanistic explanation π for how those correlations arise from the structure of the model.
Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault. We’d like to ask: in this case, does the diamond appear to remain in the vault for the normal reason

... (read more)

4paulfchristiano3y

I'm very interested in understanding whether anything like your scenario can happen. Right now it doesn't look possible to me. I'm interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of "explanation." Right now by "explanation" I mean probabilistic heuristic argument as described here. The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work). So it's OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test. I'm happy to assume that the AI's model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I'm extremely interested in exploring examples where this approach breaks down. This is the part that doesn't sound possible to me. The situation you're worried about seems to be: * We have a predictor M. * There is an explanation π for why M satisfies the "object permanence regularity." * On a new input, π still captures why M predicts the diamond will appear to just sit there. * But in fact, on this input the diamond isn't actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there. I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like "the robber stole the diamond and replaced it with a fake" or "the robber tampered with the

6Vikrant Varma3y

To add some more concrete counter-examples: * deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π. * alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn't agree with, which the (AGI) robber exploits

Finite Factored Sets

Ramana Kumar3yΩ230

Partitions (of some underlying set) can be thought of as variables like this:

The number of values the variable can take on is the number of parts in the partition.
Every element of the underlying set has some value for the variable, namely, the part that that element is in.

Another way of looking at it: say we're thinking of a variable $v : S \to D$ as a function from the underlying set $S$ to $v$ 's domain $D$ . Then we can equivalently think of $v$ as the partition ${{s \in S ∣ v (s) = d} ∣ d \in D} ∖ \emptyset$ of $S$ with (up to) $|$ ... (read more)

3Vivek Hebbar3y

Makes perfect sense, thanks!

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Ramana Kumar3yΩ363

I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.

cfoster03y1011

To the extent that this framing is correct, the "sharp left turn" concept does not seem all that decision-relevant, since ~~all~~ most of the work of aligning the system (at least on the human side) should've happened way before that point.

EDIT: "all" was too strong here

A very crude deception eval is already passed

Ramana Kumar3y10

I think it would mean the same thing with your sentence instead.

Inner alignment: what are we pointing at?

Ramana Kumar3yΩ220

I'll take a stab at answering the questions for myself (fairly quick takes):

No, I don't care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it's robustly able to achieve things, it doesn't matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operati

... (read more)

1lemonhope3y

Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.

Simulators

Ramana Kumar3yΩ599

I think Dan's point is good: that the weights don't change, and the activations are reset between runs, so the same input (including rng) always produces the same output.

I agree with you that the weights and activations encode knowledge, but Dan's point is still a limit on learning.

I think there are two options for where learning may be happening under these conditions:

During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.
Using the environment as memory. T

Ramana Kumar3yΩ332

Expanding a bit on why: I think this will fail because the house-building AI won't actually be very good at instrumental reasoning, so there's nothing for the sticky goals hypothesis to make use of.

5evhub3y

To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.

Sticky goals: a concrete experiment for understanding deceptive alignment

Ramana Kumar3yΩ332

I agree with this prediction directionally, but not as strongly.

I'd prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.

3Ramana Kumar3y

Expanding a bit on why: I think this will fail because the house-building AI won't actually be very good at instrumental reasoning, so there's nothing for the sticky goals hypothesis to make use of.

We may be able to see sharp left turns coming

Ramana Kumar3yΩ230

What was it changed from and to?

4Ethan Perez3y

"We can see sharp left turns coming" -> "We may be able to see sharp left turns coming" (also open to other better suggestions)

Some conceptual alignment research projects

Ramana Kumar3yΩ230

This post (and the comment it links to) does some of the work of #10. I agree there's more to be said directly though.

Will Capabilities Generalise More?

Ramana Kumar3yΩ133

Hm, no, not really.

OK let's start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?

There are several claims which are not true about this function:

Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you're getting at isn't as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]

1TurnTrout3y

Well, first of all, that reward function is not outer aligned to TTT, by the following definition: There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that's not a problem in reality.) So, since reward functions don't have the type of "goal", what does it mean to say the real-life reward function "captures" what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else? I don't know, but it's not that of the loss function! I think "what is the type signature?" isn't relevant to "the type signature is not that of the loss function", which is the point I was making. That said -- maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior? My main point is that this "reward/loss indicates what we want" framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn't have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.

Your posts should be on arXiv

Ramana Kumar3yΩ391

Could this be accomplished with literally zero effort from the post-writers? The tasks of identifying which posts are arXiv-worthy, formatting for submission, and doing the submission all seem like they could be done by entities other than the author. The only issue might be in associating the arXiv submitter account with the right person.

4DavidHolmes3y

I suspect the arXiv might not be keen on an account that posts papers by a range of people (not including the account-owner as coauthor). This might lead to heavier moderation/whatever. But I could be very wrong!

7Viliam3y

I think the writer should at least approve of the idea of submitting the post to arXiv.

2JanB3y

It probably could, although I'd argue that even if not, quite often it would be worth the author's time.

Will Capabilities Generalise More?

Ramana Kumar3yΩ388

What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don't think of a Platonic game but a real world implementation). Does that still seem fine?

Another aspect of the real world is that we don't necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, ..., 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous im... (read more)

1TurnTrout3y

Hm, no, not really. I mean, there are several true mechanistic facts which get swept under the rug by phrases like "captures what I really want" (no fault to you, as I asked for an explanation of this phrase!): * This function provides exact gradients to desired network outputs, thus providing "exactly the gradients we want" * This function would not be safe to "optimize for", in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe, * Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead, * For every input image, you can use this function as a classifier to achieve the human-desired behavior. There are several claims which are not true about this function: * The function does not "represent" our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires, * Similarly, the function cannot be "aligned" or "unaligned" with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images). I messaged you two docs which I've written on the subject recently.

Finding Goals in the World Model

Ramana Kumar3y60

Given a utility function $U$ ...

I might have missed it, but where do you get this utility function from ultimately? It looked like you were trying to simultaneously infer the policy and utility function of the operator. This sounds like it might run afoul of Armstrong's work, which shows that you can't be sure to split out the $U$ correctly from the policy when doing IRL (with potentially imperfect agents, like humans) without more assumptions than a simplicity prior.

3Jeremy Gillen3y

That's correct that it simultaneously infers the policy and utility function. To avoid the underspecification problem, it uses a prior that favors higher intelligence agents. This is similar to taking assumptions 1 and 2a from http://proceedings.mlr.press/v97/shah19a/shah19a.pdf

Autonomy as taking responsibility for reference maintenance

Ramana Kumar3yΩ110

I agree it is related! I hope we as a community can triangulate in on whatever is going on between theories of mental representation and theories of optimisation or intelligence.

Gradient descent doesn't select for inner search

Ramana Kumar3yΩ450

How does this square with Are minimal circuits deceptive?

7Ivan Vendrov3y

Mostly orthogonal: * Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive. * This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it. Other relevant differences are 1. gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise. 2. I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth's definition of search as "general problem solving algorithm". 3. Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I'm focusing on the average or default case.

Will Capabilities Generalise More?

Ramana Kumar3yΩ113

Sure, one concrete example is the reward function in the tic-tac-toe environment (from X's perspective) that returns -1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.

I can probably illustrate something outside of such a clean game context too, but I'm curious what your response to this one is first, and to make sure this example is as clear as it needs to be.

1TurnTrout3y

Yes, I can imagine that for a simple game like tic-tac-toe. I want an example which is not for a Platonic game, but for the real world.

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Ramana Kumar3yΩ575

I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.

Oversight Misses 100% of Thoughts The AI Does Not Think

Ramana Kumar3yΩ330

The first thing I imagine is that nobody asks those questions. But let's set that aside.

This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it's hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.

4johnswentworth3y

Ok, let's try to disentangle a bit. There are roughly three separate failure modes involved here: * Nobody asks things like "If we take the action you just proposed, will we be happy with the outcome?" in the first place (mainly because organizations of >10 people are dysfunctional by default). * The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing. * (Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize "what questions should I ask?" such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask. Zooming in on the last bullet in more detail (because that's the one closest to the OP): one of Buck's proposed questions upthread was "If we take the action you just proposed, will we be happy with the outcome?". That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all. (This is pretty closely analogous to e.g. my example with the protesters: the protesters just don't really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.) Now, it's totally reasonable to say "but that's just one random question Buck made up on the spot, obviously in practice we'll put a lot more effort into it". The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong d

Oversight Misses 100% of Thoughts The AI Does Not Think

Ramana Kumar3yΩ344

Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.

Getting some feedback mechanism (including "what do human raters think of this?" but also mundane things like "what does this sensor report in this simulation or test run?") to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI's ability to get stuff done in the world comes from. The problem is genuinely capturing "will we be happy with the outcomes?" with such a mechanism.

2William_S3y

So I do think you can get feedback on the related question of "can you write a critique of this action that makes us think we wouldn't be happy with the outcomes" as you can give a reward of 1 if you're unhappy with the outcomes after seeing the critique, 0 otherwise. And this alone isn't sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn't be happy with the outcome, which is then where you'd need to get into recursive evaluation or debate or something. But this feels like "hard but potentially tractable problem" and not "100% doomed". Or at least the failure story needs to involve more steps like "sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it's bad" or "the consequences are so complicated the system can't explain them to us in the critique and get high reward for it" ETA: So I'm assuming the story for feedback on reliably doing things in the world you're referring to is something like "we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates" or something like that, and I agree this is easier than "are we actually happy with the outcome"

Oversight Misses 100% of Thoughts The AI Does Not Think

Ramana Kumar3yΩ443

The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing.

This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)

3johnswentworth3y

Indeed, ELK is very much asking the right questions, and I do expect people would use it if a robust and reasonably-performant solution were found. (Alignment is in fact economically valuable; it would be worth a lot.)

How much alignment data will we need in the long run?

Ramana Kumar3yΩ111

If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment.

I'm curious to understand what this means, what "data favoring aligned behavior" means particularly. I'll take for granted as background that there are some policies that are good ("aligned" and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:

Data doesn't specify generalization. For any achievable training loss on some dataset, there are many policies that achieve that lo

... (read more)

1Jacob_Hilton3y

This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role. For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy. I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.

Will Capabilities Generalise More?

Ramana Kumar3yΩ9111

Straw person: We haven't found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn't one.

Ramana's gloss of TurnTrout: But AIs don't maximise their feedback. The feedback is just input to the algorithm that shapes the AI's cognition. This cognition may then go on to in effect "have a world model" and "pursue something" in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won't be pursuing high feedback. (Also, it might just do something e... (read more)

2TurnTrout3y

Thanks for running a model of me :) Actual TurnTrout response: No. Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value "positive feedback" (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers.

On how various plans miss the hard bits of the alignment challenge

Ramana Kumar3yΩ7173

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.

For example, I'm particularly interested in how "interpretability" is supposed to work if, in some... (read more)

Will Capabilities Generalise More?

Ramana Kumar3yΩ563

The desiderata you mentioned:

Make sure the feedback matches the preferences
Make sure the agent isn't changing the preferences

It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.

Will Capabilities Generalise More?

Ramana Kumar3y10

Thanks that's great to hear :)

Will Capabilities Generalise More?

Ramana Kumar3yΩ110

Nice - thanks for this comment - how would the argument be summarised as a nice heading to go on this list? Maybe "Capabilities can be optimised using feedback but alignment cannot" (and feedback is cheap, and optimisation eventually produces generality)?

3johnswentworth3y

Maybe "Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment"? (Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That's just instrumental convergence.)

Will Capabilities Generalise More?

Ramana Kumar3yΩ110

I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)

2Kaj_Sotala3y

Yes, that was my argument in the comment that I linked. :)

-1Noosphere893y

Yeah, that's the main counterargument. Evolution is purposeless and doesn't care about anything for specific species or nature itself, and evolution isn't telelogical, so Argument 5 fails.

Where I agree and disagree with Eliezer

Ramana Kumar3yΩ110

Yes that sounds right to me.