LESSWRONG
LW

All of Vika's Comments + Replies

Thanks Gunnar, those sound like reasonable guidelines!

The common space was still usable by other housemates, but it felt a bit cramped, and I felt more internal pressure to keep it tidy for others to use (while in my own space I feel more comfortable leaving it messy for longer). Our housemates were very tolerant of having kid stuff everywhere, but it still seemed suboptimal.
The fridge, laundry area and outdoor garbage bins were the most overloaded in our case, while the shed and attic were sufficiently spacious and less in demand that it wasn't an i

... (read more)

Yeah, living in a group house was important for our mental well-being as well, especially during the pandemic and parental leaves. I think the benefits of the social environment decreased somewhat because we were often occupied with the kids and had less time to socialize. It was still pretty good though - if Deep End was close enough to schools we like, we would have probably stayed and tried to make it work (though this would likely involve taking over more of the house over time). Our new place contributes to mental well-being by being much closer to nature (while still a reasonable bike commute from the office).

Moving on from community living

Vika11mo62

I would potentially be interested, if we knew the other people well. I find that, as a parent, I'm less willing to take risks by moving in with people I don't know that well, because the stress and uncertainty associated with things not working out are more costly.

Space requirements would likely be the biggest difficulty though, as you pointed out. A family with 2 kids probably needs at least 3 rooms, so two such families together would need a 6 bedroom house. This is hard to find, especially combined with other constraints like proximity to schools, commute distances, etc. It's a lot easier to live near other families than sharing a living space.

More Is Different for AI

Vika1yΩ340Review for 2022 Review

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience.

Counterarguments to the basic AI x-risk case

Vika1yΩ10172Review for 2022 Review

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika1yΩ7120Review for 2022 Review

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims:

Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim

... (read more)

Clarifying AI X-risk

Vika1y*Ω7120Review for 2022 Review

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), whi... (read more)

DeepMind alignment team opinions on AGI ruin arguments

Vika1yΩ13240Review for 2022 Review

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things.

Thoughts on some of the cruxes in the post based on last year's developments:

Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it

Vika2yΩ353

I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc.

I think the people who think we can avoid building systems that are good at deception often don't buy the idea of instrumental convergence either (e.g. Yann LeCun), so I'm not sure that arguing for correlated capabilities in terms of intelligence would have an advantage.

Steering GPT-2-XL by adding an activation vector

Vika2yΩ480

Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

6TurnTrout2y

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper.

Power-seeking can be probable and predictive for trained agents

Vika2y*Ω120

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)

An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may ... (read more)

4TurnTrout2y

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "the policy tends to go to the top-right 5x5, and searches for cheese once there, because that's where the cheese-seeking computations were more strongly historically reinforced" and "the policy sometimes pursues cheese and sometimes navigates to the top-right 5x5 corner." These predictions are (informally) gradable, even if the underlying intuitions are informal. As it pertains to shard theory more broadly, though, I agree that more precision is needed. Increasing precision and formalism is the reason I proposed and executed the project underpinning Understanding and controlling a maze-solving policy network. I wanted to understand more about realistic motivational circuitry and model internals in the real world. I think the last few months have given me headway on a more mechanistic definition of a "shard-based agent."

Power-seeking can be probable and predictive for trained agents

Vika2yΩ560

Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like "We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts".

4TurnTrout2y

Great, this sounds much better!

Power-seeking can be probable and predictive for trained agents

Vika2yΩ340

Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.

I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking t... (read more)

4TurnTrout2y

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive: Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually: 1. The assumptions used in the post ("learns a randomly-selected training-compatible goal") assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization), 2. Therefore the assumptions become less probable 3. Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large) I suspect that you agree that "learns a training-compatible goal" isn't very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the "can" in "Power-seeking can be probable and predictive."

Power-seeking can be probable and predictive for trained agents

Vika2yΩ120

The internal representations assumption was meant to be pretty broad, I didn't mean that the network is explicitly representing a scalar reward function over observations or anything like that - e.g. these can be implicit representations of state features $.$ I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits.

TurnTrout's shortform feed

Vika2yΩ480

Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post.

Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many d... (read more)

6TurnTrout2y

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

TurnTrout's shortform feed

Vika2yΩ6132

Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk.

Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper ... (read more)

TurnTrout2yΩ6120

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.

I'm open to suggestions on how to phrase this differently when I next give this talk.

It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies ten... (read more)

Power-seeking can be probable and predictive for trained agents

Vika2yΩ46-6

Here is my guess on how shard theory would affect the argument in this post:

In my understanding, shard theory would predict that the model learns multiple goals from the training-compatible (TC) set (e.g. including both the coin goal and the go-right goal in CoinRun), and may pursue different learned goals in different new situations. The simplifying assumption that the model pursues a randomly chosen goal from the TC set also covers this case, so this doesn't affect the argument.
Shard theory might also imply that the training-compatible set should b

... (read more)

5TurnTrout2y

I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn't offer any valid argument for that conclusion. I don't think these results cover the shard case. I don't think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like "maximize time-discounted sum of a scalar quantity of world state." My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.

When is Goodhart catastrophic?

Vika2yΩ240

Great post! I especially enjoyed the intuitive visualizations for how the heavy-tailed distributions affect the degree of overoptimization of X.

As a possibly interesting connection, your set of criteria for an alignment plan can also be thought of as criteria for selecting a model specification that approximates the ideal specification well, especially trying to ensure that the approximation error is light-tailed.

Yoshua Bengio: How Rogue AIs may Arise

Vika2y202

David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously

Power-seeking can be probable and predictive for trained agents

Vika2yΩ68-6

Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold.

This post assumes a standard RL setup and is not intended to apply to LLMs (it's possible some version of this result may hold for fine-tuned LLMs, but that's outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all.

I agr... (read more)

5TurnTrout2y

Thanks for the reply. This comment is mostly me disagreeing with you.[1] But I really wish someone had said the following things to me before I spent thousands of hours thinking about optimal policies. My point is not just that this post has made a strong assumption which may not hold. My point is rather that these results are not probable because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations. I also think that there is not good a priori reason to have expected "training-compatible" "goals" to be learned. According to me, "learning and optimizing a reward function" is both unclear communication and doesn't actually seem to happen in practice. I don't see any formal assumption which excludes LLM finetuning. Which assumption do you think should exclude them? EDIT: Someone privately pointed out that LLM finetuning uses KL, which isn't present for your results. In that case I would agree your results don't apply to LLMs for that reason. This point is, in large part, my fault. As I argued in my original comment, this terminology makes readers actively worse at reasoning about realistic trained systems. I regret each of the thousands of hours I spent on the power-seeking work, and sometimes fantasize about retracting one or both papers. I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency). Anyways, we can talk about utility functions, but then we're going to lose claim to probable-ness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That se

Power-seeking can be probable and predictive for trained agents

Vika2yΩ130

Which definition / result are you referring to?

4neverix2y

Seems like quoting doesn't work for LaTeX, it was definitions 2/3. Reading again I saw D2 was indeed applicable to sets.

[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy

Vika2yΩ120

We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2).

[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy

Vika2yΩ361

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

baturinsky2yΩ0119

While the model has the advantage of only having to "win" once.

Optimization Concepts in the Game of Life

Vika2yΩ120

Thanks Alex for the detailed feedback! I have updated the post to fix these errors.

Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work.

Imitative Generalisation (AKA 'Learning the Prior')

Vika2yΩ360Review for 2021 Review

This post provides a maximally clear and simple explanation of a complex alignment scheme. I read the original "learning the prior" post a few times but found it hard to follow. I only understood how the imitative generalization scheme works after reading this post (the examples and diagrams and clear structure helped a lot).

Saving Time

Vika2yΩ270Review for 2021 Review

This post helped me understand the motivation for the Finite Factored Sets work, which I was confused about for a while. The framing of agency as time travel is a great intuition pump.

Selection Theorems: A Program For Understanding Agents

Vika2yΩ460Review for 2021 Review

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.

Finding gliders in the game of life

Vika2yΩ122

+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix - this seems like it would unnecessarily discourage people from reading it.

4paulfchristiano2y

I'm convinced, I relabeled it.

The Plan - 2022 Update

Vika2yΩ7140

It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 ye... (read more)

Charlie Steiner2yΩ9148

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

6johnswentworth2y

When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with. We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika2yΩ46-2

I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.

Takeaways from a survey on AI alignment resources

Vika2y20

Too bad that my list of AI safety resources didn't make it into the survey - would be good to know to what extent it would be useful to keep maintaining it. Will you be running future iterations of this survey?

2DanielFilan2y

Would have been good to ask about that and also mine it for resources. Re: future iterations, I'm not sure. On one hand, I think it's kind of bad for this kind of thing to be run by a person who stands to benefit from his thing ranking high on the survey. On the other hand, I'm not sure if anyone else wants to do it, and I think it would be good to run future iterations. If anyone does want to take it over, please let me know. I'm not sure how many would be interested in doing that (maybe grantmaking orgs?), but if there are multiple such people it would probably be good to pick a designated successor. I should say that I reserve the right to wait until next year to make any sort of decision on this.

Simulators

Vika2yΩ230

I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tune... (read more)

DeepMind alignment team opinions on AGI ruin arguments

Vika2yΩ120

I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Vika2y20

No worries! Thanks a lot for updating the post

The alignment problem from a deep learning perspective

Vika2yΩ352

Thanks Richard for this post, it was very helpful to read! Some quick comments:

I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
Phase 1 and 2 seem to map to outer and inner alignment res

... (read more)

4Richard_Ngo2y

Thanks for the comments Vika! A few responses: Makes sense, will do. That doesn't quite seem right to me. In particular: * Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment. * Phase 1 introduces the reward misspecification problem (which I treat as synonymous with "outer alignment") but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are "robustly correlated with reward because they’re useful in a wide range of environments", which is a type of inner misalignment. * Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn't quite feel like a central example of outer misalignment, but it also doesn't quite seem like a central example of reward tampering (because "deceiving humans" doesn't seem like an example of "tampering" per se). Plausibly we want a new term for this - the best I can come up with after a few minutes' thinking is "reward fixation", but I'd welcome alternatives. It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there's so much it won't know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he's taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that's way too complex to predict precisely. Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that's really helpful in phase 3 - if it gets goo

Simulators

Vika2yΩ360

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically tar... (read more)

3VojtaKovarik2y

Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.) One implication I see is that it if the simulator architecture becomes frequently used, it might be really hard to tell whether a thing is dangerous or not. For example might just behave completely fine with most prompts and catastrophically with some other prompts, and you will never know until you try. (Or unless you do some extra interpretability/other work that doesn't yet exist.) It would be rather unfortunate if the Vulnerable World Hypothesis was true because of specific LLM prompts :-).

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Vika2y97

I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then... (read more)

3elifland2y

Good point, and you definitely have more expertise on the subject than I do. I think my updated view is ~5% on this step. I might be underconfident about my pessimism on the first step (competitiveness of process-based systems) though. Overall I've updated to be slightly more optimistic about this route to impact.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Vika2yΩ5104

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more".

Some corrections for your overall description of the DM alignment team:

I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the "fairly hard"

... (read more)

3Thomas Larsen2y

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these.

Toni Kurz and the Insanity of Climbing Mountains

Vika3y110

This post resonates with me on a personal level, since my mother was really into mountain climbing in her younger years. She quit after seeing a friend die in front of her (another young woman who broke her neck against an opposing rock face in an unlucky fall). It seems likely I wouldn't be here otherwise. Happy to report that she is still enjoying safer mountain activities 50 years later.

DeepMind alignment team opinions on AGI ruin arguments

Vika3yΩ3133

Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

DeepMind alignment team opinions on AGI ruin arguments

Vika3y113

Hmm, thanks... Can you elaborate what "this" is?

DeepMind alignment team opinions on AGI ruin arguments

Vika3yΩ7150

We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

4Noosphere893y

So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?

DeepMind alignment team opinions on AGI ruin arguments

Vika3yΩ4100

Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments).

DeepMind alignment team opinions on AGI ruin arguments

Vika3yΩ4100

Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy.

Paradigms of AI alignment: components and enablers

Vika3y30

Thanks, glad you found the post useful!

Maintaining uncertainty over the goal allows the system to model the set of goals that are consistent with the training data, notice when they disagree with each other out of distribution, and resolve that disagreement in some way (e.g. by deferring to a human).

DeepMind alignment team opinions on AGI ruin arguments

Vika3y60

Fixed

Gradations of Agency

Vika3yΩ340

Ah, I think you intended level 6 as an OR of learning from imitation / imagined experience, while I interpreted it as an AND. I agree that humans learn from imitation on a regular basis (e.g. at school). In my version of the hierarchy, learning from imitation and imagined experience would be different levels (e.g. level 6 and 7) because the latter seems a lot harder. In your decision theory example, I think a lot more people would be able to do the imitation part than the imagined experience part.

2Daniel Kokotajlo3y

Well said; I agree it should be split up like that.

Gradations of Agency

Vika3yΩ340

I think some humans are at level 6 some of the time (see Humans Who Are Not Concentrating Are Not General Intelligences). I would expect that learning cognitive algorithms from imagined experience is pretty hard for many humans (e.g. examples in the Astral Codex post about conditional hypotheticals). But maybe I have a different interpretation of Level 6 than what you had in mind?

4Daniel Kokotajlo3y

Good point re learning cognitive algorithms from imagined experience, that does seems pretty hard. From imitation though? We do it all the time. Here's an example of me doing both: I read books about decision theory & ethics, and learn about expected utility maximization & the bounded variants that humans can actually do in practice (back of envelope calculations, etc.) I immediately start implementing this algorithm myself on a few occasions. (Imitation) Then I read more books and learn about "pascal's mugging" and the like. People are arguing about whether or not it's a problem for expected utility maximization. I think through the arguments myself and come up with some new arguments of my own. This involves imagining how the expected utility maximization algorithm would behave in various hypothetical scenarios, and also just reasoning analytically about the properties of the algorithm. I end up concluding that I should continue using the algorithm but with some modifications. (Learning from imagined experience.) Would you agree with this example, or are you thinking about the hierarchy somewhat differently than me? I'm keen to hear more if the latter.

Gradations of Agency

Vika3yΩ340

This is an interesting hierarchy! I'm wondering how to classify humans and various current ML systems along this spectrum. My quick take is that most humans are at Levels 4-5, AlphaZero is at level 5, and GPT-3 is at level 4 with the right prompting. Curious if you have specific ML examples in mind for these levels.

2Daniel Kokotajlo3y

Thanks! Hmm, I would have thought humans were at Level 6, though of course most of their cognition most of the time is at lower levels.

Examples of AI Increasing AI Progress

Vika3y20

Makes sense, thanks. I think the current version of the list is not a significant infohazard since the examples are well-known, but I agree it's good to be cautious. (I tweeted about it to try to get more examples, but it didn't get much uptake, happy to delete the tweet if you prefer.) Focusing on outreach to people who care about AI risk seems like a good idea, maybe it could be useful to nudge researchers who don't work on AI safety because of long timelines to start working on it.

3TW1233y

No need to delete the tweet. I dagree the examples are not info hazards, they're all publicly known. I just probably wouldn't want somebody going to good ML researchers who currently are doing something that isn't really capabilities (e.g., application of ML to some other area) and telling them "look at this, AGI soon."