LESSWRONG
LW

zac_kenton — LessWrong

Thanks for featuring our work! I'd like to clarify a few points, which I think each share some top-level similarities: our study is study of protocols as inference-only (which is cheap and quick to study, possibly indicative) whereas what we care more about it protocols for training (which is much more expensive, and will take longer to study) which was out of scope for this work, though we intend to look at that next based on our findings -- e.g. we have learnt that some domains are easier to work with than others, some baseline protocols are more meaningful/easier to interpret. In my opinion this is time well-spent to avoid spending lots... (read 379 more words →)

Replying toOn scalable oversight with weak LLMs judging strong LLMs

zac_kenton2y

On scalable oversight with weak LLMs judging strong LLMs

The post has now been edited with the updated plots for open consultancy/debate.

Replying toOn scalable oversight with weak LLMs judging strong LLMs

zac_kenton2y*

On scalable oversight with weak LLMs judging strong LLMs

Thanks for the comment Fabien. A couple of points:

open debate accuracy is (almost, except for the way we handle invalid answers, which is very rare) the same as debate accuracy. That's because the data is almost exactly the same - we're just marking one debater as a protagonist based on what that model would choose under direct QA. So it's not bad that open debate has same accuracy as debate, that was expected. It is kinda bad that it's somewhat worse than open consultancy, though we didn't try 'fully open debate' where debaters can both pick same side (or opposite, perhaps under resampling/rephrasing etc). This is probably a better comparison to open

zac_kenton

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering, where the human judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include... (read 1890 more words →)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik, Rohin Shah

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.

Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.

Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single... (read 2770 more words →)

149

•••

Replying toThreat Model Literature Review

zac_kenton3y

Threat Model Literature Review

Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review - when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects -- rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I'll add a clarification of that at the top. Now to your comments:

To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes... (read more)

Threat Model Literature Review

zac_kenton

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, Elliot Catt

TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model.

Where available we cribbed from the summary in the Alignment Newsletter.

For other people's overviews of some threat models, see here and here.

[ETA: When considering strengths and weaknesses of each threat model, this was done with respect to our goal of better generation/prioritization among alignment research projects. They shouldn't necessarily be read as an all-things-considered review of that work.]

Categorization

Threat Models

Goal misgeneralization + Specification Gaming ➜ MAPS

Is Power-Seeking AI an Existential Risk?

... (read 7259 more words →)

Clarifying AI X-risk

zac_kenton

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, Elliot Catt

TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review.

The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk (X-risk) from misaligned AI. This post summarizes our findings. Our aim was to clarify the case for X-risk to enable better research project generation and prioritization.

First, we conducted a literature review of existing threat models, discussed their strengths/weaknesses and then formed a categorization based on the technical cause of X-risk and the path that leads to X-risk. Next we tried to find consensus... (read 1096 more words →)

127

Replying toDiscovering Agents

zac_kenton3y

Discovering Agents

I haven't considered this in great detail, but if there are variables, then I think the causal discovery runtime is $O (N^{2})$ . As we mention in the paper (footnote 5) there may be more efficient causal discovery algorithms that make use of certain assumptions about the system.

On adoption, perhaps if one encounters a situation where the computational cost is too high, one could coarse-grain their variables to reduce the number of variables. I don't have results on this at the moment but I expect that the presence of agency (none, or some) is robust to the coarse-graining, though the exact number of agents is not (example 4.3), nor are the variables identified as decisions/utilities (Appendix C).

Replying toDiscovering Agents

zac_kenton3y

Discovering Agents

Thanks, this has now been corrected to say 'not terminal'.

Discovering Agents

zac_kenton

Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.

Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.

These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potential risks before an agent is trained and inspire better agent designs with more appealing alignment properties.

Our... (read 1620 more words →)