LESSWRONG
LW

paulfchristiano — LessWrong

Earlier this year ARC posted a prize for two matrix completion problems. We received a number of submissions we considered useful, but not any complete solutions. We are closing the contest and awarding the following partial prizes:

$500 to Elad Hazan for solving a related problem and pointing us to this paper
$500 to Som Bagchi and Jacob Stavrianos for their analysis in this comment.
$500 to Shalev Ben-David for a reduction to computing the gamma 2 norm.

Our main update from running this prize is that these problems are hard and there’s probably not a simple solution we are overlooking. My current guess is that it’s possible to achieve a polynomial dependence on the precision... (read 411 more words →)

Thoughts on responsible scaling policies and regulation

paulfchristiano

I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:

I think that sufficiently good responsible scaling policies could dramatically reduce risk, and that preliminary policies like Anthropic’s RSP meaningfully reduce risk by creating urgency around key protective measures and increasing the probability of a pause if those measures can’t be implemented quickly enough.
I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely

... (read 1700 more words →)

220

•••

Thoughts on sharing information about language model capabilities

paulfchristiano

Core claim

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).

Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).

Context

ARC Evals currently focuses on evaluating the capabilities and... (read 3096 more words →)

211

Self-driving car bets

paulfchristiano

This month I lost a bunch of bets.

Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023. Then I made similar bets a dozen times because everyone disagreed with me.

The first deployment to potentially meet our bar was Phoenix in 2022. I think Waymo is close to offering public rides in SF, and there are a few more cities being tested, but it looks like it will be at least a couple of years before we get 10 cities even if everything goes well.

Waymo’s current coverage of Phoenix (here)

Back in 2016 it looked plausible to me that the technology... (read 1269 more words →)

236

ARC is hiring theoretical researchers

paulfchristiano

paulfchristiano, Jacob_Hilton, Mark Xu

The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here.

Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan to process them until that time.

What is ARC’s Theory team?

The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we’re trying to design ML training objectives that... (read 1083 more words →)

126

Prizes for matrix completion problems

paulfchristiano

Here are two self-contained algorithmic questions that have come up in our research. We're offering a bounty of $5k for a solution to either of them—either an algorithm, or a lower bound under any hardness assumption that has appeared in the literature.

Question 1 (existence of PSD completions): given $m$ entries of an $n \times n$ matrix, including the diagonal, can we tell in time $~ O (n m)$ whether it has any (real, symmetric) positive semidefinite completion? Proving that this task is at least as hard as dense matrix multiplication or PSD testing would count as a resolution.

Question 2 (fast “approximate squaring”): given $A \in R^{n \times n}$ and a set of $m = Ω (n)$ entries of $A A^{T}$ , can I find some PSD... (read 279 more words →)

164

My views on “doom”

paulfchristiano

I’m often asked: “what’s the probability of a really bad outcome from AI?”

There are many different versions of that question with different answers. In this post I’ll try to answer a bunch of versions of this question all in one place.

Two distinctions

Two distinctions often lead to confusion about what I believe:

One distinction is between dying (“extinction risk”) and having a bad future (“existential risk”). I think there’s a good chance of bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
An important subcategory of “bad future” is “AI takeover:” an outcome where the world is governed by AI systems, and we weren’t able to build AI systems who share our values

... (read 455 more words →)

252

Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes

Andrea_Miotti

Andrea_Miotti, paulfchristiano, Gabriel Alfour, Olive Branch

The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format.

Here's a summary of the discussion, as well as the full transcript below the summary, lightly edited for readability.

Summary

Introduction

GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are... (read 14043 more words →)

Thoughts on the impact of RLHF research

paulfchristiano

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."

Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:

The simplest plausible strategies for alignment

... (read 2490 more words →)

102

253

Can we efficiently distinguish different mechanisms?

paulfchristiano

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)

Background

We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good... (read 4572 more words →)

LESSWRONG
LW

LESSWRONG
LW

paulfchristiano

Where I agree and disagree with Eliezer

What failure looks like

AI alignment is distinct from its near-term applications

Another (outer) alignment failure story

paulfchristiano

paulfchristiano

Matrix completion prize results

Thoughts on responsible scaling policies and regulation

Thoughts on sharing information about language model capabilities

Self-driving car bets

ARC is hiring theoretical researchers

Prizes for matrix completion problems

My views on “doom”

Iterated Amplification

paulfchristiano

Where I agree and disagree with Eliezer

What failure looks like

AI alignment is distinct from its near-term applications

Another (outer) alignment failure story

paulfchristiano

paulfchristiano

Matrix completion prize results

Thoughts on responsible scaling policies and regulation

Thoughts on sharing information about language model capabilities

Self-driving car bets

ARC is hiring theoretical researchers

Prizes for matrix completion problems

My views on “doom”

Iterated Amplification

Core claim

Context

What is ARC’s Theory team?

Two distinctions

Summary

Introduction

Background on my involvement in RLHF work

Background