LESSWRONG
LW

Peter S. Park — LessWrong

By Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks

[This post summarizes our new report on AI deception, available here]

Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) built for specific competitive situations, and general-purpose AI systems (such as large language models). Next, we detail several risks from AI deception, such as fraud, election tampering, and losing control of AI systems. Finally, we outline several potential... (read 2806 more words →)

AI can exploit safety plans posted on the Internet

Peter S. Park

A month ago, I predicted that AI systems will be able to access safety plans posted on the Internet and use them for its own purposes. If true, it follows that a likely misaligned-by-default AGI could be able to exploit our safety plans, likely to our detriment.

The post was controversial. On the EA forum, it obtained only 13 net upvotes from 23 voters, and the top comment (which disagreed with the post) obtained 25 net upvotes from 17 voters.

On LessWrong, my post obtained only 3 net upvotes from 13 votes, while the top comment (which also disagreed with the post) obtained 9 upvotes from 3 votes.

I'm writing to report that OpenAI's recent... (read more)

-15

The limited upside of interpretability

Peter S. Park

TL;DR: A strategy aiming to elicit latent knowledge (or to make any hopefully robust, hopefully generalizable prediction) from interpreting an AGI’s fine-grained internal data may be unlikely to succeed, given that the complex system of an AGI’s agent-environment interaction dynamics will plausibly turn out to be computationally irreducible. In general, the most efficient way to predict the behavior of a complex agent in an environment is to run it in that exact environment. Mechanistic interpretability is unlikely to provide a reliable safety plan that magically improves on the default strategy of empiricism. Coarse-grained models of the complex system have a realistic chance of making robust predictions out-of-distribution, although such predictions would then... (read 2796 more words →)

Why do we post our AI safety plans on the Internet?

Peter S. Park

Cross-posted from the EA Forum.

TL;DR: It is plausible that AGI safety research should be assumed compromised once it is posted on the Internet, even in a purportedly private Google Doc. This is because the corporation creating the AGI will likely be training it on as much data as possible. And whenever the AGI knows in advance of our plan “If we see Sign X of misalignment from the AI, we should shut it down and retrain,” it can use this plan against us: for example, by hiding from us Sign X that it would have shown under normal circumstances. If true, this concern implies that the impact of EAs’ past and current... (read 3175 more words →)

Can We Align a Self-Improving AGI?

Peter S. Park

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth

TL;DR: Suppose that a team of researchers somehow aligned an AGI of human-level capabilities within the limited collection of environments that are accessible at that level. To corrigibly aid the researchers, the aligned AGI increases its capabilities (e.g., deployment into the Internet, abstraction capabilities, ability to break encryption). But after the aligned AGI increases its capabilities, it may be able to access environments that were previously inaccessible to both itself and the researchers. These new environments may then cause the AGI to phase-transition into misalignedness, in a complex way the alignment researchers could not... (read 3263 more words →)

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP

NickyP, Peter S. Park, Stephen Fowler

Midjourney generating a HD image of "a medium-length sleeve t-shirt". It in fact looks like a t-shirt that has both long sleeves and short sleeves.

Produced as part of the SERI MATS Program 2022 under John Wentworth

General Idea

There are ideas that people can learn more or less easily compared to other ideas. This will vary because of at least two things: One is that the ideas may be natural to the environment/culture ( “culturally natural” ), the other is that they might be natural/understandable by human brains ( “architecturally natural” ). This should be formalised so that an AI would use ideas that are as human-interpretable as possible. Ideally, we would also be... (read 4754 more words →)

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park

Peter S. Park, NickyP, Stephen Fowler

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth

“Overconfidence in yourself is a swift way to defeat.”

- Sun Tzu

TL;DR: Escape into the Internet is probably an instrumental goal for an agentic AGI. An incompletely aligned AGI may escape prematurely, and the biggest failure mode for this is probably the AGI socially engineering the alignment researchers. Thus, opening an additional information channel between the researchers and the AGI (e.g., adding an interpretability tool and/or researcher) is inherently risky. The expected cost of adding this channel may even exceed the expected scientific benefit. Whether this is true depends on the informational efficiency of the... (read 3226 more words →)

Finding Skeletons on Rashomon Ridge

David Udell

David Udell, Peter S. Park, NickyP

A product of a SERI MATS research sprint (taking 1.5 weeks).

Cannot yet assign positively to animal or vegetable kingdom, but odds now favour animal. Probably represents incredibly advanced evolution of radiata without loss of certain primitive features. Echinoderm resemblances unmistakable despite local contradictory evidences. Wing structure puzzles in view of probable marine habitat, but may have use in water navigation. Symmetry is curiously vegetable-like, suggesting vegetable’s essentially up-and-down structure rather than animal’s fore-and-aft structure …
Vast field of study opened … I’ve got to dissect one of these things before we take any rest.
--H. P. Lovecraft, At the Mountains of Madness

Introduction

This is our MATS research-sprint team's crack at working towards the True Name... (read 1974 more words →)

Race Along Rashomon Ridge

Stephen Fowler

Stephen Fowler, Peter S. Park, MichaelEinhorn

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Research Sprint Under John Wentworth

Two Deep Neural Networks with wildly different parameters can produce equally good results. Not only can a tweak to parameters leave performance unchanged, but in many cases, two neural networks with completely different weights and biases produce identical outputs for any input.

The motivating question:
Given two optimal models in a neural network's weight space, is it possible to find a path between them comprised entirely of other optimal models?

In other words, can we find a continuous path of tweaks from the first optimal model to the second without reducing performance at any point in the process?

Ultimately, we... (read 2475 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

Peter S. Park

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Race Along Rashomon Ridge

Finding Skeletons on Rashomon Ridge

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park

Peter S. Park

AI Deception: A Survey of Examples, Risks, and Potential Solutions

AI can exploit safety plans posted on the Internet

The limited upside of interpretability

Why do we post our AI safety plans on the Internet?

Can We Align a Self-Improving AGI?

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Race Along Rashomon Ridge

Finding Skeletons on Rashomon Ridge

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park

Peter S. Park

AI Deception: A Survey of Examples, Risks, and Potential Solutions

AI can exploit safety plans posted on the Internet

The limited upside of interpretability

Why do we post our AI safety plans on the Internet?

Can We Align a Self-Improving AGI?

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth

Produced as part of the SERI MATS Program 2022 under John Wentworth

General Idea

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth

Introduction

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Research Sprint Under John Wentworth