Replying toSecular interpretations of core perennialist claims

Secular interpretations of core perennialist claims

This Goodness of Reality hypothesis is a very strong empirical claim about psychology that strongly contradicts folk psychology,

One way of thinking about the Goodness of Reality hypothesis is that if we look at an agent in the world, its world model and utility function/preferences are fully a property of that agent/its internals rather than reality-at-large. Reality is value-neutral - it requires additional structure (utility function, etc.) to assign value to states of reality (and these utility functions, to the extent that they're real, are parts of reality itself).

Also, from the 0th-person perspective/POV of awareness, via meditation practices, one can observe how value judgments are being constructed and go "beyond" value judgments about reality.

Nitpick: Is reality "Good" or is it beyond good and ... evil?

-1

Replying toExplaining the AI Alignment Problem to Tibetan Buddhist Monks

Paul Colognese2y

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Interesting! I'm working on a project exploring something similar but from a different framing. I'll give this view some thought, thanks!

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Paul Colognese

Introduction

As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks.^[1]

I was also invited to contribute, but I sort of ignored the instruction and decided to present an introduction to the AI Alignment Problem instead. It was a fun exercise in pedagogy, communication, and outreach :)

I decided to share a draft version here in case people find it interesting/are interested in AI Alignment outreach/feel like giving feedback. Note that I tried to write it while keeping in mind the context around the culture... (read 1667 more words →)

Replying toAnomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese2y

Anomalous Concept Detection for Detecting Hidden Cognition

Thanks, should be fixed now.

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese

Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback.

Summary

Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious if an AI tasked with booking a holiday for a user was found to be using internal features/concepts related to “bioweapons” or “viruses”.

The ability to detect hidden cognition could help prevent negative outcomes that result from some forms of deceptive alignment/scheming.

In this post, we will:

Introduce the idea of concepts as units of knowledge used in cognition to solve tasks.
Introduce and discuss the idea of Anomalous Concept

... (read 2914 more words →)

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese

Thanks to Johannes Treutlein for discussions and feedback.

Introduction

An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes or include this information as part of the training signal.

For example, an AI managing a city's energy grid may begin to prioritize its own operational efficiency over the city’s actual energy needs. Knowing that its overseers will not approve of this goal, it pursues its goal via hidden cognition, undetected until issues arise.

One hope is that traces of the hidden cognition are present in... (read 1162 more words →)

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese

Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas.

Summary

WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I tidied up the summary, deleted some sections, and added warnings related to parts of the post that could be confusing.

These notes are the result of reflecting on how Internal Objectives/Internal Target Information might be represented in simple theoretical models of agents. This reflection aimed to inform how we might detect these Internal Objectives via interpretability.

Note: Insights are over-indexed on these particular models of agents.

Insights... (read 2372 more words →)

Replying toCharbel-Raphaël and Lucius discuss interpretability

Paul Colognese2y

Charbel-Raphaël and Lucius discuss interpretability

Thanks, that's the kind of answer I was looking for

Replying toCharbel-Raphaël and Lucius discuss interpretability

Paul Colognese2y

Charbel-Raphaël and Lucius discuss interpretability

Interesting discussion; thanks for posting!

I'm curious about what elementary units in NNs could be.

the elementary units are not the neurons, but some other thing.

I tend to model NNs as computational graphs where activation spaces/layers are the nodes and weights/tensors are the edges of the graph. Under this framing, my initial intuition is that elementary units are either going to be contained in the activation spaces or the weights.

There does seem to be empirical evidence that features of the dataset are represented as linear directions in activation space.

I'd be interested in any thoughts regarding what other forms elementary units in NNs could take. In particular, I'd be surprised if they aren't represented in subspaces of activation spaces.

Internal Target Information for AI Oversight

Paul Colognese

Thanks to Arun Jose for discussions and feedback.

Summary

In this short post, we discuss the concept of Internal Target Information within agentic AI systems, arguing that agentic systems possess internal information about their targets. This information, we propose, can potentially be detected and interpreted by an overseer before the target outcome is realized in the environment, offering a pathway to preempt catastrophic outcomes posed by future agentic AI systems.

This discussion aims to highlight the key idea that motivates our current research agenda, laying a foundation for forthcoming work.

We’ll start by introducing the inner alignment problem and why oversight of an agent’s internals is important. We’ll then introduce a model of an overseer overseeing an... (read 1365 more words →)

Replying toHigh-level interpretability: detecting an AI's objectives

Paul Colognese2y

High-level interpretability: detecting an AI's objectives

Thanks for pointing this out. I'll look into it and modify the post accordingly.

Potential alignment targets for a sovereign superintelligent AI

Paul Colognese

I'd like to compile a list of potential alignment targets for a sovereign superintelligent AI.

By an alignment target, I mean something like what goals/values/utility function we might want to instill in a sovereign superintelligent AI (assuming we've solved the alignment problem).

Here are some alignment targets I've come across:

Alignment to a human user (or group of human users).
Ambitious value learning.
Coherent extrapolated volition.

Examples, reviews, critiques, and comparisons of alignment targets are welcome.

Replying toHigh-level interpretability: detecting an AI's objectives

Paul Colognese2y

High-level interpretability: detecting an AI's objectives

With ideal objective detection methods, the inner alignment problem is solved (or partially solved in the case of non-ideal objective detection methods), and governance would be needed to regulate which objectives are allowed to be instilled in an AI (i.e., government does something like outer alignment regulation).

Ideal objective oversight essentially allows an overseer instill whatever objectives it wants the AI to have. Therefore, if the overseer includes the government, the government can influence whatever target outcomes the AI pursues.

So practically, this means that the governance policies would require the government to have access to the objective detection method results, directly or indirectly through the AI labs.

High-level interpretability: detecting an AI's objectives

Paul Colognese

Paul Colognese, Jozdien

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure.

Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback.

Summary

This is a brief overview of our research agenda, recent progress, and future objectives.

Having the ability to robustly detect, interpret, and modify an AI’s objectives could allow us to directly solve the inner alignment problem. Our work focuses on a top-down approach, where we focus on clarifying our understanding of how objectives might exist in an AI’s internals and developing methods to... (read 6171 more words →)

[Linkpost] Frontier AI Taskforce: first progress report

Paul Colognese

Some quotes from the report

Introduction

The Taskforce is a start-up inside government, delivering on the ambitious mission given to us by the Prime Minister: to build an AI research team that can evaluate risk at the frontier of AI. As AI systems become more capable they may significantly augment risks. An AI system that advances towards human ability at writing software could increase cybersecurity threats. An AI system that becomes more capable at modelling biology could escalate biosecurity threats. To manage this risk technical evaluations are critical - and these need to be developed by a neutral third party - otherwise we risk AI companies marking their

... (read 1124 more words →)

Replying toAligned AI via monitoring objectives in AutoGPT-like systems

Paul Colognese3y

Aligned AI via monitoring objectives in AutoGPT-like systems

Thanks for the reponse, it's useful to hear that we can to the same conclusions. I quoted your post in the first paragraph.

Thanks for bringing Fabien's post to my attention! I'll reference it.

Looking forward to your upcoming post.

Aligned AI via monitoring objectives in AutoGPT-like systems

Paul Colognese

Thanks to Arun Jose, Joseph Bloom, and Johannes Treutlein for feedback/discussions.

Introduction

The release of AutoGPT prompted discussions related to the potential of such systems to turn non-agentic LLMs into agentic systems that pursue goals, along with the dangers that could follow. The relation of such systems to the alignment problem has also been explored.

In this short post, we investigate a threat model that comes from AutoGPT-like systems pursuing unaligned objectives and explore the potential for alignment via oversight. We briefly consider some key properties of such systems, and then discuss the idea that these systems’ high-level cognition might be interpretable by default and so might allow for sufficient oversight to ensure the system is aligned. Finally, we consider... (read 1194 more words →)

Replying toTowards a solution to the alignment problem via objective detection and evaluation

Paul Colognese3y

Towards a solution to the alignment problem via objective detection and evaluation

Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

Replying toTowards a solution to the alignment problem via objective detection and evaluation

Paul Colognese3y

Towards a solution to the alignment problem via objective detection and evaluation

Thanks for the feedback! I guess the intention of this post was to lay down the broad framing/motivation for upcoming work that will involve looking at the more concrete details.

I do resonate with the feeling that the post as a whole feels a bit empty as it stands and the effort could have been better spent elsewhere.

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese

Thanks to Arun Jose, Joseph Bloom, and Evan Hubinger for discussions/feedback.

This work was primarily carried out during SERI MATS under Evan Hubinger’s mentorship.

Introduction

We explore whether the ability to detect and evaluate the objectives of advanced AI systems in training and in deployment is sufficient to solve the alignment problem. We mostly ignore practical/implementation considerations in order to examine this approach in an idealized setting before we try to make concrete progress in this direction. This allows us to highlight weaknesses with this approach and further considerations that any practical implementation must contend with. We note that taking this idealized framing hides some messy issues that we intend to discuss in future posts.

We note... (read 3315 more words →)

My current high-level research direction

It’s been about a year since I became involved in AI Alignment. Here is a super high-level overview of the research direction I intend to pursue over the next six or so months.

We’re concerned with building AI systems that produce “bad behavior”, either during training or in deployment.
We define “irreversibly bad behavior” to include actions that inhibit an overseer’s ability to monitor and control the system. This includes removing an off-switch and deceptive behavior.
To prevent bad behavior from occurring, the overseer needs to make relevant predictive observations of systems in training (and in deployment), that is, observations that allow the overseer to predict the relevant future behavior of

... (read more)

LESSWRONG
LW

LESSWRONG
LW

Paul Colognese

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Paul Colognese

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Anomalous Concept Detection for Detecting Hidden Cognition

Hidden Cognition Detection Methods and Benchmarks

Notes on Internal Objectives in Toy Models of Agents

Internal Target Information for AI Oversight

Potential alignment targets for a sovereign superintelligent AI

High-level interpretability: detecting an AI's objectives

Paul Colognese

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Paul Colognese

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Anomalous Concept Detection for Detecting Hidden Cognition

Hidden Cognition Detection Methods and Benchmarks

Notes on Internal Objectives in Toy Models of Agents

Internal Target Information for AI Oversight

Potential alignment targets for a sovereign superintelligent AI

High-level interpretability: detecting an AI's objectives

Introduction

Summary

Introduction

Summary

Summary

Summary

Other links

Some quotes from the report

Introduction

Introduction

Introduction

My current high-level research direction