LESSWRONG
LW

alamerton — LessWrong

10mo

Epistemic status: a collection of intervention proposals for digital error correction in the context of lock-in. It reflects my own intervention ideas, and the opinion of Formation Research at the time of writing.

TL;DR

We believe lock-in risks are a pressing problem, and that the digital error correction properties of digital entities will make future lock-in scenarios more stable.

Introduction

We have identified 4 key threat models for lock-in; ways we believe undesirable lock-ins could manifest in the future. This post focuses on two specific threat models:

An autonomous AI system competently pursues a goal and prevents interference
An immortal AI-enabled malevolent actor, or whole-brain emulation of a malevolent actor, instantiates a lock-in

This post rests on the claim... (read 1445 more words →)

Organisation-Level Lock-In Risk Interventions

alamerton

11mo

Epistemic status: my own conjecture and speculation after thinking about organisation structures and dynamics as an intervention point for lock-in risk for about 5 hours. My thoughts here represent the opinion of Formation Research at the time of writing.

TL;DR

We believe lock-in risks are a pressing problem, and that existing organisation structures lend themselves to lock-in via power concentration. This post outlines how lock-in could result from organisation structures and dynamics, and what interventions might be implemented to reduce the risk.

We have identified 4 key threat models for lock-in; ways we believe undesirable lock-ins could manifest in the future. This post focuses on one specific threat model:

Decision-making power being concentrated into the hands... (read 2233 more words →)

Stacity: a Lock-In Risk Benchmark for Large Language Models

alamerton

Intro

So far we have identified lock-in risk, defined lock-in, and established threat models for particularly undesirable lock-ins. Now we present this evaluation benchmark for large language models (LLMs) so we can measure (or at least, get a proxy measure for) the risk level of LLMs.

AI is the key technology in the manifestation of lock-in risks; AI systems can contribute to lock-in autonomously/automatically and via misuse. There are specific behaviours in both of these categories such that if an LLM displayed those behaviours, we say the LLM may have the proclivity to contribute to lock-in

Image of a question-answer pair evaluating for the manipulation of information systems — Question-answer pair evaluating for the manipulation of information systems

Evaluation Benchmark

We developed this benchmark by identifying the specific behaviours that... (read 231 more words →)

Lock-In Threat Models

alamerton

Epistemic status: a combination and synthesis of others' work, analysed and written over a few weeks. A high-level list of threat models that is open to criticism.

TL;DR

Humanity could end up in a lock-in within the next century. Here I outline the possible routes to that outcome, and prioritise these routes on a set of criteria for importance.

Existing Work

Lukas Finnveden

AGI and Lock-In (Finnveden et al., 2022) was authored by Lukas Finnveden during an internship at Open Philanthropy. AGI and Lock-in is currently the most detailed report on lock-in risk. The report expands on notes made on value lock-in by Jess Riedel, who co-authored the report along with Carl Shulman. The report references Nick Bostrom’s... (read 2134 more words →)

What is Lock-In?

alamerton

Epistemic status: a combination and synthesis of others' work, analysed and written over a few weeks. Early working definition that is open to criticism.

TL;DR

I create a definition of lock-in for use in future discussion and writing, and operationalise lock-in for future research. I define lock-in risks as the probabilities of situations in which features of the world, typically negative elements of human culture, are made stable for long periods of time.

Why Define Lock-In?

Lock-in is the central theme of Formation Research. Therefore before conducting any research on lock-in, it is important to create a strong working definition for use in subsequent discussion. The stronger the definition, the stronger the foundation on which the subsequent... (read 2646 more words →)

Formation Research: Organisation Overview

alamerton

Thank you to Adam Jones, Lukas Finnveden, Jess Riedel, Tianyi (Alex) Qiu, Aaron Scher, Nandi Schoots, Fin Moorhouse, and others for the conversations and feedback that helped me synthesise these ideas and create this sequence.

Epistemic Status: my own thoughts and research after thinking about lock-in and having conversations with people for a few months. This post summarises my thinking about lock-in risk.

TL;DR

This post gives an overview to Formation Research, an early stage nonprofit startup research organisation working on lock-in risk, a neglected category of AI risk.

Introduction

I spent the last few months of my master’s degree working with Adam Jones on starting a new organisation focusing on a neglected area of AI safety... (read 3078 more words →)

In-Context Learning: An Alignment Survey

alamerton

Epistemic status: new to alignment; some background. I learned about alignment about 1.5 years ago and spent the last ~1 year getting up to speed on alignment through 12 AI safety-related courses and programmes while completing an artificial intelligence MSc programme. Ultimately this post is conjecture, based on my finite knowledge of the alignment problem. I plan to correct errors that are pointed out to me, so I encourage you to please point out those errors!

(Full version available here)

TL;DR

Much research has been conducted on in-context learning (ICL) since its emergence in 2020. This is a condensed survey of the existing literature regarding ICL, summarising the work in a number of research areas,... (read 5961 more words →)

Replying toA Review of In-Context Learning Hypotheses for Automated AI Alignment Research

alamerton2y

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

I think I mean to say this would imply ICL could not be a new form of learning. And yes, it seems more likely that there could be at least some new knowledge getting generated, one way or another. BI implying all tasks have been previously seen feels extreme, and less likely. I've adjusted my wording a bit now.

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

alamerton

This project has been completed as part of the Mentorship in Alignment Research Students (MARS London) programme under the supervision of Bogdan-Ionut Cirstea, on investigating the promise of automated AI alignment research. I would like to thank Bogdan-Ionut Cirstea, Erin Robertson, Clem Von Stengel, Alexander Gietelink Oldenziel, Severin Field, Aaron Scher, and everyone who commented on my draft, for the feedback and encouragement which helped me create this post.

TL;DR

The mechanism behind in-context learning is an open question in machine learning. There are different hypotheses on what in-context learning is doing, each with different implications for alignment. This document reviews the hypotheses which attempt to explain in-context learning, finding some overlap and good... (read 4502 more words →)

alamerton

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

In-Context Learning: An Alignment Survey

Recommender Alignment for Lock-In Risk

Formation Research: Organisation Overview

alamerton

Digital Error Correction and Lock-In

Organisation-Level Lock-In Risk Interventions

Recommender Alignment for Lock-In Risk

Stacity: a Lock-In Risk Benchmark for Large Language Models

Lock-In Threat Models

What is Lock-In?

Formation Research: Organisation Overview

Lock-In

alamerton

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

In-Context Learning: An Alignment Survey

Recommender Alignment for Lock-In Risk

Formation Research: Organisation Overview

alamerton

Digital Error Correction and Lock-In

Organisation-Level Lock-In Risk Interventions

Recommender Alignment for Lock-In Risk

Stacity: a Lock-In Risk Benchmark for Large Language Models

Lock-In Threat Models

What is Lock-In?

Formation Research: Organisation Overview

Lock-In

TL;DR

Introduction

TL;DR

Intro

Evaluation Benchmark

TL;DR

Existing Work

Lukas Finnveden

TL;DR

Why Define Lock-In?

TL;DR

Introduction

TL;DR

TL;DR