Comment Permalink

Answer by Charlie SteinerDec 13, 202120

Human values exist within human-scale models of the world.

6

[ Question ]

What alignment-related concepts should be better known in the broader ML community?

by Lauro Langosco

9th Dec 2021

1 min read

4 4

6

We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here's a few recent examples:

Reward hacking / misspecification (blog post)
Convergent instrumental goals (paper)
Objective robustness (paper)
Assistance (paper)

(I'm sure this list is missing many examples; let me know if there are any in particular I should include).

Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known / understood?

Frontpage

6

What alignment-related concepts should be better known in the broader ML community?

14jbkjr

5Daniel Kokotajlo

2Charlie Steiner

New Answer

New Comment

4 Answers sorted by
top scoring

jbkjr

Dec 09, 2021*

140

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully "in control" of your choices, thoughts, and actions.

(*Not that "agent" couldn't, in fact, be a metaphysical primitive, just that such "agents" are hardly "agents" in the way most people consider humans to "be agents" [and, equally importantly, other things, like thermostats and quarks, to "not be agents"].)

Daniel Kokotajlo

Dec 09, 2021

Saints vs. Schemers vs. Sycophants as different kinds of trained models / policies we might get. (I'm drawing from Ajeya's post here).

There are more academic-sounding terms for these concepts too, I forget where, probably in Paul's posts about "the intended model" vs. "the instrumental policy" and stuff like that.

Daniel Kokotajlo

Dec 09, 2021

Inner vs. outer alignment, mesa-optimizers

Charlie Steiner

Dec 13, 2021

Human values exist within human-scale models of the world.

Moderation Log

6

[ Question ]

What alignment-related concepts should be better known in the broader ML community?

6

6

4 Answers sorted by top scoring

Dec 09, 2021*

Dec 09, 2021

Dec 09, 2021

Dec 13, 2021

4 Answers sorted by
top scoring