Comment Permalink

Unedited stream of thought:

Before trying to answer the question, I'm just gonna say a bunch of things that might not make sense (either because I am being unclear or being stupid).

So, I think the debate example is much more *about* manipulation, than the iterated amplification example, so I was largely replying to the class that includes IA and debate, I can imagine saying that Iterated amplification done right does not provide an incentive to manipulate the human.

I think that a process that was optimizing directly for finding a fixed point of does have an incentive to manipulate the human, however this is not exactly what IA is doing, because it is only having the gradients pass through the first $X$ in the fixed point equation, and I can imagine arguing that the incentive to manipulate comes from having the gradient pass through the second $X$ . If you iterate enough times, I think you might effectively have some optimization juice passing through modifying the second $X$ , but it might be much less. I am confused about how to think about optimization towards a moving target being different from optimization towards finding a fixed point.

I think that even if you only look at the effect of following the gradients coming from the effect of changing the first $X$ , you are at least providing an incentive to predict the human on a wide range of inputs. In some cases, your range of inputs might be such there isn't actually information about the human in the answers, which I think is where you are trying to get with the automated decomposition strategies. If humans have some innate ability to imitate some non-human process, and use that ability to answer the questions, and thinking about humans does not aid in thinking about that non-human process, I agree that you are not providing any incentive to think about the humans. However, it feels like a lot has to go right for that to work.

On the other hand, maybe we just think it is okay to predict, but not manipulate, the humans, while they are answering questions with a lot of common information about humans' work, which is what I think IA is supposed to be doing. In this case, even if I were to say that there is no incentive to "manipulate the human", I still argue that there is "incentive to learn how to manipulate the human," because predicting the human (on a wide range of inputs) is a very similar task to manipulating the human.

Okay, now I'll try to answer the question. I don't understand the question. I assume you are talking about incentive to manipulate in the simple examples with permutations etc in the experiments. I think there is no ability to manipulate those processes, and thus no gradient signal towards manipulation of the automated process. I still feel like there is some weird counterfactual incentive to manipulate the process, but I don't know how to say what that means, and I agree that it does not affect what actually happens in the system.

I agree that changing to a human will not change anything (except via also adding the change where the system is told (or can deduce) that it is interacting with the human, and thus ignores the gradient signal, to do some treacherous turn). Anyway, in these worlds, we likely already lost, and I am not focusing on them. I think the short answer to your question is in practice no, there is no difference, and there isn't even incentive to predict humans in strong generality, much less manipulate them, but that is because the examples are simple and not trying to have common information with how humans work.

I think that there are two paths to go down of crux opportunities for me here, and I'm sure we could find more: 1) being convinced that there is not an incentive to predict humans in generality (predicting humans only when they are very strictly following a non-humanlike algorithm doesn't count as predicting humans in generality), or 2) being convinced that this incentive to predict the humans is sufficiently far from incentive to manipulate.

Rank	Title	Visualization
1	What failure looks like
2	Risks from Learned Optimization: Introduction
3	The Parable of Predict-O-Matic
4.5	Being the (Pareto) Best in the World
4.5	Book Review: The Secret Of Our Success
6	Rule Thinkers In, Not Out
7	Book summary: Unlocking the Emotional Brain	Outlier: +20
8.5	Asymmetric Justice
8.5	Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
10	1960: The Year The Singularity Was Cancelled
11.5	Selection vs Control
11.5	You Have About Five Words
13	The Schelling Choice is "Rabbit", not "Stag"
14	Noticing Frame Differences
15	Yes Requires the Possibility of No
16	"Other people are wrong" vs "I am right"
17	Rest Days vs Recovery Days
18.5	Seeking Power is Often Robustly Instrumental in MDPs
18.5	The Forces of Blandness and the Disagreeable Majority
20.5	The Costs of Reliability
20.5	Chris Olah’s views on AGI safety
22	Reframing Superintelligence: Comprehensive AI Services as General Intelligence
23.5	Humans Who Are Not Concentrating Are Not General Intelligences
23.5	The strategy-stealing assumption
26.5	Reframing Impact
26.5	Understanding “Deep Double Descent”
26.5	Moloch Hasn’t Won
26.5	Integrity and accountability are core parts of rationality
30.5	Gears-Level Models are Capital Investments
30.5	In My Culture
30.5	Make more land
30.5	Forum participation as a research strategy
33	Unconscious Economics
34.5	Mistakes with Conservation of Expected Evidence
34.5	Bioinfohazards
36	The Tale of Alice Almost: Strategies for Dealing With Pretty Good People
37	Excerpts from a larger discussion about simulacra
38	human psycholinguists: a critical appraisal
40	AI Safety "Success Stories"
40	Do you fear the rock or the hard place?
40	Propagating Facts into Aesthetics
42	Gradient hacking
44	The Amish, and Strategic Norms around Technology
44	Power Buys You Distance From The Crime
44	Paper-Reading for Gears
48.5	How to Ignore Your Emotions (while also thinking you're awesome at emotions)
48.5	The Real Rules Have No Exceptions
48.5	Coherent decisions imply consistent utilities
48.5	Alignment Research Field Guide
48.5	Blackmail
48.5	The Curse Of The Counterfactual
52.5	The Credit Assignment Problem
52.5	Reason isn't magic
54	Mental Mountains
56.5	Simple Rules of Law
56.5	Utility ≠ Reward
56.5	Is Rationalist Self-Improvement Real?
56.5	Literature Review: Distributed Teams
59	Steelmanning Divination
60	Book Review: Design Principles of Biological Circuits
61	Building up to an Internal Family Systems model
62	Evolution of Modularity
63	[Answer] Why wasn't science invented in China?
64	How Much is Your Time Worth?
65.5	Book Review: The Structure Of Scientific Revolutions
65.5	Everybody Knows
68.5	Sequence introduction: non-agent and multiagent models of mind
68.5	From Personal to Prison Gangs: Enforcing Prosocial Behavior
68.5	Some Ways Coordination is Hard
68.5	Circle Games
71	Why Subagents?
73.5	Healthy Competition
73.5	Where to Draw the Boundaries?
73.5	Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on?
73.5	Six AI Risk/Strategy Ideas
76.5	Thoughts on Human Models
76.5	Classifying specification problems as variants of Goodhart's Law
79	Book Summary: Consciousness and the Brain
79	Gears vs Behavior
79	Strategic implications of AIs' ability to coordinate at low cost, for example by merging
82	Book Review: Secular Cycles
82	Some Thoughts on My Psychiatry Practice
82	Coordination Surveys: why we should survey to organize responsibilities, not just predictions
84.5	The Hard Work of Translation (Buddhism)
84.5	[Part 2] Amplifying generalist research via forecasting – results from a preliminary exploration
86.5	Soft takeoff can still lead to decisive strategic advantage
86.5	Total horse takeover
88.5	Complex Behavior from Simple (Sub)Agents
88.5	Less Competition, More Meritocracy?
90.5	Megaproject management
90.5	But exactly how complex and fragile?
92.5	Trauma, Meditation, and a Cool Scar
92.5	Turning air into bread
95	AlphaStar: Impressive for RL progress, not for AGI progress
95	Dishonest Update Reporting
95	What are the open problems in Human Rationality?
97.5	Integrating the Lindy Effect
97.5	mAIry's room: AI reasoning to solve philosophical problems
99	S-Curves for Trend Forecasting
100.5	Partial summary of debate with Benquo and Jessicata [pt 1]
100.5	The Power to Teach Concepts Better
102.5	Instant stone (just add water!)
102.5	Autism And Intelligence: Much More Than You Wanted To Know
104	Relevance Norms; Or, Gricean Implicature Queers the Decoupling/Contextualizing Binary
105	Rationality, Levels of Intervention, and Empiricism
106	The Zettelkasten Method
107	Two explanations for variation in human abilities
108	No, it's not The Incentives—it's you	Outlier: -11
109	Firming Up Not-Lying Around Its Edge-Cases Is Less Broadly Useful Than One Might Initially Think
110	The Power to Demolish Bad Arguments
111	Neural Annealing: Toward a Neural Theory of Everything (crosspost)	Outlier: -20
112	Dual Wielding
113	Approval Extraction Advertised as Production	Outlier: -15
114	The AI Timelines Scam	Outlier: -13
115	Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

99

2019 Review: Voting Results!

99

Top Results

Top Reviewers

Complete Results (1000+ Karma)

What does this mean, and what happens now?

99