If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Fair enough.
Yes, it seems totally reasonable for bounded reasoners to consider hypotheses (where a hypothesis like 'the universe is as it would be from the perspective of prisoner #3' functions like treating prisoner #3 as 'an instance of me') that would be counterfactual or even counterlogical for more idealized reasoners.
Typical bounded reasoning weirdness is stuff like seeming to take some counterlogicals (e.g. different hypotheses about the trillionth digit of pi) seriously despite denying 1+1=3, even though there's a chain of logic connecting one to the other. Projecting this into anthropics, you might have a certain systematic bias about which hypotheses you can consider, and yet deny that that systematic bias is valid when presented with it abstractly.
This seems like it makes drawing general lessons about what counts as 'an instance of me' from the fact that I'm a bounded reasoner pretty fraught.
I think it doesn't actually work for the repugnant conclusion - the buttons are supposed to just purely be to the good, and not have to deal with tradeoffs.
Once you start having to deal with tradeoffs, then you get into the aesthetics of population ethics - maybe you want each planet in the galaxy to have a vibrant civilization of happy humans, but past that more happy humans just seems a bit gauche - i.e. there is some value past which the raw marginal utility of cramming more humans into the universe is negative. Any a button promising an existing human extra life might be offered, but these humans are all immortal if they want to be anyhow, and their lives are so good it's hard to identify one-size-fits-all benefits one could even in theory supply via button, without violating any conservation laws.
All of this is a totally reasonable way to want the future of the universe to be arranged, incompatible with the repugnant conclusion. And still compatible with rejecting the person-affecting view, and pressing the offered buttons in our current circumstances.
Suppose there are a hundred copies of you, in different cells. At random, one will be selected - that one is going to be shot tomorrow. A guard notifies that one that they're going to be shot.
There is a mercy offered, though - there's a memory-eraser-ray handy. The one who knows they'te going to be shot is given the option to erase their memory of the warning and everything that followed, putting them in the same information state, more or less, as any of the other copies.
"Of course!" They cry. "Erase my memory, and I could be any of them - why, when you shoot someone tomorrow, there's a 99% chance it won't even be me!"
Then the next day comes, and they get shot.
I do like the idea of having "model organisms of alignment" (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can't, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
I did not believe this until I tried it myself, so yes, very paradox, much confusing.
Seemed well-targeted to me. I do feel like it could have been condensed a bit or had a simpler storyline or something, but it was still a good post.
I think 3 is close to an analogy where the LLM is using 'sacrifice' in a different way than we'd endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it - they're collaborators not adversaries - and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It's like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can't agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
Perhaps I should have said that it's silly to ask whether "being like A" or "being like B" is the goal of the game.
I have a different concern than most people.
An AI that follows the english text rather than the reward signal on the chess example cannot on that basis be trusted to do good things when given english text plus human approval reward in the real world. This is because following the text in simple well-defined environments is solving a different problem than following values-laden text in the real world.
The problem that you need to solve for alignment in the real world is how to interpret the world in terms of preferences when even we humans disagree and are internally inconsistent, and how to generalize those preferences to new situations in a way that does a good job satisfying those same human preferences (which include metapreferences). Absent a case for why this is happening, a proposed AI design doesn't strike me as dealing with alignment.
When interpreting human values goes wrong, the AI's internal monologue does not have to sound malevolent or deceptive. It's not thinking "How can I deceive the humans into giving me more power so I can make more paperclips?" It might be thinking "How can I explain to the humans that my proposal is what will help them the most?" Perhaps, if you can find its internal monologue describing its opinion of human preferences in detail, they might sound wrong to you (Wait a minute, I don't want to be a wirehead god sitting on a lotus throne!), or maybe it's doing this generalization implicitly, and just understands the output of the paraphraser slightly differently than you would.
Yeah, that's true. The obvious way is you could have optimized micro, but that's kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of 'better at minecraft.'
I mean it in a way where the preferences are modeled a little better than just "the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence." Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you'd get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn't really count is because interpreting them literally is using a smaller model than necessary - you can find a broader explanation for the human's behavior in context that still comfortably talks about goals/preferences.