User Comment Replies

Two Neglected Problems in Human-AI Safety

6yΩ240

Thanks for your reply!

our value functions probably only "make sense" in a small region of possibility space, and just starts behaving randomly outside of that.

Okay, that helps me understand what you're talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you're approaching this from the adversarial angle though, because I suppose you're concer... (read more)

3Wei Dai6y

Again, I don't have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former. Thinking this over, I guess what's happening here is that our values don't apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can't evaluate the situation at all. (This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.) We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if "objective morality" exists and if so what it says about the alien situations. Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don't see why that's not a win-win.

Two Neglected Problems in Human-AI Safety

CyberByte

6yΩ260

What does it mean for human values to be vulnerable to adversarial examples? When we say this about AI systems (e.g. image classifiers), I think it's either because their judgments on manipulated situations/images are misaligned with ours/humans, or perhaps because they get the "ground truth" wrong. But how can a value system be misaligned with itself or different from the ground truth? For alignment purposes, isn't it itself the ground truth? It could of course fail to match "objective morality" if you believe in that, but in... (read more)

7Wei Dai6y

I'm not sure how to think about this formally, but intuitively, our value functions probably only "make sense" in a small region of possibility space, and just starts behaving randomly outside of that. It doesn't seem right to treat that random behavior as someone's "real values" and try to maximize that. I wouldn't want to corrupt the values of people who share roughly the same moral and philosophical outlook as myself, but if someone already has values that are very likely to be wrong (e.g., they just want to maximize the complexity or the universe, or how technologically advanced we are, or the glory of their god) I might be ok with trying to manipulate their values, especially if they're trying to do the same thing to me. The problem is that it's much easier for them to defend their values. Since they don't think they need further moral development, they can just tell their AI to block any outside messages that might cause any changes to their values, but I can't do that. Other people may not think of the problem, or may not be as concerned about it as I am, and in some alignment schemes their AI would share their level of concern and not try very hard to avoid this problem. I don't want to see their values corrupted this way. Even for myself, if AIs overall are accelerating technological development faster than moral/philosophical progress, it's unclear how I can avoid this problem even with the assistance of an aligned AI. The AI may be faced with many choices that it doesn't know how to answer directly, and it also doesn't know how to ask me for help without risking corrupting me. If the AI is conservative it might be paralyzed with indecision or be forced to make a lot of suboptimal decisions that seem "safe", and if it's not conservative enough it might corrupt me even though it's trying hard not to. (I probably should have explained more in the OP, so I'm glad you're asking these questions.)

LESSWRONG
LW

All of CyberByte's Comments + Replies