[LINK] Wait But Why - The AI Revolution Part 2

Adam Zerner

My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn't ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn't hold much promise IMO because it's been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I've seen). The "different kind of value system" holds some promise though because I'm not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they'd be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too.

I've seen people talk about wireheading in this thread, but I've never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link.

Obviously no one has said that explicitly. I asked why outcome maximizers wouldn't turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI's values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional.

Instead of imagining intelligent agents (including humans) as 'things that are motivated to do stuff,' imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn't 'motivated to find your search results.' Google is a program that is designed to return results that meet your search criteria.

It's the "according to a set of criteria" that is what I'm on about. Once you look more closely at that, I don't see why a maximizer wouldn't change the criteria so that it's it's constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria.

The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said reward signal because that would not be the one of all possible states of the world that contained the greatest integral of future paperclips.

This is begging the question. It assumes that no matter what, the paperclip optimizer has a fundamental goal of causing "the one of all possible states of the world that contains the greatest integral of future paperclips" and therefore it wouldn't maximize reward instead. Well, with that assumption that's a fair conclusion but I think the assumption may be bad.

I think having the goal to maximize x pre-foom doesn't means that it'll have that goal post-foom. To me, an obvious pitfall is that whatever the training mechanism for developing that goal was leaves a more direct goal of maximizing the trainer output because the reward is only correlated to the input by the evaluator function. Briefly, the reward is the output evaluator function and only correlated to the input of the evaluator so it makes more sense to optimize the evaluator than the input if what you care about is the output of the evaluation. If you care about the desired state being some particular thing and the output of the evaluator function and maintaining accurate input, then it makes more sense to manipulate the the world. But, this is a more complicated thing and I don't see how you would program in caring about keeping the desired state the same across time without relying on yet another evaluation function where you only care about the output of the evaluator. I don't see how to make a thing value something that isn't an evaluator.

You're suffering from typical mind fallacy.

Well, that may be but none of the schemes I've seen mentioned so far don't involve something with a value system. I am making the claim that for any value system, the thing that an agent values is that system outputting "this is valuable" and that any external state is only valuable because it produces that output. Perhaps I lack imagination, but so far I haven't seen an instance of motivation without values. Only assertions that it doesn't have to be the case or the implication that wireheading might be a instance of another case (value drift) and smart people are working on figuring out how that will work. The assertions about how this doesn't have to be the case seem to assume that it's possible to care about a thing in and of itself and I'm not convinced that that's true without also stipulating that you've got some part of the thing which the thing can't modify. Of course, if we can guarantee there's a part of the AI that it can't modify, then we should just be able to cram an instruction not to harm anyone for some definition of harm but figuring out how to define harm doesn't seem to be the only problem that the AI people have with AI values.

The stuff below here is probably tangential to the main argument and if refuted successfully, probably wouldn't change my mind about my main point that "something like wireheading is a likely outcome for anything with a value function that also has the ability to fully self modify" without some additional work to show why refuting them also invalidates the main argument.

Besides, an AI isn't going to expend any less energy turning the entire universe into hedonium than it would turning it into paperclips, right?

Caveat: Pleasure and reward are not the same thing. "Wirehead" and "hedomium" are words that were coined in connection with pleasure-seeking, not reward-seeking. They are easily confused because in our brains pleasure almost always triggers reward, but they don't have to be and we also get reward for things that don't cause pleasure and also for some things that cause pain like krokodil abuse whose contaminants actually cause dysphoria (as compared to pure desomorphine which does not). I continue to use words like wirehead and hedonium because they still work, but they are just analogies and I want to make sure that's explicit in case the analogy breaks down in the future.

Onward: I am not convinced that a wirehead AI would necessarily turn the universe into hedonium either. I see two ways that that might not come to pass without thinking about it too deeply:

1.) The hedonium maximizer assumes that maximizing pleasure or reward is about producing more pleasure or reward infinitely; that hedonium is a thing that, for each unit produced, continues to increase marginal pleasure. This doesn't have to be the case though. The measure of pleasure (or reward) doesn't need to be the number of pleasure (or reward) units, but may also be a function like the ratio of obtained units to the capacity to process those units. In that case, there isn't really a need to turn the universe into hedonium, only a need to make sure you have enough to match your ability to process it and there is no need to make sure your capacity to process pleasure/reward lasts forever, only to make sure that you continue to experience the maximum while you have the capacity. There are lots of functions whose maxima aren't infinity.

2.) The phrase "optimizing for reward" sort of carries an implicit assumption that this means planning and arranging for future reward, but I don't see why this should necessarily be the case either. Ishaan pointed out that once reward systems developed, the original "goal" of evolution quit being important to entities except insofar as they produced reward. Where rewards happened in ways that caused gene replication, evolution provided a force that allowed those particular reward systems to continue to exist and so there is some coupling between the reward-goal and the reproduction-goal. However, narcotics that produce the best stimulation of the reward center often lead their human users unable or unwilling to plan for the future. In both the reward-maximizer and the paperclip-maximizer case, we're (obviously) assuming that maximizing over time is a given, but why should it be? Why shouldn't an AI go for the strongest immediate reward instead? There's no reason to assume that a bigger reward box (via an extra long temporal dimension) will result in more reward for on entity unless we design the reward to be something like a sum of previous rewards. (Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time. I am willing to grant more likelihood to "whatever an AI values it will try to optimize for in the future" than "an AI will not try to optimize for reward.")

27

[LINK] Wait But Why - The AI Revolution Part 2

27

27

27

[LINK] Wait But Why - The AI Revolution Part 2

27

27