If this article makes it to 20 votes will it be included in the newsletter?
But that's the thing. There is no sensory input for "social deference". It has to be inferred from an internal model of the world itself inferred from sensory data.
Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can't use it for social instincts or morality, or anything you can't just build a simple sensor to detect.
But that's the thing. There is no sensory input for "social deference". It has to be inferred from an internal model of the world itself inferred from sensory data...Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can't use it for social instincts or morality, or anything you can't just build a simple sensor to detect.
Why does it only work on simple signals? Why can't the result of inference work for reinforcement learning?
I don't think that humans are pure reinforcement learners. We have all sorts of complicated values that aren't just eating and mating.
The toy AI has an internal model of the universe. In the extreme, a complete simulation of every atom and every object. It's sensors update the model, helping it get more accurate predictions/more certainty about the universe state.
Instead of a utility function that just measures some external reward signal, it has an internal utility function which somehow measures the universe model and calculates utility from it. E.g. a function which counts the number of atoms arranged in paperclip shaped objects in the simulation.
It then chooses actions that lead to the best universe states. Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn't lead to real paperclips.
Obviously a real universe model would be highly compressed. It would have a high level representation for paperclips rather than an atom by atom simulation.
I suspect this is how humans work. We can value external objects and universe states. People care about things that have no effect on them.
I don't think that humans are pure reinforcement learners. We have all sorts of complicated values that aren't just eating and mating.
We may not be pure reinforcement learners, but the presence of values other than eating and mating isn't a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that's creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don't have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn't leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.).
The toy AI has an internal model of the universe, it has an internal utility function which somehow measures the universe model and calculates utility from it....[toy AI is actually paperclip optimizer]...Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn't lead to real paperclips.
I think we've been here before ;-)
Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I'm describing aren't really problems.
Something like that. I posted my pseudocode in an open thread a few days ago to get feedback and I couldn't get indentation to work either so I posted mine to Pastebin and linked it.
I'm still going through the Sequences, and I read Terminal Values and Instrumental Values the other day. Eliezer makes a pseudocode example of an ideal Bayesian decision system (as well as its data types), which is what an AGI would be a computationally tractable approximation of. If you can show me what you mean in terms of that post, then I might be able to understand you. It doesn't look like I was far off conceptually, but thinking of it his way is better than thinking of it my way. My way's kind of intuitive I guess (or I wouldn't have been able to make it up) but his is accurate.
I also found his paper (Paper? More like book) Creating Friendly AI. Probably a good read for avoiding amateur mistakes, which we might be making. I intend to read it. Probably best not to try to read it in one sitting.
Even though I don't want you to think of it this way, here's my pseudocode just to give you an idea of what was going on in my head. If you see a name followed by parentheses, then that is the name of a function. 'Def' defines a function. The stuff that follows it is the function itself. If you see a function name without a 'def', then that means it's being called rather than defined. Functions might call other functions. If you see names inside of the parentheses that follow a function, then those are arguments (function inputs). If you see something that is clearly a name, and it isn't followed by parentheses, then it's an object: it holds some sort of data. In this example all of the objects are first created as return values of functions (function outputs). And anything that isn't indented at least once isn't actually code. So 'For AGI in general' is not a for loop, lol.
Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it.
For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made.
Again, I really appreciate the effort that you put in. Thanks a load.
How would that work?
Well that's the quadrillion dollar question. I have no idea how to solve it.
It's certainly not impossible as humans seem to work this way. We can also do it in toy examples. E.g. a simple AI which has an internal universe it tries to optimize, and it's sensors merely update the state it is in. Instead of trying to predict the reward, it tries to predict the actual universe state and selects the ones that are desirable.
How would that [valuing universe-states themselves] work? Well that's the quadrillion dollar question. I have no idea how to solve it.
Yeah, I think this whole thread may be kind of grinding to this conclusion.
It's certainly not impossible as humans seem to work this way
Seem to perhaps, but I don't think that's actually the case. I think (as mentioned above) that we value reward signals terminally (but are mostly unaware of this preference) and nothing else. There's another guy in this thread who thinks we might not have any terminal values.
I'm not sure that I understand your toy AI. What do you mean that it has "an internal universe it tries to optimize?" Do the sensors sense the state of the internal universe? Would "internal state" work as a synonym for "internal universe" or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?
No problem, pinyaka.
I don't understand very much about mathematics, computer science, or programming, so I think that, for the most part, I've expressed myself in natural language to the greatest extent that I possibly can. I'm encouraged that about an hour and a half before my previous reply, DefectiveAlgorithm made the exact same argument that I did, albeit more briefly. It discourages me that he tabooed 'values' and you immediately used it anyway. Just in case you did decide to reply, I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value's very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare. I would prefer that you write yours before I give you mine so that you are not anchored by my example. This way you are forced to conceive of the AI as a program and do away with ambiguous wording. What do you say?
I've asked Nornagest to provide links or further reading on the value stability problem. I don't know enough about it to say anything meaningful about it. I thought that wireheading scenarios were only problems with AIs whose values were loaded with reinforcement learning.
"[W]hatever an AI values it will try to optimize for in the future."
On this at least we agree.
Of course, my sense of time is not very good and so I may be overly biases to see immediate reward as worthwhile when an AI with a better sense of time might automatically go for optimization over all time.
From what I understand, even if you're biased, it's not a bad assumption. To my knowledge, in scenarios with AGIs that have their values loaded with reinforcement learning, the AGIs are usually given the terminal goal of maximizing the time-discounted integral of their future reward signal. So, they 'bias' the AGI in the way that you may be biased. Maybe so that it 'cares' about the rewards its handlers give it more than the far greater far future rewards that it could stand to gain from wireheading itself? I don't know. My brain is tired. My question looks wrong to me.
It discourages me that he tabooed 'values' and you immediately used it anyway.
In fairness, I only used it to describe how they'd come to be used in this context in the first place, not to try and continue with my point.
I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value's very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare.
I've never done something like this. I don't know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you'd like to see something like this?
while (world_state != desired_state)
get world_state
make_plan
execute_plan
end while
ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.
It depends on the AI architecture. A reinforcement learner always has the goal of maximizing it's reward signal. It never really had a different goal, there was just something in the way (e.g. a paperclip sensor.)
But there is no theoretical reason you can't have an AI that values universe-states themselves. That actually wants the universe to contain more paperclips, not merely to see lots of paperclips.
And if it did have such a goal, why would it change it? Modifying it's code to make it not want paperclips, would hurt it's goal. It would only ever do things that help it achieve it's goal. E.g. making itself smarter. So eventually you end up with a superintelligent AI, that is still stuck with the narrow stupid goal of paperclips.
But there is no theoretical reason you can't have an AI that values universe-states themselves.
How would that work? How do you have a learner that doesn't have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality.
And if it did have such a goal, why would it change it?
Agreed. I am trying to get someone to explain how such a goal would work.
People lead fulfilling lives guided by a spiritualism that reject seeking pleasure. Aka reward.
Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn't only happen with pleasure. For the most extreme examples of what you're describing, ascetics and monks and the like, I'd guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.
I would argue that getting our reward center to fire is likely a terminal goal.
How do you explain Buddhism?
How is this refuted by Buddhism?
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
Edit: it's going to be weird if this announcement is the only post this week to pass a threshold of 20 upvotes. I count the 'week' on the same cycle as open threads posted on LessWrong. It's only been two days since 2400 hours Sunday night, i.e., Monday night 0000 hours. Still, though, there is nothing new unrelated to HPMoR which passes the threshold. My hypothesis is everyone is too busy reading HPMoR, or discussing it, to bother producing other content. I'm only half-joking. The most upvoted comments for the last week are all predictions about what's coming up in HPMoR. Like, how maybe the final trial for Harry will actually be a test of not letting the AI out of the box...
Should I break my rule of not including HPMoR-related content in the digest? If not, there will be nothing...
I'm now tempted to include this announcement of the newsletter in the newsletter just for the one-off recursion joke I can make.
I say go for it, but then my highest voted submission to discussion was this.