This is just my layman theory. Maybe it’s obvious to experts, probably has flaws. But it seems to make sense to me, perhaps will give you some ideas. I would love to hear your thoughts/feedback!

 


Consume input

The data you need from the world(like video), and useful metrics we want to optimize for, like number of paperclips in the world.

 

Make predictions and take action

Like deep learning does.

How do human brains convert their structure into action?

Maybe like:

- Take the current picture of the world as an input.

- Come up with random action.

- “Imagine” what will happen.

Take the current world + action, and run it through the ANN. Predict the outcome of the action applied to the world.

- Does the output increase the metrics we want? If yes — send out the signals to take action. If no — come up with another random action and repeat.

 

Update beliefs

Look at the outcome of the action. Does the picture of the world correspond to the picture we’ve imagined? Did this action increase the good metrics? Did the number of paperclips in the world increase? If it did — positive reinforcement. Backpropagation, and reinforce the weights.

 

Repeat

Take current picture of the world=> Imagine applying an action to it => Take action => Positive/Negative reinforcement to improve our model => Repeat until the metrics we want equal to the goal we have set.

 


 

Consciousness

Consciousness is neurons observing/recognizing patterns of other neurons.

When you see the word “cat”— photons from the page come to your retina and are converted to neural signal. A network of cells recognizes the shape of letters C, A, and T. And then a higher level, more abstract network recognizes that these letters together form the concept of a cat.

You can also recognize signals coming from the nerve cells within your body, like feeling a pain when stabbing a toe.

The same way, neurons in the brain recognize the signals coming from the other neurons within the brain. So the brain “observes/feels/experiences” itself. Builds a model of itself, just like it builds a map of the world around, “mirrors” itself(GEB).

 

Sentient and self-improving

So the structure of the network itself is fed as one of it’s inputs, along with the video and metrics we want to optimize for. It can see itself as a part of the state of the world it bases predictions on. That’s what being sentient means.

And then one of the possible actions it can take is to modify it’s own structure. “Imagine” modifyng the structure a certain way, if you predict that it leads to the better predictions/outcomes —modify it. If it did lead to more paperclips — reinforce the weights to do more of that. So it keeps continually self improving.

 

Friendly

We don’t want this to lead to the infinite amount of paperclips, and we don’t know how to quantify the things we value as humans. We can’t turn the “amount of happiness” in the world into a concrete metrics without the unintended consequences(like all human brains being hooked up to wires that stimulate our pleasure centers).

That’s why instead of trying to encode the abstract values to maximize for, we encode very specific goals.

- Make 100 paperclips (utility function is “Did I make 100 paperclips?”)

- Build 1000 cars

- Write a paper on how to cure cancer

Humans remain in charge, determine the goals we want, and let AI figure out how to accomplish them. Still could go wrong, but less likely.


(originally published on my main blog)

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 9:34 PM

Even with the limited AGI with very specific goals (build 1000 cars) the problem is not automatically solved.

The AI might deduce that if humans still exist, there is a higher than zero probability that a human will prevent it from finishing the task, so to be completely safe, all humans must be killed.

Or it will deduce that there is an even higher probability that either (1) it will fail at killing humans and be turned off itself, or (2) encounter problems for which it needs or would largely benefit from human cooperation.

If this is supposed to be a description of how actual human brains work, I guess we naturally don't have any "useful metrics we want to optimize for". Instead we are driven by various impulses, which historically appeared by random mutations, and if they happened to contribute to human survival and reproduction, they were preserved and promoted by natural selection. At this moment, the impulses that sometimes make us (want to) optimize for some useful metrics are a part of that set. But they are just one among many desires, not some essential building block of the human brain.

There is some problem even with having seemingly finite goals. For example, if the machine has a probabilistic model of the world, and you ask it to make 100 paperclips, there is a potential risk -- depending on the specific architecture -- that the machine would recognize that it doesn't have literally 100% certainty of having already created 100 paperclips, and will try to optimize for making this certainty as high as possible (destroying humanity as a side effect). For example, the machine may think "maybe humans are messing with my memory and visual output to make me falsely believe that I have 100 paperclips, when in reality maybe I have none; I guess it would be safer to kill them all". So maybe the goal should instead be something like "make 100 paperclips with probability at least 99%", but... you know, the general idea is that there may be some unnoticed way how the supposedly finite goal might spawn an infinite subtask.

Otherwise... this seems like a nice high-level view of the things, but the devil is in the details. You could write thousands of scientific papers merely on how to correctly implement things like "picture of the world", "concept of a cat", etc. That is, the heavy work is hidden behind these seemingly innocent words.

Thank you for your reply!

For a long time, the way ANNs work kinda made sense to me, and seemed to map nicely onto my (shallow) understanding of how human brain works. But I could never imagine how could the values/drives/desires be implemented in terms of ANN.

The idea that you can just quantify something you want as a metric, feed it as an input, and see if the output is closer to what we want is new to me. It was a little epiphany, that seems to make sense, so it prompted me to write this post.

Evolutionary, I guess human/animal utility function would be something like "How many copies of myself have I made? Let's maximize that." But from the subjective perspective, it's probably more like "Am I receiving the pleasure from the reward system my brain happened to develop?"

For sure there are a bunch of different impulses/drives, but they all are just little rewards for transforming the current state of the world into the one our brain prefers, right? Maybe they have appeared randomly, but if you were to design one intentionally, is that how you would go about it?


Learning

  • Get inputs from eyes/ears.
  • Recognize patterns, make predictions.
  • Compare predictions to how things turned out, update the beliefs, improve the model of the world.
  • Repeat.

General intelligence taking actions towards it's values

  • Perceive the difference between the state of world, and the state I want.
  • Use the model of the world that I've learned to predict the outcomes of possible actions.
  • If I predict that applying action to the world will lead to rewards - take action.
  • See how it turned out, update the model, repeat.

I agree that specific goals can also have unintended consequences. It just occurred to me that this kind of problem would be much easier to solve than trying to align the abstract values, and the outcome is the same - we get what we want.

Oh, and I totally agree that there's probably a ton of complexity when it comes to the implementation. But it would be pretty cool to figure out at least the general idea of what intelligence and consciousness are, what things we need to implement, and how they fit together.

In real life, the problem with metrics is that if you don't make it perfectly right (which is difficult), you can easily get something useless, often even actively harmful.

And yet, metrics often are useful in real life. You generally want to measure things. You need to know how much money you have, and it is better to know in detail the structure of your incomes and expenses. If you want to e.g. exercise regularly or stop eating chocolate, keeping a log of which days you exercised or avoided the chocolate is often a good first step.

Thus we find ourselves in a paradox that we need good metrics, but we need to remember that they are mere approximations of reality, lest we start optimizing for the metrics at the expense of the real things. (Good advice for a human, not very useful for constructing the AI.)

Evolutionary, I guess human/animal utility function would be something like "How many copies of myself have I made? Let's maximize that." But from the subjective perspective, it's probably more like "Am I receiving the pleasure from the reward system my brain happened to develop?"

Yes, the "utility" of evolution is not the same as that of the evolved human.

For sure there are a bunch of different impulses/drives, but they all are just little rewards for transforming the current state of the world into the one our brain prefers, right?

Sometimes following your impulse can make you unhappy and still on average increase your fitness, for example jealousy. (Jealous people are made less happy by the idea that their partners might be cheating on them. But feeling this discomfort and guarding one's partner increases the reproductive fitness in average.) I mean, yes, finding out that despite your suspicions your partner does not cheat on you makes you more happy (or less unhappy) than finding out that they actually do. But not worrying about the possibility would make you even more happy. Humans are instinctively not even happiness maximizers.

you're doing the right kind of thinking and should continue to do the same kind of thinking while reading papers. you have only reinvented, but the fact that you have reinvented is not trivial and you should take it as evidence that you could invent if you knew enough.

I disagree with your interpretation of how human thoughts resolve into action. My biggest point of contention is the random pick of actions. Perhaps there is some Monte-Carlo algorithm that has a statistical guarantee that after some thousands or so tries, there is a very high probability that one of them is close to the best answer. Such algorithms exist, but it makes more sense to me that we take action based not only on context, but our memory of what has happened before. So instead of a probabilistic algorithm, you may have a structure more like a hash table. Then the input to the hash table would be what we see and feel in the moment: you see a mountain lion and feel fear, this information is hashed, and run like hell is the output. Collisions of this hash table could result in things like inaction.

I think your idea of consciousness is a good start and similar to my own ideas on the matter: we are a system and the observer of the system. What questions remain, however, are what are the sufficient and necessary components of the system, besides self-observation, that would create a subjective experience? Such as, would a system need to be self-preserving and aware of that self-preservation? Is sentience a prerequisite of sapience? By your definition, you seem to imply the other way around, that one must be a self-observing system to observe that you are observing something outside of your system. Maybe this is a chicken and egg problem, and the two are co-necessary factors. I would like to hear your thoughts on this.

As to your thoughts on a friendly AI...I have come up with a silly and perhaps incorrect counter-intuitive approach. Basically, it works like this: a computer system's scheduler gives processor time to different actions in preference of some utility level. Let's say 0 is the least important, and 5 the most. Lower level processes cannot preempt higher level ones; that is, a level 0 process cannot run before all level 1 processes are complete, and even if the completion of a level 0 process can aide the completion of a level 1 process, it cannot be run. The machine must find a different method, or return that the level 1 process cannot be completed with the current schedule. A level 5 request to make 1000 paperclips is given to the machine, and the machine determines that killing all humans will aid the completion of paperclips. Alas! Killing all humans is already scheduled at level 0, and another approach must be taken.

The other, less silly approach I thought of is to enforce a minimum energy requirement on all processes of a sufficiently dangerous machine. It stands to reason that creating 1000 paperclips can take significantly less energy than killing all humans, so killing all humans will be seen as a non-optimal strategy. In this scheme, we may not want to ask for world peace, but we should always be careful what we wish for....