A note on how this post was produced:
Eliezer brain-dumped his thoughts on this open problem to Facebook, and replied to questions there for several hours. Then Robby spent time figuring out how to structure a series of posts that would more clearly explain the open problem, and wrote drafts of those posts. Several people, including Eliezer, commented heavily on various drafts until they reached a publishable form. Louie coordinates the project.
After discussion of the posts on Less Wrong, we may in some cases get someone to write up journal article expositions of some of the ideas in the posts.
The aim is to write up open problems in Friendly AI using as little Eliezer-time as possible. It seems to be working so far.
Awesome project. I really liked the facebook discussion, and this post explains clearly and concretely a part that some people found confusing. Very well written. Congratulation, Robb.
Fantastic post. Usually with posts along the lines of AI & epistemology I just quickly scan them as I expect them to go over my head, or descend straight into jargon, but this was extremely well explained, and a joy to follow.
TL;DR: To do induction, you need to calculate P(Hypothesis, given Observation). By Bayes,
=\frac{P(O%7CH)P(H)}{P(O)})
But how do we calculate =P(\mbox{Observation,%20given%20Hypothesis}))? How do we go from hypotheses to predicted observations, given that actual observations made by an agent are not internal features of a hypotheses, but rather are data coming from a totally different part of the agent?
(Edit: But seriously, read the post, it is awesome.)
5 In either case, we shouldn't be surprised to see Cai failing to fully represent its own inner workings. An agent cannot explicitly represent itself in its totality, since it would then need to represent itself representing itself representing itself ... ad infinitum. Environmental phenomena, too, must usually be compressed.
This is obviously false. An agent's model can most certainly include an exact description of itself by simple quining. That's not to say that quining is the most efficient way, but this shows that it certainly possible to have a complete representation of oneself.
A quine only prints the source code of a program, not e.g. the state of the machine's registers, the contents of main memory, or various electric voltages within the system. It's only a very limited representation.
Not sure I agree. Let me try to spell it out.
Here's what I always assumed a UDT agent to be doing: on one hand, it has a prior over world programs - let's say without loss of generality that there's just one world program, e.g. a cellular automaton containing a computer containing an operating system running the agent program. It's not pre-sliced, the agent doesn't know what transistors are, etc. On the other hand, the agent knows a quined description of itself, as an agent program with inputs and outputs, in whatever high-level language you like.
Then the agent tries to prove theorems of the form, "if the agent program implements such-and-such mapping from inputs to outputs, then the world program behaves a certain way". To prove a theorem of that form, the agent might notice that some things happening within the world program can be interpreted as "inputs" and "outputs", and the logical dependence between these is provably the same as the mapping implemented by the agent program. (So the agent will end up finding transistors within the world program, so to speak, while trying to prove theorems of the above form.) Note that the agent might discover mul...
typo: There seems to be an extra 'Y' in column 4 of the first image (it should be CYYY instead of CYYYY)
In the text, there's a superscript 1 at one point, then there's another later which should be a superscript 2. The superscript 2 should be a 3, but I think it works fine after that. It'd be cooler if they were able to link down to their points too, as it's a lot of scrolling.
Also, the link here
We normally think of science as reliant mainly on sensory faculties, not introspective ones.
Behind 'introspective', doesn't work.
Loving the post.
Hmm. I previously mentioned that in my model of personal identity, our brains include planning machinery that's based on subjective expectation ("if I do this, what do I expect to experience as a result?"), and that this requires some definition of a "self", causing our brains to always have a model for continuity of self that they use in predicting the future.
Similarly, in the comments of The Anthropic Trilemma, Eliezer says:
...It seems to me that there's some level on which, even if I say very firmly, "I now resolve to care only a
I'm reading Eliezer's posts about this, and how he says AIXI doesn't natively represent its own internal workings. But does this really matter? There are other ways to solve the anvil problem. Presumably, if the AIXI is embedded in the real world in some way (like, say, on a laptop) you can conceive that: it will eventually sense the laptop ("Hey, there is a laptop here, I wonder what it does!"), tamper with it a little bit, and then realize that whenever it tampers with it, some aspect of experience changes ("Hey, when I flip this bit on th...
Well here is my take on how AIXI would handle these sorts of situations:
First, let's assume it lives in a universe so which in any time t is in a state S(t) which is computable in terms of t. Now, AIXI finds this function S(t) interesting because it can be used to predict the input bits. More precisely, AIXI generates some function f which locate the machine running it and returns the input bits in that machine, and generates the model where its inputs in time t is f(S(t)). This function f is AIXI's phenomenological bridge, it is naturally emergent from the formalism of AIXI. This does not take into account how AIXI's model active has its future inputs depend on its current output, which would make the model more complicated.
Now suppose that AIXI considers an action with the result that in some time t' the machine computing it in no longer exists. Then AIXI would be able to compute S(t'), but f(S(t')) would no longer be well defined. What would AIXI do then? It will have to start using a different model for its inputs. Weather it will perform such an action depends on its predictions of its reward signal in this alternative. The exact result would be unpredictable, but one possible...
I don't know why you were down-voted but I'd guess because your quote doesn't sound like it addresses the same issue. At least not in the same way. If it is clear to you that it does that you should clearly outline how it does so. On first reading of your statement is looks somethat wishy washy and just pointing out a regress whereas the RobbBBs post builds on clearly outlined hypotheses illustrated with a very concrete example.
I think we should be at least mildly concerned about accepting this view of agents in which the agent's internal information processes are separated by a bright red line from the processes happening in the outside world. Yes I know you accept that they are both grounded in the same physics, and that they interact with one another via ordinary causation, but if you believe that bridging rules are truly inextricable from AI then you really must completely delineate this set of internal information processing phenomena from the external world. Otherwise, if y...
I always get confused by these articles about "experience", but this is a good article because I get confused in an interesting way, (and also a less condescending way).
Normally, I just shrug and say "well, I don't have one". In regards to "human-style conscious experience" my answer is probably still "well, I don't have one". However, clearly even if imperfect there is have some sort of semi functional agency behavior in this brain, and so I must have some form of bridge hypothesis and sensory "experience"...
You seem to be making a mistake in treating bridge rules/hypotheses as necessary--perhaps to set up a later article?
I, like Cai, tend to frame my hypotheses in terms of a world-out-there model combined with bridging rules to my actual sense experience; but this is merely an optimisation strategy to take advantage of all my brain's dedicated hardware for modelling specific world components, preprocessing of senses, etc.. The bridging rules certainly aren't logically required. In practice there is an infinite family of equivalent models over my mental expe...
Naturalized induction is an open problem in Friendly Artificial Intelligence. The problem, in brief: Our current leading models of induction do not allow reasoners to treat their own computations as processes in the world.
I checked. These models of induction apparently allow reasoners to treat their own computations as modifiable processes:
This was discussed on Facebook. I'll copy-paste the entire conversation here, since it can only be viewed by people who have Facebook accounts.
Kaj Sotala: Re: Cartesian reasoning, it sounds like Orseau & Ring's work on creating a version of the AIXI formalism in which the agent is actually embedded in the world, instead of being separated from it, would be relevant. Is there any particular reason why it hasn't been mentioned?
Tsvi BT: [...] did you look at the paper Kaj posted above?
Luke Muehlhauser: [...] can you state this open problem using the notation from Orseau and Ring's "Space-Time Embedded Intelligence" (2012), and thusly explain why the problem you're posing isn't solved by their own attack on the Cartesian boundary?
Eliezer Yudkowsky: Orseau and Ring say: "It is convenient to envision the space-time-embedded environment as a multi-tape Turing machine with a special tape for the agent." As near as I can make out by staring at their equations and the accompanying text, their version is something I would still regard as basically Cartesian. Letting the environment modify the agent is still a formalism with a distinct agent and environment. Stating tha...
[ I am sure this is not new to anyone who's been thinking about this. ]
I worry that the world is not benign in general, but contains critters that want to mislead you to win utility. There might be such critters that may exploit the difficulty of the problem faced by Cai by trying to mislead Cai into constructing the wrong mapping from "stuff out there" to "stuff in Cai" (of course even the easy version is hard :( ).
Here's a hacky solution. I suspect that it is actually not even a valid solution since I'm not very familiar with the subject matter, but I'm interested in finding out why.
The relationship between one's map and the territory is much easier to explain from outside than it is from the inside. Hypotheses about the maps of other entities can be complete as hypotheses about the territory if they make predictions based on that entity's physical responses.
Therefore: can't we sidestep the problem by having the AI consider its future map state as a step in the midd...
Given that there are a huge number of possible sequences of colors that match any given pattern so far, doesn't this problem reduce to assigning a probability distribution among hypotheses that all have identical evidence?
Brain dump of a quick idea:
A sufficiently complex bridge law might say that the agent is actually a rock which, through some bizarre arbitrary encoding, encodes a computation[1]. Meanwhile the actual agent is somewhere else. Hopefully the agent has some adequate Occamian prior and he never assigns this hypothesis any relevance because of the high complexity of the encoding code.
In idea-space, though, there is a computation which is encoded by a rock using a complex arbitrary encoding, which, by virtue of having a weird prior, concludes that it actually is ...
Indeed, it's precisely because these type errors happen explicitly within a mind that humans go around talking about a "hard problem of conscious experience". The problem of selecting a phenomenological bridge hypothesis is a generalization of the problem of reducing human-style conscious experience to unconscious computation.
This is a great post, but it is not made better by this quote.
Naturalized induction is an open problem in Friendly Artificial Intelligence (OPFAI). The problem, in brief: Our current leading models of induction do not allow reasoners to treat their own computations as processes in the world.
The problem's roots lie in algorithmic information theory and formal epistemology, but finding answers will require us to wade into debates on everything from theoretical physics to anthropic reasoning and self-reference. This post will lay the groundwork for a sequence of posts (titled 'Artificial Naturalism') introducing different aspects of this OPFAI.
AI perception and belief: A toy model
A more concrete problem: Construct an algorithm that, given a sequence of the colors cyan, magenta, and yellow, predicts the next colored field.
Colors: CYYM CYYY CYCM CYYY ????
This is an instance of the general problem 'From an incomplete data series, how can a reasoner best make predictions about future data?'. In practice, any agent that acquires information from its environment and makes predictions about what's coming next will need to have two map-like1 subprocesses:
1. Something that generates the agent's predictions, its expectations. By analogy with human scientists, we can call this prediction-generator the agent's hypotheses or beliefs.
2. Something that transmits new information to the agent's prediction-generator so that its hypotheses can be updated. Employing another anthropomorphic analogy, we can call this process the agent's data or perceptions.
Here's an example of a hypothesis an agent could use to try to predict the next color field. I'll call the imaginary agent 'Cai'. Any reasoner will need to begin with some (perhaps provisional) assumptions about the world.2 Cai begins with the belief3 that its environment behaves like a cellular automaton: the world is a grid whose tiles change over time based on a set of stable laws. The laws are local in time and space, meaning that you can perfectly predict a tile's state based on the states of the tiles next to it a moment prior — if you know which laws are in force.
Cai believes that it lives in a closed 3x3 grid where tiles have no diagonal effects. Each tile can occupy one of three states. We might call the states '0', '1', and '2', or, to make visualization easier, 'white', 'black', and 'gray'. So, on Cai's view, the world as it changes looks something like this:
An example of the world's state at one moment, and its state a moment later.
Cai also has beliefs about its own location in the cellular automaton. Cai believes that it is a black tile at the center of the grid. Since there are no diagonal laws of physics in this world, Cai can only directly interact with the four tiles directly above, below, to the left, and to the right. As such, any perceptual data Cai acquires will need to come from those four tiles; anything else about Cai's universe will be known only by inference.
Cai perceives stimuli in four directions. Unobservable tiles fall outside the cross.
How does all this bear on the color-predicting problem? Cai hypothesizes that the sequence of colors is sensory — it's an experience within Cai, triggered by environmental changes. Cai conjectures that since its visual field comes in at most four colors, its visual field's quadrants probably represent its four adjacent tiles. The leftmost color comes from a southern stimulus, the next one to the right from a western stimulus, then a northern one, then an eastern one. And the south, west, north, east cycle repeats again and again.
Cai’s visual experiences break down into quadrants, corresponding to four directions.
On this model, the way Cai’s senses organize the data isn't wholly veridical; the four patches of color aren’t perfectly shaped like Cai’s environment. But the organization of Cai's sensory apparatus and the organization of the world around Cai are similar enough that Cai can reconstruct many features of its world.
By linking its visual patterns to patterns of changing tiles, Cai can hypothesize laws that guide the world's changes and explain Cai's sensory experiences. Here's one possibility, Hypothesis A:
Hypothesis A's physical content. On the left: Cai's belief about the world's present state. On the right: Cai's belief about the rules by which the world changes over time. The rules are symmetric under rotation and reflection.
Bridging stimulus and experience
So that's one way of modeling Cai's world; and it will yield a prediction about the cellular automaton's next state, and therefore about Cai's next visual experience. It will also yield retrodictions of the cellular automaton's state during Cai's three past sensory experiences.
Hypothesis A asserts that tiles below Cai, to Cai's left, above, and to Cai's right relate to Cai's color experiences via the rule {black ↔ cyan, white ↔ yellow, gray ↔ magenta}. Corner tiles, and future world-states and experiences, can be inferred from Hypothesis A's cell transition rules.
Are there other, similar hypotheses that can explain the same data? Here's one, Hypothesis B:
The added complexity in the perception-to-environment link allows Hypothesis B to do away with most of the complexity in Hypothesis A's physical laws. Breaking down Hypotheses A and B into their respective physical and perception-to-environment components makes it more obvious how the two differ:
A has the simpler bridge hypothesis, while B has the simpler physical hypothesis.
Though they share a lot in common, and both account for Cai's experiences to date, these two hypotheses diverge substantially in the cellular automaton states and future experiences they predict:
The two hypotheses infer different distributions and dynamical rules for the tile shades from the same perceptual data. These worldly differences then diverge in the future experiences they predict.
Hypotheses linking observations to theorized entities appear to be quite different from hypothesis that just describe the theorized entities in their own right. In Cai's case, the latter hypotheses look like pictures of physical worlds, while the former are ties between different kinds of representation. But in both cases it's useful to treat these processes in humans or machines as beliefs, since they can be assigned weights of expectation and updated.
'Phenomenology' is a general term for an agent's models of its own introspected experiences. As such, we can call these hypotheses linking experienced data to theorized processes phenomenological bridge hypotheses. Or just 'bridge hypotheses', for short.
If we want to build an agent that tries to evaluate the accuracy of a model based on the accuracy of its predictions, we need some scheme to compare thingies in the model (like tiles) and thingies in the sensory stream (like colors). Thus a bridge rule appears to be necessary to talk about induction over models of the world. And bridge hypotheses are just bridge rules treated as probabilistic, updatable beliefs.
As the last figure above illustrates, bridge hypotheses can make a big difference for one's scientific beliefs and expectations. And bridge hypotheses aren't a free lunch; it would be a mistake to shunt all complexity onto them in order to simplify your physical hypotheses. Allow your bridge hypotheses to get too complicated, and you'll be able to justify mad world-models, e.g., ones where the universe consists of a single apricot whose individual atoms each get a separate bridge to some complex experience. At the same time, if you demand too much simplicity from your bridge hypotheses, you'll end up concluding that the physical world consists of a series of objects shaped just like your mental states. That way you can get away with a comically simple bridge rule like {exists(x) ↔ experiences(y,x)}.
In the absence of further information, it may not be possible to rule out Hypothesis A or Hypothesis B. The takeaway is that tradeoffs between the complexity of bridging hypotheses and the complexity of physical hypotheses do occur, and do matter. Any artificial agent needs some way of formulating good hypotheses of this type in order to be able to understand the universe at all, whether or not it finds itself in doubt after it has done so.
Generalizing bridge rules and data
Reasoners — both human and artificial — don't begin with perfect knowledge of their own design. When they have working self-models at all, these self-models are fallible. Aristotle thought the brain was an organ for cooling the blood. We had to find out about neurons by opening up the heads of people who looked like us, putting the big corrugated gray organ under a microscope, seeing (with our eyes, our visual cortex, our senses) that the microscope (which we'd previously generalized shows us tiny things as if they were large) showed this incredibly fine mesh of connected blobs, and realizing, "Hey, I bet this does information processing and that's what I am! The big gray corrugated organ that's inside my own head is me!"
The bridge hypotheses in Hypotheses A and B are about linking an agent's environment-triggered experiences to environmental causes. But in fact bridge hypotheses are more general than that.
1. An agent's experiences needn't all have environmental causes. They can be caused by something inside the agent.
2. The cause-effect relation we're bridging can go the other way. E.g., a bridge hypothesis can link an experienced decision to a behavioral consequence, or to an expected outcome of the behavior.
3. The bridge hypothesis needn't link causes to effects at all. E.g., it can assert that the agent's experienced sensations or decisions just are a certain physical state. Or it can assert neutral correlations.
Phenomenological bridge hypotheses, then, can relate theoretical posits to any sort of experiential data. Experiential data are internally evident facts that get compared to hypotheses and cause updates — the kind of data of direct epistemic relevance to individual scientists updating their personal beliefs. Light shines on your retina, gets transduced to neural firings, gets reconstructed in your visual cortex and then — this is the key part — that internal fact gets used to decide what sort of universe you're probably in.
The data from an AI’s environment is just one of many kinds of information it can use to update its probability distributions. In addition to ordinary sensory content such as vision and smell, update-triggering data could include things like how much RAM is being used. This is because an inner RAM sense can tell you that the universe is such as to include a copy of you with at least that much RAM.
We normally think of science as reliant mainly on sensory faculties, not introspective ones. Arriving at conclusions just by examining your own intuitions and imaginings sounds more like math or philosophy. But for present purposes the distinction isn't important. What matters is just whether the AGI forms accurate beliefs and makes good decisions. Prototypical scientists may shun introspectionism because humans do a better job of directly apprehending and communicating facts about their environments than facts about their own inner lives, but AGIs can have a very different set of strengths and weaknesses. Although introspection, like sensation, is fallible, introspective self-representations sometimes empirically correlate with world-states.4 And that’s all it takes for them to constitute Bayesian evidence.
Bridging hardware and experience
In my above discussion, all of Cai's world-models included representations of Cai itself. However, these representations were very simple — no more than a black tile in a specific environment. Since Cai's own computations are complex, it must be the case that either they are occurring outside the universe depicted (as though Cai is plugged into a cellular automaton Matrix), or the universe depicted is much more complex than Cai thinks.5 Perhaps its model is wildly mistaken, or perhaps the high-level cellular patterns it's hypothesized arise from other, smaller-scale regularities.
Regardless, Cai’s computations must be embodied in some causal pattern. Cai will eventually need to construct bridge hypotheses between its experiences and their physical substrate if it is to make reliable predictions about its own behavior and about its relationship with its surroundings.
Visualize the epistemic problem that an agent needs to solve. Cai has access to a series of sensory impressions. In principle we could also add introspective data to that. But you'll still get a series of (presumably time-indexed) facts in some native format of that mind. Those facts very likely won't be structured exactly like any ontologically basic feature of the universe in which the mind lives. They won't be a precise position of a Newtonian particle, for example. And even if we were dealing with sense data shaped just like ontologically basic facts, a rational agent could never know for certain that they were ontologically basic, so it would still have to consider hypotheses about even more basic particles.
When humans or AGIs try to match up hypotheses about universes to sensory experiences, there will be a type error. Our representation of the universe will be in hypothetical atoms or quantum fields, while our representation of sensory experiences will be in a native format like 'red-green'.6 This is where bridge rules like Cai's color conversions come in — bridges that relate our experiences to environmental stimuli, as well as ones that relate our experiences to the hardware that runs us.
Cai can form physical hypotheses about its own internal state, in addition to ones about its environment. This means it can form bridge hypotheses between its experiences and its own hardware, in addition to ones between its experiences and environment.
If you were an AI, you might be able to decode your red-green visual field into binary data — on-vs.-off — and make very simple hypotheses about how that corresponded to transistors making you up. Once you used a microscope on yourself to see the transistors, you'd see that they had binary states of positive and negative voltage, and all that would be left would be a hypothesis about whether the positive (or negative) voltage corresponded to an introspected 1 (or 0).
But even then, I don't quite see how you could do without the bridge rules — there has to be some way to go from internal sensory types to the types featured in your hypotheses about physical laws.
Our sensory experience of red, green, blue is certain neurons firing in the visual cortex, and these neurons are in turn made from atoms. But internally, so far as information processing goes, we just know about the red, the green, the blue. This is what you'd expect an agent made of atoms to feel like from the inside. Our native representation of a pixel field won't come with a little tag telling us with infallible transparency about the underlying quantum mechanics.
But this means that when we're done positing a physical universe in all its detail, we also need one last (hopefully simple!) step that connects hypotheses about 'a brain that processes visual information' to 'I see blue'.
One way to avoid worrying about bridge hypotheses would be to instead code the AI to accept bridge axioms, bridge rules with no degrees of freedom and no uncertainty. But the AI’s designers are not in fact infinitely confident about how the AI’s perceptual states emerge from the physical world — that, say, quantum field theory is the One True Answer, and shall be so from now until the end of time. Nor can they transmit infinite rational confidence to the AI merely by making it more stubbornly convinced of the view. If you pretend to know more than you do, the world will still bite back. As an agent in the world, you really do have to think about and test a variety of different uncertain hypotheses about what hardware you’re running on, what kinds of environmental triggers produce such-and-such experiences, and so on. This is particularly true if your hardware is likely to undergo substantial changes over time.
If you don’t allow the AI to form probabilistic, updatable hypotheses about the relation between its phenomenology and the physical world, the AI will either be unable to reason at all, or it will reason its way off a cliff. In my next post, Bridge Collapse, I'll begin discussing how the latter problem sinks an otherwise extremely promising approach to formalizing ideal AGI reasoning: Solomonoff induction.
1 By 'map-like', I mean that the processes look similar to the representational processes in human thought. They systematically correlate with external events, within a pattern-tracking system that can readily propagate and exploit the correlation. ↩
2 Agents need initial assumptions, built-in prior information. The prior is defined by whatever algorithm the reasoner follows in making its very first updates.
If I leave an agent's priors undefined, no ghost of reasonableness will intervene to give the agent a 'default' prior. For example, it won't default to a uniform prior over possible coinflip outcomes in the absence of relevant evidence. Rather, without something that acts like a prior, the agent just won't work — in the same way that a calculator won't work if you grant it the freedom to do math however it wishes. A frequentist AI might refuse to talk about priors, but it would still need to act like it has priors, else break. ↩
3 This talk of 'belief' and 'assumption' and 'perception' is anthropomorphizing, and the analogies to human psychology won't be perfect. This is important to keep in view, though there's only so much we can do to avoid vagueness and analogical reasoning when the architecture of AGIs remains unknown. In particular, I'm not assuming that every artificial scientist is particularly intelligent. Or particularly conscious.
What I mean with all this 'Cai believes...' talk is that Cai weights predictions and selects actions just as though it believed itself to be in a cellular automaton world. One can treat Cai's automaton-theoretic model as just a bookkeeping device for assigning Cox's-theorem-following real numbers to encoded images of color fields. But one can also treat Cai's model as a psychological expectation, to the extent it functionally resembles the corresponding human mental states. Words like 'assumption' and 'thinks' here needn't mean that the agent thinks in the same fashion humans think; what we're interested in are the broad class of information-processing algorithms that yield similar behaviors. ↩
4 To illustrate: In principle, even a human pining to become a parent could, by introspection alone, infer that they might be an evolved mind (since they are experiencing a desire to self-replicate) and embedded in a universe which had evolved minds with evolutionary histories. An AGI with more reliable internal monitors could learn a great deal about the rest of the universe just by investigating itself. ↩
5 In either case, we shouldn't be surprised to see Cai failing to fully represent its own inner workings. An agent cannot explicitly represent itself in its totality, since it would then need to represent itself representing itself representing itself ... ad infinitum. Environmental phenomena, too, must usually be compressed. ↩
6 One response would be to place the blame on Cai's positing white, gray, and black for its world-models, rather than sticking with cyan, yellow, and magenta. But there will still be a type error when one tries to compare perceived cyan/yellow/magenta with hypothesized (but perceptually invisible) cyan/yellow/magenta. Explicitly introducing separate words for hypothesized v. perceived colors doesn't produce the distinction; it just makes it easier to keep track of a distinction that was already present. ↩