Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.
In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?
I gave some (more or less) concrete examples of this kind of a "conceptual expansion":
- Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
- As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
- As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
- An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.
As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.
The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.
As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.
We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.
Concepts, values, and reinforcement learning
Before we go on, we need to talk a bit about why we have concepts and values in the first place.
From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.
How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)
The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.
There are two major reasons why an RL agent, like a human, should learn high-level concepts:
- They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
- There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.
Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s1 gives you a reward, and that there are also states s2 - s5 in which taking some specific action causes you to end up in s1. Then we should expect the creature to develop a common concept for being in the states s2 - s5, and we should expect that concept to be "more similar" to the concept of being in state s1 than to the concept of being in some state that was many actions away.
"More similar" how?
In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.
Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.
As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.
In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).
Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".
This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.
Getting back to conceptual expansion
Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:
- Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
- As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.
I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".
Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.
As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.
Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.
As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)
The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.
Dealing with the AI in the box
This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.
Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?
Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?
Maybe some do. But I would wager that most don't. Why not?
The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.
In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.
A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.
So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward.
It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.
You already know by now that truth is undefinable: by a famous result of Tarski, no formal system powerful enough (from now on, just system) can consistently talk about the truth of its own sentences.
You may however not know that Hamkins proved that truth is holistic.
Let me explain: while no system can talk about its own truth, it can nevertheless talk about the truth of its own substructures. For example, in every model of ZFC (the standard axioms of set theory) you can consistently define a model of standard arithmetics and a predicate that works as arithmetics' truth predicate. This can happen because ZFC is strictly more powerful than PA (the axioms of standard arithmetics).
Intuitively, one could think that if you have the same substructure in two different models, what they believe is the truth about that substructure is the same in both. Along this line, two models of ZFC ought to believe the same things about standard arithmetics.
However, it turns out this is not the case. Two different models extending ZFC may very well agree on which entities are standard natural numbers, and yet still disagree about which arithmetic sentences are true or false. For example, they could agree about the standard numbers, how the successor and addition operator works, and yet disagree on multiplication (corollary 7.1 in Hamkins' paper).
This means that when you can talk consistently about the truth of a model (that is, when you are in a more powerful formal system), that truth depends not only on the substructure, but on the entire structure you're immersed in. Figuratively speaking, local truth depends on global truth. Truth is holistic.
There's more: suppose that two model agree on the ontology of some common substructure. Suppose also that they agree about the truth predicate on that structure: they could still disagree about the meta-truths. Or the meta-meta-truths, etc., for all the ordinal levels of the definable truth predicates.
Another striking example from the same paper. There are two different extensions of set theory which agree on the structure of standard arithmetics and on the members of a subset A of natural numbers, and yet one thinks that A is first-order definable while the other thinks it's not (theorem 10).
Not even "being a model of ZFC" is an absolute property: there are two models which agree on an initial segment of the set hierarchy, and yet one thinks that the segment is a model of ZFC while the other proves that it's not (theorem 12).
Two concluding remarks: what I wrote was that there are different models which disagrees the truth of standard arithmetics, not that every different model has different arithmetic truths. Indeed, if two models have access one to the truth relation of the other, then they are bound to have the same truths. This is what happens for example when you prove absoluteness results in forcing.
I'm also remembered of de Blanc's ontological crises: changing ontology can screw with your utility function. It's interesting to note that updating (that is, changing model of reality) can change what you believe even if you don't change ontology.
[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better.]
I am fascinated by methods of thinking that work for well-understood reasons - that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.
To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it's guaranteed to work in the limit of large computing power, which ain't nothing.
Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today's largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?
Bupkis happens, that's what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.
We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?
The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines - and so again our predictor runs out of time before becoming useful.
In practice, the methods we've found to work don't look very much like Solomonoff induction. Successful methods don't take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.
Suppose that we allow a wide variety of models, including lossy models. How would a general purpose AI figure out which model to choose? Ideally we'd like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model's predictions to be, and the benefit of actually finishing on time, measured by a cost function of the resources used.
Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.
There are various techniques to estimate resource usage, but how does one estimate accuracy?
Here was my first thought: If you know how much information you're losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.
But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn't very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.
The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data - a subject that workers in AI have already been hard at work on. On the one hand, it's a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.
To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.
The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can't grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we're left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn't require perfection.
III: An Example
Here's a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.
The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it's wrong, because when the model is right you don't have to spend ~log(L) bits correcting it.
In this way, Solomonoff induction natively cares about some approximate models' predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it's simpler than N-N/L·log(P·L)-N·P·log(L).
The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.
These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing - the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.
There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven't explored how models get chosen; we've made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It's a useful habit to keep in mind what actually works for humans - as someone put it to me recently, "humans can make models they understand that work for reasons they understand."
Furthermore, this doesn't seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense "believe in the laws of physics," and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.
IV: Logical Uncertainty
It may be possible to relate this back to logical uncertainty - where by "this" I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a "logical uncertainty method" in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine's outputs.
Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We'd like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes - like in the simple case of "if you flip this coin in this exact way, it has a 50% shot at landing heads" - doesn't that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.
Again, we're back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?
Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can't predict a large prime number exactly, but we can predict it's last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can't find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.
Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we've searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it's that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way - it just sets a few degrees of freedom. But is it the right way to reason?
For one thing, this seems like it's following Gaifman's proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be "too strong" in some sense (it assigns probability 0 to some true statements - though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I'm sorry, ma'am. Your daughter has... the Gaifman condition.)
This inductive view of logical uncertainty lacks the consistent nature of many other approaches - if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.
Thanks for reading this post. In conclusion, here are some of my questions:
What's the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?
Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions - and not merely in the limit of lots of computing time?
Are humans doing the right thing by making models they understand that work for reasons they understand? What's up with that reductionism problem anyhow?
Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman's condition if it finds patterns in things like P!=NP?
Does this whole edifice even make sense?
I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.
A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner. For example, Armstrong, Sandberg & Bostrom (2012) consider the possibility of restricting an AI via "rule-based motivational control" and programming it to follow restrictions like "stay within this lead box here", but they raise worries about the difficulty of rigorously defining "this lead box here". To address this, they go on to consider the possibility of making an AI internalize human concepts via feedback, with the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that. The authors are however worried that this may fail, because
Humans seem quite adept at constructing the correct generalisations – most of us have correctly deduced what we should/should not be doing in general situations (whether or not we follow those rules). But humans share a common of genetic design, which the OAI would likely not have. Sharing, for instance, derives partially from genetic predisposition to reciprocal altruism: the OAI may not integrate the same concept as a human child would. Though reinforcement learning has a good track record, it is neither a panacea nor a guarantee that the OAIs generalisations agree with ours.
Addressing this, a possibility that I raised in Sotala (2015) was that possibly the concept-learning mechanisms in the human brain are actually relatively simple, and that we could replicate the human concept learning process by replicating those rules. I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms. Later on the post, I will discuss how one might try to verify that similar representations had in fact been learned, and how to set up a system to make them even more similar.
A particularly fascinating branch of recent research relates to the learning of word embeddings, which are mappings of words to very high-dimensional vectors. It turns out that if you train a system on one of several kinds of tasks, such as being able to classify sentences as valid or invalid, this builds up a space of word vectors that reflects the relationships between the words. For example, there seems to be a male/female dimension to words, so that there's a "female vector" that we can add to the word "man" to get "woman" - or, equivalently, which we can subtract from "woman" to get "man". And it so happens (Mikolov, Yih & Zweig 2013) that we can also get from the word "king" to the word "queen" by adding the same vector to "king". In general, we can (roughly) get to the male/female version of any word vector by adding or subtracting this one difference vector!
Why would this happen? Well, a learner that needs to classify sentences as valid or invalid needs to classify the sentence "the king sat on his throne" as valid while classifying the sentence "the king sat on her throne" as invalid. So including a gender dimension on the built-up representation makes sense.
But gender isn't the only kind of relationship that gets reflected in the geometry of the word space. Here are a few more:
It turns out (Mikolov et al. 2013) that with the right kind of training mechanism, a lot of relationships that we're intuitively aware of become automatically learned and represented in the concept geometry. And like Olah (2014) comments:
It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.
This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.
It gets even more interesting, for we can use these for translation. Since Olah has already written an excellent exposition of this, I'll just quote him:
We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.
We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.
Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.
In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.
Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.
After this, it gets even more interesting. Suppose you had this space of word vectors, and then you also had a system which translated images into vectors in the same space. If you have images of dogs, you put them near the word vector for dog. If you have images of Clippy you put them near word vector for "paperclip". And so on.
You do that, and then you take some class of images the image-classifier was never trained on, like images of cats. You ask it to place the cat-image somewhere in the vector space. Where does it end up?
You guessed it: in the rough region of the "cat" words. Olah once more:
This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.
The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.
The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.
Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.
These algorithms made no attempt of being biologically realistic in any way. They didn't try classifying data the way the brain does it: they just tried classifying data using whatever worked. And it turned out that this was enough to start constructing a multimodal representation space where a lot of the relationships between entities were similar to the way humans understand the world.
How useful is this?
"Well, that's cool", you might now say. "But those word spaces were constructed from human linguistic data, for the purpose of predicting human sentences. Of course they're going to classify the world in the same way as humans do: they're basically learning the human representation of the world. That doesn't mean that an autonomously learning AI, with its own learning faculties and systems, is necessarily going to learn a similar internal representation, or to have similar concepts."
This is a fair criticism. But it is mildly suggestive of the possibility that an AI that was trained to understand the world via feedback from human operators would end up building a similar conceptual space. At least assuming that we chose the right learning algorithms.
When we train a language model to classify sentences by labeling some of them as valid and others as invalid, there's a hidden structure implicit in our answers: the structure of how we understand the world, and of how we think of the meaning of words. The language model extracts that hidden structure and begins to classify previously unseen things in terms of those implicit reasoning patterns. Similarly, if we gave an AI feedback about what kinds of actions counted as "leaving the box" and which ones didn't, there would be a certain way of viewing and conceptualizing the world implied by that feedback, one which the AI could learn.
"Hmm, maaaaaaaaybe", is your skeptical answer. "But how would you ever know? Like, you can test the AI in your training situation, but how do you know that it's actually acquired a similar-enough representation and not something wildly off? And it's one thing to look at those vector spaces and claim that there are human-like relationships among the different items, but that's still a little hand-wavy. We don't actually know that the human brain does anything remotely similar to represent concepts."
Here we turn, for a moment, to neuroscience.
Multivariate Cross-Classification (MVCC) is a clever neuroscience methodology used for figuring out whether different neural representations of the same thing have something in common. For example, we may be interested in whether the visual and tactile representation of a banana have something in common.
We can test this by having several test subjects look at pictures of objects such as apples and bananas while sitting in a brain scanner. We then feed the scans of their brains into a machine learning classifier and teach it to distinguish between the neural activity of looking at an apple, versus the neural activity of looking at a banana. Next we have our test subjects (still sitting in the brain scanners) touch some bananas and apples, and ask our machine learning classifier to guess whether the resulting neural activity is the result of touching a banana or an apple. If the classifier - which has not been trained on the "touch" representations, only on the "sight" representations - manages to achieve a better-than-chance performance on this latter task, then we can conclude that the neural representation for e.g. "the sight of a banana" has something in common with the neural representation for "the touch of a banana".
A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.
We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.
One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.
Next post in series: The problem of alien concepts.
I am not sure that it is productive to tell certain people that they do not really believe what they claim to believe, and that they only believe they believe it. I have an alternative suggestion that could possibly be more useful.
It seems that human beings have two kinds of beliefs: binary beliefs and quasi-Bayesian beliefs. The binary beliefs are what we usually think of as beliefs, simple statements which are true or false like "Two and two make four," "The sun will rise tomorrow," "The Messiah is coming," and so on.
Binary beliefs are basically voluntary. We can choose such beliefs much as we can choose to lift our arms and legs. If I say "the sun will rise tomorrow," I am choosing to say this, just as I can choose to lift my arm. I can even choose the internal factor. I can choose to say to myself, "the sun will rise tomorrow." And I can also choose to say that the sun will NOT rise. I can choose to say this to others, and I can even choose to say it to myself, within my own head.
Of course, it would be reasonable to respond to this by saying that this does not mean that someone can choose to believe that the sun will not rise. Even if he says this to himself, he still does not act as though the sun is not going to rise. He won't start making preparations for a freezing world, for example. The answer to this is that choosing to believe something is more than choosing to say it to oneself and to others. Rather, it is choosing to conform the whole of one's life to the idea that this is true. And someone could indeed choose to believe that the sun will not rise in this sense, if he thought he had a reason to do so. If he did so choose, he would indeed begin to make preparations for a dark world, because he would be choosing to conform his actions to that opinion. And he would do this voluntarily, just as someone can voluntarily lift his arm.
At the same time, human beings have quasi-Bayesian beliefs. These are true degrees of belief like probabilities, never really becoming absolutely certain of the truth or falsity of anything, but sometimes coming very close. These are internal estimates of the mind, and are basically non-voluntary. Instead of depending on choice, they actually depend on evidence, although they are influenced by other factors as well. A person cannot choose to increase or decrease this estimate, although he can go and look for evidence. On account of the flawed nature of the mind, if someone only looks for confirming evidence and ignores disconfirming evidence, this estimate in principle can go very high even when the objective state of the evidence does not justify this.
Belief in Belief
It seems to me that what we usually call belief in belief basically means that someone holds a binary belief together with a quasi-Bayesian belief which conflicts with it. So he says "The Messiah is coming," saying it to himself and others, and in every way acting as though this is true, even though his internal Bayesian estimate is that after all these thousands of years, the evidence is strongly against this. So he has a positive binary belief while having a very low estimate of the probability of this belief.
The reason why this often happens with religion in particular is that religious beliefs very often do not have huge negative consequences if they are mistaken. In principle, someone can choose to believe that if he jumps from the window of the tenth story of a building, he will be ok. In practice, no one will choose this on account of his non-voluntary Bayesian estimate that he is very likely to be hurt if does so. But a person does not notice much harm from believing the Messiah is coming, and so he can choose to believe it even if his internal estimate says that it is likely to be false.
A cautionary note: one might be tempted to think that religious people in general have belief in belief in this sense, that they all really know that their religions are unlikely to be true. This is not the case. There are plenty of ways to distort the internal estimate, even though one cannot directly choose this estimate. I know many very religious people who clearly have an extremely high internal estimate of the truth of their religion. They REALLY BELIEVE it is true, in the fullest possible sense. But on the other hand I also know others, also extremely devout, who clearly have an internal estimate which is extremely low: they are virtually certain that their religion is false, and yet in every way, externally and internally, they act and think as though it were true.
I was re-reading the chapter on status in Impro (excerpt), and I noticed that Johnstone seemed to be implying that different people are comfortable at different levels of status: some prefer being high status and others prefer being low status. I found this peculiar, because the prevailing notion in the rationalistsphere seems to be that everyone's constantly engaged in status games aiming to achieve higher status. I've even seen arguments to the effect that a true post-scarcity society is impossible, because status is zero-sum and there will always be people at the bottom of the status hierarchy.
But if some people preferred to have low status, this whole dilemma might be avoided, if a mix of statuses could be find that left everyone happy.
First question - is Johnstone's "status" talking about the same thing as our "status"? He famously claimed that "status is something you do, not something that you are", and that
I should really talk about dominance and submission, but I'd create a resistance. Students who will agree readily to raising or lowering their status may object if asked to 'dominate' or 'submit'.
Viewed via this lens, it makes sense that some people would prefer being in a low status role: if you try to take control of the group, you become subject to various status challenges, and may be held responsible for the decisions you make. It's often easier to remain low status and let others make the decisions.
But there's still something odd about saying that one would "prefer to be low status", at least in the sense in which we usually use the term. Intuitively, a person may be happy being low status in the sense of not being dominant, but most people are still likely to desire something that feels kind of like status in order to be happy. Something like respect, and the feeling that others like them. And a lot of the classical "status-seeking behaviors" seem to be about securing the respect of others. In that sense, there seems to be something intuitive true in the "everyone is engaged in status games and wants to be higher-status" claim.
So I think that there are two different things that we call "status" which are related, but worth distinguishing.
1) General respect and liking. This is "something you have", and is not inherently zero-sum. You can achieve it by doing things that are zero-sum, like being the best fan fiction writer in the country, but you can also do it by things like being considered generally friendly and pleasant to be around. One of the lessons that I picked up from The Charisma Myth was that you can be likable by just being interested in the other person and displaying body language that signals your interest in the other person.
Basically, this is "do other people get warm fuzzies from being around you / hearing about you / consuming your work", and is not zero-sum because e.g. two people who both have great social skills and show interest in you can both produce the same amount of warm fuzzies, independent of each other's existence.
But again, specific sources of this can be zero-sum: if you respect someone a lot for their art, but then run across into even better art and realize that the person you previously admired is pretty poor in comparison, that can reduce the respect you feel for them. It's just that there are also other sources of liking which aren't necessarily zero-sum.
2) Dominance and control of the group. It's inherently zero-sum because at most one person can have absolute say on the decisions of the group. This is "something you do": having the respect and liking of the people in the group (see above) makes it easier for you to assert dominance and makes the others more willing to let you do so, but you can also voluntarily abstain from using that power and leave the decisions to others. (Interestingly, in some cases this can even increase the extent to which you are liked, which translates to a further boost in the ability to control the group, if you so desired.)
Morendil and I previously suggested a definition of status as "the general purpose ability to influence a group", but I think that definition was somewhat off in conflating the two senses above.
I've always had the vague feeling that the "everyone can't always be happy because status is zero-sum" claim felt off in some sense that I was unable to properly articulate, but this seems to resolve the issue. If this model were true, it would also make me happy, because it would imply that we can avoid zero-sum status fights while still making everybody content.
This is the second of four short essays that say explicitly some things that I would tell an intrigued proto-rationalist before pointing them towards Rationality: AI to Zombies (and, by extension, most of LessWrong). For most people here, these essays will be very old news, as they talk about the insights that come even before the sequences. However, I've noticed recently that a number of fledgling rationalists haven't actually been exposed to all of these ideas, and there is power in saying the obvious.
This essay is cross-posted on MindingOurWay.
A note on what sort of artifact a brain is:
A brain is a specialty device that, when slammed against its surroundings in a particular way, changes so that its insides reflect its outsides. A brain is a precise, complex machine that continually hits nearby things just so, so that some of its inner bits start to correlate with the outside world.
Consider the photons bouncing off the chair in the room where I write this. In coarse summary, those photons slam into specialized proteins in the membrane of my photoreceptor cells, changing their shape and setting off a chain reaction that activates an enzyme that breaks down certain nucleotides, thereby changing the electrochemical gradient between the inside and the outside of the cell, preventing the release of certain neurotransmitters through its membrane. This lack of neurotransmitters causes nearby cells to undergo similar ionization events, and those cells transmit the signal from a number of nearby photoreceptor cells into the first layer of my retinal cells (again, by the mechanism of proteins changing shape and altering the electrochemical gradient). And that's just the very beginning of a looooong Rube Goldberg machine: the signal then makes its way down the retina (interacting, at each level, with signals from higher levels) until it's passed to the optic nerve, where it's passed to the visual cortex, where the specific pattern of nerve cell ionization events causes a specific pattern of neurons to fire, setting off a cascade of neurons-affecting-other-neurons in a domino effect that results in the inside of my brain containing a tiny summarized representation of a chair.
A brain is a complex piece of machinery that, when immersed in a big soup of photons while connected to light-sensors, undergoes a massive chain reaction that causes the inner parts of the brain to correlate with the things the photons bounced off of.
A brain is a machine that builds up mutual information between its internals and its externals.
We have a reasonably clear sense of what "good" is, but it's not perfect. Suffering is bad, pleasure is good, more people living enjoyable lives is good, yes, but tradeoffs are hard. How much worse is it to go blind than to lose your leg?  How do we compare the death of someone at eighty to the death of someone at twelve? If you wanted to build some automated system that would go from data about the world to a number representing how well it's doing, where you would prefer any world that scored higher to any world scoring lower, that would be very difficult.
Say, however, that you've built a metric that you think matches your values well and you put some powerful optimizer to work maximizing that metric. This optimizer might do many things you think are great, but it might be that the easiest ways to maximize the metric are the ones that pull it apart from your values. Perhaps after it's in place it turns out your metric included many things that only strongly correlated with what you cared about, where the correlation breaks down under maximization.
What confuses me is that the people who warn about this scenario with respect to AI are often the same people in favor of futarchy. They both involve trying to define your values and then setting an indifferent optimizer to work on them. If you think AI would be very dangerous but futarchy would be very good, why?
I also posted this on my blog.
 This is a question people working in public health try to answer with Disability Weights for DALYs.
What do you think about these pairs of statements?
- People's misfortunes result from the mistakes they make
- Many of the unhappy things in people's lives are partly due to bad luck
- In the long run, people get the respect they deserve in this world.
- Unfortunately, an individual's worth often passes unrecognized no matter how hard he tries.
- Becoming a success is a matter of hard work; luck has little or nothing to do with it.
- Getting a good job mainly depends on being in the right place at the right time.
They have a similar theme: the first statement suggests that an outcome (misfortune, respect, or a good job) for a person are the result of their own action or volition. The second assigns the outcome to some external factor like bad luck.(1)
People who tend to think their own attitudes or efforts can control what happens to them are said to have an internal locus of control, those who don't, an external locus of control. (Call them 'internals' and 'externals' for short).
Internals seem to do better at life, pace obvious confounding: maybe instead of internals doing better by virtue of their internal locus of control, being successful inclines you to attribute success internal factors and so become more internal, and vice versa if you fail.(2) If you don't think the relationship is wholly confounded, then there is some prudential benefit for becoming more internal.
Yet internal versus external is not just a matter of taste, but a factual claim about the world. Do people, in general, get what their actions deserve, or is it generally thanks to matters outside their control?
Why the external view is right
Here are some reasons in favour of an external view:(3)
- Global income inequality is marked (e.g. someone in the bottom 10% of the US population by income is still richer than two thirds of the population - more here). The main predictor of your income is country of birth, it is thought to explain around 60% of the variance: not only more important than any other factor, but more important than all other factors put together.
- Of course, the 'remaining' 40% might not be solely internal factors either. Another external factor we could put in would be parental class. Include that, and the two factors explain 80% of variance in income.
- Even conditional on being born in the right country (and to the right class), success may still not be a matter of personal volition. One robust predictor of success (grades in school, job performance, income, and so on) is IQ. The precise determinants of IQ remain controversial, it is known to be highly heritable, and the 'non-genetic' factors of IQ proposed (early childhood environment, intra-uterine environment, etc.) are similarly outside one's locus of control.
On cursory examination the contours of how our lives are turned out are set by factors outside our control, merely by where we are born and who our parents are. Even after this we know various predictors, similarly outside (or mostly outside) of our control, that exert their effects on how our lives turn out: IQ is one, but we could throw in personality traits, mental health, height, attractiveness, etc.
So the answer to 'What determined how I turned out, compared to everyone else on the planet?', the answer surely has to by primarily about external factors, and our internal drive or will is relegated a long way down the list. Even if we want to look at narrower questions, like "What has made me turn out the way I am, versus all the other people who were likewise born in rich countries in comfortable circumstances?" It is still unclear whether the locus of control resides within our will: perhaps a combination of our IQ, height, gender, race, risk of mental illness and so on will still do the bulk of the explanatory work.(4)
Bringing the true and the prudentially rational together again
If it is the case that folks with an internal locus of control succeed more, yet also the external view being generally closer to the truth of the matter, this is unfortunate. What is true and what is prudentially rational seem to be diverging, such that it might be in your interests not to know about the evidence in support of an external locus of control view, as deluding yourself about an internal locus of control view would lead to your greater success.
Yet it is generally better not to believe falsehoods. Further, the internal view may have some costs. One possibility is fueling a just world fallacy: if one thinks that outcomes are generally internally controlled, then a corollary is when bad things happen to someone or they fail at something, it was primarily their fault rather than them being a victim of circumstance.
So what next? Perhaps the right view is to say that: although most important things are outside our control, not everything is. Insofar as we do the best with what things we can control, we make our lives go better. And the scope of internal factors - albeit conditional on being a rich westerner etc. - may be quite large: it might determine whether you get through medical school, publish a paper, or put in enough work to do justice to your talents. All are worth doing.
Inspired by Amanda MacAskill's remarks, and in partial response of Peter McIntyre. Neither are responsible for what I've written, and the former's agreement or the latter's disagreement with this post shouldn't be assumed.
1) Some ground-clearing: free will can begin to loom large here - after all, maybe my actions are just a result of my brain's particular physical state, and my brain's particular physical state at t depends on it's state at t-1, and so on and so forth all the way to the big bang. If so, there is no 'internal willer' for my internal locus of control to reside.
However, even if that is so, we can parse things in a compatibilist way: 'internal' factors are those which my choices can affect; external factors are those which my choices cannot affect. "Time spent training" is an internal factor as to how fast I can run, as (borrowing Hume), if I wanted to spend more time training, I could spend more time training, and vice versa. In contrast, "Hemiparesis secondary to birth injury" is an external factor, as I had no control over whether it happened to me, and no means of reversing it now. So the first set of answers imply support for the results of our choices being more important; whilst the second set assign more weight to things 'outside our control'.
2) In fairness, there's a pretty good story as to why there should be 'forward action': in the cases where outcome is a mix of 'luck' factors (which are a given to anyone), and 'volitional ones' (which are malleable), people inclined to think the internal ones matter a lot will work hard at them, and so will do better when this is mixed in with the external determinants.
3) This ignores edge cases where we can clearly see the external factors dominate - e.g. getting childhood leukaemia, getting struck by lightning etc. - I guess sensible proponents of an internal locus of control would say that there will be cases like this, but for most people, in most cases, their destiny is in their hands. Hence I focus on population level factors.
4) Ironically, one may wonder to what extent having an internal versus external view is itself an external factor.
A putative new idea for AI control; index here.
Partially inspired by as conversation with Daniel Dewey.
People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.
However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.
The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.
To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.
The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.
So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.
In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.
It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.
More or less crude
The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?
The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.
Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.
We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.
How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").
One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).
You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".
There are a few other thoughts we might have for trying to pick a safer u:
- Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
- If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
- Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
- Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
- Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
- Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
- Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
- All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.
This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?
And with that, the various results of my AI retreat are available to all.
View more: Next