Adding and removing complexity from models
Original post: http://bearlamp.com.au/adding-and-removing-complexity-from-models/
I had a really interesting conversation with a guy about modelling information. what I did when talking to him is in one case insist that his model be made more simple because adding more variation in the model was unhelpful, and then in another case in the same conversation, insist his model be made more complicated to account for the available information that didn't fit his model.
On reflection I realised that I had applied two opposing forces to models of information and could only vaguely explain why. With that in mind I decided to work out what was going on. The following is obvious, but that's why I am writing it out, so that no one else has to do the obvious thing.
Case where a model should be simplified
This all comes down to what you are measuring or describing. If you are trying to describe something rather general, like "what impact do number of beach-goers have on the pollution at the beach?", it's probably not important what gender, age, race, time spent at the beach or socioeconomic status the beach goers are. (With the exception of maybe socioeconomic status of the surrounding geopolitical territory), what is important is maybe two pieces of information:
- A measure of the number of beach goers
- A measure of the pollution
That's it. This would be a case for reducing the survey of beach goers down to a counter of beach goers and a daily photo of the remaining state of the beach at the end of the day (which could be compared to other similar photos). Or even just - 3 photos, one at 9am (start), one at 1pm (peak) and one at 5pm (end). This model needs no more moving parts. The day you want to start using historic information to decide how many beach cleaners you want to employ, you can do that from the limited but effective data you have gathered.
Case where a model should have more moving parts added to it.
Let's continue the same example. You have 3 photos of each day, but sometimes the 1pm photo is deserted. Nearly no one is at the beach, and you wonder why. It's also messing with your predictions because there is still a bit of rubbish at 5pm even though very few people were at the beach. The model no longer explains the state of the world. The map is wrong. But that's okay. We can fix it by adding more information. You notice that most days the model is good, so there might be something going on for the other days which needs a + k factor to the equation (+k is something added in chemistry, in algebra it's sometimes called a +c as in y=mx+b+c, and physics +x, but generally adding a variable to an equation is common to all science fields). Some new variable.
Let's say that being omniscient to our own made up examples we know that the cause is the weather. On stormy windy rainy days - no one goes to the beach, but some rubbish washes up. Does this match the data? almost perfectly. Does this help explain the map? Yes. Is it necessary? That depends on what you are doing with the information. Maybe it's significant enough in this scenario that it is necessary.
Second example
The example that came up in conversation was his own internal model that there is fundamentally something different between someone who does exercise, and someone who Doesn't exercise. I challenged this model for having too much complexity. I argue that the model of - there is a hidden and secret moving part between does/doesn't exercise, is a model that doesn't describe the world better than a model without that moving part.
The model does something else (and found its way into existence for this reason). If you find yourself on one side of the model (i.e. the "I don't exercise") then you can protect yourself from attributing the failure to exercise to your own inability to do it by declaring that there is a hidden and secret moving part that prevents me from being in the other observable group. This preserves your non-changing and let's you get away with it for a longer time. I know this model because that is what I did. I held this model very strongly. And then I went out and searched for the hidden and secret moving part that I could change in order to move myself into the other group. There was no hidden and secret moving part. Or if there was I couldn't find it. However, I did manage to stop holding the model that there was some hidden and secret moving part, and instead just start exercising more.
In figuring out if this model is real or a made up model to protect your own brain from being critical of itself, start to think of what the world would look like if it were true. If there was some difference between people who do exercise and people who do not - we might see people clustered in observable groups and never be able to change between them (This is not true because we regularly see people publishing their weight loss journeys, we also regularly see people getting fatter and unhealthier, suggesting that travel in either direction is entirely possible and happens all the time). If there were something describable it would be as obvious as different species, in fact - thinking evolutionarily - if such a thing existed, it's likely that it would have significantly shaped the state of the world already to be completely different... Given that we can't know for sure, this might not be a very strong argument.
If you got this far - as I did and wondered, so why can't I be in the other group - I have news for you. You can.
- Does this pattern of models with too many moving parts sound familiar to another model you have seen in action?
- Is there a model that you use that could do with more moving parts?
Meta: this took an hour to write. If I were to spend more time on it, it would probably be to tighten up the examples and maybe provide more examples. I am not sure that such time would be useful to you and am interested in if you think it will be useful.
Recent updates to gwern.net (2015-2016)
"When I was one-and-twenty / I heard a wise man say, / 'Give crowns and pounds and guineas / But not your heart away; / Give pearls away and rubies / But keep your fancy free.' / But I was one-and-twenty, / No use to talk to me."
My past year of completed writings, sorted by topic:
Genetics:
- Embryo selection for intelligence cost-benefit analysis
- meta-analysis of intelligence GCTAs, limits set by measurement error, current polygenic scores, possible gains with current IVF procedures, the benefits of selection on multiple complex traits, the possible annual value in the USA of selection & value of larger GWASes, societal consequences of various embryo selection scenarios, embryo count versus polygenic scores as limiting factors, comparison with iterated embryo selection, limits to total gains from iterated embryo selection etc.
- Wikipedia article on Genome-wide complex trait analysis (GCTA)
AI:
- Computational Complexity vs the Singularity
- Adding metadata to an RNN for mimicking individual author style
- Armstrong’s AI control problem:
Reinforce.jsdemo
Biology:
Statistics:
- Candy Japan new packaging decision analysis
- “The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example”
- Genius Revisited: Critiquing the Value of High IQ Elementary Schools
- Inferring mean ethnic IQs from very high IQ samples like TIP/SMPY
Cryptography:
Misc:
gwern.net itself has remained largely stable (some CSS fixes and image size changes); I continue to use Patreon and send out my newsletters.
Unfriendly Natural Intelligence
Related to: UFAI, Paperclip maximizer, Reason as memetic immune disorder
A discussion with Stefan (cheers, didn't get your email, please message me) during the European Community Weekend Berlin fleshed out an idea I had toyed around with for some time:
If a UFAI can wreak havoc by driving simple goals to extremes then so should driving human desires to extremes cause problems. And we should already see this.
Actually we do.
We know that just following our instincts on eating (sugar, fat) is unhealthy. We know that stimulating our pleasure centers more or less directly (drugs) is dangerous. We know that playing certain games can lead to comparable addiction. And the recognition of this has led to a large number of more or less fine-tuned anti-memes e.g. dieting, early drug prevention, helplines. These memes steering us away from such behaviors were selected for because they provided aggregate benefits to the (members of) social (sub) systems they are present in.
Many of these memes have become so self-evident we don't recognize them as such. Some are essential parts of highly complex social systems. What is the general pattern? Did we catch all the critical cases? Are the existing memes well-suited for the task?How are they related. Many are probably deeply woven into our culture and traditions.
Did we miss any anti-memes?
This last question really is at the core of this post. I think we lack some necessary memes keeping new exploitations of our desires in check. Some new ones result from our society a) having developed the capacity to exploit them and b) the scientific knowledge to know how to do this.
Estimating the kolmogorov complexity of the known laws of physics?
In the post Complexity and Intelligence, Eliezer says that the Kolmogorov Complexity (length of shortest equivalent computer program) of the laws of physics is about 500 bits:
Suppose you ran a Turing machine with unlimited tape, so that, starting from our laws of physics, it simulated our whole universe - not just the region of space we see around us, but all regions of space and all quantum branches. [...]
Then the "Kolmogorov complexity" of that entire universe [...] would be 500 bits, or whatever the size of the true laws of physics when written out as equations on a sheet of paper.
Where did this 500 come from?
I googled around for estimates on the Kolmogorov Complexity of the laws of physics, but didn't find anything. Certainly nothing as concrete as 500.
I asked about it on the physics stack exchange, but haven't received any answers as of yet.
I considered estimating it myself, but doing that well involves significant time investment. I'd need to learn the standard model well enough to write a computer program that simulated it (however inefficiently or intractably, it's the program length that matters not it's time or memory performance).
Based on my experience programming, I'm sure it wouldn't take a million bits. Probably less than ten thousand. The demo scene does some pretty amazing things with 4096 bits. But 500 sounds like a teeny tiny amount to mention off hand for fitting the constants, the forces, the particles, and the mathematical framework for doing things like differential equations. The fundamental constants alone are going to consume ~20-30 bits each.
Does anyone have a reference, or even a more worked-through example of an estimate?
Modifying Universal Intelligence Measure
In 2007, Legg and Hutter wrote a paper using the AIXI model to define a measure of intelligence. It's pretty great, but I can think of some directions of improvement.
- Reinforcement learning. I think this term and formalism are historically from much simpler agent models which actually depended on being reinforced to learn. In its present form (Hutter 2005 section 4.1) it seems arbitrarily general, but it still feels kinda gross to me. Can we formalize AIXI and the intelligence measure in terms of utility functions, instead? And perhaps prove them equivalent?
- Choice of Horizon. AIXI discounts the future by requiring that total future reward is bounded, and therefore so does the intelligence measure. This seems to me like a constraint that does not reflect reality, and possibly an infinitely important one. How could we remove this requirement? (Much discussion on the "Choice of the Horizon" in Hutter 2005 section 5.7).
- Unknown utility function. When we reformulate it in terms of utility functions, let's make sure we can measure its intelligence/optimization power without having to know its utility function. Perhaps by using an average of utility functions weighted by their K-complexity.
- AI orientation. Finally, and least importantly, it tests agents across all possible programs, even those which are known to be inconsistent with our universe. This might okay if your agent is a playing arbitrary games on a computer, but if you are trying to determine how powerful an agent will be in this universe, you probably want to replace the Solomonoff prior with the posterior resulting from updating the Solomonoff prior with data from our universe.
Any thought or research on this by others? I imagine lots of discussion has occurred over these topics; any referencing would be appreciated.
Combining causality with algorithmic information theory
Warning: maths.
Causal inference using the algorithmic Markov condition (Janzing and Schölkopf, 2008) replaces conditional independences between random variables, which define the structure of causal graphs, with algorithmic conditional independences between bit strings.
Conditional probabilities between variables become conditional complexities between strings, i.e. K(x|y) is the length of the shortest program that can generate the string x from y. Similarly, algorithmic mutual information I(x:y) is the amount of information that can be omitted in defining a string y given a shortest compressor for string x, I(x:y) = K(y) - K(y|x*). K(x,y) is the complexity of the concatenation of two strings x and y. These lead naturally to a definition of algorithmic conditional independence as I(x:y|z) = K(x|z) + K(y|z) - K(x,y|z) = 0 , where equality is defined up to the standard additive constant.
Then a lot of sexy, confusing proofs happen. When the dust settles, it looks like if you take some strings describing observations, interpret them as nodes in a graph, and "factor" so that a certain algorithmic Markov condition holds (every node string should be algorithmically independent of its non-descendant node strings given the optimal compressor of its parents' node strings), then every node can be computed by an O(1) program run on a Turing machine, with the node's parents and a noise term as input (with each node's noise string being jointly independent of the others).
Notably, this means that if we make two observations which were "generated from their parents by the same complex rule", then we can "postulate another causal link between the nodes that explains the similarity of mechanisms". They say "complex rule" because the mutual algorithmic information between simple information strings, like some digits of pi, will be swallowed up by additive constants. Which all seems very close to rediscovering TDT.
There's more to the paper, but that's the tasty bit, so the summary ends here.
Are Magical Categories Relatively Simple?
In Magical Categories, Eliezer criticizes using machine learning to learn the concept of "smile" from examples. "Smile" sounds simple to humans but is actually a very complex concept. It only seems simple to us because we find it useful.
If we saw pictures of smiling people on the left and other things on the right, we would realize that smiling people go to the left and categorize new things accordingly. A supervised machine learning algorithm, on the other hand, will likely learn something other than what we think of as "smile" (such as "containing things that pass the smiley face recognizer") and categorize molecular smiley faces as smiles.
This is because simplicity is subjective: a human will consider "happy" and "person" to be basic concepts, so the intended definition of smile as "expression of a happy person" is simple. A computational Occam's Razor will consider this correct definition to be a more complex concept than "containing things that pass the smiley face recognizer". I'll use the phrase "magical category" to refer to concepts that have a high Kolmogorov complexity but that people find simple.
I hope that it's possible to create conditions under which the computer will have an inductive bias towards magical categories, as humans do. I think that people find these concepts simple because they're useful to explain things that humans want to explain (such as interactions with people or media depicting people). The video has pixels arranged in this pattern because it depicts a person who is happy because he is eating chocolate.
So, maybe it's possible to learn these magical categories from a lot of data, by compressing the categorizer along with the data. Here's a sketch of a procedure for doing this:
-
Amass a large collection of data from various societies, containing photographs, text, historical records, etc.
-
Come up with many categories (say, one for each noun in a long list). For each category, decide which pieces of data fit the category.
-
Find categorizer_1, categorizer_2, ..., categorizer_n to minimize K(dataset + categorizer_1 + categorizer_2 + ... + categorizer_n)
What do these mean:
- K(x) is the Kolmogorov complexity of x; that is, the length of the shortest (program,input) pair that, when run, produces x. This is uncomputable so it has to be approximated (such as through resource-bounded data compression).
- + denotes string concatenation. There should be some separator so the boundaries between strings are clear.
- dataset is the collection of data
-
categorizer_k is a program that returns "true" or "false" depending on whether the input fits category #k
-
When learning a new category, find new_categorizer to minimize K(dataset + categorizer_1 + categorizer_2 + ... + categorizer_n + new_categorizer) while still matching the given examples.
Note that while in this example we learn categorizers, in general it should be possible to learn arbitrary functions including probabilistic functions.
The fact that the categorizers are compressed along with the dataset will create a bias towards categorizers that use concepts useful in compressing the dataset and categorizing other things. From looking at enough data, the concept of "person" naturally arises (in the form of a recognizer/generative model/etc), and it will be used both to compress the dataset and to recognize the "person" category. In effect, because the "person" concept is useful for compressing the dataset, it will be cheap/simple to use in categorizers (such as to recognize real smiling faces).
A useful concept here is "relative complexity" (I don't know the standard name for this), defined as K(x|y) = K(x + y) - K(y). Intuitively this is how complex x is if you already understand y. The categorizer should be trusted in inverse proportion to its relative complexity K(categorizer | dataset and other categorizers); more complex (relative to the data) categorizers are more arbitrary, even given concepts useful for understanding the dataset, and so they're more likely to be wrong on new data.
If we can use this setup to learn "magical" categories, then Friendly AI becomes much easier. CEV requires the magical concepts "person" and "volition" to be plugged in. So do all seriously proposed complete moral systems. I see no way of doing Friendly AI without having some representation of these magical categories, either provided by humans or learned from data. It should be possible to learn deontological concepts such as "obligation" or "right", and also consequentialist concepts such as "volition" or "value". Some of these are 2-place predicates so they're categories over pairs. Then we can ask new questions such as "Do I have a right to do x in y situation?" All of this depends on whether the relevant concepts have low complexity relative to the dataset and other categorizers.
Using this framework for Friendly AI has many problems. I'm hand-waving the part about how to actually compress the data (approximating Kolmogorov complexity). This is a difficult problem but luckily it's not specific to Friendly AI. Another problem is that it's hard to go from categorizing data to actually making decisions. This requires connecting the categorizer to some kind of ontology. The categorization question that we can actually give examples for would be something like "given this description of the situation, is this action good?". Somehow we have to provide examples of (description,action) pairs that are good or not good, and the AI has to come up with a description of the situation before deciding whether the action is good or not. I don't think that using exactly this framework to make Friendly AI is a good idea; my goal here is to argue that sufficiently advanced machine learning can learn magical categories.
If it is in fact possible to learn magical categories, this suggests that machine learning research (especially related to approximations of Solomonoff induction/Kolmogorov complexity) is even more necessary for Friendly AI than it is for unFriendly AI. I think that the main difficulty of Friendly AI as compared with unFriendly AI is the requirement of understanding magical concepts/categories. Other problems (induction, optimization, self-modification, ontology, etc.) are also difficult but luckily they're almost as difficult for paperclip maximizers as they are for Friendly AI.
This has a relationship to the orthogonality thesis. Almost everyone here would agree with a weak form of the orthogonality thesis: that there exist general optimizers AI programs to which you can plug in any goal (such as paperclip maximization). A stronger form of the orthogonality thesis asserts that all ways of making an AI can be easily reduced to specifying its goals and optimization separately; that is, K(AI) ~= K(arbitrary optimizer) + K(goals). My thesis here (that magical categories are simpler relative to data) suggests that the strong form is false. Concepts such as "person" and "value" have important epistemic/instrumental value and can also be used to create goals, so K(Friendly AI) < K(arbitrary optimizer) + K(Friendliness goal). There's really no problem with human values being inherently complex if they're not complex relative to data we can provide to the AI or information it will create on its own for instrumental purposes. Perhaps P(Friendly AI | AGI, passes some Friendliness tests) isn't actually so low even if the program is randomly generated (though I don't actually suggest taking this approach!).
I'm personally working on a programming language for writing and verifying generative models (proving lower bounds on P(data|model)). Perhaps something like this could be used to compress data and categories in order to learn magical categories. If we can robustly learn some magical categories even with current levels of hardware/software, that would be strong evidence for the possibility of creating Friendly AI using this approach, and evidence against the molecular smiley face scenario.
2011 Buhl Lecture, Scott Aaronson on Quantum Complexity
I was planning to post this in the main area, but my thoughts are significantly less well-formed than I thought they were. Anyway, I hope that interested parties find it nonetheless.
In the Carnegie Mellon 2011 Buhl Lecture, Scott Aaronson gives a remarkably clear and concise review of P, NP, other fundamentals in complexity theory, and their quantum extensions. In particular, beginning around the 46 minute mark, a sequence of examples is given in which the intuition from computability theory would have accurately predicted physical results (and in some cases this actually happened, so it wasn't just hindsight bias).
In previous posts we have learned about Einstein's arrogance and Einstein's speed. This pattern of results flowing from computational complexity to physical predictions seems odd to me in that context. Here we are using physical computers to derive abstractions about the limits of computation, and from there we are successfully able to intuit limits of physical computation (e.g. brains computing abstractions of the fundamental limits of brains computing abstractions...) At what point do we hit the stage where individual scientists can rationally know that results from computational complexity theory are more fundamental than traditional physics? It seems like a paradox wholly different than Einstein rationally knowing (from examining bits of theory-space evidence rather than traditional-experiment-space evidence) that relativity would hold true. In what sort of evidence space can physical brain computation yielding complexity limits count as bits of evidence factoring into expected physical outcomes (such as the exponential smallness of the spectral gap of NP-hard-Hamiltonians from the quantum adiabatic theorem)?
Maybe some contributors more well-versed in complexity theory can steer this in a useful direction.
[Links] The structure of exploration and exploitation
Inefficiencies are necessary for resilience:
Results suggest that when agents are dealing with a complex problem, the more efficient the network at disseminating information, the better the short-run but the lower the long-run performance of the system. The dynamic underlying this result is that an inefficient network maintains diversity in the system and is thus better for exploration than an efficient network, supporting a more thorough search for solutions in the long run.
Introducing a degree of inefficiency so that the system as a whole has the potential to evolve:
Efficiency is about maximising productivity while minimising expense. Its something that organisations have to do as part of routine management, but can only safely execute in stable environments. Leadership is not about stability; it is about managing uncertainty through changing contexts.
That means introducing a degree of inefficiency so that the system as a whole has the potential to evolve. Good leaders generally provide top cover for mavericks, listen to contrary opinions and maintain a degree or resilience in the system as a whole.
Systems that eliminate failure, eliminate innovation:
Innovation happens when people use things in unexpected ways, or come up against intractable problems. We learn from tolerated failure, without the world is sterile and dies. Systems that eliminate failure, eliminate innovation.
Natural systems are highly effective but inefficient due to their massive redundancy:
Natural systems are highly effective but inefficient due to their massive redundancy (picture a tree dropping thousands of seeds). By contrast, manufactured systems must be efficient (to be competitive) and usually have almost no redundancy, so they are extremely vulnerable to breakage. For example, many of our modern industrial systems will collapse without a constant and unlimited supply of inexpensive oil.
I just came across those links here.
Might our "irrationality" and the patchwork-architecture of the human brain constitute an actual feature? Might intelligence depend upon the noise of the human brain?
A lot of progress is due to luck, in the form of the discovery of unknown unknowns. The noisiness and patchwork architecture of the human brain might play a significant role because it allows us to become distracted, to leave the path of evidence based exploration. A lot of discoveries were made by people pursuing “Rare Disease for Cute Kitten” activities.
How much of what we know was actually the result of people thinking quantitatively and attending to scope, probability, and marginal impacts? How much of what we know today is the result of dumb luck versus goal-oriented, intelligent problem solving?
My point is, what evidence do we have that the payoff of intelligent, goal-oriented experimentation yields enormous advantages (enough to enable explosive recursive self-improvement) over evolutionary discovery relative to its cost? What evidence do we have that any increase in intelligence does vastly outweigh its computational cost and the expenditure of time needed to discover it?
There is a significant difference between intelligence and evolution if you apply intelligence to the improvement of evolutionary designs:
- Intelligence is goal-oriented.
- Intelligence can think ahead.
- Intelligence can jump fitness gaps.
- Intelligence can engage in direct experimentation.
- Intelligence can observe and incorporate solutions of other optimizing agents.
But when it comes to unknown unknowns, what difference is there between intelligence and evolution? The critical similarity is that both rely on dumb luck when it comes to genuine novelty. And where else but when it comes to the dramatic improvement of intelligence does it take the discovery of novel unknown unknowns?
A basic argument supporting the risks from superhuman intelligence is that we don't know what it could possible come up with. That is why we call it a 'Singularity'. But why does nobody ask how it knows what it could possible come up with?
It is argued that the mind-design space must be large if evolution could stumble upon general intelligence. I am not sure how valid that argument is, but even if that is the case, shouldn't the mind-design space reduce dramatically with every iteration and therefore demand a lot more time to stumble upon new solutions?
An unquestioned assumption seems to be that intelligence is kind of a black box, a cornucopia that can sprout an abundance of novelty. But this implicitly assumes that if you increase intelligence you also decrease the distance between discoveries. Intelligence is no solution in itself, it is merely an effective searchlight for unknown unknowns. But who knows that the brightness of the light increases proportionally with the distance between unknown unknowns? To have an intelligence explosion the light would have to reach out much farther with each generation than the increase of the distance between unknown unknowns. I just don't see that to be a reasonable assumption.
It seems that if you increase intelligence you also increase the computational cost of its further improvement and the distance to the discovery of some unknown unknown that could enable another quantum leap. It seems that you need to apply a lot more energy to get a bit more complexity.
The greater a technology’s complexity, the more slowly it improves?
A new study by researchers at MIT and other institutions shows that it may be possible to predict which technologies are likeliest to advance rapidly, and therefore may be worth more investment in research and resources.
The researchers found that the greater a technology’s complexity, the more slowly it changes and improves over time. They devised a way of mathematically modeling complexity, breaking a system down into its individual components and then mapping all the interconnections between these components.
Link: nextbigfuture.com/2011/05/mit-proves-that-simpler-systems-can.html
Might this also be the case for intelligence? Can intelligence be effectively applied to itself? To paraphrase the question:
- If you increase intelligence, do you also decrease the distance between discoveries?
- Does an increase in intelligence vastly outweigh its computational cost and the expenditure of time needed to discover it?
- Would it be instrumental for an AGI to increase its intelligence rather than using its existing intelligence to pursue its terminal goal?
- Do the resources that are necessary to increase intelligence outweigh the cost of being unable to use those resources to pursue its terminal goal directly?
This reminds me of a post by Robin Hanson:
Minds are vast complex structures full of parts that depend intricately on each other, much like the citizens of a city. Minds, like cities, best improve gradually, because you just never know enough to manage a vast redesign of something with such complex inter-dependent adaptations.
Link: Is The City-ularity Near?
Of course, artificial general intelligence might differ in its nature from the complexity of cities. But do we have any evidence that hints at such a possibility?
Another argument made for an AI project causing a big jump is that intelligence might be the sort of thing for which there is a single principle. Until you discover it you have nothing, and afterwards you can build the smartest thing ever in an afternoon and can just extend it indefinitely. Why would intelligence have such a principle? I haven’t heard any good reason. That we can imagine a simple, all powerful principle of controlling everything in the world isn’t evidence for it existing.
Link: How far can AI jump?
(via Hard Takeoff Sources)
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)