Information theory and FOOM

PhilGoetz

Information theory and FOOM

6 min read14th Oct 200995 comments

5

Information is power. But how much power? This question is vital when considering the speed and the limits of post-singularity development. To address this question, consider 2 other domains in which information accumulates, and is translated into an ability to solve problems: Evolution, and science.

DNA Evolution

Genes code for proteins. Proteins are composed of modules called "domains"; a protein contains from 1 to dozens of domains. We classify genes into gene "families", which can be loosely defined as sets of genes that on average share >25% of their amino acid sequence and have a good alignment for >75% of their length. The number of genes and gene families known doubles every 28 months; but most "new" genes code for proteins that recombine previously-known domains in different orders.

Almost all of the information content of a genome resides in the amino-acid sequence of its domains; the rest mostly indicates what order to use domains in individual genes, and how genes regulate other genes. About 64% of domains (and 84% of those found in eukaryotes) evolved before eukaryotes split from prokaryotes about 2 billion years ago. (Michael Levitt, PNAS July 7 2009, "Nature of the protein universe"; D. Yooseph et al. "The Sorcerer II global ocean sampling expedition", PLoS Bio 5:e16.) (Prokaryotes are single-celled organisms lacking a nucleus, mitochondria, or gene introns. All multicellular organisms are eukaryotes.)

It's therefore accurate to say that most of the information generated by evolution was produced in the first one or two billion years; the development of more-complex organisms seems to have nearly stopped evolution of protein domains. (Multi-cellular organisms are much larger and live much longer; therefore there are many orders of magnitude fewer opportunities for selection in a given time period.) Similarly, most evolution within eukaryotes seems to have occurred during a period of about 50 million years leading up to the Cambrian explosion, half a billion years ago.

My first observation is that evolution has been slowing down in information-theoretic terms, while speeding up in terms of the intelligence produced. This means that adding information to the gene pool increases the effective intelligence that can be produced using that information by a more-than-linear amount.

In the first of several irresponsible assumptions I'm going to make, let's assume that the information evolved in time t is proportional to i = log(t), while the intelligence evolved is proportional to e^t = e^{e^i}. I haven't done the math to support those particular functions; but I'm confident that they fit the data better than linear functions would. (This assumption is key, and the data should be studied more closely before taking my analysis too seriously.)

My second observation is that evolution occurs in spurts. There's a lot of data to support this, including data from simulated evolution; see in particular the theory of punctuated equilibrium, and the data from various simulations of evolution in Artificial Life and Artificial Life II. But I want to single out the eukaryote-to-Cambrian-explosion spurt. The evolution of the first eukaryotic cell suddenly made a large subset of organism-space more accessible; and the speed of evolution, which normally decreases over time, instead increased for tens of millions of years.

Science!

The following discussion relies largely on de Solla Price's Little Science, Big Science (1963), Nicholas Rescher's Scientific Progress: A Philosophical Essay on the Economics of Research in Natural Science (1978), and the data I presented in my 2004 TransVision talk, "The myth of accelerating change".

The growth of "raw" scientific knowledge is exponential by most measures: Number of scientists, number of degrees granted, number of journals, number of journal articles, number of dollars spent. Most of these measures have a doubling time of 10-15 years. (GDP has a doubling time closer to 20 years, suggesting that the ultimate limits on knowledge may be economic.)

The growth of "important" scientific knowledge, measured by journal citations, discoveries considered worth mentioning in histories of science, and perceived social change, is much slower; if it is exponential, it appears IMHO to have had a doubling time of 50-100 years between 1600 and 1940. (It can be argued that this growth began slowing down at the onset of World War II, and more dramatically around 1970). Nicholas Rescher argues that important knowledge = log(raw information).

A simple argument supporting this is that "important" knowledge is the number of distinctions you can make in the world; and the number of distinctions you can draw based on a set of examples is of course proportional to the log of the size of your data set, assuming that the different distinctions are independent and equiprobable, and your data set is random. However, an opposing argument is that log(i) is simply the amount of non-redundant information present in a database with uncompressed information i. (This appears to be approximately the case for genetic sequences. IMHO it is unlikely that scientific knowledge is that redundant; but that's just a guess.) Therefore, important knowledge is somewhere between O(log(information)) and O(information), depending whether information is closer to O(raw information) or O(log(raw information)).

Analysis

We see two completely-opposite pictures: In evolution, the efficaciousness of information increases more-than-exponentially with the amount of information. In science, it increases somewhere between logarithmically and linearly.

My final irreponsible assumption will be that the production of ideas, concepts, theories, and inventions ("important knowledge") from raw information, is analogous to the production of intelligence from gene-pool information. Therefore, evolution's efficacy at using the information present in the gene pool can give us a lower bound on the amount of useful knowledge that could be extracted from our raw scientific knowledge.

I argued above that the amount of intelligence produced from a given gene-information-pool i is approximately e^eⁱ, while the amount of useful knowledge we extract from raw information i is somewhere between O(i) and O(log(i)). The implication is that the fraction of discoveries that we have made, out of those that could be made from the information we already have, has an upper bound between O(1/e^e^i) and O(1/e^e^e^i).

One key question in asking what the shape of AI takeoff will be, is therefore: Will AI's efficiency at drawing inferences from information be closer to that of humans, or that of evolution?

If the latter, then the number of important discoveries that an AI could make, using only the information we already have, may be between e^e^i and e^e^e^i times the number of important discoveries that we have made from it. i is a large number representing the total information available to humanity. e^e^i is a goddamn large number. e^e^e^i is an awful goddamn large number. Where before, we predicted FOOM, we would then predict FOOM^FOOM^FOOM^FOOM.

Furthermore, the development of the first AI will be, I think, analogous to the evolution of the first eukaryote, in terms of suddenly making available a large space of possible organisms. I therefore expect the pace of information generation by evolution to suddenly switch from falling, to increasing, even before taking into account recursive self-improvement. This means that the rate of information increase will be much greater than can be extrapolated from present trends. Supposing that the rate of acquisition of important knowledge will change from log(i=e^t) to e^t gives us FOOM^FOOM^FOOM^FOOM^FOOM, or ⁴FOOM.

This doesn't necessarily mean a hard takeoff. "Hard takeoff" means, IMHO, FOOM in less than 6 months. Reaching the e^e^e^i level of efficiency would require vast computational resources, even given the right algorithms; an analysis might find that the universe doesn't have enough computronium to even represent, let alone reason over, that space. (In fact, this brings up the interesting possibility that the ultimate limits of knowledge will be storage capacity: Our AI descendants will eventually reach the point where they need to delete knowledge from their collective memory in order to have the space to learn something new.)

However, I think this does mean FOOM. It's just a question of when.

ADDED: Most commenters are losing sight of the overall argument. This is the argument:

Humans have diminishing returns on raw information when trying to produce knowledge. It takes more dollars, more data, and more scientists to produce a publication or discovery today than in 1900.
Evolution has increasing returns on information when producing intelligence. With 51% of the information in a human's DNA, you could build at best a bacteria. With 95-99%, you could build a chimpanzee.
Producing knowledge from information is like producing intelligence from information. (Weak point.)
Therefore, the knowledge that could be inferred from the knowledge that we have is much, much larger than the knowledge that we have.
An artificial intelligence may be much more able than us to infer what is implied by what it knows.
Therefore, the Singularity may not go FOOM, but FOOM^FOOM.

Personal Blog

5

Mentioned in

10Put all your eggs in one basket?

Information theory and FOOM

0Swimmer963 (Miranda Dixon-Luinenburg)

New Comment

95 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:55 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]whpearson15y60

Doesn't information have to be about something? Bits are not inherently powerful... proteins are about structure but they do not inherently win you any evolutionary races. I'd contend that there is a lot more information about how to survive in which proteins are in the genome and when they are transcribed.

You seem to mixing up bits needed to replicate the genome with bits of information gained about the outside world and how to survive in it.

Edit: To give you an example of the difference. Consider a standard computer program that does something useful in ... (read more)

[-]taw15y60

The idea that pace of discovery slowed down is an extremely common and really obvious fallacy.

We only know that discovery was important after it gets widely implemented, what happens decades after invention. Yet, we count it as happening not at implementation time, but at invention time. So recent discoveries that will be implemented in the future are not counted at all, artificially lowering our discovery importance counts.

Also if you use silly measures like railroad tracks per person, or max land mph, you will obviously not see much progress, as large pa... (read more)

4Perplexed14y

The idea can't be a fallacy. What you mean is that the usual argument for this idea contains an obvious fallacy. It is an important distinction because reversed stupidity is not intelligence. Identifying the fallacy doesn't prove that the pace of discovery has not slowed.

2PhilGoetz15y

The idea that pace of discovery slowed down in the 20th century is a parenthetical digression, and has no bearing on the analysis in this post. But it's okay when Ray Kurzweil does it? He is underestimating progress by doing so? What measures are less silly?

2taw15y

It seemed vaguely related to your exps and logs. There are many locally valid measures, but all become ridiculous when applies to wrong times. It seems to me that GDP/capita is the least bad measure at the moment, but it's very likely it won't do too far in the past or too far in the future. I have no idea what Kurzweil is doing.

1PhilGoetz15y

It is related, which is why I mentioned it. But it isn't a link in the chain of reasoning.

6gwern15y

I don't quite follow the whole thing (too many Big Os and exponents for me to track the whole thing), but wouldn't it be quite relevant given your observations about S-curves in the development of microbes? What's to stop us from saying that science has hits its S-curve's peak of how much it could extract from the data and that an AI would be similarly hobbled, especially if we bring in statistical studies like Charles Murray's _Human Accomplishment_ which argues that up to 1950, long enough ago that recency effects ought to be gone, major scientific discoveries show a decline from peaks in the 1800s or whenever? (Or that mammalian intelligences have largely exhausted the gains?) Eliezer may talk about how awesome a Solomonoff-inducting intelligence would be and writes stories about how much weak superintelligences could learn, but that's still extrapolation which could easily fail (eg. we know the limits on maximum velocity and have relatively good ideas how one could get near the speed of light, but we're not very far from where we began, even with awesome machines).

-1PhilGoetz15y

I see what you're saying. That would lead to a more complicated analysis, which I'm not going to do, since people here don't find this approach interesting.

0gwern14y

If an idea is important and interesting to you, then I think that's enough justification. The post isn't negative, after all.

1timtyler14y

I don't think there is any consensus on how to measure innovation. So, before dealing with the question, one must first be clear about what form of measurement you are using - otherwise nobody will know what you aare talking about.

[-]timtyler15y50

Re: ""Hard takeoff" means, IMHO, FOOM in less than 6 months."

Nobody ever specifies when the clock is going to start ticking. We already have speech recognition, search oracles, stockmarket wizards, and industrial automation on a massive scale. Machine intelligence has been under construction for at least 60 years - and machines have been taking people's jobs for over 100 years.

If your clock isn't ticking by now, then what exactly are you waiting for?

3PhilGoetz15y

Valid point. I spoke loosely. I was referring to the EY - RH debate, in which "6 months" usually meant 6 months from proof-of-concept, or else from not being able to stay under the radar, to FOOM.

4timtyler15y

Well, it's not just you at fault. Nobody who advocates the idea that some future period of change is rapid ever seems to pin down what they mean to the point where their view is falsifiable. People need to say when they are starting and stopping their clocks - or else they might just as well be mumbling mystical incantations. Explosive increases in information technology are happening NOW. If you are looking for an information technology explosion, then THIS IS IT.

2billswift15y

But not as explosive as it will (may) be in the future; this is what has people worried.

1timtyler15y

How do you measure the "explosiveness" of an explosion? By measuring the base of its exponential growth? If so, I would classify the hypothesis that the explosion will get "more explosive" as a somewhat speculative one. The growth in CPU clock rates has already run out of steam. The explosion may well get faster in some areas - but getting more "explosive" has got to be different from that. Today, information technology is exploding in the same sense that the nucleii in a nuclear bomb are exploding - by exhibiting exponential growth processes. The explosion is already happening. How fast it is happening is really another issue.

[-]taw15y30

Here's ungated "Nature of protein universe" paper.

[-]gwern13y20

Here's another interesting set of quotes; are we even correct in assuming the most recent percent of DNA matters much? After all, chimps outperform humans in some areas like monkey ladder. From "If a Lion Could Talk":

"Giving a blind person a written IQ test is obviously not a very mean meaningful evaluation of his mental abilities. Yet that is exactly what many cross-species intelligence tests have done. Monkeys, for example, were found not only to learn visual discrimination tasks but to improve over a series of such tasks -- they formed

... (read more)

[-]gwern13y20

In the first of several irresponsible assumptions I'm going to make, let's assume that the information evolved in time t is proportional to i = log(t), while the intelligence evolved is proportional to et = ee^i. I haven't done the math to support those particular functions; but I'm confident that they fit the data better than linear functions would.

This may be covered by the following assumption about 'spurts', but this doesn't seem to work for me.

If intelligence really could jump like that, shouldn't we expect to see that in humans already? For examp... (read more)

4gwern13y

For example, I may be misinterpreting this new study http://www.guardian.co.uk/science/2011/aug/09/genetic-differences-intelligence but it seems to back me up: From the abstract, "Genome-wide association studies establish that human intelligence is highly heritable and polygenic":

2CarlShulman13y

Interesting. This study is a significant positive update for the feasibility of embryo selection for intelligence: it means that sufficiently enormous/high-powered GWAS studies can give good estimates of genetic potential for IQ in embryos. If common SNPs were less important relative to rare deleterious variants (in explaining heritability), then embryo selection would be complicated by the need to attribute effects to novel rare mutations (without having those properties made immediately clear by the population studies) based on physiological models.

1gwern13y

Well, it's good news if you didn't expect it to be possible at all (is that anyone here?), but it's bad news if you were expecting it to be easy or give high gains. The result seems to say only that X percent of the genome was related in any way; when it comes time to actually predict intelligence, they only get '1% of the variance of crystallized and fluid cognitive phenotypes in an independent sample'. Given that they cover a lot of genetic information and that with this sort of thing, there seem to be diminishing returns, that suggests the final product will only be a few percent, and nowhere near the ceiling set by genetic influence. Maybe a few points is worthwhile but embryo selection is an expensive procedure...

1CarlShulman13y

We already knew that there weren't common variants of large effect. Conditioning on that, more heritability from common variants of small effect is better for embryo selection than heritability from rare variants.

1CarlShulman13y

In what sense? As you go to higher IQs each additional IQ point means a greater (multiplicative) difference in the rarity of individuals with that frequency. Studies like those of Benbow and Terman show sizable continuing practical impact of increasing IQ on earnings, patents, tenured academic positions, etc. ETA: Thanks for the clarification.

1gwern13y

Because of the construction of the tests. As you go to higher points, each point represents fewer and fewer correctly answered questions. Matrix IQ tests can be mechanically generated by combining simple rules, and they show the same bell curve norms despite what look only like linear increases in number of rules or complexity. And the Benbow and Terman studies and others do show practical impact, but they don't show a linear impact where each IQ point is as valuable as the previous, and they certainly do not show an increasing marginal returns where the next IQ point gives greater benefits than before!

0Vladimir_Nesov13y

Do you mean "at higher IQ values each additional point corresponds to less and less additional (expected) intelligence" or "at higher IQ values each additional point corresponds to less and less total intelligence"?

2gwern13y

I mean the former - diminishing returns in measured intelligence (IQ) versus actual intelligence. (I definitely am not saying that IQ points are uncorrelated with actual intelligence at some point, or inversely correlated!)

0CarlShulman13y

I didn't thus misinterpret: my prior on the latter meaning is low.

0Vladimir_Nesov13y

(I corrected my comment before you replied, sorry for acting confused.)

0Swimmer963 (Miranda Dixon-Luinenburg) 13y

I followed the link and read the page. Fascinating!

0gwern13y

You're welcome. If you have any suggestions for further examples, I'd be glad to hear them - the essay is kind of skinny for such a grand principle.

[-]teageegeepea15y20

Have you read "The 10,000 Year Explosion"? Cochran & Harpending (and Hawks and some others in the paper its based on) argue that evolution has accelerated recently. The reason is that there is a larger population, so more new mutations to be selected. Also, because our environment is not a steady state our genes don't reach a steady state either (like horseshoe crabs or a number of other species). I've only read a bit past the first chapter, but it would seem relevant to your claim.

2PhilGoetz15y

I'd be interested, but evolution over the past 20,000 years doesn't affect the argument I'm making here, which looks at a long-term trend in evolution. ADDED: There are some factors that will increase genetic exchange and selective pressure, as discussed in some comments below; but not that increasing genetic exchange often slows evolution. There's a balance between being able to spread beneficial mutations, and reaching premature convergence; the "sweet spot" of that balance is with very small communities, much, much smaller than continent-sized. Some equations and data indicate that species diversity is much larger when the environment is fragmented into small areas with little communication (google "island theory of biogeography").

0timtyler15y

It sounds as though they are talking about human evolution - plus maybe the evolution of rats, lettuces and pigeons. The numbers of many other species have dwindled.

[-]bogus15y10

What's your justification for the claim that "almost all of the information content of an organism resides in the amino-acid sequence of its domains"?

For your claims about "speed of evolution" to make any sense, it must be the case that we could get rid of the information content which does not reside in these sequences with minimal losses in evolutionary fitness. My guess is that this is not the case, hence your measure of "information" is quite suspect.

-1PhilGoetz15y

The paragraphs before that claim, plus the fact that the fraction of DNA devoted to regulation, such as promoter sequences, is a small fraction of that devoted to coding, and is also much more redundant. In short: That is the result you find when you measure it. I see no reason to think that. You can often kill an organism by changing 1 bit in its genome.

1SilasBarta15y

Even so, you're saying that I have virtually all of the information needed to (in principle) reconstruct an organism once I see its DNA (which tells me the protein domains and their order therein). What about the information about its ontogenic process (the surrounding womb/shell that enables chemical reactions to happen in just the right way), its physical injuries and modifications, its upbrining, its diet, its culture ...?

2PhilGoetz15y

No, I'm not saying that. Yes, there is extra-genomic information. I'm sure that the increase in intelligence happens because more-complex creatures can extract more information from the environment. But that is an output, not an input, in my analysis; it goes into "intelligence", not into "information input". I am asking what function translates genomic information (plus epigenetic information) into "intelligence", or an organism's ability to solve problems.

2SilasBarta15y

Yes, you were saying "that" where "that" refers to "almost all of the information content of an organism resides in the amino-acid sequence of its domains". And that statement means that, but for practical difficulties, the DNA suffices to tell you what you need to do to build the organism (though you left a caveat that you might still need a small amount of extra information, which I assumed to mean e.g. age). If you mean something else by it, then you're using the terms in a non-standard way. This isn't an issue of organisms being able to extract more information from the environment; irrespective of how much information it extracts from the environment, you still need lots of information in addition to the genome to make a copy -- and this is a big part of why Jurassic Park hasn't already happened. (By the way, based on our previous exchanges, we seem to be looking at similar problems and could help each other: one big hole in my knowledge is that of organic chemistry and thus how existing self-replicators work at the chemical level.)

1PhilGoetz15y

I misspoke. I've fixed it now.

1Mike Bishop15y

I assume you mean you've edited your previous comment. I'd appreciate it if, when people edit their posts or comments, they indicate that they have done so, and ideally how, in the very same comment/post. That said, I don't want to be so nitpicky as to discourage contributions.

0billswift15y

"almost all of the information content of an organism resides in the amino-acid sequence of its domains" is an accurate statement. The uterine environment is involved with the creation of another organism, but it is not part of the information content of the organism, except in the sense that everything that happens to an organism is recorded in damages to that organism.

0timtyler15y

"Almost all of the information content of an organism resides in the amino-acid sequence of its domains" seems simply wrong to me. Whatever the intended point was, it needs rephrasing.

1PhilGoetz15y

Rephrased.

0bogus15y

Given that most of DNA is junk, considering "information" in raw storage terms makes little sense. It may be a tiny fraction, but is it an important contributor to genetic fitness? If it is, then it's hard to argue that evolution has slowed down. ETA: I'm not disputing that, but see above. I'm trying to qualify the information's overall contribution to genetic fitness.

0PhilGoetz15y

I'm only arguing about the speed at which information is produced, and the speed at which intelligence is produced.

[-]Vladimir_Nesov15y00

Even the concepts involved in this "analysis" seem pretty meaningless, but when you start plugging them into "math" and "exponentials", it results in meaninglessness singularity, that is a point at which all sanity breaks down and woo ensues!

2Kaj_Sotala15y

Downvoted because the criticism is too vague to really reply to. This isn't the first time I've observed it in your comments; I'd recommend elaborating more if you want them to be useful. (On the other hand, often - including now - your comments do seem like they'd have an insightful idea behind them, which is why I took the time to contribute this comment. )

0Vladimir_Nesov15y

It's more of a voice of dissent than criticism, naming an issue rather than constructing it. There is a tradeoff between elaborating and staying silent: on one side too much effort, on another absence of data about what intuition says. I don't feel it's OK for this post to have a positive score, and I'm pretty sure about this judgment even without making details explicit for myself. Sometimes that's a fallacy, of course.

1Kaj_Sotala15y

Fair enough. Though I don't think it'd take that much work to make the comment in question more constructive. A sentence or two about why the concepts used are useless would already help a lot.

[-]SilasBarta15y00

May be relevant, and seems to be consistent with your point: evolution has a speed limit and complexity bound.

7PhilGoetz15y

Actually, that's based on the mistaken belief that selection can provide only 1 bit of information per generation. If you'll look down to the end of the original 2007 post, you'll see I gave the correct (and now Eliezer-approved) formulation, which is: If you take a population of organisms, and you divide it arbitrarily into 2 groups, and you show the 2 groups to God and ask, "Which one of these groups is, on average, more fit?", and God tells you, then you have been given 1 bit of information. But if you take a population of organisms, and ask God to divide it into 2 groups, one consisting of organisms of above-average fitness, and one consisting of organisms of below-average fitness, that gives you a lot more than 1 bit. It takes n lg(n) bits to sort the population; then you subtract out the information needed to sort each half, so you gain n lg(n) - 2(n/2)lg(n/2) = n[lg(n) - lg(n/2)] = nlg(2) = n bits. If you do tournament selection, you have n/2 tournaments, each of which gives you 1 bit, so you get n/2 bits per generation. ADDED: This doesn't immediately get you out of the problem, as n bits spread out among n genomes gives you 1 bit per genome. That doesn't mean, though, that you've gained only 1 bit for the species as a whole. The more-important observation in that summary is that organisms with more mutations are more likely to die, eliminating > 1 mutation per death on average.

3PhilGoetz15y

This paragraph is more important:

1Eliezer Yudkowsky15y

That's not exactly Eliezer-approved, because now the real problem is to tell what the conditions are more like in nature - Worden or MacKay or somewhere in between. That's what I put up on the Wiki as summary of the state of information. Mathematical assumptions are cheaper than empirical truths.

0[anonymous]15y

Right - I just meant wrt "1 bit per generation regardless of population size".

0timtyler15y

If this is a discussion of Worden's paper, then you seem to have missed that he is not talking about information, but rather "Genetic Information in the Phenotype" - which is actually a completely different concept.

0PhilGoetz15y

How so?

0timtyler15y

For instance: "GIP is a measure of how much the observed values i in a large population tend to cluster on a few values; if there is no clustering, Gµ=0, and if there is complete clustering on one value, Gµ= log2(Nµ). It is a property of the population, not of an individual." * http://dspace.dial.pipex.com/jcollie/sle/

0SilasBarta15y

Ah. Sorry for not reading through the history, and thanks for the good explanation!

0timtyler15y

Worden? Essentially that's a crock. See: http://alife.co.uk/essays/no_speed_limit_for_evolution/

[-]LauraABJ15y00

I think this post presents a very interesting view of the information explosion. Even the task of self-improvement will undergo an evolution of sorts, and we have no better example to draw from than genetic evolution. We have observed an increasing efficiency of information to directed behavior (intelligence as the article puts it), and it is yet to be seen what the limits of that efficiency may be.

Only one upvote? Really?

[-]timtyler15y00

Long lived organisms do reproduce and evolve more slowly - though note that evolution still acts on their germ line cells during their lifetime.

However, to jump from there to "evolution has been slowing down in information-theoretic terms" seems like a bit of a wild leap. Bacteria haven't gone away - and they are evolving as fast as ever. How come their evolution is not being counted?

2PhilGoetz15y

I mentioned that 64% of the known protein domains, and 84% of those found in eukaryotes, evolved before the prokaryote-eukaryote split. This means that 73% of those found in bacteria evolved in the first billion years of life, before the prok-euk split about 2 billion years ago. So, roughly 3/4 of the information in bacterial genomes evolved during the first 1/3 of life on earth, meaning the rate of information generation in bacteria during that first 1/3 was about 6 times what it was in the next 2/3. This is more surprising than the observations about eukaryote evolution. I interpret it as meaning that competition became more intense as evolution progressed, allowing less experimentation and causing more confinement to local maxima.

0timtyler15y

I didn't find your figures in the (enormous) referenced papers. Without them, it isn't clear what the figures mean or where the 73% comes from. Most proteins have not been discovered - and there is probably a bias towards discovering the ones that are shared with eucaryotes - which would distort the figures in favour of finding older genes. Also, life started around 3.7 billion years ago. Also, it seems rather dubious to measure the rate of information change within evolution as the rate of information change within bacterial genomes. That doesn't consider the information present in the diversity of life.

3PhilGoetz15y

In the Levitt paper, 64% is the number of single-domain architecture proteins that are found in at least two of the 3 groups viruses, prokaryotes, and eukaryotes (figure 3). This is my (very close) approximation for the fraction of families in eukaryotes or prokaryotes found in both eukaryotes and prokaryotes, which isn't reported. 84% is computed from that information, plus the caption of figure 3 saying that prokaryotes contain 88% of SDA families. 73% is computed from all of that information. There is no bias towards discovering genes shared with eukaryotes in ordinary sequencing. We sequence complete genomes. Almost all of the bacterial genes known come from these whole-genome projects. We've sequenced many more bacteria than eukaryotes. Bacterial genomes don't contain much repetitive intergenic DNA, so you get nice complete genome assemblies. Life starting 3.7 billion years ago - could be. Google's top ten show claims ranging from 2.7GY to 4.4GY ago. Adding that .7 billion could make the information-growth curve more linear, and remove one exponentiation in my analysis. Let's just say I'm measuring the information in DNA. Information in "the diversity of life" is too vague. I don't want to measure any information that an organism or an ecosystem gains from the environment by expressing those genetic codes.

1SilasBarta15y

Out of curiosity, has the protein problem yet been mathematically formalized so that it can be handed over to computers? That is, do we understand molecular dynamics well enough to automatically discover all possible proteins, starting from the known ones?

0PhilGoetz15y

We could list them in order, if that's what you mean. It would be a Library of Babel. Could we determine their structures? No. Need much much more computational power, for starters.

1SilasBarta15y

That's what I intended to refer to -- the tertiary structure. Have they thrown the full mathematical toolbox at it? I've heard that predicting it is NP-complete, in that it's order-dependent, etc., but have they thrown all the data-mining tricks at it to find out whatever additional regularity constrains the folding so as to make it less computationally intensive? The reason I don't immediately assume the best minds have already considered this, is that there seems to be some disconnect between the biological and mathematical communities -- just recently I heard that they just got around to using the method of adjacency matrix eigenvectors (i.e. what Google's PageRank uses) for identifying critical species in ecosystems.

4pengvado15y

Predicting the ground state of a protein is NP-hard. But nature can't solve NP-hard problems either, so predicting what actually happens when a protein folds is merely in BQP. I would expect most proteins found in natural organisms to be in some sense easy instances of the protein folding problem, i.e. that the BQP method finds the ground state. Because the alternative is getting stuck in local minima, which probably means it doesn't fold consistently to the same shape, which is probably an evolutionary disadvantage. But if there are any remaining differences, then for the purpose of protein structure prediction it's actually the local minimum that's the right answer, and the NP problem that doesn't get solved is irrelevant. And yes there are quantum simulation people hard at work on the problem, so it's not just biologists. But I don't know enough of the details to say whether they've exhausted the conventional toolbox of heavy-duty math yet.

1Cyan15y

This is a nice insight.

0DanArmak15y

That explains why I've seen descriptions of folding prediction algorithms that run in polynomial time, on the order of n^5 or less with n = number of amino acids in primary chain. I wanted to add that many proteins found in nature require chaperones to fold correctly. These can be any other molecules - usually proteins, RNA-zymes, or lipids - that influence the folding process to either assist or prevent certain configurations. They can even form temporary covalent bonds with the protein being folded. (Or permanent ones; some working proteins have attached sugars, metals, other proteins, etc.) And the protein making machinery in the ribosomes has a lot of complexity as well - amino acid chains don't just suddenly appear and start folding. All this makes it much harder to predict the folding and action of a protein in a real cell environment. In vivo experiments can't be replaced by calculations without simulating a big chunk of the whole cell on a molecular level.

-2PhilGoetz15y

Why do you think nature can't solve NP-hard problems? When you dip a twisted wire with 3D structure into a dish of liquid soap and water, and pull it out and get a soap film, didn't it just solve an NP problem? All of the bonds and atoms in a protein are "computing" simultaneously, so the fact that the problem is NP in terms of number of molecules isn't a problem. I don't understand BQP & so can't comment on that. Incidentally, your observation about consistent folding is often right, but some proteins have functions that depending on them folding to different shapes under different conditions. Usually these shapes are similar. I don't know if any proteins routinely fold into 2 very-different shapes.

6SilasBarta15y

Oh no. Ohhhhh no. Not somebody trying to claim that nature solves problems in NP in polynomial time because of bubble shape minimization again! Like Cyan just said, nature does not solve the NP problem of a global optimal configuration. It just finds a local optimum, which is already known to be computationally easy! Here's a reference list.

5Cyan15y

The more convoluted the wire structure, the more likely the soap film is to be in a stable sub-optimal configuration.

2pengvado15y

And there are at most N^2 of them, so that doesn't transform exponential into tractable. It's not even a Grover speedup (2^N -> 2^(N/2)), which we do know how to get out of a quantum computer.

0SilasBarta15y

And, interestingly enough, slashdot just ran a story on progress in protein folding in the nucleus of a cell.

0timtyler15y

So: with two data-points and a back-of-the-envelope calculation, you conclude that DNA-evolution has been slowing down? It seems like pretty feeble evidence to me :-( I should add that - conventionally - evolutionary rates are measured in 'darwins', and are based on trait variation (not variation in the underlying genotypes) because of how evolution is defined.

2PhilGoetz15y

Dude. It's an idea. I said not to take my conclusions too seriously. This is not a refereed journal. Why do you think your job is only to find fault with ideas, and never to play with them, try them out, or look for other evidence? Different people measure evolutionary rate or distance differently, depending on what their data is. People studying genetic evolution never use Darwins. The reason for bringing up genomes at all in this post is to look at the shape of the relationship between genome information and phenotypic complexity; so to start by measuring only phenotypes would get you nowhere.

3timtyler15y

Inaccurate premise: I don't think my job is "only to find fault with ideas". When I do that, it's often because that is the simplest and fastest way to contribute. Destruction is easier than construction - but it is pretty helpful nonetheless. Critics have saved me endless hours of frustration pursuing bad ideas. I wish to pass some of that on. In this particular sub-thread, my behavior is actually fairly selfish: if there's reasonable evidence that DNA-evolution has been slowing down, I would be interested in hearing about it. However, I'm not going to find such evidence in this thread if people get the idea that this point has already been established.

1PhilGoetz15y

I don't have strong evidence that DNA evolution has been slowing down in bacteria. I presented both evidence and explanation why it has been slowing down in eukaryotes. That is all that matters for this post; because the point of referring to DNA evolution in this post has to do with how efficiently evolution uses information in the production of intelligence. Eukaryotes are more intelligent than bacteria.

0taw15y

So I've read the paper. According to it, and it seems very plausible to me, we have some reason to suspect we seriously underestimate number of SDA families, and most widely distributed SDA families are most likely to be known (those often happen to occur in multiple groups), and less widely distributed families are least likely to be known (those often happen to be one group only). The actual percentage of shared SDA families is almost certainly lower than what we can naively estimate from current data. I don't know how much lower. Maybe just a few percent, maybe a lot. Not mentioned in the paper, but quite obvious is huge amount of horizontal gene transfer happening on evolutionary scales like that (especially with viruses). It also increases apparent sharing and makes them appear older than they really are. Third effect is that SDA family that diverged long time ago might be unrecognizable as single family, and one that developed more recently is still recognizable as such. This can only increase apparent age of SDA families. So there are at least three effects of unknown magnitude, but known direction. If any of them is strong enough, it invalidates your hypothesis. If all of them are weak, your hypothesis still relies a lot on dating of eukaryote-prokaryote split.

0[anonymous]15y

Imagine you have 100 related organisms in a bag with PRO written on it. You take 10 and put them in a bag with EUC written on it. Then you sequence everything - and calculate what fraction of the total number of genes are found in both bags - and you come back with 64%. That this is less than 100% doesn't represent the genomes in the EUC bag changing. It just means that you selected a small sample from the PRO bag. Choose 5 and the figure would have been smaller. Choose 50 and the figure would have been bigger.

0timtyler15y

At least 3.7 billion years ago, then.

0timtyler15y

I too was talking about information in DNA. The number of species influences the quantity of information present in the DNA of an ecosystem - just as rolling a dice 100 times supplies more information than rolling it once.

[-]timtyler15y-10

I can't say I care much for the Evolution/Science split in this post.

A more natural split would be between DNA evolution and cultural evolution.

Evolution is best seen as an umbrella term that covers any copying-with-variation-and-selection process - and science is one example of cultural evolution.

0PhilGoetz15y

Agreed on DNA evolution. Not agreed on cultural evolution. Science is easier to quantify. My data concerns mainly science.

1timtyler15y

You were being specific, when you could have been more general. It isn't just science that is evolving rapidly, it is technology, fashion, music, literature - and so on. Science makes a fine example - but it's just one part among many of a snowballing human culture - and readers should ideally be aware of that.

[-]timtyler15y-10

Re: "I therefore expect the pace of evolution to suddenly switch from falling, to increasing [...]"

Since the pace of evolution has clearly been increasing recently, this seems like a rather retroactive prediction.

0PhilGoetz15y

I refer you to Katja's blog post, "'Clearly' covers murky thought". I'm not aware of any evidence that the pace of evolution has been increasing recently. I am pretty sure that the pace of DNA evolution has not been increasing recently. I am aware of many reasons for predicting that the pace of evolution of humans has been decreasing in the past centuries. But I will change the statement to specify rate of information acquisition in genomes.

0timtyler15y

Evolution as a whole is has clearly been speeding up recently - due to the current mass extinction, the evolution of intelligent design, genetic engineering, etc. Today we are witnessing an unprecedented rate of evolutionary change. Just look out of your window. For the human genome, perhaps see: Human evolution is 'speeding up'

-2PhilGoetz15y

We are witnessing many extinctions. That's a loss of information, not a gain of information. But, yes, massive disturbances and relocations increase the rate of evolution, all else being equal.

0timtyler15y

If you refer to the "pace of evolution" I should hope that mass extinctions count as rapid evolution. The gene frequencies there are changing pretty rapidly. If you mean to refer to some other kind of metric, you should probably be more specific - for example, you might want to consider talking about "constructive evolution" - or something similar.

3PhilGoetz15y

I wouldn't. That's just loss. If your planet was hit by a supernova, would you call that rapid evolution? I've been quite specific. I'm talking about the accumulation of information in the DNA of all organisms.

0timtyler15y

Don't call it "evolution", then - or people will get very confused. Evolution is about change - not about gain or loss. Check with the definition of the term.

[-]PeterKinnon15y-40

Sadly we here observe a retreat within the simple language of mathematics. I am not decrying mathematics nor am I underestimating the great value of that language in extending knowledge of the physical world by bypassing the complexities and irrelevancies common to the natural languages.

It does, however, suffer from two major weaknesses:

Firstly, like all languages, it is capable of generating fictions - entities and scenarios which have no correspondence wit the real world.

Secondly, it is, like all reasoning or computational processes, raw data sensitive.... (read more)

1PhilGoetz15y

Nope. If you'll look at the math, you'll see that I said "important knowledge" ranges somewhere between O(log(raw information)) and O(raw information). Important knowledge = O(raw information) means we do not make any distinction between raw information and "important" information. Some of the ancients would have said that human inventions and nature are fundamentally the same, since nature is the invention of God. Now some people say that technology and evolution are fundamentally the same, since humans are part of nature. Whatever. I just want to know if the curves match.

Moderation Log