Comment author: Viliam_Bur 18 August 2013 08:16:34PM *  2 points [-]

Don't import java.util.*; use import java.util.List and java.util.Map

I am curious about the compression algorithm that would compress "import java.util.List; import java.util.Map" to a shorter string than "import java.util.*;".

Unless you would artificially make it compress the latter very inefficiently -- but, of course, using that kind of strategy you could "prove" almost anything.

Comment author: pengvado 19 August 2013 04:38:19AM 1 point [-]

I interpret Daniel_Burfoot's idea as: "import java.util.*" makes subsequent mentions of List longer, since there are more symbols in scope that it has to be distinguished from.

But I don't think that idea actually works. You can decompose the probability of a conjunction into a product of conditional probabilities, and you get the same number regardless of the order of said decomposition. Whatever probability (and corresponding total compressed size) you assign to a certain sequence of imports and symbols, you could just as well record the symbols first and then the imports. In which case by the time you get to the imports you already know that the program didn't invoke any utils other than List and Map, so even being able to represent the distinction between the two forms of import is counterproductive for compression.

The intuition of "there are more symbols in scope that it has to be distinguished from" goes wrong because it fails to account for updating a probability distribution over what symbol comes next based on what symbols you've seen. Such an update can include knowledge of which symbols come in the same header, if that's correlated with which symbols are likely to appear in the same program.

Comment author: Transfuturist 30 July 2013 02:35:17AM *  5 points [-]

I believe I've encountered a problem with either Solomonoff induction or my understanding of Solomonoff induction. I can't post about it in Discussion, as I have less than 20 karma, and the stupid questions thread is very full (I'm not even sure if it would belong there).

I've read about SI repeatedly over the last year or so, and I think I have a fairly good understanding of it. Good enough to at least follow along with informal reasoning about it, at least. Recently I was reading Rathmanner and Hutter's paper, and Legg's paper, due to renewed interest in AIXI as the theoretical "best intelligence," and the Arcade Learning Environment used to test the computable Monte Carlo AIXI approximation. Then this problem came to me.

Solomonoff Induction uses the size of the description of the smallest Turing machine to output a given bitstring. I saw this as a problem. Say AIXI was reasoning about a fair coin. It would guess before each flip whether it would come up heads or tails. Because Turing machines are deterministic, AIXI cannot make hypotheses involving randomness. To model the fair coin, AIXI would come up with increasingly convoluted Turing machines, attempting to compress a bitstring that approaches Kolmogorov randomness as its length approaches infinity. Meanwhile, AIXI would be punished and rewarded randomly. This is not a satisfactory conclusion for a theoretical "best intelligence." So is the italicized statement a valid issue? An AI that can't delay reasoning about a problem by at least labeling it "sufficiently random, solve later" doesn't seem like a good AI, particularly in the real world where chance plays a significant part.

Naturally, Eliezer has already thought of this, and wrote about it in Occam's Razor:

The formalism of Solomonoff Induction measures the "complexity of a description" by the length of the shortest computer program which produces that description as an output. To talk about the "shortest computer program" that does something, you need to specify a space of computer programs, which requires a language and interpreter. Solomonoff Induction uses Turing machines, or rather, bitstrings that specify Turing machines. What if you don't like Turing machines? Then there's only a constant complexity penalty to design your own Universal Turing Machine that interprets whatever code you give it in whatever programming language you like. Different inductive formalisms are penalized by a worst-case constant factor relative to each other, corresponding to the size of a universal interpreter for that formalism.

In the better (IMHO) versions of Solomonoff Induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2^N strings of length N. This is Solomonoff Induction's approach to fitting the observed data. The higher the probability a program assigns to the observed data, the better that program fits the data. And probabilities must sum to 1, so for a program to better "fit" one possibility, it must steal probability mass from some other possibility which will then "fit" much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.

Does this warrant further discussion, if at least to validate or refute this claim? I don't think Eliezer's proposal for a version of SI that assigns probabilities to strings is strong enough, it doesn't describe what form the hypotheses would take. Would hypotheses in this new description be universal nondeterministic Turing machines, with the aforementioned probability distribution summed over the nondeterministic outputs?

Comment author: pengvado 30 July 2013 08:44:40PM *  1 point [-]

Eliezer's proposal was a different notation, not an actual change in the strength of Solomonoff Induction. The usual form of SI with deterministic hypotheses is already equivalent to one with probabilistic hypotheses. Because a single hypothesis with prior probability P that assigns uniform probability to each of 2^N different bitstrings, makes the same predictions as an ensemble of 2^N deterministic hypotheses each of which has prior probability P*2^-N and predicts one of the bitstrings with certainty; and a Bayesian update in the former case is equivalent to just discarding falsified hypotheses in the latter. Given any computable probability distribution, you can with O(1) bits of overhead convert it into a program that samples from that distribution when given a uniform random string as input, and then convert that into an ensemble of deterministic programs with different hardcoded values of the random string. (The other direction of the equivalence is obvious: a computable deterministic hypothesis is just a special case of a computable probability distribution.)

Yes, if you put a Solomonoff Inductor in an environment that contains a fair coin, it would come up with increasingly convoluted Turing machines. This is a problem only if you care about the value of an intermediate variable (posterior probability assigned to individual programs), rather than the variable that SI was actually designed to optimize, namely accurate predictions of sensory inputs. This manifests in AIXI's limitation to using a sense-determined utility function. (Granted, a sense-determined utility function really isn't a good formalization of my preferences, so you couldn't build an FAI that way.)

Comment author: [deleted] 24 July 2013 05:29:25PM *  1 point [-]

After reading a number of recent comments noting "Please use the standard font." I'm wondering if a technological block stopping people from using a nonstandard font would be a worthwhile use of programmer time. Thoughts? I myself don't have a strong opinion either way, but I figured the first thing to do was to get a better feel for current opinion.

Edit: Is there a way to correct spelling without destroying a poll?

Submitting...

In response to comment by [deleted] on Open thread, July 23-29, 2013
Comment author: pengvado 25 July 2013 01:16:18AM *  1 point [-]

Is there a benefit from doing that server-side rather than client-side? I've long since configured my web browser to always use my favorite font rather than whatever is suggested by any website.

Comment author: ChrisHallquist 19 July 2013 04:41:43AM *  2 points [-]

He may like being out among the stars, but that's no reason to cut his fun on Earth short unnecessarily. Like, if the Earth were to be destroyed, and all the horcruxes hidden there with it, he'd regard having the Pioneer horcrux left as vastly superior to nonexistence, but that doesn't mean he wouldn't like to continue hanging out on Earth.

Edit: Also, if he's afraid that someone like Harry could accidentally explode the Sun, taking the Pioneer Plaque with it, he has reason to stick around to stop that from happening. Also, if Bacon's diary wasn't a horcrux, how do you account for Voldemort telling Harry it's nearly indestructible?

Comment author: pengvado 19 July 2013 08:17:18AM 2 points [-]

"Oh," said Professor Quirrell, "don't worry about a little rough handling. You could toss this diary in a fireplace and it would emerge unscathed.

That isn't necessarily the same level of indestructibility as a horcrux. It could just be a standard charm placed on rare books.

Comment author: Vaniver 18 July 2013 05:27:04PM 0 points [-]

Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input.

Note that step 1 is "Assume that I saw myself doing X," not "Assume that EDT outputs X as the optimal action." I believe that excludes any contradictions along those lines. Does logical omniscience preclude imagining counterfactual worlds?

Comment author: pengvado 19 July 2013 03:14:10AM 1 point [-]

If I already know "I am EDT", then "I saw myself doing X" does imply "EDT outputs X as the optimal action". Logical omniscience doesn't preclude imagining counterfactual worlds, but imagining counterfactual worlds is a different operation than performing Bayesian updates. CDT constructs counterfactuals by severing some of the edges in its causal graph and then assuming certain values for the nodes that no longer have any causes. TDT does too, except with a different graph and a different choice of edges to sever.

Comment author: Vaniver 18 July 2013 06:09:49AM *  0 points [-]

What I'm saying is that the only way to solve any decision theory problem is to learn a causal model from data.

I think there are a couple of confusions this sentence highlights.

First, there are approaches to solving decision theory problems that don't use causal models. Part of what has made this conversation challenging is that there are several different ways to represent the world- and so even if CDT is the best / natural one, it needs to be distinguished from other approaches. EDT is not CDT in disguise; the two are distinct formulas / approaches.

Second, there are good reasons to modularize the components of the decision theory, so that you can treat learning a model from data separately from making a decision given a model. An algorithm to turn models into decisions should be able to operate on an arbitrary model, where it sees a -> b -> c as isomorphic to Drunk -> Fall -> Death.

To tell an anecdote, when my decision analysis professor would teach that subject to petroleum engineers, he quickly learned not to use petroleum examples. Say something like "suppose the probability of striking oil by drilling a well here is 40%" and an engineer's hand will shoot up, asking "what kind of rock is it?". The kind of rock is useful for determining whether or not the probability is 40% or something else, but the question totally misses the point of what the professor is trying to teach. The primary example he uses is choosing a location for a party subject to the uncertainty of the weather.

It just doesn't make sense to postulate particular correlations between an EDT agent's decisions and other things before you even know what EDT decides!

I'm not sure how to interpret this sentence.

The way EDT operates is to perform the following three steps for each possible action in turn:

  1. Assume that I saw myself doing X.
  2. Perform a Bayesian update on this new evidence.
  3. Calculate and record my utility.

It then chooses the possible action which had the highest calculated utility.

One interpretation is you saying that EDT doesn't make sense, but I'm not sure I agree with what seems to be the stated reason. It looks to me like you're saying "it doesn't make sense to assume that you do X until you know what you decide!", when I think that does make sense, but the problem is using that assumption as Bayesian evidence as if it were an observation.

Comment author: pengvado 18 July 2013 10:03:12AM *  1 point [-]

The way EDT operates is to perform the following three steps for each possible action in turn:

  1. Assume that I saw myself doing X.
  2. Perform a Bayesian update on this new evidence.
  3. Calculate and record my utility.

Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input. If you know that you are an EDT agent, and condition on all of your past observations and also on the fact that you do X, but X is not in fact what EDT does given those inputs, then as an ideal Bayesian you will know that you're conditioning on something impossible. More generally, what update you perform in step 2 depends on EDT's input-output map, thus making the definition circular.

So, is EDT really underspecified? Or are you supposed to search for a fixed point of the circular definition, if there is one? Or does it use some method other than Bayes for the hypothetical update? Or does an EDT agent really break if it ever finds out its own decision algorithm? Or did I totally misunderstand?

Comment author: CronoDAS 15 July 2013 03:59:21AM 1 point [-]

Um... doesn't it take exponential time in order to simulate quantum mechanics on a classical computer?

Comment author: pengvado 17 July 2013 04:01:09AM *  0 points [-]

Yes (At least that's the general consensus among complexity theorists, though it hasn't been proved.) This doesn't contradict anything Eliezer said in the grandparent. The following are all consensus-but-not-proved:

P⊂BQP⊂EXP
P⊂NP⊂EXP
BQP≠NP (Neither is confidently predicted to be a subset of the other, though BQP⊂NP is at least plausible, while NP⊆BQP is not.)
If you don't measure any distinctions finer than P vs EXP, then you're using a ridiculously coarse scale. There are lots of complexity classes strictly between P and EXP, defined by limiting resources other than time-on-a-classical-computer. Some of them are tractable under our physics and some aren't.

Comment author: ciphergoth 13 July 2013 11:22:43AM 0 points [-]

Uncountability doesn't seem like a big deal to me. Just give the Turing machine an auxiliary tape containing the real parameter.

I can't work out whether this works or not. Here's a really simple example system: each output is 0 or 1, drawn with a fixed probability. If the probability is an uncomputable number, then no algorithm with a finite initial state can generate output with exactly that probability; there has to be a rational number inbetween the real probability and the simulated one, which means there's an attacker that can distinguish the two given enough outputs.

If instead the simulator can read the real probability on an infinite tape... obviously it can't read the whole tape before producing an output. So it has to read, then output, then read, then output. It seems intuitive that with this strategy, it can place an absolute limit on the advantage that any attacker can achieve, but I don't have a proof of that yet.

A related question is whether a Turing machine can fully simulate the dynamical system, that is, whether it can compute the state at any future time to any precision using only finitely many bits from the starting parameters.

Obviously if this can be done then what I ask can be done. I had thought this impossible, which is why I wanted to substitute an easier question about distinguishability.

I think the answer to that differential equations with initial conditions a computable real number can evolve to an uncomputable real number. But if the initial random number is not arbitrary, but guaranteed random, maybe it is computable. (That is, maybe the inputs with computable futures have full measure.)

Well that's clearly true, if the dynamical system is that when t = 0 then y = 0 and dy/dt = 1, then y will pass through lots of uncomputable values at uncomputable times. Some kind of computable uncertainty about the initial state may address the cardinality issue, but I'm not sure how to formalise that.

Comment author: pengvado 14 July 2013 12:32:47AM 0 points [-]

If instead the simulator can read the real probability on an infinite tape... obviously it can't read the whole tape before producing an output. So it has to read, then output, then read, then output. It seems intuitive that with this strategy, it can place an absolute limit on the advantage that any attacker can achieve, but I don't have a proof of that yet.

In this model, a simulator can exactly match the desired probability in O(1) expected time per sample. (The distribution of possible running times extends to arbitrarily large values, but the L1-norm of the distribution is finite. If this were a decision problem rather than a sampling problem, I'd call it ZPP.)

Algorithm:
1. Start with an empty string S.
2. Flip a coin and append it to S.
3. If S is exactly equal to the corresponding-length prefix of your probability tape P, then goto 2.
4. Compare (S < P)

Comment author: JoshuaZ 23 June 2013 06:29:09PM 6 points [-]

Another key factor here is that the machine that does this would operate inside an alien environment compared to existing life - it would operate in a clean vacuum, possibly at low temperatures, and would use extremely stiff subunits made of covalently bonded silicon or carbon

If you have to do this, then the threat of nanotech looks a lot smaller. Replicators that need a nearly perfect vacuum aren't much of a threat.

Also, this is one place where AI comes in. The universe doesn't have any trouble modeling the energetics of a large network of atoms. If we have trouble doing the same, even using gigantic computers made of many many of these same atoms, then maybe the problem is we are doing it a hugely inefficient way. An entity smarter that humans might find a way to re-formulate the math for many orders of magnitude more efficient calculations, or it might find a way to build a computer that more efficiently uses the atoms it is composed of.

This sounds very close to a default assumption that these processes are genuinely easy to not just compute, but to actually work out what solutions one wants. Answering "how will this protein most likely fold?" is computationally much easier (as far as we can tell) than answering "what protein will fold like this?" It may well be that these are substantially computationally easier than we currently think. Heck, it could be that P=NP, or it could be that even with P != NP that there's still some extremely slow growing algorithm that solves NP complete problems. But these don't seem like likely scenarios unless one has some evidence for them.

Comment author: pengvado 24 June 2013 10:45:39AM 2 points [-]

Answering "how will this protein most likely fold?" is computationally much easier (as far as we can tell) than answering "what protein will fold like this?"

Got a reference for that? It's not obvious to me (CS background, not bio).

What if you have an algorithm that attempts to solve the "how will this protein most likely fold?" problem, but is only tractable on 1% of possible inputs, and just gives up on the other 99%? As long as the 1% contains enough interesting structures, it'll still work as a subroutine for the "what protein will fold like this?" problem. The search algorithm just has to avoid the proteins that it doesn't know how to evaluate. That's how human engineers work, anyway: "what does this pile of spaghetti code do?" is uncomputable in the worst case, but that doesn't stop programmers from solving "write a program that does X".

Comment author: cousin_it 23 June 2013 08:08:13AM *  2 points [-]

I just found this nice quote on The Last Conformer which is supposed to prove that betting on major events is qualitatively different from betting on coinflips:

I wouldn't even offer bets on this kind of probability because that would just invite better informed people to take my money.

It seems to me that the problem exists for coinflips as well. If I flip a coin and don't show you the result, your beliefs about the coin are probably 50/50. But if I offer you to bet at 50/50 odds that the coin came up heads, you'll probably refuse, because I know which way the coin came up and you don't.

According to the Dutch book argument for rationality, we are supposed to accept either side of any bet offered at the odds corresponding to our beliefs. In my example, that idea breaks down, because getting the offer is evidence that you shouldn't take the bet. But then how do we formulate the Dutch book argument?

Comment author: pengvado 23 June 2013 11:28:51AM *  2 points [-]

The selection effect you mention only applies to offering bets, not accepting them. If Alice announces her betting odds and then Bob decides which side of the bet to take, Alice might be doing something irrational there (if she didn't have a bid-ask spread), but we can still talk about dutch books from Bob's perspective. If you want to eliminate the effect whereby Bob updates on the existence of Alice's offer before making his decision, then replace Alice with an automated market maker (setup by someone who expects to lose money in exchange for outsourcing the probability estimate). Or assume some natural process with a naturally occurring payoff ratio that isn't determined by the payoff frequencies nor by anyone's state of knowledge.

View more: Prev | Next