This is not unlike Neyman-Pearson theory. Surely this will run into the same trouble with more than 2 possible actions.
Our research group and collaborators, foremost Daniel Polani, have been studying this for many years now. Polani calls an essentially identical concept empowerment. These guys are welcome to the party, and as former outsiders it's understandable (if not totally acceptable) that they wouldn't know about these piles of prior work.
You have a good and correct point, but it has nothing to do with your question.
a machine can never halt after achieving its goal because it cannot know with full certainty whether it has achieved its goal
This is a misunderstanding of how such a machine might work.
To verify that it completed the task, the machine must match the current state to the desired state. The desired state is any state where the machine has "made 32 paperclips". Now what's a paperclip?
For quite some time we've had the technology to identify a paperclip in an image, if ...
The "world state" of ASH is in fact an "information state" of p("heads")>SOME_THRESHOLD
Actually, I meant p("heads") = 0.999 or something.
(C), if I'm following you, maps roughly to the English phrase "I know for absolutely certain that the coin is almost surely heads".
No, I meant: "I know for absolutely certain that the coin is heads". We agree that this much you can never know. As for getting close to this, for example having the information state (D) where p("heads") = 0.99999...
I probably need to write a top-level post to explain this adequately, but in a nutshell:
I've tossed a coin. Now we can say that the world is in one of two states: "heads" and "tails". This view is consistent with any information state. The information state (A) of maximal ignorance is a uniform distribution over the two states. The information state (B) where heads is twice as likely as tails is the distribution p("heads") = 2/3, p("tails") = 1/3. The information state (C) of knowing for sure that the result is heads...
To clarify further: likelihood is a relative quantity, like speed - it only has meaning relative to a specific frame of reference.
If you're judging my calibration, the proper frame of reference is what I knew at the time of prediction. I didn't know what the result of the fencing match would be, but I had some evidence for who is more likely to win. The (objective) probability distribution given that (subjective) information state is what I should've used for prediction.
If you're judging my diligence as an evidence seeker, the proper frame of reference is ...
This is perhaps not the best description of actualism, but I see your point. Actualists would disagree with this part of my comment:
If I believed that "you will win" (no probability qualifier), then in the many universes where you didn't I'm in Bayes Hell.
on the grounds that those other universes don't exist.
But that was just a figure of speech. I don't actually need those other universes to argue against 0 and 1 as probabilities. And if Frequentists disbelieve in that, there's no place in Bayes Heaven for them.
we've already seen [...] or [...] in advance
Does this answer your question?
Predictions are justified not by becoming a reality, but by the likelihood of their becoming a reality [1]. When this likelihood is hard to estimate, we can take their becoming a reality as weak evidence that the likelihood is high. But in the end, after counting all the evidence, it's really only the likelihood itself that matters.
If I predict [...] that I will win [...] and I in fact lose fourteen touches in a row, only to win by forfeit
If I place a bet on you to win and this happens, I'll happily collect my prize, but still feel that I put my money ...
Thanks!
The best book is doubtlessly Elements of Information Theory by Cover and Thomas. It's very clear (to someone with some background in math or theoretical computer science) and lays very strong introductory foundations before giving a good overview of some of the deeper aspects of the theory.
It's fortunate that many concepts of information theory share some of their mathematical meaning with the everyday meaning. This way I can explain the new theory (popularized here for the first time) without defining these concepts.
I'm planning another sequence wh...
an intelligent agent should update on it and then forget it.
Should being the operative word. This refers to a "perfect" agent (emphasis added in text; thanks!).
People don't do this, as well they shouldn't, because we update poorly and need the original data to compensate.
If you forget the discarded cards, and later realize that you may have an incorrect map of the deck, aren't you SOL?
If I remember the cards in play, I don't care about the discarded ones. If I don't, the discarded cards could help a bit, but that's not the heart of my problem.
if you really care about the values on that list, then there are linear aggregations
Of course existence doesn't mean that we can actually find these coefficients. Even if you have only 2 well-defined value functions, finding an optimal tradeoff between them is generally computationally hard.
Philosiphically, yes.
Practically, it may be useful to distinguish between a coin and a toss. The coin has persisting features which make it either fair or loaded for a long time, with correlation between past and future. The toss is transient, and essentially all information about it is lost when I put the coin away - except through the memory of agents.
So yes, the toss is a feature of the present state of the world. But it has the very special property, that given the bias of the coin, the toss is independent of the past and the future. It's sometimes more useful to treat a feature like that as an observation external to the world, but of course it "really" isn't.
I'm trying to balance between introducing terminology to new readers and not boring those who've read my previous posts. Thanks for the criticism, I'll use it (and its upvotes) to correct my balance.
Well, thank you!
Yes, I do this more for the math and the algorithms than for advice for humans.
Still, the advice is perhaps not so trivial: study not what you're most uncertain about (highest entropy given what you know) but those things with entropy generated by what you care about. And even this advice is incomplete - there's more to come.
When the new memory state is generated by a Bayesian update from the previous one
and the new observation
, it's a sufficient statistic of these information sources for the world state
, so that
keeps all the information about the world that was remembered or observed:
=I(W_t;(M_{t-1},O_t)))
As this is all the information available, other ways to update can only have less information.
The amount of information gained by a Bayesian update is
...
)-I(W_t;M_{t-1}))
}{\Pr(W_t)\Pr(M_{t-1},O_t)}-\log\frac{\Pr(W_t,M_{t-1})}{\Pr(W_t)\Pr(M_{t-1})}\right])
}{\Pr(W_t,
I explained this in my non-standard introduction to reinforcement learning.
We can define the world as having the Markov property, i.e. as a Markov process. But when we split the world into an agent and its environment, we lose the Markov property for each of them separately.
I'm using non-standard notation and terminology because they are needed for the theory I'm developing in these posts. In future posts I'll try to link more to the handful of researchers who do publish on this theory. I did publish one post relating the terminology I'm using to more stan...
Fixed. Thanks!
How does deciding one model is true give you more information?
Let's assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for "narrowing&...
You're not really wrong. The thing is that "Occam's razor" is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.
P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).
Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than ...
The ease with which images, events and concepts come to mind is correlated with how frequently they have been observed, which in turn is correlated with how likely they are to happen again.
Yes, and I was trying to make this description one level more concrete.
Things never happen the exact same way twice. The way that past observations are correlated with what may happen again is complicated - in a way, that's exactly what "concepts" capture.
So we don't just recall something that happened and predict that it will happen again. Rather, we compos...
Take for example your analysis of the poker hand I partially described. You give 3 possibilities for what the truth of it may be. Are there any other possibilities? Maybe the player is bluffing to gain the reputation of a bluffer? Maybe she mistook a 4 for an ace (it happened to me once...)? Maybe aliens hijacked her brain?
It would be impossible to enumerate or notice all the possibilities, but fortunately we don't have to. We make only the most likely and important ones available.
I was trying to give a specific reason that the availability heuristic is there: it's coupled with another mechanism that actually generates the availability; and then to say a few things about this other mechanism.
Does anyone have specific advice on how I could convey this better?
Imagine a bowl of jellybeans. [...]
Allow me to suggest a simpler thought experiment, that hopefully captures the essence of yours, and shows why your interpretation (of the correct math) is incorrect.
There are 100 recording studios, each recording each day with probability 0.5. Everybody knows that.
There's a red light outside each studio to signal that a session is taking place that day, except for one rogue studio, where the signal is reversed, being off when there's a session and on when there isn't. Only persons B and C know that.
A, B and C are stand...
To anyone thinking this is not random, with 42 votes in:
The p-value is 0.895 (this is the probability of seeing at least this much non-randomness, assuming a uniform distribution)
The entropy is 2.302bits instead of log(5) = 2.322bits, for 0.02bits KL-distance (this is the number of bits you lose for encoding one of these votes as if it was random)
If you think you see a pattern here, you should either see a doctor or a statistician.
It is perfectly legal under the bayes to learn nothing from your observations.
Right, in degenerate cases, when there's nothing to be learned, the two extremes of learning nothing and everything coincide.
Or learn in the wrong direction, or sideways, or whatever.
To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?
There is no unique "Bayesian belief".
If you mean to say that there's no unique justifiable prior, I agree. The prior in our setting is basically what you assume yo...
Everything you say is essentially true.
As the designer of the agent, will you be explicitly providing it with that information in some future instalment?
Technically, we don't need to provide the agent with p and sigma explicitly. We use these parameters when we build the agent's memory update scheme, but the agent is not necessarily "aware" of the values of the parameters from inside the algorithm.
Let's take for example an autonomous rover on Mars. The gravity on Mars is known at the time of design, so the rover's software, and even hardware,...
If you're a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here's how:
%3d\Pr(W_{t+1}|O1,\ldots,O{t+1})%3d\frac{\sigma(O{t+1}|W{t+1})\cdot\Pr(W_{t+1}|O_1,\ldots,O_t)}{\sumw\sigma(O{t+1}|w)\cdot\Pr(w|O_1,\ldots,O_t)})
As usual with Bayes' theorem, we only need to calculate the numerator for different values of , and the denominator will normalize them to sum to 1, as probabilities do. We know
as part of the dynamics ...
p(H|E1,E2) [...] is simply not something you can calculate in probability theory from the information given [i.e. p(H|E1) and p(H|E2)].
Jaynes would disapprove.
You continue to give more information, namely that p(H|E1,E2) = p(H|E1). Thanks, that reduces our uncertainty about p(H|E1,E2).
But we are hardly helpless without it. Whatever happened to the Maximum Entropy Principle? Incidentally, the maximum entropy distribution (given the initial information) does have E1 and E2 independent. If your intuition says this before having more information, it is good...
Clearly you have some password I'm supposed to guess.
This post is not preliminary. It's supposed to be interesting in itself. If it's not, then I'm doing something wrong, and would appreciate constructive criticism.
That's an excellent point. Of course one cannot introduce RL without talking about the reward signal, and I've never intended to.
To me, however, the defining feature of RL is the structure of the solution space, described in this post. To you, it's the existence of a reward signal. I'm not sure that debating this difference of opinion is the best use of our time at this point. I do hope to share my reasons in future posts, if only because they should be interesting in themselves.
As for your last point: RL is indeed a very general setting, and classical planning can easily be formulated in RL terms.
I'm not sure why you say this.
Please remember that this introduction is non-standard, so you may need to be an expert on standard RL to see the connection. And while some parts are not in place yet, this post does introduce what I consider to be the most important part of the setting of RL.
So I hope we're not arguing over definitions here. If you expand on your meaning of the term, I may be able to help you see the connection. Or we may possibly find that we use the same term for different things altogether.
I should also explain why I'm giving a non-standa...
I internally debated this question myself. Ideally, I'd completely agree with you. But I needed the shorter publishing and feedback cycle for a number of reasons. Sorry, but a longer one may not have happened at all.
Edit: future readers will have the benefit of a link to part 2
In the model there's the distribution p, which determines how the world is changing. In the chess example this would include: a) how the agent's action changes the state of the game + b) some distribution we assume (but which we may or may not actually know) about the opponent's action and the resulting state of the game. In a physics example, p should include the relevant laws of physics, together with constants which tell the rate (and manner) in which the world is changing. Any changing parameters should be part of the state.
It seems that you're saying ...
There's supposed to be some way to do so partially, if anyone knows what it is.
This should work in Markdown, but it seems broken :(
Edit: t̶e̶s̶t̶ Thanks, Vincent, it works!
And how do you strikeout your comment?
I'm not sure what you mean. It looks fine to me, and I can't find where to check / change such a setting.
Edit:
Very strange. Fixed, I hope.
Thanks!
You are expressing a number of misconceptions here. I may address some in future posts, but in short:
By information I mean the Shannon information (see also links in OP). Your example is correct.
By the action of powering the electromagnet you are not increasing your information on the state of the world. You are increasing your information on the state of the coin, but through making it dependent on the state of the electromagnet which you already knew. This point is clearly worth a future post.
There is no "entropy in environment". Entropy is subjective to the viewer.
I realize now that an example would be helpful, and yours is a good one.
Any process can be described on different levels. The trick is to find a level of description that is useful. We make an explicit effort to model actions and observation so as to separate the two directions of information flow between the agent and the environment. Actions are purely "active" (no information is received by the agent) while observations are purely "passive" (no information is sent by the agent). We do this because these two aspects of the process hav...
Excellent point. It will be a few posts (if the audience is interested) before I can answer you in a way that is both intuitive and fully convincing.
The technical answer is that the belief update caused by an action is deterministically contractive. It never increases the amount of information.
A more intuitive answer (but perhaps not yet convincing) is that, proximally, your action of asking your friend did not change the location of your laptop, only your friend's mental state. And the effect it had on your friend's mental state is that you are now less s...
Having a word [...] is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like "red" [...]).
Remember that mutual information is symmetric. If some things have the property of being red, then "red" has the property of being a property of those things. Saying "blood is red" is really saying "remember that visual experience that you get when you look at certain roses, apples, peppers, lipsticks and English...
The Harsanyi paper is very enlightening, but he's not really arguing that people have shared priors. Rather, he's making the following points (section 14):
It is worthwhile for an agent to analyze the game as if all agents have the same prior, because it simplifies the analysis. In particular, the game (from that agent's point of view) then becomes equivalent to a Bayesian complete-information game with private observations.
The same-prior assumption is less restrictive than it may seem, because agents can still have private observations.
A wide family
I'm aware of this result. It specifically requires the two Beyesians to have the same prior. My point is exactly that this doesn't have to be the case, and in reality is sometimes not the case.
EDIT: The original paper by Aumann references a paper by Harsanyi which supposedly addresses my point. Aumann himself is careful in interpreting his result as supporting my point (since evidently there are people who disagree despite trusting each other). I'll report here my understanding of the Harsanyi paper once I get past the paywall.
Traditional Rationalists can agree to disagree. Traditional Rationality doesn't have the ideal that thinking is an exact art in which there is only one correct probability estimate given the evidence.
This is also true of Bayesians. The probability estimate given the evidence is a property of the map, not the territory (hence "estimate"). One correct posterior implies one correct prior. What is this "Ultimate Prior"? There isn't one.
Possibly, you meant that there's one correct posterior given the evidence and the prior. That's correc...
A GAI with the utility of burning itself? I don't think that's viable, no.
What do you mean by "viable"?
Intelligence is expensive. More intelligence costs more to obtain and maintain. But the sentiment around here (and this time I agree) seems to be that intelligence "scales", i.e. that it doesn't suffer from diminishing returns in the "middle world" like most other things; hence the singularity.
For that to be true, more intelligence also has to be more rewarding. But not just in the sense of asymptotically approaching op...
Not at all. If you insist, let's take it from the top:
I wanted to convey my reasoning, let's call it R.
I quoted a claim of the form "because P is true, Q is true", where R is essentially "if P then Q". This was a rhetorical device, to help me convey what R is.
I indicated clearly that I don't know whether P or Q are true. Later I said that I suspect P is false.
Note that my reasoning is, in principle, falsifiable: if P is true and Q is false, then R must be false.
While Q may be relatively easy to check, I think P is not.
I expe
I'll try to remember that, if only for the reason that some people don't seem to understand contexts in which the truth value of a statement is unimportant.
a GAI with [overwriting its own code with an arbitrary value] as its only goal, for example, why would that be impossible? An AI doesn't need to value survival.
A GAI with the utility of burning itself? I don't think that's viable, no.
I'd be interested in the conclusions derived about "typical" intelligences and the "forbidden actions", but I don't see how you have derived them.
At the moment it's little more than professional intuition. We also lack some necessary shared terminology. Let's leave it at that until and unless someon...
It seems that your research is coming around to some concepts that are at the basis of mine. Namely, that noise in an optimization process is a constraint on the process, and that the resulting constrained optimization process avoids the nasty properties you describe.
Feel free to contact me if you'd like to discuss this further.