All of Jobst Heitzig's Comments + Replies

I think you are right about the representation claim since any quasi-ordering (reflexive and transitive relation) can be represented as the intersection of complete quasi-orderings.

You are of course perfectly right. What I meant was: so that their convex hull is full-dimensional and contains the origin. I fixed it. Thanks for spotting this!

Exactly! Thanks for providing this concise summary in your words. 

In the next post we generalize the target from a single point to an interval to get even more freedom that we can use for increasing safety further. 

In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post

Alex Turner's post you referenced first convinces me that his arguments about "orbit-level power-seeking" apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.

He also claims in that post that the "orbit-level power-seeking" issue affects all schemes that are based on expected utility: "There is no clever EU-based scheme which doesn't have orbit-level power-seekin... (read more)

Thank you for the warm encouragement. 

We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.

In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don't believe there is a unique problem for safety that... (read more)

"Hence the information what I will do cannot have been available to the predictor." If the latter statement is correct, then how can could have "often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation"?

There's many possible explanations for this data. Let's say I start my analysis with the model that the predictor is guessing, and my model attaches some prior probability for them guessing right in a single case. I might also have a prior about the likelihood of being lied about the predictor's suc... (read more)

2Shmi
Right! If I understand your point correctly, given a strong enough prior for the predictor being lucky or deceptive, it would have to be a lot of evidence to change one's mind, and the evidence would have to be varied. This condition is certainly not satisfied by the original setup. If your extremely confident prior is that foretelling one's actions is physically impossible, then the lie/luck hypothesis would have to be much more likely than changing your mind about physical impossibility. That makes perfect sense to me.  I guess one would want to simplify the original setup a bit. What if you had full confidence that the predictor is not a trickster? Would you one-box or two-box? To get the physical impossibility out of the way, they do not necessarily have to predict every atom in your body and mind, just observe you (and read your LW posts, maybe) to Sherlock-like make a very accurate conclusion about what you would decide. Another question: what kind of experiment, in addition to what is in the setup, would change your mind? 

Take a possible world in which the predictor is perfect (meaning: they were able to make a prediction, and there was no possible extension of that world's trajectory in which what I will actually do deviates from what they have predicted). In that world, by definition, I no longer have a choice. By definition I will do what the predictor has predicted. Whatever has caused what I will do lies in the past of the prediction, hence in the past of the current time point. There is no point in asking myself now what I should do as I have no longer causal influenc... (read more)

4Shmi
Sorry, could not reply due to rate limit. In reply to your first point, I agree, in a deterministic world with perfect predictors the whole question is moot. I think we agree there. Also, yes, assuming "you have a choice between two actions", what you will do has not been decided by you yet. Which is different from "Hence the information what I will do cannot have been available to the predictor." If the latter statement is correct, then how can could have "often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation"? Presumably some information about your decision-making process is available to the predictor in this particular situation, or else the problem setup would not be possible, would it? If you think that you are a very special case, and other people like you are not really like you, then yes, it makes sense to decide that you can get lucky and outsmart the predictor, precisely because you are special. If you think that you are not special, and other people in your situation thought the same way, two-boxed and lost, then maybe your logic is not airtight and your conclusion to two-box is flawed in some way that you cannot quite put your finger on, but the experimental evidence tells you that it is. I cannot see a third case here, though maybe I am missing something. Either you are like others, and so one-boxing gives you more money than two boxing, or you are special and not subject to the setup at all, in which case two-boxing is a reasonable approach. Right, that is, I guess, the third alternative: you are like other people who lost when two-boxing, but they were merely unlucky, the predictor did not have any predictive powers after all. Which is a possibility: maybe you were fooled by a clever con or dumb luck. Maybe you were also fooled by a clever con or dumb luck when the predictor "has never, so far as you know, made an incorrect prediction about your choices". Maybe this all led to this momen

Can you please explain the "zero-probability possible world"?

2Shmi
There is no possible world with a perfect predictor where a two-boxer wins without breaking the condition of it being perfect.

Hi Nathan,

I'm not sure. I guess it depends on what your definition of "agent" is. In my personal definition, following Yann LeCun's recent whitepaper, the "agent" is a system with a number of different modules, one of it being a world model (in our case, an MDP that it can use to simulate consequences of possible policies), one of it being a policy (in our case, an ANN that takes states as inputs and gives action logits as outputs), and one module being a learning algorithm (in our case, a variant of Q-learning that uses the world model to learn a policy t... (read more)

2Nathan Helm-Burger
Thanks, that makes sense!

Excellent! I have three questions

  1. How would we get to a certain upper bound on ?

  2. As collisions with the boundary happen exactly when one action's probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?

  3. Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of

... (read more)

I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the initial point plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.

I like the clarity of this post very much! Still, we should be aware that all this hinges on what exactly we mean by "the model".

If "the model" only refers to one or more functions, like a policy function pi(s) and/or a state-value function V(s) and/or a state-action -value function Q(s,a) etc., but does not refer to the training algorithm, then all you write is fine. This is how RL theory uses the word "model".

But some people here also use the term "the model" in a broader sense, potentially including the learning algorithm that adjusts said functions, an... (read more)

replacing the SGD with something that takes the shortest and not the steepest path

Maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name. 

2Thane Ruthenis
Doesn't sound like it'd meaningfully change the fundamental dynamics. It's an intervention on the order of things like Momentum or Adam, and they're still "basically just the SGD". Pretty sure similar will be the case here: it may introduce some interesting effects, but won't actually robustly address the greed. ... My current thoughts is that "how can we design a procedure that takes the shortest and not the steepest path to the AGI?" is just "design it manually". I. e., the corresponding "training algorithm" we'll want to replace the SGD with is just "our own general intelligence".

roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.

I believe this is not correct as long as "gradient-descend" means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.

But maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal... (read more)

Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected "savedness" of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which "just tries to save the world" with a decent probability, and doesn't aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?

I'm asking this because

  • I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to
... (read more)

Definition 4: Expectation w.r.t. a Set of Sa-Measures

This definition is obviously motivated by the plan to later apply some version of maximin rule, so that only the inf matters. 

I suggest that we also study versions what employ other decision-under-ambiguity rules such as Hurwicz' rule or Savage's minimax regret rule. 

From my reading of quantilizers, they might still choose "near-optimal" actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to "make me a tea of this quantity and quality within this time and with this probability" and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.

1Linda Linsefors
Yes, that is a thing you can do with decision transforms too. I was referring to variant of the decision transformer (see link in original short form) where the AI samples the reward it's aiming for. 

even when the agents are unable to explicitly bargain or guarantee their fulfilment of their end by external precommitments

I believe there is a misconception here. The actual game you describe is the game between the programmers, and the fact that they know in advance that the others' programs will indeed be run with the code that their own program has access to does make each program submission a binding commitment to behave in a certain way

Game Theory knows since long that if binding commitments are possible, most dilemmas can be solved easily. In... (read more)

I just stumbled upon this and noticed that a real-world mechanism for international climate policy cooperation that I recently suggested in this paper can be interpreted as a special case of your (G,X,Y) framework.

Assume a fixed game G where

  • each player's action space is the nonnegative reals,
  • U(x,y) is weakly decreasing in x and weakly increasing in y.
  • V(x,y) is weakly decreasing in y and weakly increasing in x.

(Many public goods games, such as the Prisoners' Dilemma, have such a structure)

Let's call an object a Conditional Commitment Function (CCF) iff it i... (read more)

Dear Robert, I just found out about your work and absolutely love it.

Has the following idea been explored yet?

  • The AI system is made of two agents, a strategic agent S and a controller agent C.
  • S's reward function approximates the actual objective function of the system as defined by the designer.
  • S can only propose actions to C, only knows about the environment and the actual actions taken what C tells it, and only has as many compute resources as C gives it.
  • C's reward function encodes hard constraints such as the three laws of robotics or some other fo
... (read more)

Having just read Scott's Geometric Expectation stuff, I want to add that of course another variant of all of this is to replace every occurrence of a mean or expectation by a geometric mean or geometric expectation to make the whole thing more risk-averse.

In its suggested form Maximal Lottery-Lotteries is still a majoritarian system in the sense that a mere majority of 51% of the voters can make sure that candidate A wins regardless how the other 49% vote. For this, they only need to give A a rating of 1 and all other candidates a rating of 0.

One can also turn the system into a non-majoritarian system in which power is distributed proportionally in the sense that any group of x% of the voters can make sure that candidate A gets at least x% winning probability, similar to what is true of the MaxParC voting s... (read more)