All of StrivingForLegibility's Comments + Replies

I can answer this now!

Expected Utility, Geometric Utility, and Other Equivalent Representations

It turns out there are a large family of expectations we can use to build utility functions, including the arithmetic expectation , the geometric expectation , and the harmonic expectation , and they're all equivalent models of VNM rationality! And we need something beyond that family like Scott's  to formalize geometric rationality.

Thank you for linking to these different families of means! The quasi-arithmetic mean turned out ... (read more)

4scottviteri
Very interesting! I'm excited to read your post.

Thank you! I think it's exactly the same kind of "conditioning my output on their output" that you were pointing to in your analogy to iterated games. And I expect there's a strong correspondence between "program equilibria where you only condition on predicted outputs" and "iterated game equilibria that can form a stable loop."

Thank you! Ideally, I think we'd all like a model of individual rationality that composes together into a nice model of group rationality. And geometric rationality seems like a promising step in that direction.

This might be a framing thing!

The background details I’d been imagining are that Alive and Bob were in essentially identical situations before their interaction, and it was just luck that Alice and Bob got the capabilities they did.

Alice and Bob have two ways to convert tokens into money, and I’d claim that any rational joint strategy involves only using Bob’s way. Alice's ability to convert tokens into pennies is a red herring that any rational group should ignore.

At that point, it's just a bargaining game over how to split the $1,000,000,000. And I claim... (read more)

Limiting it to economic/comparable values is convenient, but also very inaccurate for all known agents - utility is private and incomparable.

I think modeling utility functions as private information makes a lot of sense! One of the claims I’m making in this post is that utility valuations can be elicited and therefore compared.

My go-to example of an honest mechanism is a second-price auction, which we know we can implement from within the universe. The bids serve as a credible signal of valuation, and if everyone follows their incentives they’ll bid honest... (read more)

The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

My understanding is that commitment races only occur in cases where "information about the commitments made by other agents" has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)

It seems like updateless agents should not find themselves in commitment races.

My impression is ... (read more)

Got it, thank you!

It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality. 

My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make. 

Similarly, commitment races are the result of dysfunctional reasoning that re... (read more)

The distinction between "solving the problem for our prior" and "solving the problem for all priors" definitely helps! Thank you!

I want to make sure I understand the way you're using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red -> Push Red, See Blue -> Push Blue) is the policy that CDT, EDT, and UDT would all implement.

In the way that I understand the terms, CDT and EDT... (read more)

2Martín Soto
Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It's just it won't be able to change its mind about which updateful strategy seems best. See this comment for more. As mentioned also in that comment, correct implementations of UDT don't actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it. The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

Got it, I think I understand better the problem you're trying to solve! It's not just being able to design a particular software system and give it good priors, it's also finding a framework that's robust to our initial choice of priors.

Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? I'm thinking of Yudkowsky's example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.

I'm also surprised that dynamic stabili... (read more)

2Martín Soto
Certainly not, in the most general case, as you correctly point out. Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have. Eliezer believes that, in these particular cases, it's very likely we will get optimal behavior (we won't get trapped priors, nor commitment races). I disagree, and that's what I argue in the post. If by "predictable in advance" you mean "from the updateless agent's prior", then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that's what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about "the world being so complex that your prior will be missing a lot of stuff". And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket. Of course, you can say "but that consideration will also be included in your prior". And that does make the situation better. But eventually your prior needs to end. And I argue, that's much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).

It sounds like we already mostly agree!

I agree with Caspar's point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards "going Straight sometimes, even after observing that your counterpart has pre-committed to always going Straight" is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then there's a program equilibrium where:

  • Both programs first try to perform a logical handshake, coordinating on a socially optimal join
... (read more)
2Martín Soto
I share the intuition that this lens is important. Indeed, there might be some important quantitative differences between a) I have a well-defined decision theory, and am choosing how to build my successor and b) I'm doing some vague normative reasoning to choose a decision theory (like we're doing right now), but I think these differences are mostly contingent, and the same fundamental dynamics about strategicness are at play in both scenarios. I think this is equivalent to your decision theory being dynamically stable (that is, its performance never improves by having access to commitments), and I'm pretty sure the only way to attain this is complete updatelessness (which is bad). That said, again, it might perfectly be that given our prior, many parts of cooperation-relevant concept-space seem very safe to explore, and so "for all practical purposes" some decision procedures are basically completely safe, and we're able to use them to coordinate with all agents (even if we haven't "solved in all prior-independent generality" the fundamental trade-off between updatelessness and updatefulness).

So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.

 

It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.

  • If it's helpful to act on a piece of information you know, act on it.
  • If it
... (read more)
2Martín Soto
As mentioned in another comment, I think this is not possible to solve in full generality (meaning, for all priors), because that requires complete updatelessness and we don't want to do that. I think all your proposed approaches are equivalent (and I think the most intuitive framing is "cognitive sandboxes"). And I think they don't work, because of reasoning close to this paragraph: But again, the problem might be solvable in particular cases (like, our prior).

To feed back, it sounds like "thinking more about what other agents will do" can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didn't have that logical information in the first place. They can sometimes regret thinking more.

It seems like it should always be possible to structure our software systems so that this doesn't happen. I think this comes at the cost of not always best-responding to other agents' policies.

In the example of Chicke... (read more)

2Martín Soto
You're right that (a priori and on the abstract) "bargaining power" fundamentally trades off against "best-responding". That's exactly the point of my post. This doesn't prohibit, though, that a lot of pragmatic and realistic improvements are possible (because we know agents in our reality tend to think like this or like that), even if the theoretical trade-off can never be erased completely or in all situations and for all priors. Your latter discussion is a normative one. And while I share your normative intuitions that best-responding completely (being completely updateful) is not always the best to do in realistic situations, I do have quibbles with this kind of discourse (similar to this). For example, why would I want to go Straight even after I have learned the other does? Out of some terminal valuation of fairness, or counterfactuals, more than anything, I think (more here). Or similarly, why should I think sticking to my notion of fairness shall ex ante convince the other player to coordinate on it, as opposed to the other player trying to pull out some "even more meta" move, like punishing notions of fairness that are not close enough to theirs? Again, all of this will depend on our priors.

It’s certainly not looking very likely (> 80%) that ... in causal interactions [most superintelligences] can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.

 

What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?

(My understanding of the term "logical handshake" is as a generalization of the te... (read more)

2Martín Soto
This is exactly the kind of procedure which might get hindered by commitment races, because it involves "thinking more about what the other agents will do", and the point of commitment races is that sometimes (and depending on your beliefs) this can seem net-negative ex ante (that is, before actually doing the thinking). Of course, this doesn't prohibit logical handshakes from being enacted sometimes. For example, if all agents start with a high enough prior on others enacting their part of ϕ, then they will do it. More realistically, it probably won't be as easy as this, but if it is the case that all agents feel safe enough thinking about ϕ (they deem it unlikely this backfires into losing bargaining power), and/or the upshot is sufficiently high (when multiplied by the probability and so on), then all agents will deem it net-positive to think more about ϕ and the others, and eventually they'll implement it. So it comes down to how likely we think are priors (or the equivalent thing for AIs) which successfully fall into this coordination basin, opposed to priors which get stuck in some earlier prior without wanting to think more. And again, we have a few pro tanto reasons to expect coordination to be viable (and a few in the other direction). I do think out of my list of statements, logical handshakes in causal interactions might be one of the most likely ones.

Thank you! I'm interested in checking out earlier chapters to make sure I understand the notation, but here's my current understanding:

There are 7 axioms that go into Joyce's representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.

My impression of the representation theorem is that a formula like  can represent a variety of decision theories. Including ones like CDT which are dynami... (read more)

Totally! The ecosystem I think you're referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).

Games like Chicken are actually what motivated me to think in terms of "collaborating to build mechanisms to reshape incentives." If both players choose their mixed strategy separately, there's an equilibrium where they independently mix () between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Sw... (read more)

I'd been thinking about "cleanness", but I think you're right that "being oriented to what we're even talking about" is more important. Thank you again for the advice!

Thank you! I started writing the previous post in this sequence and decided to break the example off into its own post. 

For anyone else looking for a TLDR: this is an example of how a network of counterfactual mechanisms can be used to make logical commitments for an arbitrary game.

2Carl Feynman
Put those two sentences at the beginning of your post and my objection goes away!

Totally! One of the most impressive results I've seen for one-shot games is the Robust Cooperation paper studying the open-source Prisoners' Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.

I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained o... (read more)

My point is there's a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.

I think the big link I would point to is "politics/economics." The spherical cows in a vacuum model of a modern democracy might be something like "a bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement."

When it comes to notions of fairness, I think we agree that there is no single stan... (read more)

2Dagon
I love the probabalistic rejection idea - it's clever and fun.  But it depends a LOT on communication or repetition-with-identity so the offerer has any clue that's the algorithm in play.  And in that case, the probabalistic element is unnecessary - simple precommitment is enough (and, in strictly-controlled games without repetition, allowing the reponder to publicly and enforceably precommit just reverses the positions).   I think our main disagreement is on what to do when one or more participants in one-shot (or fixed-length) games are truly selfish, and the payouts listed are fully correct in utility, after accounting for any empathy or desire for fairness.  Taboo "fair", and substitute "optimizing for self".  Shapley values are a good indicator of bargaining power for some kinds of game, but the assumption of symmetry is hard to justify.

This might be a miscommunication, I meant something like "you and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad."

Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.

2Dagon
I'm confused.  Are you and/or I the people making the decisions about a given company competitiveness decision?  My point is there's a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.   The framing of "fair", "justice", "good" and "bad" aren't well-defined in terms of rationality or game theory.  There is no "standardized consensus notion of fairness".  MOST actions are good for some individuals, bad for some, and neutral to a whole shitload.   Cost-cutting by firing someone and expecting the remaining workers to work a little more efficiently is a very good example.  It's good for the owners of the company, good for customers who focus on price, neutral to bad for customers who focus on service/touch (which requires more workers), good for workers who'd otherwise fare worse if the company went out of business, bad for workers who have to work harder for the same pay, and neutral to 9 billion uninvolved humans. It's VERY unclear how or whether this framework applies to any of the stakeholders.

Oops, when I heard about it I'd gotten the impression that this had been adopted by at least one AI firm, even a minor one, but I also can't find anything suggesting that's the case. Thank you!

It looks like OpenAI has split into a nonprofit organization and a "capped-profit" company.

The fundamental idea of OpenAI LP is that investors and employees can get a capped return if we succeed at our mission, which allows us to raise investment capital and attract employees with startup-like equity. But any returns beyond that amount—and if we are successful, we ex

... (read more)

I think we agree that in cases where competition is leading to good results, no change to the dynamics is called for.

We probably also agree on a lot of background value judgements like "when businesses become more competitive by spending less on things no one wants, like waste or pollution, that's great!" And "when businesses become more competitive by spending less on things people want, like fair wages or adequate safety, that's not great and intervention is called for."

One case where we might literally want to distribute resources from the makers of a v... (read more)

2Dagon
Wait - now there's an authority who can distinguish between good efficiency and bad efficiency?  That's a pretty long jump from the post about how individual agents should approach incentives and retribution for other agents.
3Richard_Kennaway
I had not heard of the FHI's Windfall Clause, but looking on the Internet, I don't see signs of anyone signing up to it yet. Metaculus has a still-open prediction market on whether any major AI company will by the end of 2025.

For games without these mechanisms, the rational outcomes don't end up that pleasant.  Except sometimes, with players who have extra-rational motives.

I think we agree that if a selfish agent needs to be forced to not treat others poorly, in the absence of such enforcement they will treat others poorly.

It also seems like in many cases, selfish agents have an incentive to create exactly those mechanisms ensuring good outcomes for everyone, because it leads to good outcomes for them in particular. A nation-state comprised entirely of very selfish people ... (read more)

That sounds reasonable to me! This could be another negative externality that we judge to be acceptable, and that we don't want to internalize. Something like "if you break any of these rules, (e.g. worker safety, corporate espionage, etc.) then you owe the affected parties compensation. But as long as you follow the rules, there is no consensus-recognized debt."

It seems straightforward! Kaggle is the closest example I've been able to think of. But yes that's totally the sort of thing that I think would constitute an optimization market!

Absolutely! I have less experience on the "figuring out what interventions are appropriate" side of the medical system, but I know of several safety measures they employ that we can adapt for AI safety.

For example, no actor is unilaterally permitted to think up a novel intervention and start implementing it. They need to convince a institutional review board that the intervention has merit, and that a clinical trial can be performed safely and ethically. Then the intervention needs to be approved by a bunch of bureaucracies like the FDA. And then medical d... (read more)

This is a much more nuanced take! At the beginning of Chapter 6, Jan proposes restricting our attention to agents which are limit computable

Our agents are useless if they cannot be approximated in practice, i.e., by a regular Turing machine. Therefore we posit that any ideal for a ‘perfect agent’ needs to be limit computable ().

This seems like a very reasonable restriction! Any implementation needs to be computable, but it makes sense to look for theoretic ideals which can be approximated.

Yes! I'm a fan of Yudkowsky's view that the sensation of free will is the sensation of "couldness" among multiple actions. When it feels like I could do one thing or another, it feels like I have free will. When it feels like I could have chosen differently, it feels like I chose freely.

I suspect that an important ingredient of the One True Decision Theory is being shaped in such a way that other agents, modelling how you'll respond to different policies they might implement, find it in their interest to implement policies which treat you fairly.

Got it, I misunderstood the semantics of what  was supposed to capture. I thought the elements needed to be mutual best-responses. Thank you for the clarification, I've updated my implementation accordingly!

Edit: Cleo Nardo has confirmed that they intended  to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what  was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is based on my initial (wrong) interpretation of .

 

I write the original expression, and your expression rewritten using the OP's notation:

Original: 

... (read more)
1rotatingpaguro
Oh I see now, B just needs to work to pinpoint Nash equilibria, I did not make that connection. But anyway, the reason I'm suspicious of your interpretation is not that your math is not correct, but that it makes the OP notation so unnatural. The unnatural things are: 1. ∏ being context-dependent. 2. ∏ not having its standard meaning. 3. Ui used implicitly instead of explicitly, when later it takes on a more important role to change decision theory. 4. Using x∈B(x) as condition without mentioning that already B(x)≠∅⟺x is Nash if |I|≥2. So I guess I will stay in doubt until the OP confirms "yep I meant that".

Edit: Cleo Nardo has confirmed that they intended  to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what  was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is my initial (wrong) interpretation of .

 

I needed to go back to one of the papers cited in Part 1 to understand what that  was doing in that expressio... (read more)

2rotatingpaguro
I'm weirded out by this. To look at everything together, I write the original expression, and your expression rewritten using the OP's notation: Original: B:x↦∏i∈Iψi(g∘Ui(x))  Yours: B(x)=⋂i∈I{x(i↦α):α∈ψi(g∘Ui(x))}=⋂i∈IUi(x)(ψi(g∘Ui(x))) (I'm using the notation that a function applied to a set is the image of that set.) So the big pi symbol stands for ∏i∈IAi=⋂i∈IUi(x)(Ai) So it's not a standalone operator: it's context-dependent because it pops out an implicit x. The OP otherwise gives the impression of a more functional mindset, so I suspect the OP may mean something different from your guess. Other problem with your interpretation: it yields the empty set unless all agents consider doing nothing an option. The only possible non-empty output is {x}. Reason: each set you are intersecting contains tuples with all elements equal to the ones in x, but for one. So the intersection will necessarily only contain tuples with all elements equal to those in x.