Thank you! I think it's exactly the same kind of "conditioning my output on their output" that you were pointing to in your analogy to iterated games. And I expect there's a strong correspondence between "program equilibria where you only condition on predicted outputs" and "iterated game equilibria that can form a stable loop."
Thank you! Ideally, I think we'd all like a model of individual rationality that composes together into a nice model of group rationality. And geometric rationality seems like a promising step in that direction.
This might be a framing thing!
The background details I’d been imagining are that Alive and Bob were in essentially identical situations before their interaction, and it was just luck that Alice and Bob got the capabilities they did.
Alice and Bob have two ways to convert tokens into money, and I’d claim that any rational joint strategy involves only using Bob’s way. Alice's ability to convert tokens into pennies is a red herring that any rational group should ignore.
At that point, it's just a bargaining game over how to split the $1,000,000,000. And I claim...
Limiting it to economic/comparable values is convenient, but also very inaccurate for all known agents - utility is private and incomparable.
I think modeling utility functions as private information makes a lot of sense! One of the claims I’m making in this post is that utility valuations can be elicited and therefore compared.
My go-to example of an honest mechanism is a second-price auction, which we know we can implement from within the universe. The bids serve as a credible signal of valuation, and if everyone follows their incentives they’ll bid honest...
The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.
My understanding is that commitment races only occur in cases where "information about the commitments made by other agents" has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)
It seems like updateless agents should not find themselves in commitment races.
My impression is ...
Got it, thank you!
It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality.
My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make.
Similarly, commitment races are the result of dysfunctional reasoning that re...
The distinction between "solving the problem for our prior" and "solving the problem for all priors" definitely helps! Thank you!
I want to make sure I understand the way you're using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red -> Push Red, See Blue -> Push Blue) is the policy that CDT, EDT, and UDT would all implement.
In the way that I understand the terms, CDT and EDT...
Got it, I think I understand better the problem you're trying to solve! It's not just being able to design a particular software system and give it good priors, it's also finding a framework that's robust to our initial choice of priors.
Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? I'm thinking of Yudkowsky's example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.
I'm also surprised that dynamic stabili...
It sounds like we already mostly agree!
I agree with Caspar's point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards "going Straight sometimes, even after observing that your counterpart has pre-committed to always going Straight" is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then there's a program equilibrium where:
So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.
It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.
To feed back, it sounds like "thinking more about what other agents will do" can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didn't have that logical information in the first place. They can sometimes regret thinking more.
It seems like it should always be possible to structure our software systems so that this doesn't happen. I think this comes at the cost of not always best-responding to other agents' policies.
In the example of Chicke...
It’s certainly not looking very likely (> 80%) that ... in causal interactions [most superintelligences] can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.
What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?
(My understanding of the term "logical handshake" is as a generalization of the te...
Thank you! I'm interested in checking out earlier chapters to make sure I understand the notation, but here's my current understanding:
There are 7 axioms that go into Joyce's representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.
My impression of the representation theorem is that a formula like can represent a variety of decision theories. Including ones like CDT which are dynami...
Totally! The ecosystem I think you're referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).
Games like Chicken are actually what motivated me to think in terms of "collaborating to build mechanisms to reshape incentives." If both players choose their mixed strategy separately, there's an equilibrium where they independently mix (, ) between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Sw...
I'd been thinking about "cleanness", but I think you're right that "being oriented to what we're even talking about" is more important. Thank you again for the advice!
Thank you! I started writing the previous post in this sequence and decided to break the example off into its own post.
For anyone else looking for a TLDR: this is an example of how a network of counterfactual mechanisms can be used to make logical commitments for an arbitrary game.
Totally! One of the most impressive results I've seen for one-shot games is the Robust Cooperation paper studying the open-source Prisoners' Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.
I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained o...
My point is there's a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.
I think the big link I would point to is "politics/economics." The spherical cows in a vacuum model of a modern democracy might be something like "a bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement."
When it comes to notions of fairness, I think we agree that there is no single stan...
This might be a miscommunication, I meant something like "you and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad."
Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.
Oops, when I heard about it I'd gotten the impression that this had been adopted by at least one AI firm, even a minor one, but I also can't find anything suggesting that's the case. Thank you!
It looks like OpenAI has split into a nonprofit organization and a "capped-profit" company.
...The fundamental idea of OpenAI LP is that investors and employees can get a capped return if we succeed at our mission, which allows us to raise investment capital and attract employees with startup-like equity. But any returns beyond that amount—and if we are successful, we ex
I think we agree that in cases where competition is leading to good results, no change to the dynamics is called for.
We probably also agree on a lot of background value judgements like "when businesses become more competitive by spending less on things no one wants, like waste or pollution, that's great!" And "when businesses become more competitive by spending less on things people want, like fair wages or adequate safety, that's not great and intervention is called for."
One case where we might literally want to distribute resources from the makers of a v...
For games without these mechanisms, the rational outcomes don't end up that pleasant. Except sometimes, with players who have extra-rational motives.
I think we agree that if a selfish agent needs to be forced to not treat others poorly, in the absence of such enforcement they will treat others poorly.
It also seems like in many cases, selfish agents have an incentive to create exactly those mechanisms ensuring good outcomes for everyone, because it leads to good outcomes for them in particular. A nation-state comprised entirely of very selfish people ...
That sounds reasonable to me! This could be another negative externality that we judge to be acceptable, and that we don't want to internalize. Something like "if you break any of these rules, (e.g. worker safety, corporate espionage, etc.) then you owe the affected parties compensation. But as long as you follow the rules, there is no consensus-recognized debt."
It seems straightforward! Kaggle is the closest example I've been able to think of. But yes that's totally the sort of thing that I think would constitute an optimization market!
Absolutely! I have less experience on the "figuring out what interventions are appropriate" side of the medical system, but I know of several safety measures they employ that we can adapt for AI safety.
For example, no actor is unilaterally permitted to think up a novel intervention and start implementing it. They need to convince a institutional review board that the intervention has merit, and that a clinical trial can be performed safely and ethically. Then the intervention needs to be approved by a bunch of bureaucracies like the FDA. And then medical d...
This is a much more nuanced take! At the beginning of Chapter 6, Jan proposes restricting our attention to agents which are limit computable.
Our agents are useless if they cannot be approximated in practice, i.e., by a regular Turing machine. Therefore we posit that any ideal for a ‘perfect agent’ needs to be limit computable ().
This seems like a very reasonable restriction! Any implementation needs to be computable, but it makes sense to look for theoretic ideals which can be approximated.
Yes! I'm a fan of Yudkowsky's view that the sensation of free will is the sensation of "couldness" among multiple actions. When it feels like I could do one thing or another, it feels like I have free will. When it feels like I could have chosen differently, it feels like I chose freely.
I suspect that an important ingredient of the One True Decision Theory is being shaped in such a way that other agents, modelling how you'll respond to different policies they might implement, find it in their interest to implement policies which treat you fairly.
Got it, I misunderstood the semantics of what was supposed to capture. I thought the elements needed to be mutual best-responses. Thank you for the clarification, I've updated my implementation accordingly!
Edit: Cleo Nardo has confirmed that they intended to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is based on my initial (wrong) interpretation of .
...I write the original expression, and your expression rewritten using the OP's notation:
Original:
Edit: Cleo Nardo has confirmed that they intended to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what was intended to represent. I've updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is my initial (wrong) interpretation of .
I needed to go back to one of the papers cited in Part 1 to understand what that was doing in that expressio...
I can answer this now!
Expected Utility, Geometric Utility, and Other Equivalent Representations
It turns out there are a large family of expectations we can use to build utility functions, including the arithmetic expectation E, the geometric expectation G, and the harmonic expectation H, and they're all equivalent models of VNM rationality! And we need something beyond that family like Scott's G[E[U]] to formalize geometric rationality.
Thank you for linking to these different families of means! The quasi-arithmetic mean turned out ... (read more)