How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?
I've failed to find anything in our literature.
It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.
However, fair problems are more than just fair environments: we want a definition of a fair problem (and fair agents) under which, among other things:
Modal combat doesn't need to worry about this, because all the agents in it are fair-by-construction.
Yeah, I know, it's about a decade late to be asking this question.
Over the past three years, as my timelines have shortened and my hopes for alignment or coordination have dwindled, I've switched over to consumption. I just make sure to keep a long runway, so that I could pivot if AGI progress is somehow halted or sputters out on its own or something.
The fault does not lie with Jacob, but wow, this post aged like an open bag of bread.
I suggest a fourth default question for these reading groups:
How did this post age?
Soon the two are lost in a maze of words defined in other words, the problem that Steven Harnad once described as trying to learn Chinese from a Chinese/Chinese dictionary.
Of course, it turned out that LLMs do this just fine, thank you.
intensional terms
Should probably link to Extensions and Intensions; not everyone reads these posts in order.
Mati described himself as a TPM since September 2023 (after being PM support since April 2022), and Andrei described himself as a Research Engineer from April 2023 to March 2024. Why do you believe either was not a FTE at the time?
And while failure to sign isn't proof of lack of desire to sign, the two are heavily correlated—otherwise it would be incredibly unlikely for the small Superalignment team to have so many members who signed late or not at all.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.
The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were).
I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.
Below are my current tallies of some notable subsets. Please comment with any corrections!
People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone:
People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone:
Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone:
Notable other ex-employees:
EDIT: On reflection, I made this a full Shortform post.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd do a more thorough scan of the departures. I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.
Below are my current tallies of some notable subsets. Please comment with any corrections!
People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone:
People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone:
Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone:
Notable other ex-employees:
My current candidate definitions, with some significant issues in the footnotes:
A fair environment is a probabilistic function F(x1,...,xN)=[X1,...,XN] from an array of actions to an array of payoffs.
An agent A is a random variable
A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN)
which takes in a fair environment F[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in F. [2]
A fair agent is one whose mixed strategy is a function of subjective probabilities[3] that it assigns to [the actions of some finite collection of agents in fair environments, where any agents not appearing in the original problem must themselves be fair].
Formally, if A is a fair agent in with a subjective probability estimator P, A's mixed strategy in a fair environment F,
A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN)
should depend only on a finite collection of A's subjective probabilities about outcomes
{P(Fk(A1,...,AN,B1,...BM))=[X1,...,XN+M]}Kk=1
for a set of fair environments F1,...,FK and an additional set of fair[4] agents[5] B1,...,BM if needed (note that not all agents need to appear in all environments).
A fair problem is a fair environment with one designated player, where all other agents are fair agents.
I might need to require every F to have a default action dF, so that I don't need to worry about axiom-of-choice issues when defining an agent over the space of all fair environments.
I specified a probabilistic environment and mixed strategies because I think there should be a unique fixed point for agents, such that this is well-defined for any fair environment F. (By analogy to reflective oracles.) But I might be wrong, or I might need further restrictions on F.
Grossly underspecified. What kinds of properties are required for subjective probabilities here? You can obviously cheat by writing BlueEyedBot into your probability estimator.
This is an infinite recursion, of course. It works if we require each Bm to have a strictly lower complexity in some sense than A (e.g. the rank of an agent is the largest number K of environments it can reason about when making any decision, and each Bm needs to be lower-rank than A), but I worry that's too strong of a restriction and would exclude some well-definable and interesting agents.
Does the fairness requirement on the Bm suffice to avert the MetaBlueEyedBot problem in general? I'm unsure.