How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?
I've failed to find anything in our literature.
It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.
However, fair problems are more than just fair environments: we want a definition of a fair problem (an...
Over the past three years, as my timelines have shortened and my hopes for alignment or coordination have dwindled, I've switched over to consumption. I just make sure to keep a long runway, so that I could pivot if AGI progress is somehow halted or sputters out on its own or something.
The fault does not lie with Jacob, but wow, this post aged like an open bag of bread.
It… was the fault of Jacob?
The post was misleading when it was written, and I think was called out as such by many people at the time. I think we should have some sympathy with Jacob being naive and being tricked, but surely a substantial amount of blame accrues to him for going to the bat for OpenAI when that turned out to be unjustified in the end (and at least somewhat predictably so).
I suggest a fourth default question for these reading groups:
How did this post age?
Soon the two are lost in a maze of words defined in other words, the problem that Steven Harnad once described as trying to learn Chinese from a Chinese/Chinese dictionary.
Of course, it turned out that LLMs do this just fine, thank you.
intensional terms
Should probably link to Extensions and Intensions; not everyone reads these posts in order.
Mati described himself as a TPM since September 2023 (after being PM support since April 2022), and Andrei described himself as a Research Engineer from April 2023 to March 2024. Why do you believe either was not a FTE at the time?
And while failure to sign isn't proof of lack of desire to sign, the two are heavily correlated—otherwise it would be incredibly unlikely for the small Superalignment team to have so many members who signed late or not at all.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.
The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were...
There are a few people in this list who I think are being counted incorrectly as FTEs (Mati and Andrei, for example).
I would also be careful about making inferences based on timing of supposed signature: I have heard that the signature Google Doc had crashed and so the process for adding names was slow and cumbersome. That is, the time at which someone’s name was added may have been significantly after they expressed desire to sign.
EDIT: On reflection, I made this a full Shortform post.
With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd do a more thorough scan of the departures. I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
CDT agents respond well to threats
Might want to rephrase this as "CDT agents give in to threats"
This is weirdly meta.
If families are worried about the cost of groceries, they should welcome this price discrimination. The AI will realize you are worried about costs. It will offer you prime discounts to win your business. It will know you are willing to switch brands to get discounts, and use this to balance inventory.
Then it will go out and charge other people more, because they can afford to pay. Indeed, this is highly progressive policy. The wealthier you are, the more you will pay for groceries. What’s not to love?
A problem is that this is not only a tax on indifferenc...
Re: experience machine, Past Me would have refused it and Present Me would take it. The difference is due to a major (and seemingly irreversible) deterioration in my wellbeing several years ago, but not only because that makes the real world less enjoyable.
Agency is another big reason to refuse the experience machine; if I think I can make a difference in the base-level world, I feel a moral responsibility towards it. But I experience significantly less agency now (and project less agency in the future), so that factor is diminished for me.
The main f...
I'm really impressed with your grace in writing this comment (as well as the one you wrote on the market itself), and it makes me feel better about Manifold's public epistemics.
Yes, and I gained some easy mana from such markets; but the market that got the most attention by far was the intrinsically flawed conditional market.
Real-money markets do have stronger incentives for sharps to scour for arbitrage, so the 1/1/26 market would have been more likely to be noticed before months had gone by.
However (depending on the fee structure for resolving N/A markets), real-money markets have even stronger incentives for sharps to stay away entirely from spurious conditional markets, since they'd be throwing away cash and not just Internet points. Never ever ever cite out-of-the-money conditional markets.
Broke: Prediction markets are an aggregation of opinions, weighted towards informed opinions by smart people, and are therefore a trustworthy forecasting tool on any question.
Woke: Prediction markets are MMOs set in a fantasy world where, if someone is Wrong On The Internet, you can take their lunch money.
Can you share any strong evidence that you're an unusually trustworthy person in regard to confidential conversations? People would in fact be risking a lot by talking to you.
(This is sincere btw; I think this service should absolutely exist, but the best version of it is probably done by someone with a longstanding public reputation of circumspection.)
Good question.
I can't really make this legible, no.
On the whistleblowing part, you should be able to get good advice without trusting me. It's publicly known that Kelsey Piper plus iirc one or two of the ex-OpenAI folks are happy to talk to potential whistleblowers. I should figure out exactly who that is and put their (publicly verifiable) contact info in this post (and, note to self, clarify whether or in-what-domains I endorse their advice vs merely want to make salient that it's available). Thanks.
[Oh, also ideally maybe I'd have a real system for anon...
I trust Zach a lot and would be shocked if he maliciously or carelessly leaked info. I don’t believe he’s very experienced at handling confidential information, but I expect him to seek out advice as necessary and overall do a good job here. Happy to say more to anyone interested.
Good question! I picked it up from a friend at a LW meetup a decade ago, so it didn't come with all the extra baggage that vipassana meditation seems to usually carry. So this is just going to be the echo of it that works for me.
Step 1 is to stare at your index finger (a very sensitive part of your body) and gently, patiently try to notice that it's still producing a background level of sensory stimulus even when it's not touching anything. That attention to the background signal, focused on a small patch of your body, is what the body scan is based on.
Ste...
I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.
...GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
- Two hum
The spun-off agent foundations team seems to have less reason than most AI safety orgs to be in the Bay Area, so moving to NZ might be worth considering for them.
Note on current methodology:
Counterpoint: other labs might become more paranoid that SSI is ahead of them. I think your point is probably more correct than the counterpoint, but it's worth mentioning.
Elon diversifies in the sense of "personally micromanaging more companies", not in the sense of "backing companies he can't micromanage".
By my assessment, the employees who failed to sign the final leaked version of the Altman loyalty letter have now been literally decimated.
I'm trying to track the relative attrition for a Manifold market: of the 265 OpenAI employees who hadn't yet signed the loyalty letter by the time it was first leaked, what percent will still be at OpenAI on the one-year anniversary?
I'm combining that first leaked copy with 505 signatures, the final leaked copy with 702 signatures, the oft-repeated total headcount of 770, and this spreadsheet tracking OpenAI departures ...
"decimate" is one of those relatively rare words where the literal meaning is much less scary than the figurative meaning.
I'm not even angry, just disappointed.
I am angry and disappointed.
The diagnosis is roughly correct (I would say "most suffering is caused by an internal response of fleeing from pain but not escaping it"), but IMO the standard proffered remedy (Buddhist-style detachment from wanting) goes too far and promises too much.
Re: the diagnosis, three illustrative ways the relationship between pain, awareness, and suffering has manifested for me:
Go all-in on lobbying the US and other governments to fully prohibit the training of frontier models beyond a certain level, in a way that OpenAI can't route around (so probably block Altman's foreign chip factory initiative, for instance).
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)
I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.
Thank you! I'd forgotten about that.
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?
How certain are you that this is always true (rather than "we've usually noticed this even though we haven't explicitly been checking for it in general"), and that it will continue to be so as models become stronger?
It seems to me like additionally running evals on base models is a highly reasonable precaution.
Oh wait, I misinterpreted you as using "much worse" to mean "much scarier", when instead you mean "much less capable".
I'd be glad if it were the case that RL*F doesn't hide any meaningful capabilities existing in the base model, but I'm not sure it is the case, and I'd sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)
That's exactly the point: if a model has bad capabilities and deceptive alignment, then testing the post-tuned model will return a false negative for those capabilities in deployment. Until we have the kind of interpretability tools that we could deeply trust to catch deceptive alignment, we should count any capability found in the base model as if it were present in the tuned model.
I'd like to see evals like DeepMind's run against the strongest pre-RL*F base models, since that actually tells you about capability.
The WWII generation is negligible in 2024. The actual effect is partly the inverted demographic pyramid (older population means more women than men even under normal circumstances), and partly that even young Russian men die horrifically often:
At 2005 mortality rates, for example, only 7% of UK men but 37% of Russian men would die before the age of 55 years
And for that, a major culprit is alcohol (leading to accidents and violence, but also literally drinking oneself to death).
Among the men who don't self-destruct, I imagine a large fraction have already b...
That first statistic, that it swiped right 353 times and got to talk to 160 women, is completely insane. I mean, that’s almost a 50% match rate, whereas estimates in general are 4% to 14%.
Given Russia's fucked-up gender ratio (2.5 single women for every single man), I don't think it's that unreasonable!
Generally, the achievement of "guy finds a woman willing to accept a proposal" impresses me far less in Russia than it would in the USA. Let's see if this replicates in a competitive dating pool.
In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you're making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you'd acted in bog-standard normal ways.
(This doesn't mean suspending your ethics! Those are part of winning! But if you can't figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)
Ah, I'm talking about introspection in a therapy context and not about exhorting others.
For example:
Internal coherence: "I forgive myself for doing that stupid thing".
Load-bearing but opaque: "It makes sense to forgive myself, and I want to, but for some reason I just can't".
Load-bearing and clear resistance: "I want other people to forgive themselves for things like that, but when I think about forgiving myself, I get a big NOPE NOPE NOPE".
P.S. Maybe forgiving oneself isn't actually the right thing to do at the moment! But it will also be easier to learn that in the third case than in the second.
"I endorse endorsing X" is a sign of a really promising topic for therapy (or your preferred modality of psychological growth).
If I can simply say "X", then I'm internally coherent enough on that point.
If I can only say "I endorse X", then not-X is psychologically load-bearing for me, but often in a way that is opaque to my conscious reasoning, so working on that conflict can be slippery.
But if I can only say "I endorse endorsing X", then not only is not-X load-bearing for me, but there's a clear feeling of resistance to X that I can consciously hone in on, connect with, and learn about.
Re: Canadian vs American health care, the reasonable policy would be:
"Sorry, publicly funded health care won't cover this, because the expected DALYs are too expensive. We do allow private clinics to sell you the procedure, though unless you're super wealthy I think the odds of success aren't worth the cost to your family."
(I also approve of euthanasia being offered as long as it's not a hard sell.)
I think MIRI is correct to call it as they see it, both on general principles and because if they turn out to be wrong about genuine alignment progress being very hard, people (at large, but also including us) should update against MIRI's viewpoints on other topics, and in favor of the viewpoints of whichever AI safety orgs called it more correctly.
Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:
we just call 'em like we see 'em
[...]
insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy
I'll also add: Nate (u...
I mean, I don't really care how much e.g. Facebook AI thinks they're racing right now. They're not in the game at this point.
The race dynamics are not just about who's leading. FB is 1-2 years behind (looking at LLM metrics), and it doesn't seem like they're getting further behind OpenAI/Anthropic with each generation, so I expect that the lag at the end will be at most a few years.
That means that if Facebook is unconstrained, the leading labs have only that much time to slow down for safety (or prepare a pivotal act) as they approach AGI before Facebook g...
I'm surprised that nobody has yet brought up the development that the board offered Dario Amodei the position as a merger with Anthropic (and Dario said no!).
(There's no additional important content in the original article by The Information, so I linked the Reuters paywall-free version.)
Crucially, this doesn't tell us in what order the board made this offer to Dario and the other known figures (GitHub CEO Nat Friedman and Scale AI CEO Alex Wang) before getting Emmett Shear, but it's plausible that merging with Anthropic was Plan A all along. Moreover, I s...
Has this one been confirmed yet? (Or is there more evidence that this reporting that something like this happened?)
No, I don't think the board's motives were power politics; I'm saying that they failed to account for the kind of political power moves that Sam would make in response.
In addition to this, Microsoft will exert greater pressure to extract mundane commercial utility from models, compared to pushing forward the frontier. Not sure how much that compensates for the second round of evaporative cooling of the safety-minded.
If they thought this would be the outcome of firing Sam, they would not have done so.
The risk they took was calculated, but man, are they bad at politics.
I keep being confused by them not revealing their reasons. Whatever they are, there's no way that saying them out loud wouldn't give some ammo to those defending them, unless somehow between Friday and now they swung from "omg this is so serious we need to fire Altman NOW" to "oops looks like it was a nothingburger, we'll look stupid if we say it out loud". Do they think it's a literal infohazard or something? Is it such a serious accusation that it would involve the police to state it out loud?
My current candidate definitions, with some significant issues in the footnotes:
A fair environment is a probabilistic function F(x1,...,xN)=[X1,...,XN] from an array of actions to an array of payoffs.
An agent A is a random variable
A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN)
which takes in a fair environment F[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in F. [2]
A fair agent is one whose mixed strategy is a function of subjective probabilities[3] that... (read more)