I don’t know much background here so I may be off base, but it’s possible that the motivation of the trust isn’t to bind leadership’s hands to avoid profit-motivated decision making, but rather to free their hands to do so, ensuring that shareholders have no claim against them for such actions, as traditional governance structures might have provided.
(Unless "employees who signed a standard exit agreement" is doing a lot of work — maybe a substantial number of employees technically signed nonstandard agreements.)
Yeah, what about employees who refused to sign? Have we gotten any clarification on their situation?
Thank you, I appreciated this post quite a bit. There's a paucity of historical information about this conflict which isn't colored by partisan framing, and you seem to be coming from a place of skeptical, honest inquiry. I'd look forward to reading what you have to say about 1967.
Thanks for doing this! I think a lot of people would be very interested in the debate transcripts if you posted them on GitHub or something.
Okay. I do agree that one way to frame Matthew’s main point is that MIRI thought it would be hard to specify the human value function, and an LM that understands human values and reliably tells us the truth about that understanding is such a specification, and hence falsifies that belief.
To your second question: MIRI thought we couldn’t specify the value function to do the bounded task of filling the cauldron, because any value function we could naively think of writing, when given to an AGI (which was assumed to be a utility argmaxer), leads to all sorts ...
I think this reply is mostly talking past my comment.
I know that MIRI wasn't claiming we didn't know how to safely make deep learning systems, GOFAI systems, or what-have-you fill buckets of water, but my comment wasn't about those systems. I also know that MIRI wasn't issuing a water-bucket-filling challenge to capabilities researchers.
My comment was specifically about directing an AGI (which I think GPT-4 roughly is), not deep learning systems or other software generally. I *do* think MIRI was claiming we didn't know how to make AGI systems safely do mun...
Okay, that clears things up a bit, thanks. :) (And sorry for delayed reply. Was stuck in family functions for a couple days.)
This framing feels a bit wrong/confusing for several reasons.
I guess by “lie to us” you mean act nice on the training distribution, waiting for a chance to take over the world while off distribution. I just … don’t believe GPT-4 is doing this; it seems highly implausible to me, in large part because I don’t think GPT-4 is clever enough that it could keep up the veneer until it’s ready to strike if that were the case.
The term “l
...I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still
Hmm, you say “your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values”. I’m disagreeing with this. I think Matthew isn’t claiming that MIRI thought AI wouldn’t understand human values.
I think you’re misunderstanding the paragraph you’re quoting. I read Matthew, in that paragraph as acknowledging the difference between the two problems, and saying that MIRI thought value specification (not value understanding) was much harder than it’s looking to actually be.
I know this is from a bit ago now so maybe he’s changed his tune since, but I really wish he and others would stop repeating the falsehood that all international treaties are ultimately backed by force on the signatory countries. There are countless trade, climate reduction, and nuclear disarmament agreements which are not backed by force. I’d venture to say that the large majority of agreements are backed merely by the promise of continued good relations and tit-for-tat mutual benefit or defection.
A key distinction is between linearity in the weights vs. linearity in the input data.
For example, the function is linear in the arguments and but nonlinear in the arguments and , since and are nonlinear.
Similarly, we have evidence that wide neural networks are (almost) linear in the parameters , despite being nonlinear in the input data (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activati...
The more I stare at this observation, the more it feels potentially more profound than I intended when writing it.
Consider the “cauldron-filling” task. Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked? I...
Though interestingly, aligning a langchainesque AI to the user’s intent seems to be (with some caveats) roughly as hard as stating that intent in plain English.
The more I stare at this observation, the more it feels potentially more profound than I intended when writing it.
Consider the “cauldron-filling” task. Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked? I...
My guess is “today” was supposed to refer to some date when they were doing the investigation prior to the release of GPT-4, not the date the article was published.
Minerva (from June 2022) used 3e24; there's no way "several orders of magnitude larger" was right when the article was being written. I think the author just made a mistake.
Got a source for this estimate?
Epoch says 2.2e25. Skimming that page, it seems like a pretty unreliable estimate. They say their 90% confidence interval is about 1e25 to 5e25.
Nitpick: the paper from Eloundou et al is called “GPTs are GPTs”, not “GPTs and GPTs”.
Probably I should get around to reading CAIS, given that it made these points well before I did.
I found it's a pretty quick read, because the hierarchical/summary/bullet point layout allows one to skip a lot of the bits that are obvious or don't require further elaboration (which is how he endorsed reading it in this lecture).
We don’t know with confidence how hard alignment is, and whether something roughly like the current trajectory (even if reckless) leads to certain death if it reaches superintelligence.
There is a wide range of opinion on this subject from smart, well-informed people who have devoted themselves to studying it. We have a lot of blog posts and a small number of technical papers, all usually making important (and sometimes implicit and unexamined) theoretical assumptions which we don’t know are true, plus some empirical analysis of much weaker systems.
We do not have an established, well-tested scientific theory like we do with pathogens such as smallpox. We cannot say with confidence what is going to happen.
I agree that if you're absolutely certain AGI means the death of everything, then nuclear devastation is preferable.
I think the absolute certainty that AGI does mean the death of everything is extremely far from called for, and is itself a bit scandalous.
(As to whether Eliezer's policy proposal is likely to lead to nuclear devastation, my bottom line view is it's too vague to have an opinion. But I think he should have consulted with actual AI policy experts and developed a detailed proposal with them, which he could then point to, before writing up an emotional appeal, with vague references to air strikes and nuclear conflict, for millions of lay people to read in TIME Magazine.)
One's credibility would be less of course, but Eliezer is not the one who would be implementing the hypothetical policy (that would be various governments), so it's not his credibility that's relevant here.
I don't have much sense he's holding back his real views on the matter.
Did you intend to say risk off, or risk of?
If the former, then I don't understand your comment and maybe a rewording would help me.
If the latter, then I'll just reiterate that I'm referring to Eliezer's explicitly stated willingness to trade off the actuality of (not just some risk of) nuclear devastation to prevent the creation of AGI (though again, to be clear, I am not claiming he advocated a nuclear first strike). The only potential uncertainty in that tradeoff is the consequences of AGI (though I think Eliezer's been clear that he thinks it means certain doom), and I suppose what follows after nuclear devastation as well.
Right, but of course the absolute, certain implication from “AGI is created” to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals” requires some amount of justification, and that justification for this level of certainty is completely missing.
In general such confidently made predictions about the technological future have a poor historical track record, and there are multiple holes in the Eliezer/MIRI story, and there is no formal, canonical write up of why they’re so confident in their apparently sec...
There’s a big difference between pre-committing to X so you have a credible threat against Y, vs. just outright preferring X over Y. In the quoted comment, Eliezer seems to have been doing the latter.
I don’t agree billions dead is the only realistic outcome of his proposal. Plausibly it could just result in actually stopping large training runs. But I think he’s too willing to risk billions dead to achieve that.
In response to the question,
“[Y]ou’ve gestured at nuclear risk. … How many people are allowed to die to prevent AGI?”,
he wrote:
“There should be enough survivors on Earth in close contact to form a viable reproductive population, with room to spare, and they should have a sustainable food supply. So long as that's true, there's still a chance of reaching the stars someday.”
He later deleted that tweet because he worried it would be interpreted by some as advocating a nuclear first strike.
I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.
If there were a game-theoretically reliable way to get everyone to pause all together, I'd support it.
In §3.1–3.3, you look at the main known ways that altruism between humans has evolved — direct and indirect reciprocity, as well as kin and group selection[1] — and ask whether we expect such altruism from AI towards humans to be similarly adaptive.
However, as observed in R. Joyce (2007). The Evolution of Morality (p. 5),
Evolutionary psychology does not claim that observable human behavior is adaptive, but rather that it is produced by psychological mechanisms that are adaptations. The output of an adaptation need not be adaptive.
This is a subtle dist...
A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:
...Reward-seeking reinforcement-learning agents can in some instances serve as models of utility-maximizing, self-modifying agents, but in current practice, RL systems are typically distinct from the agents they produce … In multi-task RL systems, for example, RL “rewards” serve not as sources of value to agents, but as signals that guide trai
So the the way that you are like taking what is probably basically the same architecture in GPT-3 and throwing 20 times as much compute at it, probably, and getting out GPT-4.
Indeed, GPT-3 is almost exactly the same architecture as GPT-2, and only a little different from GPT.
X-risks tend to be more complicated beasts than lions in bushes, in that successfully avoiding them requires a lot more than reflexive action: we’re not going to navigate them by avoiding carefully understanding them.
Thanks, I appreciate the spirit with which you've approached the conversation. It's an emotional topic for people I guess.
The negation of the claim would not be "There is definitely nothing to worry about re AI x-risk." It would be something much more mundane-sounding, like "It's not the case that if we go ahead with building AGI soon, we all die."
I debated with myself whether to present the hypothetical that way. I chose not to, because of Eliezer's recent history of extremely confident statements on the subject. I grant that the statement I quoted in isolation could be interpreted more mundanely, like the example you give here.
When the stakes are this high and the policy pr...
Would you say the same thing about the negations of that claim? If you saw e.g. various tech companies and politicians talking about how they're going to build AGI and then [something that implies that people will still be alive afterwards] would you call them out and say they need to qualify their claim with uncertainty or else they are being unreasonable?
Yes, I do in fact say the same thing to professions of absolute certainty that there is nothing to worry about re: AI x-risk.
The negation of the claim would not be "There is definitely nothing to worry about re AI x-risk." It would be something much more mundane-sounding, like "It's not the case that if we go ahead with building AGI soon, we all die."
That said, yay -- insofar as you aren't just applying a double standard here, then I'll agree with you. It would have been better if Yud added in some uncertainty disclaimers.
There simply don't exist arguments with the level of rigor needed to justify a claim such as this one without any accompanying uncertainty:
If we go ahead on this everyone will die, including children who did not choose this and did not do anything wrong.
I think this passage, meanwhile, rather misrepresents the situation to a typical reader:
...When the insider conversation is about the grief of seeing your daughter lose her first tooth, and thinking she’s not going to get a chance to grow up, I believe we are past the point of playing political chess about a s
Would you say the same thing about the negations of that claim? If you saw e.g. various tech companies and politicians talking about how they're going to build AGI and then [something that implies that people will still be alive afterwards] would you call them out and say they need to qualify their claim with uncertainty or else they are being unreasonable?
Re: the insider conversation: Yeah, I guess it depends on what you mean by 'the insider conversation' and whether you think the impression random members of the public will get from these passages brings...
A somewhat reliable source has told me that they don't have the compute infrastructure to support making a more advanced model available to users.
That might also reflect limited engineering efforts to optimize state-of-the-art models for real world usage (think of the performance gains from GPT-3.5 Turbo) as opposed to hitting benchmarks for a paper to be published.
I believe Anthropic is committed to not pushing at the state-of-the-art, so they may not be the most relevant player in discussions of race dynamics.
Yes, although the chat interface was necessary but insufficient. They also needed a capable language model behind it, which OpenAI already had, and Google still lacks months later.
I agree that those are possibilities.
On the other hand, why did news reports[1] suggest that Google was caught flat-footed by ChatGPT and re-oriented to rush Bard to market?
My sense is that Google/DeepMind's lethargy in the area of language models is due to a combination of a few factors:
If a major fraction of all resources at the top 5–10 labs were reallocated to "us[ing] this pause to jointly develop and implement a set of shared safety protocols", that seems like it would be a good thing to me.
However, the letter offers no guidance as to what fraction of resources to dedicate to this joint safety work. Thus, we can expect that DeepMind and others might each devote a couple teams to that effort, but probably not substantially halt progress at their capabilities frontier.
The only player who is effectively being asked to halt progress at its capabilities frontier is OpenAI, and that seems dangerous to me for the reasons I stated above.
Currently, OpenAI has a clear lead over its competitors.[1] This is arguably the safest arrangement as far as race dynamics go, because it gives OpenAI some breathing room in case they ever need to slow down later on for safety reasons, and also because their competitors don't necessarily have a strong reason to think they can easily sprint to catch up.
So far as I can tell, this petition would just be asking OpenAI to burn six months of that lead and let other players catch up. That might create a very dangerous race dynamic, where now you have multip...
DeepMind might be more cautious about what it releases, and/or developing systems whose power is less legible than GPT. I have no real evidence here, just vague intuitions.
Another interesting section:
Silva-Braga: Are we close to the computers coming up with their own ideas for improving themselves?
Hinton: Uhm, yes, we might be.
Silva-Braga: And then it could just go fast?
Hinton: That's an issue, right. We have to think hard about how to control that.
Silva-Braga: Yeah. Can we?
Hinton: We don't know. We haven't been there yet, but we can try.
Silva-Braga: Okay. That seems kind of concerning.
Hinton: Uhm, yes.
Of course the choice of what sort of model we fit to our data can sometimes preordain the conclusion.
Another way to interpret this is there was a very steep update made by the community in early 2022, and since then it’s been relatively flat, or perhaps trending down slowly with a lot of noise (whereas before the update it was trending up slowly).
Seems to me there's too much noise to pinpoint the break at a specific month. There are some predictions made in early 2022 with an even later date than those made in late 2021.
But one pivotal thing around that time might have been the chain-of-thought stuff which started to come to attention then (even though there was some stuff floating around Twitter earlier).
It's a terribly organized and presented proof, but I think it's basically right (although it's skipping over some algebraic details, which is common in proofs). To spell it out:
Fix any and . We then have,
.
Adding to both sides,
.
Therefore, if (by assumption in that line of the proof) and , we'd have,
,
which contradicts our assumption that .
Thanks! This is clearer. (To be pedantic, the distance should have a root around the integral, but it's clear what you mean.)
Perhaps of interest to this community is GPT-4 using a Linux terminal to iteratively problem-solve locating and infiltrating a poorly-secured machine on a local network:
That's because the naive inner product suggested by the risk is non-informative,
Hmm, how is this an inner product? I note that it lacks, among other properties, positive definiteness:
Edit: I guess you mean a distance metric induced by an inner product (similar to the examples later on, where you have distance metrics induced by a norm), not an actual inner product? I'm confused by the use of standard inner product notation if that's the intended meaning. Also, in this case, this ...
I think this is overcomplicating things.
We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.
We just have to recognize that the vast majority of us really don't want a single rogue to be able to destroy everything we've all built, and we can act pragmatically to make that less likely.
Well, admittedly “alignment to humanity as a whole” is open to interpretation. But would you rather everyone have their own personal superintelligence that they can brainwash to do whatever they want?
And mine.