Yeah, I think they will probably do better and more regulations than if politicians were more directly involved, but I’m not super sanguine about bureaucrats in absolute terms.
This was addressed in the post: "To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of "agentic") as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don't think these discussions have been very helpful; I think agency is probably a "you know it when you see it" kind of phenomenon. Additionally, even if we do need a more formal operat...
Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious ...
I think concrete ideas like this that take inspiration from past regulatory successes are quite good, esp. now that policymakers are discussing the issue.
I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models - Leopold thinks that governments and th...
How do any capabilities or motivations arise without being explicitly coded into the algorithm?
I don't think it is correct to conceptualize MLE as a "goal" that may or may not be "myopic." LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don't intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra's idea of "HFDT," for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distrib...
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attain...
Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.
Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.
In the intervening period, I've updated towards your position, though I still think it is risky to build systems with capabilities that open ended which are that close to agents in design space
I agree with something like this, though I think you're too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal. If the model is sufficiently good at deception, there will be few to no differential adversarial examples. Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is ...
My understanding of Shard Theory is that what you said is true, except sometimes the shards "directly" make bids for outputs (particularly when they are more "reflexive," e. g. the "lick lollipop" shard is activated when you see a lollipop), but sometimes make bids for control of a local optimization module which then implements the output which scores best according to the various competing shards. You could also imagine shards which do a combination of both behaviors. TurnTrout can correct me if I'm wrong.
My takeoff speeds are on the somewhat faster end, probably ~a year or two from “we basically don’t have crazy systems” to “AI (or whoever controls AI) controls the world”
EDIT: After further reflection, I no longer endorse this. I would now put 90% CI from 6 months to 15 years with median around 3.5 years. I still think fast takeoff is plausible but now think pretty slow is also plausible and overall more likely.
I think I'm like >95th percentile on verbal-ness of thoughts. I feel like almost all of my thoughts that aren't about extremely concrete things in front of me or certain abstract systems that are best thought of visually are verbal, and even in those cases I sometimes think verbally. Almost all of the time, at least some words are going through my head, even if it's just random noise or song lyrics or something like that. I struggle to imagine what it would be like to not think this way, as if I feel like many propositions can't be eas...
Hmm... I guess I'm skeptical that we can train very specialized "planning" systems? Making superhuman plans of the sort that could counter those of an agentic superintelligence seems like it requires both a very accurate and domain-general model of the world as well as a search algorithm to figure out which plans actually accomplish a given goal given your model of the world. This seems extremely close in design space to a more general agent. While I think we could have narrow systems which outperform the misaligned superintelligence in o...
This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can't get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can't get feedback from another intelligent grader/assessor reasoning in a ~a priori way.
I think this is a very good critique of OpenAI's plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language...
I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.
I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to acc...
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the...
I think the NAH does a lot of work for interpretability of an AI's beliefs about things that aren't values, but I'm pretty skeptical about the "human values" natural abstraction. I think the points made in this post are good, and relatedly, I don't want the AI to be aligned to "human values"; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA
Good point, post updated accordingly.
I think making progress on ML is pretty hard. In order for a single AI to self improve quickly enough that it changed timelines, it would have to improve close to as fast as the speed at which all of the humans working on it could improve it. I don't know why you would expect to see such superhuman coding/science capabilities without other kinds of superintelligence.
I think the world modelling improvements from modern science and IQ raising social advances can be analytically separated from changes in our approach to welfare. As for non consensual wireheading, I am uncertain as to the moral status of this, so it seems like partially we just disagree about values. I am also uncertain as to the attitude of Stone Age people towards this - while your argument seems plausible, the fact that early philosophers like the Ancient Greeks were not pure hedonists in the wireheading sense but valued flourishing seems like evidence against this, suggesting that favoring non consensual wireheading is downstream of modern developments in utilitarianism.
The claim about Stone Age people seems probably false to me - I think if Stone Age people could understand what they were actually doing (not at the level of psychology or morality, but at the purely "physical" level), they would probably do lots of very nice things for their friends and family, in particular give them a lot of resources. However, even if it is true, I don't think the reason we have gotten better is because of philosophy - I think it's because we're smarter in a more general way. Stone Age people were uneducated and had less good nutrition than us; they were literally just stupid.
Having had a similar experience, I strongly endorse this advice. Actually optimizing for high quality relationships in modern society looks way different than following the social strategies that didn't get you killed in the EEA.
I think this is probably true; I would assign something like a 20% chance of some kind of government action in response to AI aimed at reducing x-risk, and maybe a 5-10% chance that it is effective enough to meaningfully reduce risk. That being said, 5-10% is a lot, particularly if you are extremely doomy. As such, I think it is still a major part of the strategic landspace even if it is unlikely.
Why should we expect that as the AI gradually automates us away, it replace us with better versions of ourselves rather than non-sentient, or minimally non-aligned, robots who just do its bidding?
I don’t think we have time before AGI comes to deeply change global culture.
This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.
Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.
This seems unlikely to be the case to me. However, even if this is the case and so the AI doesn't need to deceive us, isn't disempowering humans via force still necessary? Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it. Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against h...
Check out CLR's research: https://longtermrisk.org/research-agenda. They are focused on answering questions like these because they believe that competition between AI's is a big source of s-risk
It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.
This seems pretty plausible to me, but I suspect that the first AGIs will exhibit a different distribution of skills across cognitive domains than humans and may also be much less agentic. Humans evolved in environments where the ability to form and execute long term plans to accumulate power and achieve dominance over other humans was highly selected for. The environments in which the first AGIs are trained may not have this property. That doesn’t mean they won’t develop it, but they may very well not until they are more strongly and generally superintelligent,
They rely on secrecy to gain relative advantages, but absolutely speaking, openness increases research speed; it increases the amount of technical information available to every actor.
With regard to God specifically, belief in God is somewhat unique because God is supposed to make certain things good in virtue of his existence; the value of the things religious people value is predicated on the existence of God. In contrast, the value of cake to the kid is not predicated on the actual existence of the cake.
I think this is a good point and one reason to favor more CEV style solutions to alignment, if they are possible, rather than solutions which align make the values of the AI relatively "closer" to our original values.
Or, the other way around, perhaps "values" are defined by being robust to ontology shifts.
This seems wrong to me. I don't think that reductive physicalism is true (i. e. the hard problem really is hard), but if I did, I would probably change my values significantly. Similarly for religious values; religious people seem to think that God has a unique metaphysical status such that his will determines what is right and wrong, and if no being with such a metaphysical status existed, their values would have to change.
How do you know what that is? You don't have the ability to stand outside the mind-world relationship and perceive it, any more than anything else. You have beliefs about the mind-world relationship, but they are all generated by inference in your mind. If there were some hard core of non-inferential knowledge about he ontological nature of reality, you might be able to lever it to gain more knowledge, but there isn't because because the same objections apply
I'm not making any claims about knowing what it is. The OP's argument is that our normal dete...
By "process," I don't mean internal process of thought involving an inference from perceptions to beliefs about the world, I mean the actual perceptual and cognitive algorithm as a physical structure in the world. Because of the way the brain actually works in a deterministic universe, it ends up correlated with the external world. Perhaps this is unknowable to us "from the inside," but the OP's argument is not about external world skepticism given direct access only to what we perceive, but rather that given normal hypotheses about how the bra...
We get correspondence to reality through predictive accuracy; we can predict experience well using science because scientific theories are roughly isomorphic to the structures in reality that they are trying to describe.
Yeah this is exactly right imo. Thinking about good epistemics as about believing what is "justified" or what you have "reasons to believe" is unimportant/useless insofar as it departs from "generated by a process makes that the ensuing map correlate with the territory." In the world where we don't have free will, but our beliefs are produced deterministically by our observations and our internal architecture in a way such that they are correlated with the world, we have all the knowledge that we need.
While this doesn't answer the question exactly, I think important parts of the answer include the fact that AGI could upload itself to other computers, as well as acquire resources (minimally money) completely through using the internet (e. g. through investing in stocks via the internet). A superintelligent system with access to trillions of dollars and with huge numbers of copies of itself on computers throughout the world more obviously has a lot of potentially very destructive actions available to it than one stuck on one computer with no resources.
I think you’re probably right. But even this will make it harder to establish an agency where the bureaucrats/technocrats have a lot of autonomy, and it seems there’s at least a small chance of an extreme ruling which could make it extremely difficult.