I agree in principle that labs have the responsibility to dispel myths about what they're committed to. OTOH, in defense of the labs I imagine that this can be hard to do while you're in the middle of negotiations with various AISIs about what those commitments should look like.
The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.
evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.
I want to distinguish two possible takes here:
I'm not saying that GPT-4 is lying to us - that part is just clarifying what I think Matthew's claim is.
Re cauldron: I'm pretty sure MIRI didn't think that. Why would they?
I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion.
IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much.
Mor...
Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values?
(I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).
I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.
I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that...
I think maybe there's a parenthesis issue here :)
I'm saying "your claim, if I understand correctly, is that MIRI thought AI wouldn't (understand human values and also not lie to us)".
I think we agree - that sounds like it matches what I think Matthew is saying.
You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):
...The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of a
My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.
(End paraphrase)
I t...
(Newbie guest fund manager here) My impression is there are plans re individuals but they're not very developed or put into practice yet. AFAIK there are currently no plans to fundraise from companies or governments.
IMO a good candidate is anything that is object-level useful for X-risk mitigation. E.g. technical alignment work, AI governance / policy work, biosecurity, etc.
Broadly agree with the takes here.
However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.
This seems right and I don't think we say anything contradicting it in the paper.
...I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing
There are positive feedback loops between prongs:
A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)
whether or not this is the safest path, important actors seem likely to act as though it is
It's not clear to me that this is true, and it strikes me as maybe overly cynical. I get the sense that people at OpenAI and other labs are receptive to evidence and argument, and I expect us to get a bunch more evidence about takeoff speeds before it's too late. I expect people's takes on AGI safety plans to evolve a lot, including at OpenAI. Though TBC I'm pretty uncertain about all of this―definitely possible that you're right here.
Whether or not this is the safest path, the fact that OpenAI thinks it’s true and is one of the leading AI labs makes it a path we’re likely to take. Humanity successfully navigating the transition to extremely powerful AI might therefore require successfully navigating a scenario with short timelines and slow, continuous takeoff.
You can't just choose "slow takeoff". Takeoff speeds are mostly a function of the technology, not company choices. If we could just choose to have a slow takeoff, everything would be much easier! Unfortunately, OpenAI can't jus...
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).
Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:
Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".
Thanks for link-posting this! I'd find it useful to have the TLDR at the beginning of the post, rather than at the end (that would also make the last paragraph easier to understand). You did link the TLDR at the beginning, but I still managed to miss it on the first read-through, so I think it would be worth it.
Also: consider crossposting to the alignmentforum.
Edit: also, the author is Eliezer Yudkowsky. Would be good to mention that in the intro.
I like that mini-game! Thanks for the reference
...like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data,
It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).
More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.
Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?
...Newtonian gravity states that objects are attracted to each other in proportion to their mass. A webcam video of two apples falling will show two objects, of slightly differing masses, accelerating at the exact same rate in the same direction, and not towards each other. When you don’t know about the earth or the mechanics of the solar system, this observation points against Newtonian gravity. [...] But it requires postulating the existence of an unseen object offscreen that is 25 orders of magnitude more massive than anything it can see, with a center of
I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.
(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sur...
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.
(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).
I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.
However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mista...
For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.
Yes, and this can be framed as a consequence of a more general principle, which is that model uncertainty doesn't save you from pessimistic outcomes unless your prior (which after all is what you fall back t...
To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don't see continuity or "Things will be weird before getting extremely weird" as a crux. (I don't know why you think he does, and don't know what he thinks, but would guess he doesn't think it's a crux either)
Thanks for doing this! I think this could be valuable. What's your current plan for developing this further / onboarding collaborators?
Some observations / thoughts from interacting with the QA system for a few minutes:
...Why do people think AI is an existential risk? People think AI is an existential risk because of the possibility of a superintelligent AI system with recursive self-improvement capabilities, which could lead to catastrophic consequences like turning humans i
This seems wrong. Here's an incomplete list of reasons why:
Yeah we're on the same page here, thanks for checking!
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the prob...
Yeah that seems reasonable! (Personally I'd prefer a single break between sentence 3 and 4)
IMO ~170 words is a decent length for a well-written abstract (well maybe ~150 is better), and the problem is that abstracts are often badly written. Steve Easterbrook has a great guide on writing scientific abstracts; here's his example template which I think flows nicely:
...(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):
That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:
(Crossposting some of my twitter comments).
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on
Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.
Is there an open-source implementation of causal scrubbing available?
I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).
And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.
I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").
Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fa...
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".
(Though of course it's important to spell the argument out)
I agree with your general point here, but I think Ajeya's post actually gets this right, eg
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”
and
...What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from
FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.
Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn't extend to 'rebut every random rumor'.