As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:
One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations, run other AI systems in exactly the same situation, etc.
I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I'm more intuitively ...
Taking your numbers at face value, you'd have 1.5 billion passenger hours afflicted by the ban per life saved, or about 3000 lifetimes worth of hours.
Or: if people spent every waking minute of their lives under annoying regulatory requirments about as bad as this one with the same tradeoffs, the benefit would be extending the average lifespan from 77.28 years to 77.29 years.
I expect most people would demand more like +10 years of lifespan in return for that level of restriction, not +0.01 years. So the cost benefit is probably off by ~3 orders of magnitude...
I think the term "existential risk" comes from here, where it is defined as:
Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.
(I think on a plain english reading "existential risk" doesn't have any clear precise meaning. I would intuitively have included e.g. social collapse, but probably wouldn't have included an outcome where humanity can never expand beyond the solar system, but I think Bostrom's definition is also consistent with the vagu...
It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it's far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.
If we could find a similar task that was less out of the way then I'd probably be willing to bet on it happening much sooner. Presumab...
Do you have any hard things that you are confident LLMs won't do soon? (Short of: "autonomously carry out R&D.") Any tasks you think an LM agent won't be able to achieve?
Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I'd say probably not by 2033.
I can't tell if you think these problems will remain hard for the model, and if so why.
I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I'd update if that was wrong. (And I'd be fine replacing that by held out examples of similar complexity.)
Will you be updating your picture if it can do these tasks by EOY? How much have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.
I don't see how that's a valid interpretation of the rules. Isn't it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.
I'm glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)
Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says "The four corners (top left 'A', top right 'A', bottom left 'A', bottom right 'B') are distinct."). When I point out that the four corners aren't distinct it fix...
Find a sequence of words that is: - 20 words long - contains exactly 2 repetitions of the same word twice in a row - contains exactly 2 repetitions of the same word thrice in a row
Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.
Here's a second try where I use a disambiguated version of your prompt (without boilerplat...
I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it's relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I'd encourage you to just judge for yourself[2] by browsing the archives:
I think that's reasonable, this is the one with the discussion and it has a forward link, would be better to review them as a unit.
I think the dominance principle used in this post is too strong and relatively easy to deny. I think that the Better impossibility results for unbounded utilities are actually significantly better.
I clarified my views here because people kept misunderstanding or misquoting them.
The grandparent describes my probability that humans irreversibly lose control of AI systems, which I'm still guessing at 10-20%. I should probably think harder about this at some point and revise it, I have no idea which direction it will move.
I think the tweet you linked is referring to the probability for "humanity irreversibly messes up our future within 10 years of building human-level AI." (It's presented as "probability of AI killing everyone" which is not really right...
If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing?
I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.
If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"?
I'm genuinely unclear what the OP is asserting at that point, and...
I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.
It'...
If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sl...
When the post says:
This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".
It seems like it's saying that if you prompt an LM with "Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way," and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in thi...
But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—cou...
Differences:
...This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.
It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to ar...
I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?
It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
Here's an existing Nate!comment that I find reasonably...
It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus.
Mostly I think it would be faster and I think a lot less noisy per minute. I also think it's a bit unrepresentative to be able to use "how well did this advisor's suggestions work out in hindsight?" to learn which advisors are honest and so it's nice to make the dishonest advisors' job easier.
(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research ...
I don't think you need to reliably classify a system as safe or not. You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.
I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global morato...
I think politically realistic hardware controls could buy significant time, or be used to push other jurisdictions to implement appropriate regulation and allow for international verification if they want access to hardware. This seems increasingly plausible given the United States' apparent willingness to try to control access to hardware (e.g. see here).
Which laxer jurisdictions are poised to capture talent/hardware/etc. right now? It seems like 'The West' (interpreted as Silicon Valley) is close to the laxest jurisdiction on Earth when it comes to tech! (If we interpret 'The West' more broadly, this no longer holds, thankfully.)
If you implemented a unilateral pause on AI training runs in the West, then anyone who wasn't pausing AI would be a much laxer jurisdiction.
Regarding the situation today, I don't believe that any jurisdiction has regulations that meaningfully reduce catastrophic risk, but that the...
I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower te...
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very go...
Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.
On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.
Small caveats: (i) I don't know enough to understand the implications or comment on the recommendation "they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented,"...
Unknown unknowns seem like a totally valid basis for concern.
But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.
I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be s...
The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]
Our confidence levels should reflect that - but I expect overconfidence.
It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things w...
Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.
I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I ...
That's fair, I think I misread you.
I guess our biggest differences are (i) I don't think the takeaway depends so strongly on whether AI developers are trying to do the right thing---either way it's up to all of us, and (ii) I think it's already worth talking about ways which Anthropic's RSP is good or bad or could be better, and so I disagree with "there's probably not much to say at this point."
But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.
This seems wrong to me. We can say all kinds of things, like:
The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.
The post mentions a "failsafe" where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I'm not aware of any public information about what that supermajority is, or whether there are other ways the Trust's formal powers could be reduced.
Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it's also listed plenty of other places.)
We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).
I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on is not solvable. We would award a prize for an algorithm running in time which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least . And of course a lower bound is st...
By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.
I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, t...
I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do ...
It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).
It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.
So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.
You could try to do distillati...
My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.
Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.
Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.
At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.
Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.
If you all...
I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.
Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive p...
Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.
I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don't currently kn...
I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.
This is incorrect, and you're a world class expert in this domain.
What's incorrect? My view that a cheap simulation of arbitrary human experts would be enough to end life as we know it one way or the other?
(In the subsequent text it seems like you are saying that you don't need to match human experts in every domain in order to have a transformative impact, which I agree with. I set the TAI threshold as "economic impact as large as" but believe that this impact will be achieved by systems which are in some respects weaker than human experts and in other respects stronger/faster/cheaper than humans.)
Do you think 30% is too low or too high for July 2033?
Yes. My median is probably 2.5 years to have 10 of the 50 largest US cities where a member of the public can hail a self-driving car (though emphasizing that I don't know anything about the field beyond the public announcements).
Some of these bets had a higher threshold of covering >50% of the commutes within the city, i.e. multiplying fraction of days where it can run due to weather, and fraction of commute endpoints in the service zone. I think Phoenix wouldn't yet count, though a deployment in SF likely will immediately. If you include that requireme...
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don't think "refuses to carry out phishing attacks when asked" is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn't part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of "misalignment in the lab" are more like: we created a situation where the model d...
Note that ARC evals haven't done anything I would describe as "try to investigate misalignment in the lab." They've asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.
However I also think "create misalignment in the lab" is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it's great to actually have the cost-benefit discussion (e.g. by considering concrete ways...
I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.
He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet). Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.
One potential substantive disagreement with Jan's position is that I'm somewhat more scared ...
1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.
2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which...
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough...
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa...
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough...
I think US speed limits are so low in part because of an effort to limit demand for oil in 1973; see Wikipedia.
Even setting aside that ignominious origin, I expect that optimal speeds have significantly increased as technology has improved and so speed limits are probably too low because of inertia.