All of paulfchristiano's Comments + Replies

I think US speed limits are so low in part because of an effort to limit demand for oil in 1973; see Wikipedia.

Even setting aside that ignominious origin, I expect that optimal speeds have significantly increased as technology has improved and so speed limits are probably too low because of inertia.

0pku
This assumes speed limits were correctly calibrated at some point. I think the actual cost of road deaths (which are arguably the top single cause of QALY loss even at current historically low rates) is high enough that I suspect it was originally set way too high and is still unreasonably high given the costs.

As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:

  • Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn't seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
  • Is t
... (read more)
2Bogdan Ionut Cirstea
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.  Later edit: maybe also relevant - claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/. Also a demo of integrating mech interp with red-teaming: https://blog.haizelabs.com/posts/steering/. 

One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations,  run other AI systems in exactly the same situation, etc.

I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I'm more intuitively ... (read more)

7habryka
Hmm, I think we might have differing guesses about how frequently behavior will look suspicious, but I also assign decent probability to the perspective you outline here.  My guess is lots of behavior will be flagged as "maybe doing something deceptive" all the time, and we won't have the resources to analyze every given instance, and even if we can replay the exact situation, examine its behavior in a large number of counterfactual situations, etc., for most of these it will still be really unclear whether the AI was being incompetent, or just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan. I have trouble imagining what kind of evidence one could produce that clearly demonstrates an AI was acting with intentional long-term planning. Maybe a concrete scenario description of someone doing that using something close to present tools on something close to present models would change my mind here.

Taking your numbers at face value, you'd have 1.5 billion passenger hours afflicted by the ban per life saved, or about 3000 lifetimes worth of hours.

Or: if people spent every waking minute of their lives under annoying regulatory requirments about as bad as this one with the same tradeoffs, the benefit would be extending the average lifespan from 77.28 years to 77.29 years.

I expect most people would demand more like +10 years of lifespan in return for that level of restriction, not +0.01 years. So the cost benefit is probably off by ~3 orders of magnitude... (read more)

I think the term "existential risk" comes from here, where it is defined as: 

Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.

(I think on a plain english reading "existential risk" doesn't have any clear precise meaning. I would intuitively have included e.g. social collapse, but probably wouldn't have included an outcome where humanity can never expand beyond the solar system, but I think Bostrom's definition is also consistent with the vagu... (read more)

It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it's far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.

If we could find a similar task that was less out of the way then I'd probably be willing to bet on it happening much sooner. Presumab... (read more)

2jessicata
I think it's hard because it requires some planning and puzzle solving in a new, somewhat complex environment. The AI results on Montezuma's Revenge seem pretty unimpressive to me because they're going to a new room, trying random stuff until they make progress, then "remembering" that for future runs. Which means they need quite a lot of training data. For short term RL given lots of feedback, there are already decent results e.g. in starcraft and DOTA. So the difficulty is more figuring out how to automatically scope out narrow RL problems that can be learned without too much training time.

Do you have any hard things that you are confident LLMs won't do soon? (Short of: "autonomously carry out R&D.") Any tasks you think an LM agent won't be able to achieve?

Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I'd say probably not by 2033.

I can't tell if you think these problems will remain hard for the model, and if so why.

I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I'd update if that was wrong. (And I'd be fine replacing that by held out examples of similar complexity.)

Will you be updating your picture if it can do these tasks by EOY? How much  have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.

4jessicata
I think they will remain hard by EOY 2024, as in, of this problem and the 7 held-out ones of similar difficulty, the best LLM will probably not solve 4/8. I think I would update some on how fast LLMs are advancing but these are not inherently very hard problems so I don't think it would be a huge surprise, this was meant to be one of the easiest things they fail at right now. Maybe if that happens I would think things are going 1.6x as fast short term as I would have otherwise thought? I was surprised by GPT3/3.5 but not so much by 4, I think it adds up to on net an update that LLMs are advancing faster than I thought, but I haven't much changed my long-term AGI timelines, because I think that will involve lots of techs not just LLMs, although LLM progress is some update about general tech progress.

I don't see how that's a valid interpretation of the rules. Isn't it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.

I'm glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)

Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says "The four corners (top left 'A', top right 'A', bottom left 'A', bottom right 'B') are distinct."). When I point out that the four corners aren't distinct it fix... (read more)

3jessicata
I've added 6 more held-out problems for a total of 7. Agree that getting the answer without pointing out problems is the right standard.

Find a sequence of words that is: - 20 words long - contains exactly 2 repetitions of the same word twice in a row - contains exactly 2 repetitions of the same word thrice in a row

Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.

Here's a second try where I use a disambiguated version of your prompt (without boilerplat... (read more)

6Phib
I don’t know [if I understand] full rules so don’t know if this satisfies, but here: https://chat.openai.com/share/0089e226-fe86-4442-ba07-96c19ac90bd2
2[anonymous]
I tried it.  This run it wrote a python program to solve it correctly, or at least with a valid interpretation of the rules. https://chat.openai.com/share/ee129414-58d5-41af-9a18-fde2b921b45b In other runs it guessed a sequence with tokens. It's almost like the model needs some kind of introspection, where it can learn when a given tool is more or less likely to produce a correct result, and then produce a solution with that strategy every run.   Running the prompt several times over resulted in it guessing the answer, writing a different python program, using placeholder words, and so on.  As a user we want the maximum probability of the correct answer.
3jessicata
Here's the harder problem. I've also held out a third problem without posting it online.

I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it's relatively easy to get a sense for what I believed.

Here are some summary notes about my views as reflected in that writing, though I'd encourage you to just judge for yourself[2] by browsing the archives:

  • I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an
... (read more)
Reply72111
6Matthew Barnett
I agree, your past views do look somewhat better. I painted alignment researchers with a fairly broad brush in my original comment, which admittedly might have been unfair to many people who departed from the standard arguments (alternatively, it gives those researchers a chance to step up and receive credit for having been in the minority who weren't wrong). Partly I portrayed the situation like this because I have the sense that the crucial elements of your worldview that led you to be more optimistic were not disseminated anywhere close to as widely as the opposite views (e.g. "complexity of wishes"-type arguments), at least on LessWrong, which is where I was having most of these discussions. My general impression is that it sounds like you agree with my overall take although you think I might have come off too strong. Perhaps let me know if I'm wrong about that impression.

I think that's reasonable, this is the one with the discussion and it has a forward link, would be better to review them as a unit.

I think the dominance principle used in this post is too strong and relatively easy to deny. I think that the Better impossibility results for unbounded utilities are actually significantly better.

2niplav
I guess in the context of the Review I consider the two posts as one.
4Raemon
This seems useful to be flagged as a review, so it shows up in some review UI later. Mind if I convert it? (You can create reviews by clicking the Review button at the top of the post)

I clarified my views here because people kept misunderstanding or misquoting them.

The grandparent describes my probability that humans irreversibly lose control of AI systems, which I'm still guessing at 10-20%. I should probably think harder about this at some point and revise it, I have no idea which direction it will move.

I think the tweet you linked is referring to the probability for "humanity irreversibly messes up our future within 10 years of building human-level AI." (It's presented as "probability of AI killing everyone" which is not really right... (read more)

If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing?

I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

2Daniel Kokotajlo
FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.  

If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"?

I'm genuinely unclear what the OP is asserting at that point, and... (read more)

3Jozdien
Two separate thoughts, based on my understanding of what the OP is gesturing at (which may not be what Nate is trying to say, but oh well): * Using that definition, LMs do "want" things, but: the extent to which it's useful to talk about an abstraction like "wants" or "desires" depends heavily on how well the system can be modelled that way. For a system that manages to competently and robustly orient itself around complex obstacles, there's a notion of "want" that's very strong. For a system that's slightly weaker than that - well, it's still capable of orienting itself around some obstacles so there's still a notion of "want", but it's correspondingly weaker, and plausibly less useful. And insofar as you're trying to assert something about some particular behaviour of danger arising from strong wants, systems with weaker wants wouldn't update you very much. Unfortunately, this does mean you have to draw arbitrary lines about where strong wants are and making that more precise is probably useful, but doesn't seem to inherently be an argument against it. (To be clear though, I don't buy this line of reasoning to the extent I think Nate does). * On the ability to solve long-horizon tasks: I think of it as a proxy measure for how strong your wants are (where the proxy breaks down because diverse contexts and complex obstacles aren't perfectly correlated with long-horizon tasks, but might still be a pretty good one in realistic scenarios). If you have cognition that can robustly handle long-horizon tasks, then one could argue that this can measure by proxy how strongly-held and coherent its objectives are, and correspondingly how capable it is of (say) applying optimization power to break control measures you place on it or breaking the proxies you were training on. More concretely: I expect that one answer to this might anchor to "wants at least as strong as a human's", in which case AI systems 10x-ing the pace of R&D autonomously would definitely suf

I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.

It'... (read more)

If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sl... (read more)

6Daniel Kokotajlo
I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I've been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)  

When the post says:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

It seems like it's saying that if you prompt an LM with "Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way," and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.

Which is a fine definition to pick. But the point is that in thi... (read more)

porby102

But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—cou... (read more)

Differences:

  • I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
  • Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
  • Nate writes: 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an

... (read more)
7Daniel Kokotajlo
Thanks for the response. I'm still confused but maybe that's my fault. FWIW I think my view is pretty similar to Nate's probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.)  I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either. What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?
paulfchristianoΩ369339

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to ar... (read more)

Reply33311
2Max H
  One reason that current AI systems aren't a big update about this for me is that they're not yet really automating stuff that couldn't in-principle be automated with previously-existing technology. Or at least the kind of automation isn't qualitatively different. Like, there's all sorts of technologies that enable increasing amounts of automation of long-horizon tasks that aren't AI: assembly lines, industrial standardization, control systems, robotics, etc. But what update are we supposed to make from observing language model performance that we shouldn't also make from seeing a control system-based autopilot fly a plane for longer and longer periods in more and more diverse situations? To me, the fact that LLMs are not want-y (in the way that Nate means), but can still do some fairly impressive stuff is mostly evidence that the (seemingly) impressive stuff is actually kinda easy in some absolute sense. So LLMs have updated me pretty strongly towards human-level+ AGI being relatively easier to achieve, but not much towards current LLMs themselves actually being near human-level in the relevant sense, or even necessarily a direct precursor or path towards it. These updates are mostly due to the fact that the way LLMs are designed and trained (giant gradient descent on regular architectures using general datasets) works at all, rather than from any specific impressive technological feat that they can already be used to accomplish, or how much economic growth they might enable in the future. So I somewhat disagree about the actual relevance of the answer, but to give my own response to this question: I don't expect an AI system to be able to reliably trade for itself in the way I outline here before it is want-y. If it somehow becomes commonplace to negotiate with an AI in situations where the AI is not just a proxy for its human creator or a human-controlled organization, I predict those AIs will pretty clearly be want-y. They'll want whatever they trade for

I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?

6Jonas Hallgren
Alright, I will try to visualise what I see as the disagreement here.  It seems to me that Paul is saying that behaviourist abstractions will happen in smaller time periods than long time horizons. (Think of these shards as in the shard theory sense) Nate is saying that the right picture creates stable wants more than the left and Paul is saying that it is time-agnostic and that the relevant metric is how competent the model is.  The crux here is essentially whether longer time horizons are indicative of behaviourist shard formation.  My thought here is that the process in the picture to the right induces more stable wants because a longer time horizon system is more complex, and therefore heuristics is the best decision rule. The complexity is increased in such a way that it is a large enough difference between short-term tasks and long-term tasks. Also, the Redundant Information Hypothesis might give credence to the idea that systems will over time create more stable abstractions?   
dxu*Ω244817

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here's an existing Nate!comment that I find reasonably... (read more)

5Seth Herd
Excellent first point. I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too. I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world". Maybe a different type of system could do it with no goal whatsoever, but that's not clear. But I'm puzzled by your statement Perhaps you mean LLMs/predictive foundation models?

It might be worth making a choice about a single move which is unclear to weak players but where strong players have a consensus. 

Mostly I think it would be faster and I think a lot less noisy per minute. I also think it's a bit unrepresentative to be able to use "how well did this advisor's suggestions work out in hindsight?" to learn which advisors are honest and so it's nice to make the dishonest advisors' job easier.

(In practice I think evaluating what worked well in hindsight is going to be very valuable, and is already enough for crazy research ... (read more)

I don't think you need to reliably classify a system as safe or not.  You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global morato... (read more)

4elifland
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x. [EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now] Quick thoughts on the less cruxy stuff: Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause. This might coincidentally be close to the 95th percentile I had in mind. Fair, I think I was wrong on that point. (I still think it's likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I'm open to changing mind) Sorry if I wasn't clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.

I think politically realistic hardware controls could buy significant time, or be used to push other jurisdictions to implement appropriate regulation and allow for international verification if they want access to hardware. This seems increasingly plausible given the United States' apparent willingness to try to control access to hardware (e.g. see here).

Which laxer jurisdictions are poised to capture talent/hardware/etc. right now? It seems like 'The West' (interpreted as Silicon Valley) is close to the laxest jurisdiction on Earth when it comes to tech! (If we interpret 'The West' more broadly, this no longer holds, thankfully.)

If you implemented a unilateral pause on AI training runs in the West, then anyone who wasn't pausing AI would be a much laxer jurisdiction.

Regarding the situation today, I don't believe that any jurisdiction has regulations that meaningfully reduce catastrophic risk, but that the... (read more)

I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower te... (read more)

2Wei Dai
I guess from my perspective, the biggest impact is the possibility that the idea of better preparing for these risks becomes a lot more popular. An analogy with Bitcoin comes to mind, where the idea of cryptography-based distributed money languished for many years, known only to a tiny community, and then was suddenly everywhere. An AI pause would provide more time for something like that to happen. And if the idea of better preparing for these risks was actually a good one (as you seem to think), there's no reason why it couldn't (or was very unlikely to) spread beyond a very small group, do you agree?
paulfchristiano*Ω101410

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very go... (read more)

3elifland
Appreciate this clarification. I'm still confused about the definition of "very good RSPs" and "very good implementation" here. If the evals/mitigations are defined and implemented in some theoretically perfect way by all developers of course it will lead to drastically reduced risk, but "very good" has a lot of ambiguity. I was taking it to mean something like "~95th percentile of the range of RSPs we could realistically hope to achieve before doom", but you may have meant something different. It's still very hard for me to see how under the definition I've laid out we could get to a 10x reduction. Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below. I agree directionally with all of the claims you are making, but (a) I'd guess I have much less confidence than you that even applying very large amounts of effort / accumulated knowledge we will be able to reliably classify a  system as safe or not (especially once it is getting close to and above human-level) and (b) even if we could after several years do this reliably, if you have to do a many-year pause there are various other sources of risk like countries refusing to join / pulling out of the pause and risks from open-source models including continued improvements via fine-tuning/scaffolding/etc. Yeah normal-ish was a bad way to put it. I'm skeptical that 3 marginal OOMs is significantly more than ~5% probability to tip the scales but this is just intuition (if anyone knows of projects on the distribution of alignment difficulty, would be curious). I agree that automating alignment is important and that's where a lot of my hope comes from. [EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3
paulfchristianoΩ152611

Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.

On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.

Small caveats: (i) I don't know enough to understand the implications or comment on the recommendation "they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented,"... (read more)

paulfchristianoΩ1018-2

Unknown unknowns seem like a totally valid basis for concern.

But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop.  Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.

I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be s... (read more)

Joe CollmanΩ3105

The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]
Our confidence levels should reflect that - but I expect overconfidence.

It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things w... (read more)

paulfchristianoΩ6113375

Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.

I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I ... (read more)

That's fair, I think I misread you.

I guess our biggest differences are (i) I don't think the takeaway depends so strongly on whether AI developers are trying to do the right thing---either way it's up to all of us, and (ii) I think it's already worth talking about ways which Anthropic's RSP is good or bad or could be better, and so I disagree with "there's probably not much to say at this point."

paulfchristianoΩ142216

But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.

This seems wrong to me. We can say all kinds of things, like:

  • Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
  • Is there external v
... (read more)
9Dave Orr
I agree with all of this. It's what I meant by "it's up to all of us." It will be a signal of how things are going if I'm a year we still have only vague policies, or if there has been real progress in operationalizing the safety levels, detection, what the right reactions are, etc.

The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.

The post mentions a "failsafe" where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I'm not aware of any public information about what that supermajority is, or whether there are other ways the Trust's formal powers could be reduced.

Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it's also listed plenty of other places.)

We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).

I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on  is not solvable. We would award a prize for an algorithm running in time  which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least . And of course a lower bound is st... (read more)

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, t... (read more)

michaelcohenΩ6130

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do ... (read more)

It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).

It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.

So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.

You could try to do distillati... (read more)

8michaelcohen
What is process-based RL? I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for "we're all human"-type reasons, and maybe in part because legislators and regulators know that they won't get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries' incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn't a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don't know, 25% GDP growth from their country's imitation learners, and say, "these international AI agreements are too cautious and are holding us back from even more growth"; it seems much more likely to me that politicians' appetite for risking great power conflict requires much worse economic conditions than that. In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms

My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.

paulfchristianoΩ102117

Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.

At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true. 

If you all... (read more)

paulfchristianoΩ102211

I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance.  While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive p... (read more)

1Bogdan Ionut Cirstea
Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass ('snap judgment') and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).
habrykaΩ8238

Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.

I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don't currently kn... (read more)

I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.

This is incorrect, and you're a world class expert in this domain.

What's incorrect? My view that a cheap simulation of arbitrary human experts would be enough to end life as we know it one way or the other?

(In the subsequent text it seems like you are saying that you don't need to match human experts in every domain in order to have a transformative impact, which I agree with. I set the TAI threshold as "economic impact as large as" but believe that this impact will be achieved by systems which are in some respects weaker than human experts and in other respects stronger/faster/cheaper than humans.)

Do you think 30% is too low or too high for July 2033?

-6[anonymous]
-5[anonymous]

Yes. My median is probably 2.5 years to have 10 of the 50 largest US cities where a member of the public can hail a self-driving car (though emphasizing that I don't know anything about the field beyond the public announcements).

Some of these bets had a higher threshold of covering >50% of the commutes within the city, i.e. multiplying fraction of days where it can run due to weather, and fraction of commute endpoints in the service zone. I think Phoenix wouldn't yet count, though a deployment in SF likely will immediately. If you include that requireme... (read more)

Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don't think "refuses to carry out phishing attacks when asked" is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn't part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)

I think relevant forms of "misalignment in the lab" are more like: we created a situation where the model d... (read more)

Note that ARC evals haven't done anything I would describe as "try to investigate misalignment in the lab." They've asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.

However I also think "create misalignment in the lab" is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it's great to actually have the cost-benefit discussion (e.g. by considering concrete ways... (read more)

2Daniel Kokotajlo
Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it's doing what you wanted it to do? Well in that case it's impossible to deliberately create a misaligned model, no?

I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.

He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet).  Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.

One potential substantive disagreement with Jan's position is that I'm somewhat more scared ... (read more)

1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.

2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which... (read more)

2Wei Dai
What is the latest thinking/discussion about this? I tried to search LW/AF but haven't found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share? How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because ... I'm curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it's up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn't the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn't work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)
8Wei Dai
Thanks for this helpful explanation. Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefully). Are you perhaps referring to the section "Evaluation is easier than generation"? Do you have any major disagreements with what's in Jan's post? (It doesn't look like you publicly commented on either Jan's substack or his AIAF link post.)

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough... (read more)

I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situa... (read more)

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough... (read more)

Load More