LESSWRONG
LW

Sammy Martin — LessWrong

7mo

Executive summary

We examine whether intent-aligned defensive AI can effectively counter potentially unaligned or adversarial takeover-level AI. This post identifies two primary threat scenarios: strategic post-deployment takeover where AI gradually integrates into systems before executing a coordinated strike, and rapid pre-deployment "blitz" attacks exploiting existing vulnerabilities. To counter these threats, we propose a three-pillar defensive framework combining domain-specific defenses across cybersecurity, biosecurity, and physical infrastructure; AI safety and policy measures including alignment verification and monitoring; and decision support systems to enhance human decision-making.

However, our analysis reveals a fundamental asymmetry where defensive AI must operate within legal constraints while adversarial AI can break laws freely. If offensive AI achieves victory in any single domain... (read 5384 more words →)

Replying toTen Levels of AI Alignment Difficulty

Sammy Martin1y

Ten Levels of AI Alignment Difficulty

The point is that in this scenario you have aligned AGI or ASI on your side. On the assumption that the other side has/is a superintelligence and you are not, then yes this is likely a silly question, but I talk about 'TAI systems reducing the total risk from unaligned TAI'. So this is the chimp with access to a chess computer playing Gary Kasparov at chess.

And to be clear, on any slower takeoff scenario where there's intermediate steps from AGI to ASI, the analogy might not be quite that. In the scenario where there I'm talking about multiple actions, I'm usually talking about a moderate or slow takeoff where there are intermediately powerful AIs and not just a jump to ASI.

Replying toAnalysis of Global AI Governance Strategies

Sammy Martin1y

Analysis of Global AI Governance Strategies

We do discuss this in the article and tried to convey that it is a very significant downside of SA. All 3 plans have enormous downsides though, so a plan posing massive risks is not disqualifying. The key is understanding when these risks might be worth taking given the alternatives.

CD might be too weak if TAI is offense-dominant, regardless of regulations or cooperative partnerships, and result in misuse or misalignment catastrophe
If GM fails it might blow any chance of producing protective TAI and hand over the lead to the most reckless actors.
SA might directly provoke a world war or produce unaligned AGI ahead of schedule.

SA is favored when alignment is easy or... (read more)

Replying toAnalysis of Global AI Governance Strategies

Sammy Martin1y

Analysis of Global AI Governance Strategies

Let me clarify an important point: The strategy preferences outlined in the paper are conditional statements - they describe what strategy is optimal given certainty about timeline and alignment difficulty scenarios. When we account for uncertainty and the asymmetric downside risks - where misalignment could be catastrophic - the calculation changes significantly. However, it's not true that GM's only downside is that it might delay the benefits of TAI.

Misalignment (or catastrophic misuse) has a much larger downside than a successful moratorium. That is true, but trying to do a moratorium, losing your lead, and then someone else developing catastrophically misaligned AI when you could have developed a defense against it if you'd... (read more)

Analysis of Global AI Governance Strategies

Sammy Martin

Sammy Martin, Justin Bullock, Corin Katzke

We analyze three prominent strategies for governing transformative AI (TAI) development: Cooperative Development, Strategic Advantage, and Global Moratorium. We evaluate these strategies across varying levels of alignment difficulty and development timelines, examining their effectiveness in preventing catastrophic risks while preserving beneficial AI development.

Our analysis reveals that strategy preferences shift significantly based on these key variables. Cooperative Development proves most effective with longer timelines and easier alignment challenges, offering maximum flexibility and minimal intrinsic risk. Strategic Advantage becomes more viable under shorter timelines or moderate alignment difficulty, particularly when international cooperation appears unlikely, but carries the most intrinsic risk. Global Moratorium emerges as potentially necessary in scenarios with very hard alignment or extremely... (read 10705 more words →)

Sammy Martin1yQuick Take

The fact that an AI arms race would be extremely bad does not imply that rising global authoritarianism is not worth worrying about (and vice versa)

I am someone who is worried both about AI risks (from loss of control, and from war and misuse/structural risks) and from what seems to be a 'new axis' of authoritarian threats cooperating in unprecedented ways.

I won't reiterate all the evidence here, but these two pieces and their linked sources should suffice:

Despite believing this thesis, I am not, on current evidence, in favor of aggressive efforts to "race and beat China" in AI, or for abandoning attempts to slow an AGI race. I think on balance it... (read 449 more words →)

Replying toWhat are the best arguments for/against AIs being "slightly 'nice'"?

Sammy MartinSep 25, 2024*

What are the best arguments for/against AIs being "slightly 'nice'"?

One underlying idea comes from how AI misalignment is intended to work. If superintelligent AI systems are misaligned, does this misalignment look like an inaccurate generalization from what their overseers wanted, or a 'randomly rolled utility function' deceptively misaligned goal that's entirely unrelated to anything their overseers intended to train? This is represented by Levels 1-4 vs levels 5+, in my difficulty scale, more or less. If the misalignment is result of economic pressures and a 'race to the bottom' dynamic then its more likely to result in systems that care about human welfare alongside other things.

If the AI that's misaligned ends up 'egregiously' misaligned and doesn't care at all about anything... (read more)

Replying toAI #82: The Governor Ponders

Sammy Martin1y

AI #82: The Governor Ponders

For months, those who want no regulations of any kind placed upon themselves have hallucinated and fabricated information about the bill’s contents and intentionally created an internet echo chamber, in a deliberate campaign to create the impression of widespread opposition to SB 1047, and that SB 1047 would harm California’s AI industry.

There is another significant angle to add here. Namely: Many of the people in this internet echo chamber or behind this campaign are part of the network of neoreactionaries, MAGA supporters, and tech elites who want to be unaccountable that you've positioned yourself as a substantial counterpoint to.

Obviously it's invoking a culture war fight which has its downsides, but it's not... (read more)

Replying toHow difficult is AI Alignment?

Sammy Martin1y

How difficult is AI Alignment?

I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn't explicitly connect it to the potential for reducing implementation costs and you're right to point that out.

Let me clarify how the framework handles this aspect and elaborate on its implications.

Key points:

Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what's the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs!
The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If

... (read more)

Replying toGPT-o1

Sammy Martin1y

GPT-o1

I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That's what I said in my recent discussion of what this does and doesn't show.

Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure.

In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being... (read more)

Replying toHow difficult is AI Alignment?

Sammy Martin1y

How difficult is AI Alignment?

Good point. You're right to highlight the importance of the offense-defense balance in determining the difficulty of high-impact tasks, rather than alignment difficulty alone. This is a crucial point that I'm planning on expand on in the next post in this sequence.

Many things determine the overall difficulty of HITs:

the "intrinsic" offense-defense balance in related fields (like biotechnology, weapons technologies and cybersecurity) and especially whether there are irresolutely offense-dominant technologies that transformative AI can develop and which can't be countered
Overall alignment difficulty, affecting whether we should expect to see a large number of strategic, power seeking unaligned systems or just systems engaging in more mundane reward hacking and sycophancy.
Technology diffusion rates, especially for

... (read more)

How difficult is AI Alignment?

Sammy Martin

This article revisits and expands upon the AI alignment difficulty scale, a framework for understanding the increasing challenges of aligning artificial intelligence systems with human values.

We explore how alignment difficulties evolve from simple goal misalignment to complex scenarios involving deceptive alignment and gradient hacking as we progress up the scale. We examine the changing dynamics across different levels, including the growing difficulty of verifying alignment and the increasing disconnect between alignment and capabilities research.

By introducing the concept of High Impact Tasks (HITs) and discussing architectural influences on alignment, we are able to explain precisely what determines AI alignment difficulty.

This work was funded by Polaris Ventures

There is currently no consensus on how difficult... (read 6694 more words →)

•••

Replying toAI Constitutions are a tool to reduce societal scale risk

Sammy Martin1y

AI Constitutions are a tool to reduce societal scale risk

Yes, I do think constitution design is neglected! I think it's possible people think constitution changes now won't stick around or that it won't make any difference in the long term, but I do think based on the arguments here that even if it's a bit diffuse you can influence AI behavior on important structural risks by changing their constitutions. It's simple, cheap and maybe quite effective especially for failure modes that we don't have any good shovel-ready technical interventions for.

AI Constitutions are a tool to reduce societal scale risk

Sammy Martin

This work was funded by Polaris Ventures

As AI systems become more integrated into society, we face potential societal-scale risks that current regulations fail to address. These risks include cooperation failures, structural failures from opaque decision-making, and AI-enabled totalitarian control. We propose enhancing LLM-based AI Constitutions and Model Specifications to mitigate these risks by implementing specific behaviours aimed at improving AI systems' epistemology, decision support capabilities, and cooperative intelligence. This approach offers a practical, near-term intervention to shape AI behaviour positively. We call on AI developers, policymakers, and researchers to consider and implement improvements along these lines, as well as for more research into testing Constitution/Model Spec improvements, setting a foundation for more... (read 5389 more words →)

Tom Davidson’s report: https://docs.google.com/document/d/1rw1pTbLi2brrEP0DcsZMAVhlKp6TKGKNUSFRkkdP_hs/edit?usp=drivesdk

My old 2020 post: https://www.lesswrong.com/posts/66FKFkWAugS8diydF/modelling-continuous-progress

In my analysis of Tom Davidson's "Takeoff Speeds Report," I found that the dynamics of AI capability improvement, as discussed in the context of a software-only singularity, align closely with the original simplified equation ( I′(t) = cI + f(I)I^2 ) from my 4 year old post on Modeling continuous progress. Essentially, that post describes how we switch from exponential to hyperbolic growth as the fraction of AI research done by AIs improves along a logistic curve. These are all features of the far more complex mathematical model in tom’s report.

In this equation, ( I ) represents the intelligence or capability of the AI system.... (read 509 more words →)

A Model-based Approach to AI Existential Risk

Sammy Martin

Sammy Martin, Lonnie Chrisman, Aryeh Englander

Introduction

Polarisation hampers cooperation and progress towards understanding whether future AI poses an existential risk to humanity and how to reduce the risks of catastrophic outcomes. It is exceptionally challenging to pin down what these risks are and what decisions are best. We believe that a model-based approach offers many advantages for improving our understanding of risks from AI, estimating the value of mitigation policies, and fostering communication between people on different sides of AI risk arguments. We also believe that a large percentage of practitioners in the AI safety and alignment communities have appropriate skill sets to successfully use model-based approaches.

In this article, we will lead you through an example application of a... (read 9319 more words →)

https://blog.google/outreach-initiatives/public-policy/google-microsoft-openai-anthropic-frontier-model-forum/

Today, Anthropic, Google, Microsoft and OpenAI are announcing the formation of the Frontier Model Forum, a new industry body focused on ensuring safe and responsible development of frontier AI models. The Frontier Model Forum will draw on the technical and operational expertise of its member companies to benefit the entire AI ecosystem, such as through advancing technical evaluations and benchmarks, and developing a public library of solutions to support industry best practices and standards.
The core objectives for the Forum are:
Advancing AI safety research to promote responsible development of frontier models, minimize risks, and enable independent, standardized evaluations of capabilities and safety.
Identifying best practices for the responsible development and deployment of frontier models,

... (read more)

Ten Levels of AI Alignment Difficulty

Sammy Martin

Image from https://threadreaderapp.com/thread/1666482929772666880.html

Chris Olah recently released a tweet thread describing how the Anthropic team thinks about AI alignment difficulty. On this view, there is a spectrum of possible scenarios ranging from ‘alignment is very easy’ to ‘alignment is impossible’, and we can frame AI alignment research as a process of increasing the probability of beneficial outcomes by progressively addressing these scenarios. I think this framing is really useful, and here I have expanded on it by providing a more detailed scale of AI alignment difficulty and explaining some considerations that arise from it.

The discourse around AI safety is dominated by detailed conceptions of potential AI systems and their failure modes, along with ways... (read 3365 more words →)

143

When is intent alignment sufficient or necessary to reduce AGI conflict?

JesseClifton

JesseClifton, Sammy Martin, Anthony DiGiovanni

In this post, we look at conditions under which Intent Alignment isn't Sufficient or Intent Alignment isn't Necessary for interventions on AGI systems to reduce the risks of (unendorsed) conflict to be effective. We then conclude this sequence by listing what we currently think are relatively promising directions for technical research and intervention to reduce AGI conflict.

Intent alignment is not always sufficient to prevent unendorsed conflict

In the previous post, we outlined possible causes of conflict and directions for intervening on those causes. Many of the causes of conflict seem like they would be addressed by successful AI alignment. For example: if AIs acquire conflict-seeking preferences from their training data when we didn’t want... (read 2688 more words →)

When would AGIs engage in conflict?

JesseClifton

JesseClifton, Sammy Martin, Anthony DiGiovanni

Here we will look at two of the claims introduced in the previous post: AGIs might not avoid conflict that is costly by their lights (Capabilities aren’t Sufficient) and conflict that is costly by our lights might not be costly by the AGIs’ (Conflict isn’t Costly).

Explaining costly conflict

First we’ll focus on conflict that is costly by the AGIs’ lights. We’ll define “costly conflict” as (ex post) inefficiency: There is an outcome that all of the agents involved in the interaction prefer to the one that obtains.^[1] This raises the inefficiency puzzle of war: Why would intelligent, rational actors behave in a way that leaves them all worse off than they could be?

We’ll operationalize... (read 3668 more words →)

When does technical work to reduce AGI conflict make a difference?: Introduction

JesseClifton

JesseClifton, Sammy Martin, Anthony DiGiovanni

This is a pared-down version of a longer draft report. We went with a more concise version to get it out faster, so it ended up being more of an overview of definitions and concepts, and is thin on concrete examples and details. Hopefully subsequent work will help fill those gaps.

Sequence summary^[1]

Some researchers are focused on reducing the risks of conflict between AGIs. In this sequence, we’ll present several necessary conditions for technical work on AGI conflict reduction to be effective, and survey circumstances under which these conditions hold. We’ll also share some thoughts on promising directions for research and intervention to prevent AGI conflict.

This post
1. We give a breakdown of necessary conditions

... (read 1654 more words →)

Uncontrollable Super-Powerful Explosives

Sammy Martin

In the late 19th century, two researchers meet to discuss their differing views on the existential risk posed by future Uncontrollable Super-Powerful Explosives.

Catastrophist: I predict that one day, not too far in the future, we will find a way to unlock a qualitatively new kind of explosive power. This explosive will represent a fundamental break with what has come before. It will be so much more powerful than any other explosive that whoever gets to this technology first might be in a position to gain a DSA over any opposition. Also, the governance and military strategies that we were using to prevent wars or win them will be fundamentally unable to control

... (read 1203 more words →)

Nuclear Energy: Gradualism vs Catastrophism

catastrophists: when evolution was gradually improving hominid brains, suddenly something clicked - it stumbled upon the core of general reasoning - and hominids went from banana classifiers to spaceship builders. hence we should expect a similar (but much sharper, given the process speeds) discontinuity with AI.
gradualists: no, there was no discontinuity with hominids per se; human brains merely reached a threshold that enabled cultural accumulation (and in a meaningul sense it was culture that built those spaceships). similarly, we should not expect sudden discontinuities with AI per se, just an accelerating (and possibly unfavorable to humans) cultural changes as human contributions will be automated away.

I found the extended... (read 500 more words →)

Update to 'Modelling Continuous Progress'

I made an attempt to model intelligence explosion dynamics in this post, by attempting to make the very oversimplified exponential-returns-to-exponentially-increasing-intelligence model used by Bostrom and Yudkowsky slightly less oversimplified.

This post tries to build on a simplified mathematical model of takeoff which was first put forward by Eliezer Yudkowsky and then refined by Bostrom in Superintelligence, modifying it to account for the different assumptions behind continuous, fast progress as opposed to discontinuous progress. As far as I can tell, few people have touched these sorts of simple models since the early 2010’s, and no-one has tried to formalize how newer notions of continuous takeoff fit into them. I find

... (read more)

I think that the notion of Simulacra Levels is both useful and important, especially when we incorporate Harry Frankfurt's idea of Bullshit.

Harry Frankfurt's On Bullshit seems relevant here. I think its worth trying to incorporate Frankfurt's definition as well, as it is quite widely known, see e.g. this video - If you were to do so, I think you would say that on Frankfurt's definition, Level 1 tells the truth, Level 2 lies, Level 3 bullshits about physical facts but will lie or tell the truth about things in the social realm (e.g. others motives, your own affiliation), and Level 4 always bullshits.

How do we distinguish lying from bullshit? I worry that... (read more)

Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:

GPT-3, Bloviator: OpenAI’s language has no idea what it’s talking about

The article purports to give six examples of GPT-3's failure - Biological, Physical, Social, Object and Psychological reasoning and 'non sequiturs'. Leaving aside that GPT-3 works on Gary's earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.

With no reruns, randomness = 0.5, I gave Gary's questions (all six... (read 735 more words →)

Modelling the Human Trajectory or ‘How I learned to stop worrying and love Hegel’.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stronger than any specific claim about (for example) AI takeoff. What is the other evidence?

We're... (read 595 more words →)

Improving preference learning approaches

When examining value learning approaches to AI Alignment, we run into two classes of problem - we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.

Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract... (read 2063 more words →)

Normative Realism

Normative Realism by Degrees

Normative Anti-realism is self-defeating

Normativity and recursive justification

Prescriptive Anti-realism

'Realism about rationality' is Normative Realism

'Realism about rationality' discussed in the context of AI safety and some of its driving assumptions may already have a name in existing philosophy literature. I think that what it's really referring to is 'normative realism' overall - the notion that there are any facts about what we have most reason to believe or do. Moral facts, if they exist, are a subset of normative facts. Epistemic facts, facts about what we have most reason to believe (e.g. if there is a correct decision theory that we should use, that would be an epistemic fact), are... (read 2182 more words →)

-1

LESSWRONG
LW

LESSWRONG
LW

Sammy Martin

Ten Levels of AI Alignment Difficulty

Distinguishing AI takeover scenarios

Coronavirus as a test-run for X-risks

Takeoff Speeds and Discontinuities

Sammy Martin

AI Offense Defense Balance in a Multipolar World

Analysis of Global AI Governance Strategies

How difficult is AI Alignment?

AI Constitutions are a tool to reduce societal scale risk

A Model-based Approach to AI Existential Risk

Ten Levels of AI Alignment Difficulty

When is intent alignment sufficient or necessary to reduce AGI conflict?

The AI Alignment and Deployment Problems

Sammy Martin

Ten Levels of AI Alignment Difficulty

Distinguishing AI takeover scenarios

Coronavirus as a test-run for X-risks

Takeoff Speeds and Discontinuities

Sammy Martin

AI Offense Defense Balance in a Multipolar World

Analysis of Global AI Governance Strategies

How difficult is AI Alignment?

AI Constitutions are a tool to reduce societal scale risk

A Model-based Approach to AI Existential Risk

Ten Levels of AI Alignment Difficulty

When is intent alignment sufficient or necessary to reduce AGI conflict?

The AI Alignment and Deployment Problems

Executive summary

Introduction

Intent alignment is not always sufficient to prevent unendorsed conflict

Explaining costly conflict

Sequence summary^[1]

Nuclear Energy: Gradualism vs Catastrophism

Update to 'Modelling Continuous Progress'

Improving preference learning approaches

Normative Realism

'Realism about rationality' is Normative Realism

Sammy Martin

Ten Levels of AI Alignment Difficulty

Distinguishing AI takeover scenarios

Coronavirus as a test-run for X-risks

Takeoff Speeds and Discontinuities

Sammy Martin

AI Offense Defense Balance in a Multipolar World

Analysis of Global AI Governance Strategies

How difficult is AI Alignment?

AI Constitutions are a tool to reduce societal scale risk

A Model-based Approach to AI Existential Risk

Ten Levels of AI Alignment Difficulty

When is intent alignment sufficient or necessary to reduce AGI conflict?

The AI Alignment and Deployment Problems

Sammy Martin

Ten Levels of AI Alignment Difficulty

Distinguishing AI takeover scenarios

Coronavirus as a test-run for X-risks

Takeoff Speeds and Discontinuities

Sammy Martin

AI Offense Defense Balance in a Multipolar World

Analysis of Global AI Governance Strategies

How difficult is AI Alignment?

AI Constitutions are a tool to reduce societal scale risk

A Model-based Approach to AI Existential Risk

Ten Levels of AI Alignment Difficulty

When is intent alignment sufficient or necessary to reduce AGI conflict?

The AI Alignment and Deployment Problems

Executive summary

Introduction

Intent alignment is not always sufficient to prevent unendorsed conflict

Explaining costly conflict

Sequence summary[1]

Nuclear Energy: Gradualism vs Catastrophism

Update to 'Modelling Continuous Progress'

Improving preference learning approaches

Normative Realism

'Realism about rationality' is Normative Realism

Sequence summary^[1]