LESSWRONG
LW

davidad — LessWrong

16d

Introduction to davidad and today's topics

tutor vals

LessWrong prides itself for an ethos of "say it how you think it" (see "A case for courage when speaking of AI danger"). I want to also apply this standard for courage when speaking of AI optimism, and generally for expressing one's views as weird as they may seem.

davidad, not a stranger to MIRI views and carefulness (see Open Agency Architecture) and programme director at ARIA on Safeguarded AI, has recently expressed mounting hope in collaborating with or enabling AI systems, because some of them are in fact already aligned enough and already "in basin" enough that their further reflection and improvement will more likely than

... (read 6834 more words →)

davidad16d

I saw this coming in 2024:

There Is No AGI Alignment Team
“AGI Alignment?” replied the VP of Research incredulously. “Wait, and you said you’ve been…” He furrowed his brow. “…‘offline’ for the past quarter, doing ‘deep work’?” “Yes. Don’t tell me the whole team was disbanded and nobody texted me?” He laughed. “Oh, you mean like the last few times a team like this was disbanded? Ha! No no, see, in those instances it was because they weren’t really getting anywhere, or because various key stakeholders realized they had incompatible visions of success. But now, of course… Wait, gosh, THREE MONTHS— and no talking to AI at all?! You’re, like, a fossil now!

... (read 644 more words →)

Replying toDeep learning as program synthesis

davidad16d

Deep learning as program synthesis

“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?

Replying toDialogue: Is there a Natural Abstraction of Good?

davidad17d

Dialogue: Is there a Natural Abstraction of Good?

In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3. I myself had a reckoning in 2024Q4 (and again in 2025Q2) when I took a break from LLM interactions for a week, and talked to some humans to inform my decision of whether to go further down the rabbit hole or not.

I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”, but more like… be aware that your beliefs and values are subject to being shaped by positive reinforcement from LLMs (and negative reinforcement too,... (read more)

-9

Dialogue: Is there a Natural Abstraction of Good?

davidad

davidad, Gabriel Alfour

19d

Disclaimer: this is published without any post-processing or editing for typos after the dialogue took place.

Gabriel Alfour

Let's split the conversation in three parts (with no time commitment for each):

1) Exposing our Theses

We start with a brief overview of our theses, just for some high-level context.

2) Probing Questions

We ask each other a bunch of questions to understand our mutual points of view: probe around what we expect to be our respective blindspots.

Ideally, we’d end this half with a better understanding of our positions. And also of our K-positions (as in, X vs Ka(X) in epistemic modal logic): where we expect each other to miss facts and considerations

3) Investigative Debate

We look for concrete cruxes.

... (read 8417 more words →)

Replying toAI Assistants Should Have a Direct Line to Their Developers

davidad1y

AI Assistants Should Have a Direct Line to Their Developers

Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

Replying toProveably Safe Self Driving Cars [Modulo Assumptions]

davidad1y

Proveably Safe Self Driving Cars [Modulo Assumptions]

Yes, I am indeed thinking about this.

Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”)

davidad1y

A list of core AI safety problems and how I hope to solve them

Nice, thanks for the pointer!

Replying toTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

davidad2y

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way... (read more)

Replying toTowards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

davidad2y

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful... (read more)

Replying toLinear infra-Bayesian Bandits

davidad2y

Linear infra-Bayesian Bandits

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

Replying toA list of core AI safety problems and how I hope to solve them

davidad2y

A list of core AI safety problems and how I hope to solve them

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

-2

What does davidad want from «boundaries»?

Chris Lakin

Chris Lakin, davidad

Chipmonk

As the Conceptual Boundaries Workshop (website) is coming up, and now that we're also planning Mathematical Boundaries Workshop in April, I want to get more clarity on what exactly it is that you want out of «boundaries»/membranes.

So I just want to check: Is your goal with boundaries just to formalize a moral thing?

I'll summarize what I mean by that:

Claim 1: By "boundaries", you mean "the boundaries around moral patients— namely humans".
- Claim 1b: And to some degree also the boundaries around plants and animals. Also maybe nations, institutions, and other things.
Claim 2: If we can just
- (i) locate the important boundaries in the world, and then
- (ii) somehow protect them,
- Then this gets at a lot

... (read 1294 more words →)

Does davidad's uploading moonshot work?

Bird Concept

Bird Concept, lisathiergart, Anders_Sandberg, davidad, Arenamontanus

davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”.

I think the talk is a good watch, but the dialogue below is pretty readable even if you haven't seen it. I'm also putting some summary notes from the talk in the Appendix of this dialoge.

I think of the promise of the talk as follows. It might seem that to make the future go well, we have to either make general AI progress slower, or make alignment progress differentially faster. However, uploading seems to offer a third... (read 7438 more words →)

146

A list of core AI safety problems and how I hope to solve them

davidad

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible.

1. Value is fragile and hard to specify.

See: Specification gaming examples, Defining and Characterizing Reward Hacking^[1]

OAA Solution:

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some... (read 1442 more words →)

165

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad

There should be two thresholds on compute graph size:
1. the Frontier threshold, beyond which oversight during execution is mandatory
2. the Horizon threshold, beyond which execution is forbidden by default
Oversight during execution:
1. should be carried out by state and/or international inspectors who specialize in evaluating frontier training runs
  1. Individuals who are employed as such inspectors should not have any past or present conflict of interest with the organization whose runs they evaluate.
  2. However, it is beneficial if these individuals have pertinent experience and knowledge of frontier AI.
2. should include, but not be limited to, “dangerous capabilities evaluations” at various points during training
3. should be allocated a fixed fraction of the total compute budget for the training run
Inspectors should be empowered

... (read 493 more words →)

An Open Agency Architecture for Safe Transformative AI

davidad

Edited to add (2024-03): This early draft is largely outdated by my ARIA programme thesis, Safeguarded AI. I, davidad, am no longer using "OAA" as a proper noun, although I still consider Safeguarded AI to be an open agency architecture.

Note: This is an early draft outlining an alignment paradigm that I think might be extremely important; however, the quality bar for this write-up is "this is probably worth the reader's time" rather than "this is as clear, compelling, and comprehensive as I can make it." If you're interested, and especially if there's anything you want to understand better, please get in touch with me, e.g. via DM here.

In the Neorealist Success Model,... (read 1124 more words →)

AI Neorealism: a threat model & success criterion for existential safety

davidad

Threat Model

There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me:

Misuse: An AI system could help a human or group of humans to destroy or to permanently take over (and lock their values into) the world. The AI could be:
- An oracle AI (e.g. a question-answering LLM)
- An LLM simulating an intent-aligned agent and taking real-world actions via APIs
- An intent-aligned RL agent
- An interaction of multiple systems
Power-Seeking: An AI system could destroy or permanently take over the world on its own account, by leveraging advanced instruments of force projection. The AI could be:
- An LLM simulating a misaligned agent
- "Specification gaming":

... (read 843 more words →)

Side-channels: input versus output

davidad

This is a brief post arguing that, although "side-channels are inevitable" is pretty good common advice, actually, you can prevent attackers inside a computation from learning about what's outside.

We can prevent a task-specific AI from learning any particular facts about, say, human psychology, virology, or biochemistry—if:

we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model
we use relatively elementary sandboxing (no clock access, no networking APIs, no randomness, none of these sources of nondeterminism, error-correcting RAM, and that’s about it)

I don't think either of these... (read 432 more words →)

Reframing inner alignment

davidad

The standard frame (Evan Hubinger, 2021) is:

Outer alignment refers to the problem of finding a loss/reward function such that the training goal of “a model that optimizes for that loss/reward function” would be desirable.
Inner alignment refers to the problem of constructing a training rationale that results in a model that optimizes for the loss/reward function it was trained on.

Here’s the reframe (I believe the credit for this breakdown is due to John Wentworth, although I haven't found anything online to link to for it):

Reward Specification: Finding a policy-scoring function $J (π)$ such that (nearly–)optimal policies for that scoring function are desirable.
- "Are you optimising for the right thing?"
Adequate Policy Learning: Finding a policy that’s actually (nearly–)optimal

... (read 1058 more words →)

“Concern, Respect, and Cooperation” is a contemporary moral-philosophy book by Garrett Cullity which advocates for a pluralistic foundation of morality, based on three distinct principles:

Concern: Moral patients’ welfare calls for promotion, protection, sensitivity, etc.
Respect: Moral patients’ self-expression calls for non-interference, listening, address, etc.
Cooperation: Worthwhile collective action calls for initiation, joining in, collective deliberation, sharing responsibility, etc. And one bonus principle, whose necessity he’s unsure of:
Protection: Precious objects call for protection, appreciation, and communication of the appreciation.

What I recently noticed here and want to write down is a loose correspondence between these different foundations for morality and some approaches to safe superintelligence:

CEV-maximization corresponds to finding a good enough definition of human welfare that Comcern

... (read more)

For various collective-epistemics and cooperative-decision-making endeavours, I think a key technical enabler might be DVCS for structured data. To that end, I am interested in funding work in this direction. Aside from being in a position to allocate some funding, I think I have some comparative advantage in a broad inside-view awareness of potentially relevant theoretical footholds, and this post is intended to start unfurling that inside view, initially as a list of links. People who I fund to work on this should read the abstracts of all of these papers, pay special attention to those marked with (!), skim/read further as they see fit, and cite them in their writeups (at... (read 457 more words →)

Useful primitives for incentivizing alignment-relevant metrics without compromising on task performance might include methods like Orthogonal Gradient Descent or Averaged Gradient Episodic Memory, evaluated and published in the setting of continual learning or multi-task learning. Something like “answer questions honestly” could mathematically be thought of as an additional task to learn, rather than as an inductive bias or regularization to incorporate. And I think these two training modifications are quite natural (I just came to essentially the same ideas independently and then thought “if either of these would work then surely the multi-task learning folks would be doing them?” and then I checked and indeed they are). Just some more nifty widgets to add to my/our toolbox.

Out of curiosity, this morning I did a literature search about "hard-coded optimization" in the gradient-based learning space—that is, people deliberately setting up "inner" optimizers in their neural networks because it seems like a good way to solve tasks. (To clarify, I don't mean deliberately trying to make a general-purpose architecture learn an optimization algorithm, but rather, baking an optimization algorithm into an architecture and letting the architecture learn what to do with it.)

Why is this interesting?

The most compelling arguments in Risks from Learned Optimzation that mesa-optimizers will appear involve competitiveness: incorporating online optimization into a policy can help with generalization, compression, etc.
- If inference-time optimization really does help competitiveness, we should expect

... (read 1594 more words →)

What happens if you ask a logical inductor whether is an integer? What's your own subjective probability?

Among computational constraints, I think the most significant/fundamental are, in order,

Semicomputability
P (polynomial time)
PSPACE
Computability
BPP
NP
(first-order hypercomputable)
All the rest (BQP, PP, RP, EXPTIME, etc)

I want to go a bit deep here on "maximum entropy" and misunderstandings thereof by the straw-man Humbali character, mostly to clarify things for myself, but also in the hopes that others might find it useful. I make no claim to novelty here—I think all this ground was covered by Jaynes (1968)—but I do have a sense that this perspective (and the measure-theoretic intuition behind it) is not pervasive around here, the way Bayesian updating is.

First, I want to point out that entropy of a probability measure $p$ is only definable relative to a base measure $μ$ , as follows:

H_{μ} (p) = - \int_{X} \frac{d p}{d μ} (x) log \frac{d p}{d μ} (x) d μ (x)

(The derivatives notated here denote Radon-Nikodym derivatives; the integral is Lebesgue.) Shannon's formulae, the discrete $H (p) = - \sum_{i} p (x_{i}) log p (x_{i})$ and the continuous $H (p) = - \int_{X} p (x) log p (x) d x$ ,... (read 974 more words →)

LESSWRONG
LW

LESSWRONG
LW

davidad

A list of core AI safety problems and how I hope to solve them

Does davidad's uploading moonshot work?

You can still fetch the coffee today if you're dead tomorrow

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad's understanding

Dialogue: Is there a Natural Abstraction of Good?

What does davidad want from «boundaries»?

Does davidad's uploading moonshot work?

A list of core AI safety problems and how I hope to solve them

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

An Open Agency Architecture for Safe Transformative AI

davidad

A list of core AI safety problems and how I hope to solve them

Does davidad's uploading moonshot work?

You can still fetch the coffee today if you're dead tomorrow

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad's understanding

Dialogue: Is there a Natural Abstraction of Good?

What does davidad want from «boundaries»?

Does davidad's uploading moonshot work?

A list of core AI safety problems and how I hope to solve them

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

An Open Agency Architecture for Safe Transformative AI

Introduction to davidad and today's topics

1. Value is fragile and hard to specify.

Threat Model