LESSWRONG
LW

All of davidad's Comments + Replies

AI Assistants Should Have a Direct Line to Their Developers

Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

4Jan_Kulveit3mo

Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.

Proveably Safe Self Driving Cars [Modulo Assumptions]

davidad6mo75

Yes, I am indeed thinking about this.

Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
If normalization is hard, we can set up the box to b

... (read more)

1Geoffrey Irving5mo

I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like log(1/ϵ) where ϵ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.

A list of core AI safety problems and how I hope to solve them

davidad6moΩ120

Nice, thanks for the pointer!

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

davidad10moΩ230

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is ... (read more)

2Joe Collman10mo

(understood that you'd want to avoid the below by construction through the specification) I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes. It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn't capture everything that we care about. I haven't thought about it in any detail, but doesn't using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

davidad10moΩ360

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications a... (read more)

8Joe Collman10mo

[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"] Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe". This is one mechanism by which such a system could cause great downstream harm. Suppose that we have a process to avoid this. What assurance do we have that there aren't other mechanisms to cause harm? I don't yet buy the description complexity penalty argument (as I currently understand it - but quite possibly I'm missing something). It's possible to manipulate by strategically omitting information. Perhaps the "penalise heavily biased sampling" is intended to avoid this (??). If so, I'm not sure how this gets us more than a hand-waving argument. I imagine it's very hard to do indirect manipulation without adding much complexity. I imagine that ASL-4+ systems are capable of many very hard things. Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer - which I expect is untrue for any simple x. I can buy that there are simple properties whose reduction guarantees safety if it's done to an extreme degree - but then I'm back to expecting the system to do nothing useful. As an aside, I'd note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That's not a criticism of the overall approach - I just want to highlight that I don't think we get to have both [system provides helpful-in-ways-we-hadn't-considered output] and [system can't produce harmful output]. Allowing the former seems to allow the latter. That's probably a good idea, but this kind of approach doesn't seem in keeping with a "Guaranteed safe" label. More of a "We haven't yet found a way in which this is

Linear infra-Bayesian Bandits

davidad10moΩ7100

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

4Vanessa Kosoy10mo

Sorry, that footnote is just flat wrong, the order actually doesn't matter here. Good catch! There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat larger than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

A list of core AI safety problems and how I hope to solve them

davidad11mo7-2

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

ThomasCederborg11mo114

Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.

In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the... (read more)

Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis

davidad1y51

Yes. You will find more details in his paper, Provably safe systems with Steve Omohundro, in which I am listed in the acknowledgments (under my legal name, David Dalrymple).

Max and I also met and discussed the similarities in advance of the AI Safety Summit in Bletchley.

Uncertainty in all its flavours

davidad1y20

I agree that each of $(- + 1)$ and $(- + 2)$ has two algebraically equivalent interpretations, as you say, where one is about inconsistency and the other is about inferiority for the adversary. (I hadn’t noticed that).

The $(- + 2)$ variant still seems somewhat irregular to me; even though Diffractor does use it in Infra-Miscellanea Section 2, I wouldn’t select it as “the” infrabayesian monad. I’m also confused about which one you’re calling unbounded. It seems to me like the $(- + 2)$ variant is bounded (on both sides) whereas the $(- + 1)$ variant is bounded on one side, and... (read more)

Agent membranes/boundaries and formalizing “safety”

davidad1y116

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the ... (read more)

1Chipmonk1y

Here's a tricky example I've been thinking about: Is a cell getting infected by a virus a boundary violation? What I think makes this tricky is that viruses generally don't physically penetrate cell membranes. Instead, cells just "let in" some viruses (albeit against their better judgement). ---------------------------------------- Then once you answer the above, please also consider: Is a cell taking in nutrients from its environment a boundary violation? I don't know what makes this different from the virus example (at least as long as we're not allowed to refer to preferences).

1Chipmonk1y

I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.

1Chipmonk1y

Thanks for writing this! I largely agree (and the rest I need to think more about)

2the gears to ascension1y

Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.

3the gears to ascension1y

You're sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

davidad1yΩ382

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

Uncertainty in all its flavours

davidad1y20

Kosoy's infrabayesian monad $□$ is given by $P^{+} \circ Δ \circ (- + 2)$

There are a few different varieties of infrabayesian belief-state, but I currently favour the one which is called "homogeneous ultracontributions", which is "non-empty topologically-closed ⊥–closed convex sets of subdistributions", thus almost exactly the same as Mio-Sarkis-Vignudelli's "non-empty finitely-generated ⊥–closed convex sets of subdistributions monad" (Definition 36 of this paper), with the difference being essentially that it's presentable, but it's much more like $P_{f}^{+} \circ Δ \circ (-$ ... (read more)

2Cleo Nardo1y

For the sake of potential readers, a (full) distribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)=1, whereas a subdistribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)≤1. Note that a subdistribution γ over X is equivalent to a full distribution over X+1, where X+1 is the disjoint union of X with some additional element, so the subdistribution monad can be written Δ(−+1). Doesn't the Nirvana Trick basically say that these two interpretations are equivalent? Let (−+2) be X↦X+{0,1} and let (−+1) be X↦X+{0}. We can interpret ∨ as possibility, 0 as a hypothesis consistent with no observations, and 1 as a hypothesis consistent with all observations. Alternatively, we can interpret ∨ as the free choice made by an adversary, 0 as "the game terminates and our agent receives minimal disutility", and 1 as "the game terminates and our agent receives maximal disutility". These two interpretations are algebraically equivalent, i.e. (∨,0,1) is a topped and bottomed semilattice. Unless I'm mistaken, both P+f∘Δ∘(−+2) and P+f∘Δ∘(−+1) demand that the agent may have the hypothesis "I am certain that I will receive minimal disutility", which is necessary for the Nirvana Trick. But P+f∘Δ∘(−+2) also demands that the agent may have the hypothesis "I am certain that I will receive maximal disutility". The first gives bounded infrabayesian monad and the second gives unbounded infrabayesian monad. Note that Diffractor uses P+f∘Δ∘(−+2) in Infra-Miscellanea Section 2.

Uncertainty in all its flavours

davidad1y20

Does this article have any practical significance, or is it all just abstract nonsense? How does this help us solve the Big Problem? To be perfectly frank, I have no idea. Timelines are probably too short agent foundations, and this article is maybe agent foundations foundations...

I do think this is highly practically relevant, not least of which because using an infrabayesian monad instead of the distribution monad can provide the necessary kind of epistemic conservatism for practical safety verification in complex cyber-physical systems like the biospher... (read more)

Uncertainty in all its flavours

davidad1y20

Meyer's

If this is David Jaz Myers, it should be "Myers' thesis", here and elsewhere

Does davidad's uploading moonshot work?

davidad1y125

I have said many times that uploads created by any process I know of so far would probably be unable to learn or form memories. (I think it didn't come up in this particular dialogue, but in the unanswered questions section Jacob mentions having heard me say it in the past.)

Eliezer has also said that makes it useless in terms of decreasing x-risk. I don't have a strong inside view on this question one way or the other. I do think if Factored Cognition is true then "that subset of thinking is enough," but I have a lot of uncertainty about whether Factored C... (read more)

RSPs are pauses done right

davidad1y*Ω112812

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

Davidad's Bold Plan for Alignment: An In-Depth Explanation

davidad2y20

I like the idea of trying out H-JEPA with GFlowNet actors.

I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.

A list of core AI safety problems and how I hope to solve them

davidad2yΩ120

Yes, it's the latter. See also the Open Agency Keyholder Prize.

A list of core AI safety problems and how I hope to solve them

davidad2yΩ256

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

2niplav2y

Thanks for the clarification!

A list of core AI safety problems and how I hope to solve them

davidad2yΩ350

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someon... (read more)

7Wei Dai2y

Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion: I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it? In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

A list of core AI safety problems and how I hope to solve them

davidad2y42

It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.

A list of core AI safety problems and how I hope to solve them

davidad2yΩ350

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in o... (read more)

3Daniel Murfet2y

Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

A list of core AI safety problems and how I hope to solve them

davidad2yΩ6100

I think you’re directionally correct; I agree about the following:

A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
I imagine these would be mostly if not entirely learned.
There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)

5davidad2y

A list of core AI safety problems and how I hope to solve them

davidad2y20

Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.

We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.

A list of core AI safety problems and how I hope to solve them

davidad2y6-3

There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.

3MiguelDev2y

Sorry for not specifying the method, but I wasn't referring to RL-based or supervised learning methods. There's a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning. I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.

A list of core AI safety problems and how I hope to solve them

davidad2y60

A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad2yΩ380

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

davidad2yΩ121

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Davidad's Bold Plan for Alignment: An In-Depth Explanation

davidad2yΩ130

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

davidad2yΩ120

I think formal verification belongs in the "requires knowing what failure looks like" category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".

1scasper2y

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.

You can still fetch the coffee today if you're dead tomorrow

davidad2yΩ130

Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.

This pattern is instantiated in the Open Agency Architecture twice:

Step 2 generates top-level agents which are time-bounded at a moderate timescale (~days), with the deliberation about whether to redeploy a top-level agent being carried out by human operators.
In Step 4, the top-level agent dispatches most tasks by deploying narrower low-level agents with much tighter time bounds, with the deliberation a

davidad2y30

For what it's worth, the phrase "night watchman" as I use it is certainly downstream of Nozick's concept.

Steering GPT-2-XL by adding an activation vector

davidad2yΩ352

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

Steering GPT-2-XL by adding an activation vector

davidad2yΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

Dan H2yΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

davidad2y42

Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!

An Open Agency Architecture for Safe Transformative AI

davidad2y32

I agree that we should start by trying this with far simpler worlds than our own, and with futarchy-style decision-making schemes, where forecasters produce extremely stylized QURI-style models that map from action-space to outcome-space while a broader group of stakeholders defines mappings from output-space to each stakeholder’s utility.

Why Are Maximum Entropy Distributions So Ubiquitous?

davidad2y*116

Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define $E (x) := - k_{B} T ln P [x]$ , and presto, $P [x] = e^{- \frac{1}{k_{B} T} E (x)}$ .

This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as classes of distributions, rather than as certain ways of looking at distributions.

A related important fact is that temperature is not really a physical quantity, but $\frac{1}{k_{B} T}$ is: it’s known as inverse temperature or $β$ . (The nonexistence of zero-temperature systems, the existence of negat... (read more)

4Alexander Gietelink Oldenziel2y

I am a little confused about this. It was my understanding that exponential families are distinguished class of families of distributions. For instance, they are regular (rather than singular). The family of mixed Gaussians is not an exponential family I believe. So my conclusion would be that the while "being Boltzmann" for a distribution is trivial as you point out, "being Boltzmann" (= exponential) for a family is nontrivial.

Practical Pitfalls of Causal Scrubbing

davidad2y30

Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph $G$ is deterministic, the $D_{K L}$ of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to "thicken" the distributions by adding absolutely-continuous noise to the input and/or output.

Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case $E_{x} D_{K L} (I$ ... (read more)

1Lucius Bushnaq2y

Second paragraph is what I meant, thanks.

Practical Pitfalls of Causal Scrubbing

davidad2y20

As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.

Behavioral and mechanistic definitions (often confuse AI alignment discussions)

davidad2yΩ230

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

Assigning Praise and Blame: Decoupling Epistemology and Decision Theory

davidad2y161

For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.

5lalaithion2y

I walked through some examples of Shapley Value here, and I'm not so sure it satisfies exactly what we want on an object level. I don't have a great realistic example here, but Shapley Value assigns counterfactual value to individuals who did in fact not contribute at all, if they would have contributed were your higher-performers not present. So you can easily have "dead weight" on a team which has a high Shapley Value, as long as they could provide value if their better teammates were gone.

2adamShimi2y

Thanks for the pointer!

The Alignment Problem from a Deep Learning Perspective (major rewrite)

davidad2yΩ340

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally repres

davidad2yΩ4104

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

World-Model Interpretability Is All We Need

davidad2yΩ490

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those poi... (read more)

1wassname1y

Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don't think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence. When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like "virtue", or you could modify the world model before training.

Categorizing failures as “outer” or “inner” misalignment is often confused

davidad2yΩ6137

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorre

... (read more)

2Rohin Shah2y

Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

Side-channels: input versus output

davidad2yΩ130

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

3Donald Hobson2y

I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan. A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info.

Side-channels: input versus output

davidad2yΩ120

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

4Donald Hobson2y

Sure, binary is fairly natural, but there are a lot of details of IEEE floats that aren't. https://en.wikipedia.org/wiki/Subnormal_number

1TAG2y

Binary might be a attractor, but there's a lot of ways of implementing floating point in binary.

3TekhneMakre2y

He's saying that since floating point arithmetic isn't necessarily associative, you can tell something about how some abstract function like the sum of a list is actually implemented / computed; and that partial info points at some architectures more than others.

An Open Agency Architecture for Safe Transformative AI

davidad2yΩ120

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property $Q$ (presumably a conjunct of many other properties, etc.) such that $Q$ conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to reso... (read more)

An Open Agency Architecture for Safe Transformative AI

davidad2yΩ250

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model?

Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.

An Open Agency Architecture for Safe Transformative AI

davidad2yΩ350

In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):

Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire net

... (read more)