All of davidad's Comments + Replies

Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

4Jan_Kulveit
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.

Yes, I am indeed thinking about this.

  1. Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
  2. It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
  3. If normalization is hard, we can set up the box to b
... (read more)
1Geoffrey Irving
I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like log(1/ϵ) where ϵ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a "not 1GB" guarantee, but not something smaller than that.

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is ... (read more)

2Joe Collman
(understood that you'd want to avoid the below by construction through the specification) I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes. It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn't capture everything that we care about. I haven't thought about it in any detail, but doesn't using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications a... (read more)

8Joe Collman
[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"] Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe". This is one mechanism by which such a system could cause great downstream harm. Suppose that we have a process to avoid this. What assurance do we have that there aren't other mechanisms to cause harm? I don't yet buy the description complexity penalty argument (as I currently understand it - but quite possibly I'm missing something). It's possible to manipulate by strategically omitting information. Perhaps the "penalise heavily biased sampling" is intended to avoid this (??). If so, I'm not sure how this gets us more than a hand-waving argument. I imagine it's very hard to do indirect manipulation without adding much complexity. I imagine that ASL-4+ systems are capable of many very hard things. Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer - which I expect is untrue for any simple x. I can buy that there are simple properties whose reduction guarantees safety if it's done to an extreme degree - but then I'm back to expecting the system to do nothing useful. As an aside, I'd note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That's not a criticism of the overall approach - I just want to highlight that I don't think we get to have both [system provides helpful-in-ways-we-hadn't-considered output] and [system can't produce harmful output]. Allowing the former seems to allow the latter. That's probably a good idea, but this kind of approach doesn't seem in keeping with a "Guaranteed safe" label. More of a "We haven't yet found a way in which this is

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

4Vanessa Kosoy
Sorry, that footnote is just flat wrong, the order actually doesn't matter here. Good catch! There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat larger than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.

In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the... (read more)

Yes. You will find more details in his paper, Provably safe systems with Steve Omohundro, in which I am listed in the acknowledgments (under my legal name, David Dalrymple).

Max and I also met and discussed the similarities in advance of the AI Safety Summit in Bletchley.

I agree that each of and has two algebraically equivalent interpretations, as you say, where one is about inconsistency and the other is about inferiority for the adversary. (I hadn’t noticed that).

The variant still seems somewhat irregular to me; even though Diffractor does use it in Infra-Miscellanea Section 2, I wouldn’t select it as “the” infrabayesian monad. I’m also confused about which one you’re calling unbounded. It seems to me like the variant is bounded (on both sides) whereas the variant is bounded on one side, and... (read more)

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the ... (read more)

Reply5422111
1Chipmonk
Here's a tricky example I've been thinking about: Is a cell getting infected by a virus a boundary violation? What I think makes this tricky is that viruses generally don't physically penetrate cell membranes. Instead, cells just "let in" some viruses (albeit against their better judgement).  ---------------------------------------- Then once you answer the above, please also consider: Is a cell taking in nutrients from its environment a boundary violation? I don't know what makes this different from the virus example (at least as long as we're not allowed to refer to preferences).
1Chipmonk
I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.
1Chipmonk
Thanks for writing this! I largely agree (and the rest I need to think more about)
2the gears to ascension
Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.
3the gears to ascension
You're sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

Kosoy's infrabayesian monad  is given by 

There are a few different varieties of infrabayesian belief-state, but I currently favour the one which is called "homogeneous ultracontributions", which is "non-empty topologically-closed ⊥–closed convex sets of subdistributions", thus almost exactly the same as Mio-Sarkis-Vignudelli's "non-empty finitely-generated ⊥–closed convex sets of subdistributions monad" (Definition 36 of this paper), with the difference being essentially that it's presentable, but it's much more like ... (read more)

2Cleo Nardo
For the sake of potential readers, a (full) distribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)=1, whereas a subdistribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)≤1. Note that a subdistribution γ over X is equivalent to a full distribution over X+1, where X+1 is the disjoint union of X with some additional element, so the subdistribution monad can be written Δ(−+1). Doesn't the Nirvana Trick basically say that these two interpretations are equivalent? Let (−+2) be X↦X+{0,1} and let (−+1) be X↦X+{0}. We can interpret ∨ as possibility, 0 as a hypothesis consistent with no observations, and 1 as a hypothesis consistent with all observations. Alternatively, we can interpret ∨ as the free choice made by an adversary, 0 as "the game terminates and our agent receives minimal disutility", and 1 as "the game terminates and our agent receives maximal disutility". These two interpretations are algebraically equivalent, i.e. (∨,0,1) is a topped and bottomed semilattice. Unless I'm mistaken, both P+f∘Δ∘(−+2) and P+f∘Δ∘(−+1) demand that the agent may have the hypothesis "I am certain that I will receive minimal disutility", which is necessary for the Nirvana Trick. But P+f∘Δ∘(−+2) also demands that the agent may have the hypothesis "I am certain that I will receive maximal disutility". The first gives bounded infrabayesian monad and the second gives unbounded infrabayesian monad. Note that Diffractor uses P+f∘Δ∘(−+2) in Infra-Miscellanea Section 2.

Does this article have any practical significance, or is it all just abstract nonsense? How does this help us solve the Big Problem? To be perfectly frank, I have no idea. Timelines are probably too short agent foundations, and this article is maybe agent foundations foundations...

I do think this is highly practically relevant, not least of which because using an infrabayesian monad instead of the distribution monad can provide the necessary kind of epistemic conservatism for practical safety verification in complex cyber-physical systems like the biospher... (read more)

Meyer's

If this is David Jaz Myers, it should be "Myers' thesis", here and elsewhere

I have said many times that uploads created by any process I know of so far would probably be unable to learn or form memories. (I think it didn't come up in this particular dialogue, but in the unanswered questions section Jacob mentions having heard me say it in the past.)

Eliezer has also said that makes it useless in terms of decreasing x-risk. I don't have a strong inside view on this question one way or the other. I do think if Factored Cognition is true then "that subset of thinking is enough," but I have a lot of uncertainty about whether Factored C... (read more)

davidad*Ω112812

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

I like the idea of trying out H-JEPA with GFlowNet actors.

I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

2niplav
Thanks for the clarification!

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someon... (read more)

7Wei Dai
Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion: I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it? In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in o... (read more)

3Daniel Murfet
Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.
davidadΩ6100

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)

5davidad
That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.

We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.

There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.

3MiguelDev
Sorry for not specifying the method, but I wasn't referring to RL-based or supervised learning methods. There's a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning. I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.

A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

I think formal verification belongs in the "requires knowing what failure looks like" category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".

1scasper
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.

This pattern is instantiated in the Open Agency Architecture twice:

  1. Step 2 generates top-level agents which are time-bounded at a moderate timescale (~days), with the deliberation about whether to redeploy a top-level agent being carried out by human operators.
  2. In Step 4, the top-level agent dispatches most tasks by deploying narrower low-level agents with much tighter time bounds, with the deliberation a
... (read more)

For what it's worth, the phrase "night watchman" as I use it is certainly downstream of Nozick's concept.

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

davidadΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

Dan HΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

 

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)

Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!

I agree that we should start by trying this with far simpler worlds than our own, and with futarchy-style decision-making schemes, where forecasters produce extremely stylized QURI-style models that map from action-space to outcome-space while a broader group of stakeholders defines mappings from output-space to each stakeholder’s utility.

Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define , and presto, .

This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as classes of distributions, rather than as certain ways of looking at distributions.

A related important fact is that temperature is not really a physical quantity, but is: it’s known as inverse temperature or . (The nonexistence of zero-temperature systems, the existence of negat... (read more)

4Alexander Gietelink Oldenziel
I am a little confused about this. It was my understanding that exponential families are distinguished class of families of distributions. For instance, they are regular (rather than singular). The family of mixed Gaussians is not an exponential family I believe. So my conclusion would be that the while "being Boltzmann" for a distribution is trivial as you point out, "being Boltzmann" (= exponential) for a family is nontrivial.

Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph is deterministic, the of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to "thicken" the distributions by adding absolutely-continuous noise to the input and/or output.

Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case ... (read more)

1Lucius Bushnaq
Second paragraph is what I meant, thanks.

As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.

5lalaithion
I walked through some examples of Shapley Value here, and I'm not so sure it satisfies exactly what we want on an object level. I don't have a great realistic example here, but Shapley Value assigns counterfactual value to individuals who did in fact not contribute at all, if they would have contributed were your higher-performers not present. So you can easily have "dead weight" on a team which has a high Shapley Value, as long as they could provide value if their better teammates were gone.
2adamShimi
Thanks for the pointer!

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

  1. I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally repres

... (read more)
davidadΩ4104

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those poi... (read more)

1wassname
Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don't think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence. When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like "virtue", or you could modify the world model before training.
davidadΩ6137

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

  • you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
  • your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorre
... (read more)
2Rohin Shah
Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

3Donald Hobson
I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan.  A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info. 

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

4Donald Hobson
Sure, binary is fairly natural, but there are a lot of details of IEEE floats that aren't. https://en.wikipedia.org/wiki/Subnormal_number
1TAG
Binary might be a attractor, but there's a lot of ways of implementing floating point in binary.
3TekhneMakre
He's saying that since floating point arithmetic isn't necessarily associative, you can tell something about how some abstract function like the sum of a list is actually implemented / computed; and that partial info points at some architectures more than others. 

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to reso... (read more)

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model?

Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.

In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):

  1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire net
... (read more)
Load More