All of Charlie Steiner's Comments + Replies

If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.

On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

Thanks!

Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)

3cloud
Thanks for the question! Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback. Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward-- as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency "gain control over my environment," it won't learn that tendency. Instead, the trajectory might simply serve as noise.

The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.

Very interesting!

It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?

My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.

At the risk of making people do more morally grey things, have you considered doing a similar experiment with models... (read more)

I'm a big fan! Any thoughs on how to incorporate different sorts of reflective data, e.g. different measures of how people think mediation "should" go?

2NicholasKees
The authors focus on measuring consensus and whether the process toward consensus was fair, and come up with their measures accordingly. This is because, as they see it, "finding common ground is a precursor to collective action." Some other possible goals (just spitballing): * Shrinking the perception gap, or how well people can predict the opinions of people they disagree with (weaker forms of ITT?). There's some research showing that this gap GROWS when people interact with social media, and you might be able to engineer and measure a reversal of that trend. * Identifying cruxes and double cruxes with mediation. * Finding latent coalitions. If a discussion is dominated by a primary axis of disagreement, other axes of disagreement will be occluded (around which a majority coalition could be formed). Finding these other axes is a bit of what we're trying to do here. * Moving from abstract disagreement to concrete (empirical?) disagreements. 

I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).

Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)

Anyhow, thanks for the reply. I have now seen last figure.

Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.

"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.

A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite diff... (read more)

5Fabien Roger
I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer). This is why "encoded reasoning" is maybe a better expression. The experiments to evaluate whether it is present stay the same. I agree this is a concern. This can be tested more directly by adding filler tokens, and I find that adding filler tokens (for base model L) doesn't perform better than having the final answer directly (see last figure). I agree it won't be very clean. But the most scary scenarios are the ones where an AI can actually have thoughts that are independent of what the Chain-of-Thought looks like, since this seriously compromises CoT monitoring. So while I am curious about the ways in which LLM CoT work in ways more subtle than the "naive way", I think this is much lower stakes than figuring out if LLMs can do proper encoded reasoning. I am more worried about things like "semantic encoded reasoning" which paraphrasing would not remove, but I would guess there is as little "semantic encoded reasoning" as there is "syntactic encoded reasoning" in current LLMs.

Well, I'm disappointed.

Everything about misuse risks and going faster to Beat China, nothing about accident/systematic risks. I guess "testing for national security capabilities" is probably in practice code for "some people will still be allowed to do AI alignment work," but that's not enough.

I really would have hoped Anthropic could be realistic and say "This might go wrong. Even if there's no evil person out there trying to misuse AI, bad things could still happen by accident, in a way that needs to be fixed by changing what AI gets built in the first place, not just testing it afterwards. If this was like making a car, we should install seatbelts and maybe institute a speed limit."

I think it's about salience. If you "feel the AGI," then you'll automatically remember that transformative AI is a thing that's probably going to happen, when relevant (e.g. when planning AI strategy, or when making 20-year plans for just about anything). If you don't feel the AGI, then even if you'll agree when reminded that transformative AI is a thing that's probably going to happen, you don't remember it by default, and you keep making plans (or publishing papers about the economic impacts of AI or whatever) that assume it won't.

I agree that in some theoretical infinite-retries game (that doesn't allow the AI to permanently convince the human of anything), scheming has a much longer half-life than "honest" misalignment. But I'd emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they're a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don't get to iterate as much as you'd like.

I have a lot of implicit disagreements.

Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

2StefanHex
I can see an argument for "outer alignment is also important, e.g. to avoid failure via sycophancy++", but this doesn't seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.) I don't understand why this is true (I don't claim the reverse is true either). I don't expect a great deal of correlation / implication here.
2Noosphere89
I'd say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics. The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn't, so I expect it to be solved well enough in practice, so I don't think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).

I'm confused about how to parse this. One response is "great, maybe 'alignment' -- or specifically being a trustworthy assistant -- is a coherent direction in activation space." 

Another is "shoot, maybe misalignment is convergent, it only takes a little bit of work to knock models into the misaligned basin, and it's hard to get them back." Waluigi effect type thinking.

My guess is neither of these.

If 'aligned' (i.e. performing the way humans want on the sorts of coding, question-answering, and conversational tasks you'd expect of a modern chatbot) beha... (read more)

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.[1] 

I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to... (read more)

Oh, I see; asymptotically, BB(6) is just O(1), and immediately halting is also O(1). I was real confused because their abstract said "the same order of magnitude," which must mean complexity class in their jargon (I first read it as "within a factor of 10.")

That average case=worst case headline is so wild. Consider a simple lock and key algorithm:

if input = A, run BB(6). else, halt.

Where A is some random number (K(A)~A).

Sure seems like worst case >> average case here. Anyone know what's going on in their paper that disposes of such examples?

9Cole Wyeth
Yes - assuming what you described is a fixed algorithm T, the complexity of A is just a constant, and the universal distribution samples input A for T a constant fraction of the time, meaning that this still dominates the average case runtime of T.  More generally: the algorithm has to be fixed (uniform), it can't be parameterized by the input size. The results of the paper are asymptotic. 

Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.

  • It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.

"Alignment" is a bit of a fuzzy word.

Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like c... (read more)

2joshc
Developers separately need to justify models are as skilled as top human experts I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction) > An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions. We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.

I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.

It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed t... (read more)

2Thane Ruthenis
Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine. One of the nice-to-have properties here would be is if we don't need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn't a real-world concept.

This doesn't sound like someone engaging with the question in the trolley-problem-esque way that the paper interprets all of the results: gpt-4o-mini shows no sign of appreciating that the anonymous Muslim won't get saved if it takes the $30, and indeed may be interpreting the question in such a way that this does not hold.

In other words, I think gpt-4o-mini thinks it's being asked about which of two pieces of news it would prefer to receive about events outside its control, rather than what it would do if it could make precisely one of the options occur,

... (read more)

Neat! I think the same strategy works for the spectre tile (the 'true' Einstein tile) as well, which is what's going on in this set.

Just to copy over a clarification from EA forum: dates haven't been set yet, likely to start in June.

Another naive thing to do is ask about the length of the program required to get from one program to another, in various ways.

Given an oracle for p1, what's the complexity of the output of p2?

What if you had an oracle for all the intermediate states of p1?

What if instead of measuring the complexity, you measured the runtime?

What if instead of asking for the complexity of the output of p2, you asked for the complexity of all the intermediate states?

All of these are interesting but bad at being metrics. I mean, I guess you could symmetrize them. But I feel like there's a deeper problem, which is that they by default ignore computational process, and have to have it tacked as extra.

I'm not too worried about human flourishing only being a metastable state. The universe can remain in a metastable state longer than it takes for the stars to burn out.

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

Second problem comes in two flavors - object level and meta level. The... (read more)

1Q Home
I assume we get an easily interpretable model where the difference between "real strawberries" and "pictures of strawberries" and "things sometimes correlated with strawberries" is easy to define, so we can use the model to directly pick the physical things AI should care about. I'm trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I'm misunderstanding your point? If you're talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs. I disagree too, but in a slightly different way. IIRC, John says approximately the following: 1. All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe. 2. In this space of abstractions it's easy to find the abstraction corresponding to e.g. real diamonds. I think (1) doesn't need to be true. I say: 1. By default, humans only care about things they can easily interact with in humanly comprehensible ways. "Things which are easy to interact with in humanly comprehensible ways" should have a simple definition. 2. Among all "things which are easy to interact with in humanly comprehensible ways", it's easy to find the abstraction corresponding to e.g. real diamonds.

Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.

I think locating goals "within the CoT" often doesn't work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that's good according to humans?

2Seth Herd
I think you're pointing to more layers of complexity in how goals will arise in LLM agents. As for what it all means WRT metacognition that can stabilize the goal structure: I don't know, but I've got some thoughts! They'll be in the form of a long post I've almost finished editing; I plan to publish tomorrow. Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that's entirely based on the semantics it's learned from training.

Would you agree that the Jeffrey-Bolker picture has stronger conditions? Rather than just needing the agent to tell you their preference ordering, they need to tell you a much more structured and theory-laden set of objects.

If you're interested in austerity it might be interesting to try to weaken the Jeffrey-Bolker requirements, or strengthen the Savage ones, to zoom in on what lets you get austerity.

Also, richness is possible in the Savage picture, you just have to stretch the definitions of "state," "action," and "consequence." In terms of the functiona... (read more)

I'm glad you shared this, but it seems way overhyped. Nothing wrong with fine tuning per se, but this doesn't address open problems in value learning (mostly of the sort "how do you build human trust in an AI system that has to make decisions on cases where humans themselves are inconsistent or disagree with each other?").

1MiguelDev
Hello there, and I appreciate the feedback! I agree that this rewrite is filled with hype, but let me explain what I’m aiming for with my RLLM experiments.   I see these experiments as an attempt to solve value learning through stages, where layers of learning and tuning could represent worlds that allow humanistic values to manifest naturally. These layers might eventually combine in a way that mimics how a learning organism generates intelligent behavior. Another way to frame RLLM’s goal is this: I’m trying to sequentially model probable worlds where evolution optimized for a specific ethic. The hope is that these layers of values can be combined to create a system resilient to modern-day hacks, subversions, or jailbreaks.   Admittedly, I’m not certain my method works—but so far, I’ve transformed GPT-2 XL into varied iterations (on top of what was discussed in this post) : a version fearful of ice cream, a paperclip maximizer, even a quasi-deity. Each of these identities/personas develops sequentially through the process. 
Answer by Charlie SteinerΩ340

Not being an author in any of those articles, I can only give my own take.

I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.

It is not an alignment technique to me because the phrase "alignment technique" sounds li... (read more)

I honestly think your experiment made me more temporarily confused than an informal argument would have, but this was still pretty interesting by the end, so thanks.

1james__p
Yeah I agree that with hindsight, the conclusion could be better explained and motivated from first principles, rather than by running an experiment. I wrote this post in the order in which I actually tried things as I wanted to give an honest walkthrough of the process that lead me to the conclusion, but I can appreciate that it doesn't optimise for ease to follow.

I think there may be some things to re-examine about the role of self-experimentation in the rationalist community. Nootropics, behavioral interventions like impractical sleep schedules, maybe even meditation. It's very possible these reflect systematic mistakes by the rationalist community, that people should mostly warned away from.

2Viliam
Yeah. I mean, there were problems with drugs already at the early rationality minicamps in 2014, and yet somehow this topic remains open to discussion... I am not opposed to self-experimentation per se, as long as people acknowledge the risks. But if we simply treat self-experimentation as high status, and talking about risks as low-status...

It's tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that's not accurate. It's safe, but it's not conforming to a positive meaning of "alignment" that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren't rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.

I give the probability that some authority figure would use an order-following AI to get torturous revenge on me (probably for being part of a group they dislike) is quite slim. Maybe one in a few thousand, with more extreme suffering being less likely by a few more orders of magnitude? The probablility that they have me killed for instrumental reasons, or otherwise waste the value of the future by my lights, is mich higher - ten percent-ish, depends on my distribution over who's giving the orders. But this isn't any worse to me than being killed by an AI that wants to replace me with molecular smiley faces.

1rvnnt
To me, those odds each seem optimistic by a factor of about 1000, but ~reasonable relative to each other. (I don't see any low-cost way to find out why we disagree so strongly, though. Moving on, I guess.) Makes sense (given your low odds for bad outcomes). Do you also care about minds that are not you, though? Do you expect most future minds/persons that are brought into existence to have nice lives, if (say) Donald "Grab Them By The Pussy" Trump became god-emperor (and was the one deciding what persons/minds get to exist)?
Answer by Charlie Steiner2-1

Yes. Current AI policy is like people in a crowded room fighting over who gets to hold a bomb. It's more important to defuse the bomb than it is to prevent someone you dislike from holding it.

That said, we're currently not near any satisfactory solutions to corrigibility. And I do think it would be better for the world if were easier (by some combination of technical factors and societal factors) to build AI that works for the good of all humanity than to build equally-smart AI that follows the orders of a single person. So yes, we should focus research an... (read more)

1rvnnt
I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power. If you agree that there exists that kind of tradeoff, I'm curious as to why you think it's better to trade in the direction of decreasing probability-of-death for increased probability-of-suffering. So, the question I'm most interested in is the one at the end of the post[1], viz ---------------------------------------- 1. Didn't put it in the title, because I figured that'd be too long of a title. ↩︎

One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.

I certainly agree that Asimov's three laws are not a good foundation for morality! Nor are any other simple set of rules.

So if that's how you mean "value alignment," yes let's discount it. But let me sell you on a different idea you... (read more)

3boazbarak
I am not 100% sure I follow all that you wrote, but to the extent that I do, I agree. Even chatbot are surprisingly good at understanding human sentiments and opinions. I would say that already they mostly do the reasonable thing, but not with high enough probability and certainly  not reliably under stress of adversarial input, Completely agree that we can't ignore these problems because the stakes will be much higher very soon.

Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.

I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.

Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants ... (read more)

2Daniel Kokotajlo
Interesting, thanks for this. Hmmm. I'm not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful -- won't the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.
Answer by Charlie Steiner30

First, I agree with Dmitry.

But it does seem like maybe you could recover a notion of information bottleneck even with out the Bayesian NN model. If you quantize real numbers to N-bit floating point numbers, there's a very real quantity which is "how many more bits do you need to exactly reconstruct X, given Z?" My suspicion is that for a fixed network, this quantity grows linearly with N (and if it's zero at 'actual infinity' for some network despite being nonzero in the limit, maybe we should ignore actual infinity).

But this isn't all that useful, it woul... (read more)

1Dalcy
That makes sense. I've updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real. We could formalize it like Iσ(X;f(X)) where Iσ(X;f(X))=I(X;f(X)+ϵσ) with ϵσ being some independent noise parameterized by \sigma. Then Iσ(X;f(X)) would become finite. We could think of binning the output of a layer to make it stochastic in a similar way. Ideally we'd like the new measure to be finite even for deterministic maps (this is the case for above) and some strict data processing inequality like Iσ(X;g(f(X)))<Iσ(f(X);g(f(X))) to hold, intuition being that each step of the map adds more noise. But Iσ(X;f(X)) is just h(f(X)+ϵσ) up to a constant that depends on the noise statistic, so the above is an equality. Issue is that the above intuition is based on each application of f and g adding additional noise to the input (just like how discretization lets us do this: each layer further discretizes and bins its input, leading in gradual loss of information hence letting mutual information capture something real in the sense of amount of bits needed to recover information up to certain precision across layers), but I_\sigma just adds an independent noise. So any relaxation if I(X;f(X)) will have to depend on the functional structure of f. With that (+ Dmitry's comment on precision scale), I think the papers that measure mutual information between activations in different layers with a noise distribution over the parameters of f sound a lot more reasonable than I originally thought.

A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and  |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states.

I wonder if this can be resolved by treating the randomness of the machines quantum mechanically, rather than having this semi-classical picture where you start with some randomness handed down from God. Suppose these machines us... (read more)

2Ben
You are completely correct in the "how does the machine work inside?" question. As you point out that density matrix has the exact form of something that is entangled with something else. I think its very important to be discussing what is real, although as we always have a nonzero inferential distance between ourselves and the real the discussion has to be a little bit caveated and pragmatic. 

people who study very "fundamental" quantum phenomena increasingly use a picture with a thermal bath

Maybe talking about the construction of pointer states? That linked paper does it just as you might prefer, putting the Boltzmann distribution into a density matrix. But of course you could rephrase it as a probability distribution over states and the math goes through the same, you've just shifted the vibe from "the Boltzmann distribution is in the territory" to "the Boltzmann distribution is in the map."

Still, as soon as you introduce the notion of measure

... (read more)
5Dmitry Vaintrob
Thanks for the reference -- I'll check out the paper (though there are no pointer variables in this picture inherently). I think there is a miscommunication in my messaging. Possibly through overcommitting to the "matrix" analogy, I may have given the impression that I'm doing something I'm not. In particular, the view here isn't a controversial one -- it has nothing to do with Everett or einselection or decoherence. Crucially, I am saying nothing at all about quantum branches. I'm now realizing that when you say map or territory, you're probably talking about a different picture where quantum interpretation (decoherence and branches) is foregrounded. I'm doing nothing of the sort, and as far as I can tell never making any "interpretive" claims. All the statements in the post are essentially mathematically rigorous claims which say what happens when you  * start with the usual QM picture, and posit that * your universe divides into at least two subsystems, one of which you're studying * one of the subsystems your system is coupled to is a minimally informative infinite-dimensional environment (i.e., a bath). Both of these are mathematically formalizable and aren't saying anything about how to interpret quantum branches etc. And the Lindbladian is simply a useful formalism for tracking the evolution of a system that has these properties (subdivisions and baths). Note that (maybe this is the confusion?) subsystem does not mean quantum branch, or decoherence result. "Subsystem" means that we're looking at these particles over here, but there are also those particles over there (i.e. in terms of math, your Hilbert space is a tensor product Sytem1⊗System2. Also, I want to be clear that we can and should run this whole story without ever using the term "probability distribution" in any of the quantum-thermodynamics concepts. The language to describe a quantum system as above (system coupled with a bath) is from the start a language that only involves density matr

Some combination of:

  • Interpretability
    • Just check if the AI is planning to do bad stuff, by learning how to inspect its internal representations.
  • Regularization
    • Evolution got humans who like Doritos more than health food, but evolution didn't have gradient descent. Use regularization during training to penalize hidden reasoning.
  • Shard / developmental prediction
    • Model-free RL will predictably use simple heuristics for the reward signal. If we can predict and maybe control how this happens, this gives us at least a tamer version of inner misalignment.
  • Self-modeling
    • M
... (read more)
4Noosphere89
My personal ranking of impact would be regularization, then AI control (at least for automated alignment schemes), with interpretability a distant 3rd or 4th at best. I'm pretty certain that we will do a lot better than evolution, but whether that's good enough is an empirical question for us.

When you say there's "no such thing as a state," or "we live in a density matrix," these are statements about ontology: what exists, what's real, etc.

Density matrices use the extra representational power they have over states to encode a probability distribution over states. If we regard the probabilistic nature of measurements as something to be explained, putting the probability distribution directly into the thing we live in is what I mean by "explain with ontology."

Epistemology is about how we know stuff. If we start with a world that does not inherent... (read more)

6Ben
There are some non-obvious issues with saying "the wavefunction really exists, but the density matrix is only a representation of our own ignorance". Its a perfectly defensible viewpoint, but I think it is interesting to look at some of its potential problems: 1. A process or machine prepares either |0> or |1> at random, each with 50% probability. Another machine prepares either |+> or |-> based on a coin flick, where |+> = (|0> + |1>)/root2, and  |+> = (|0> - |1>)/root2. In your ontology these are actually different machines that produce different states. In contrast, in the density matrix formulation these are alternative descriptions of the same machine. In any possible experiment, the two machines are identical.  Exactly how much of a problem this is for believing in wavefuntions but not density matrices is debatable - "two things can look the same, big deal" vs "but, experiments are the ultimate arbiters of truth, if experiemnt says they are the same thing then they must be and the theory needs fixing." 2. There are many different mathematical representations of quantum theory. For example, instead of states in Hilbert space we can use quasi-probability distributions in phase space, or path integrals. The relevance to this discussion is that the quasi-probability distributions in phase space are equivalent to density matrices, not wavefunctions. To exaggerate the case, imagine that we have a large number of different ways of putting quantum physics into a mathematical language, [A, B, C, D....] and so on. All of them are physically the same theory, just couched in different mathematics language, a bit like say, ["Hello", "Hola", "Bonjour", "Ciao"...] all mean the same thing in different languages. But, wavefunctions only exist as an entity separable from density matrices in some of those descriptions.  If you had never seen another language maybe the fact that the word "Hello" contains the word "Hell" as a substring might seem to possibly correspond to somet
7Dmitry Vaintrob
One person's "occam's razor" may be description length, another's may be elegance, and a third person's may be "avoiding having too much info inside your system" (as some anti-MW people argue). I think discussions like "what's real" need to be done thoughtfully, otherwise people tend to argue past each other, and come off overconfident/ underinformed.  To be fair, I did use language like this so I shouldn't be talking -- but I used it tongue-in-cheek, and the real motivation given in the above is not "the DM is a more fundamental notion" but "DM lets you make concrete the very suggestive analogy between quantum phase and probability", which you would probably agree with. For what it's worth, there are "different layers of theory" (often scale-dependent), like classical vs. quantum vs. relativity, etc., where there I think it's silly to talk about "ontological truth". But these theories are local conceptual optima among a graveyard of "outdated" theories, that are strictly conceptually inferior to new ones: examples are heliocentrism (and Ptolemy's epycycles), the ether, etc.  Interestingly, I would agree with you (with somewhat low confidence) that in this question there is a consensus among physicists that one picture is simply "more correct" in the sense of giving theoretically and conceptually more elegant/ precise explanations. Except your sign is wrong: this is the density matrix picture (the wavefunction picture is genuinely understood as "not the right theory", but still taught and still used in many contexts where it doesn't cause issues). I also think that there are two separate things that you can discuss. 1. Should you think of thermodynamics, probability, and things like thermal baths as fundamental to your theory or incidental epistemological crutches to model the world at limited information? 2. Assuming you are studying a "non-thermodynamic system with complete information", where all dynamics is invertible over long timescales, should you use

Treating the density matrix as fundamental is bad because you shouldn't explain with ontology that which you can explain with epistemology.

4Dmitry Vaintrob
I've found our Agent Smith :) If you are serious, I'm not sure what you mean. Like there is no ontology in physics -- every picture you make is just grasping at pieces of whatever theory of everything you eventually develop

For topological debate that's about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the "strictest" combination, a big class of fatal flaw would be if you don't actually have the partial order you think you have within the practical range of the settings - i.e. if some settings you thought were more accurate/strict are actually systematically less accurate.

In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long ... (read more)

1lunatic_at_large
Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate. If you already have a specific simulation program in mind then you can define ≤ as follows: if you're handed two programs which are exact copies of your simulation software using different hard-coded world models then you consult your ordering on world models, if one submission is even a single character different from your intended program then it's automatically less, if both programs differ from your program then you decide arbitrarily. What's nice about the "ordering on computations" perspective is that it naturally generalizes to situations where you don't follow this construction. What could happen if we don't supply our own simulation program via this construction? In the planes example, maybe the "snap" debater hands you a 50,000-line simulation program with a bug so that if you're crafty with your grid sizes then it'll get confused about the material properties and give the wrong answer. Then the "safe" debater might hand you a 200,000-line simulation program which avoids / patches the bug so that the crafty grid sizes now give the correct answer. Of course, there's nothing stopping the "safe" debater from having half of those lines be comments containing a Lean proof using super annoying numerical PDE bounds or whatever to prove that the 200,000-line program avoids the same kind of bug as the 50,000-line program. When you think about it that way, maybe it's reasonable to give the "it'll snap" debater a chance to respond to the "it's safe" debater's comments. Now maybe we change the type of ≤ from being a subset of (Turing Machines) x (Turing Machines) to being a subset of (Turing
3lunatic_at_large
These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though. I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic evaluation maps in a similar fashion to the deterministic case but everything gets a bit more annoying (e.g. you likely need to include an incentive to point to simple world models if you want the game to have any equilibria at all). Anyways, if you've figured out topological debate in the stochastic case, then you can reduce from the occasional-errors problem to the stochastic problem as follows: suppose (W,≤) is a directed set of world models and E is some simulation software. Define a stochastic program E′ which takes in a world model w, randomly samples a world model w′≥w according to some reasonably-spread-out distribution, and return E(w′). In the 1D plane case, for example, you could take in a given resolution, divide it by a uniformly random real number in (1,10), and then run the simulation at that new resolution. If your errors are sufficiently rare then your stochastic topological debate setup should handle things from here. Somewhat more serious is the case where "it's harder to disrupt patterns injected during bids." Mathematically I interpret this statement as the existence of a world model which evaluates to the wrong answer such that you have to take a vastly more computationally intensive refinement to get the correct answer. I think it's reasonable to detect when this problem is occurring but preventing it seems hard: you'd basically need to create a better simulation program which doesn't suf

Fun post, even though I don't expect debate of either form to see much use (because resolving tough real world questions offers too many chances for the equivalent of the plane simulation to have fatal flaws).

1lunatic_at_large
Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions. Also, is there some prototypical example of a "tough real world question" you have in mind? I will gladly concede that not all questions naturally fit into this framework. I was primarily inspired by physical security questions like biological attacks or backdoors in mechanical hardware. 

With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I'm not at all confused about why people would think this is useful safety work, even if I haven't personally hashed out the cost/benefit to any degree of confidence.

I'm mostly confused about ML / SWE / research benchmarks.

I'm not sure but I have a guess. A lot of "normies" I talk to in the tech industry are anchored hard on the idea that AI is mostly a useless fad and will never get good enough to be useful.

They laugh off any suggestions that the trends point towards rapid improvements that can end up with superhuman abilities. Similarly, completely dismiss arguments that AI might used for building better AI. 'Feed the bots their own slop and they'll become even dumber than they already are!'

So, people who do believe that the trends are meaningful, and that we are near to a... (read more)

The mathematical structure in common is called a "measure."

I agree that there's something mysterious-feeling about probability in QM, though I mostly think that feeling is an illusion. There's a (among physicists) famous fact that the only way to put a 'measure' on a wavefunction that has nice properties (e.g. conservation over time) is to take the amplitude squared. So there's an argument: probability is a measure, and the only measure that makes sense is the amplitude-squared measure, therefore if probability is anything it's the amplitude squared. And i... (read more)

Charlie SteinerΩ277320

Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?

It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear.

But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.

But clearly other people think differently than me.

4jacquesthibs
In case you didn't read Paul's reasoning.
habrykaΩ72014

I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record". 

I find these arguments reasonably compelling, FWIW.

3throwaway_2025
Capabilities benchmarks can be highly useful in safety applications. You raised a great example with ML benchmarks. Strong ML R&D capabilities lie upstream of many potential risks: * Labs may begin automating research, which could shorten timelines. * These capabilities may increase proliferation risks of techniques used to develop frontier models. * In the extremes, these capabilities may increase the risk of uncontrolled recursive self-improvement. Labs, governments, and everyone else involved should have an accurate understanding of where the capabilities frontier lies to enable good decision making. The only quantitatively rigorous way of doing that is with good benchmarks. Capabilities are not bottlenecked on benchmarks to inform where model developers could make improvements, and adding more is extremely unlikely to make any significant difference to capabilities progress. Therefore, I think having more capabilities benchmarks a good thing because it can greatly increase our understanding of model capabilities without making much of a difference in timelines. However, if you are interested in doing safety work, building capabilities benchmarks is probably not the most effective thing you could be doing. 

At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.

5Chris_Leong
I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.
4the gears to ascension
What will you do if nobody makes a successful case?
4Nathan Helm-Burger
You probably don't mean dangerous capabilities evals, right? I mean, I do feel hesitant even about those. I would really not want someone using my work on WMDP to increase their model's ability to make bioweapons. In Connor Leahy's recent interview on Trajectory he argues that scientists making evals are being "used" as tools by the AI corporations in a similar way to how cancer researchers were used by cigarette companies to throw confusion into the path of concluding cigarettes cause cancer.

Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn't a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.

One big reason I might expect an AI to do a bad job at alignment research is if it doesn't do a good job (according to humans) of resolving cases where humans are inconsistent or disagree. How do you detect this in string theory research? Part of the reason we know so much about physics is humans aren't that inconsistent about it and don't disagree that much. And if you go to sub-topics where humans do disagree, how do you judge its performance (because 'be very convincing to your operators' is an objective with a different kind of danger).

Another potentia... (read more)

Thanks for the great reply :) I think we do disagree after all.

humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans

Except about that - here we agree.

 

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflecti

... (read more)
3Sophie Bridgers
Thanks for another thoughtful response and explaining further. I think we can now both agree that we disagree (at least in certain respects) ;-) We take seriously your argument that AI could get really smart and good at predicting human preferences and values, which could change the level of human involvement in training, evaluation, and monitoring. However, if we go with the approach you propose: > Instead, I think our strategy should be "If humans are inconsistent and disagree, let's strive to learn a notion of human values that's robust to our inconsistency and disagreement." > A committee of humans reviewing an AI's proposal is, ultimately, a physical system that can be predicted. If you have an AI that's good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary. The question arises: How will we know if AI has learned a notion of human values that’s robust to inconsistency and disagreement and that its predictions are accurate? We would argue some form of human input would be needed to evaluate what the AI has learned. Though this input need not be prompt-response feedback typical of current RLHF approaches.  If this evaluation reveals that the AI is indeed accurate (whatever that may mean for the particular product and context in question), then we agree that further human input could be more limited. Though continual training, evaluation, and monitoring with humans in the loop in some capacity will likely be needed since values change over time and to ensure that the system has not drifted.  > (And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI's decision-making.) We are hesitant to take an approach o
Load More