All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn't achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system's physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization criteria.
It looks to me like the "updatelessness trick" you describe (essentially, behaving as though certain non-local branches of the decision tree are still counterfactually relevant even though they are not — although note that I currently don't see an obvious way to use that to avoid the usual money pump against intransitivity) recovers most of the behavior we'd see under VNM anyway; and so I don't think I understand your confusion re: VNM axioms.
E.g. can you give me a case in which (a) we have an agent that exhibits preferences against whose naive implementat...
I think I might be missing something, because the argument you attribute to Dávid still looks wrong to me. You say:
The entropy of the simulators’ distribution need not be more than the entropy of the (square of the) wave function in any relevant sense. Despite the fact that subjective entropy may be huge, physical entropy is still low (because the simulations happen on a high-amplitude ridge of the wave function, after all).
Doesn't this argument imply that the supermajority of simulations within the simulators' subjective distribution over universe his...
...The AI has a similarly hard time to the simulators figuring out what's a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that. So it's probability that it's in a real Everett branch is not p, but p times 2^-N, as it has only a 2^-N prior probability that the kind of word it observes is the kind of thing that can come up in a real Everett branch. So it's balanced out with the simulation hypothesis, and as long as the simulators are s
...These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular exte
It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on "imitation" as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers.
Ah, yeah, I can see how I might've been unclear there. I was implicitly taking CoT into account when I talked about the "base distribution" of the model's outputs, as it's essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model's O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different di...
And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood's control agenda, etc.). I'd be curious where you'd disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they're (roughly) human-level at safety research, or they never scale to human-level, etc.?
Jeremy's response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching...
I'm interested! Also curious as to how this is implemented; are you using retrieval-augmented generation, and if so, with what embeddings?
Epistemic status: exploratory, "shower thought", written as part of a conversation with Claude:
For any given entity (broadly construed here to mean, essentially, any physical system), it is possible to analyze that entity as follows:
...Define the set of possible future trajectories that entity might follow, according to some suitably uninformative ignorance prior on its state and (generalized) environment. Then ask, of that set, whether there exists some simple, obvious, or otherwise notable prior on the set in question, that assigns probabilities to var
The rule of thumb test I tend to use to assess proposed definitions of agency (at least from around these parts) is whether they'd class a black hole as an agent. It's not clear to me whether this definition does; I would have said it very likely does based on everything you wrote, except for this one part here:
...A cubic meter of rock has a persistent boundary over time, but no interior, states in an informational sense and therefore are not agents. To see they have no interior, note that anything that puts information into the surface layer of the rock tr
How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They're not! But humans aren't ideal Bayesians, and it's entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one's intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to "de-update", because the evidence that went into the update isn't ...
There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
My sense is that this is because different people have different intuitive priors, an...
Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can't see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility f...
I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: "There's no canonical way to scale me up."
Does that sound right to you?
Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
(Just to be clear: yes, I know what training and test sets ar...
I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]
I think I don't understand what you're imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don't understand why you make the bracketed claim you do, but perhaps hashing that out isn't a conversational priority.]
...As I understand this thought experiment, we're doing next-t
I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'.
Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher's spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Quantum fie...
I'd assume that when we tell it, "optimize this company, in a way that we would accept, after a ton of deliberation", this could be instead described as, "optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology"
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) "accepting" its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of "would a hu...
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don't expect to be able to get t...
That (on it's own, without further postulates) is a fully general argument against improving intelligence.
Well, it's a primarily a statement about capabilities. The intended construal is that if a given system's capabilities profile permits it to accomplish some sufficiently transformative task, then that system's capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that ...
The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it's genuinely not that dangerous.
I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).
There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative -> dangerous” ...
(9) is a values thing, not a beliefs thing per se. (I.e. it's not an epistemic claim.)
(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.
Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:
...The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.
In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (a
It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
Here's an existing Nate!comment that I find reasonably...
I don't see why you can't just ask at each point in time "Which action would maximize the expected value of X". It seems like asking once and asking many times as new things happen in reality don't have particularly different properties.
Paul noted:
...It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which acti
I think I'm not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V "inside" the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I'd expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn't regressional, and so V and X aren't independent.
(Consider e.g. two arbitrary functions U' and V', and compute the "error term" X' be...
(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."
...I expect this sort of thing to be less common with AI systems that can have much bigger "cranial capacity". But then again, I guess that at what
It sounds like you're arguing that uploading is impossible, and (more generally) have defined the idea of "sufficiently OOD environments" out of existence. That doesn't seem like valid thinking to me.
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
...The general point being that degree of mi
...No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormou
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn't ma...
In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other.
But it sounds like the agent's initial choice between A and B is forced, yes? (Otherwise, it wouldn't be the case that the agent is permitted to end up with either A+ or B, but not A.) So the presence of A+ within a particular continuation of the decision tree influences the agent's choice at the initial...
This is a good post! It feels to me like a lot of discussion I've recently encountered seem to be converging on this topic, and so here's something I wrote on Twitter not long ago that feels relevant:
I think most value functions crystallized out of shards of not-entirely-coherent drives will not be friendly to the majority of the drives that went in; in humans, for example, a common outcome of internal conflict resolution is to explicitly subordinate one interest to another.
...I basically don’t think this argument differs very much between humans and ASI
...The main way I'd imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is "trying" to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what's going on, and some of those subgoals won't play well with shutdown. That's the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system cat
This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."
I don't think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in "problematic" behavi...
My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.
On my understanding, the argument isn’t that your DSM agent can...
I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.
Flagging here that I don't think the subagent framing is super important and/or necessary for "collusion" to happen. Even if the "outer" agent isn't literally built from subagents, "collusion" can still occur in the sense that it [the...
If we live in an “alignment by default” universe, that means we can get away with being careless, in the sense of putting forth minimal effort to align our AGI, above and beyond the effort put in to get it to work at all.
This would be great if true! But unfortunately, I don’t see how we’re supposed to find out that it’s true, unless we decide to be careless right now, and find out afterwards that we got lucky. And in a world where we were that lucky—lucky enough to not need to deliberately try to get anything right, and get away with it—I mostly think misu...
Can you say more about how a “frame” differs from a “model”, or a “hypothesis”?
(I understand the distinction between those three and “propositions”. It’s less clear to me how they differ from each other. And if they don’t differ, then I’m pretty sure you can just integrate over different “frames” in the usual way to produce a final probability/EV estimate on whatever proposition/decision you’re interested in. But I’m pretty sure you don’t need Garrabrant induction to do that, so I mostly think I don’t understand what you’re talking about.)
I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?
You have my (mostly abstract, fortunately/unfortunately) sympathies for what you went through, and I’m glad for you that you sound to be doing better than you were.
Having said that: my (rough) sense, from reading this post, is that you’ve got a bunch of “stuff” going on, some of it plausibly still unsorted, and that that stuff is mixed together in a way that I feel is unhelpful. For example, the things included at the beginning of the post as “necessary background” don’t feel to me entirely separate from what you later describe occurring; they mostly feel ...
I am pushing back because, if you are St. Petersberg Paradox-pilled like SBF and make public statements that actually you should keep taking double or nothing bets, perhaps you are more likely to make tragic betting decisions and that's because of you're taking certain ideas seriously. If you have galaxy brained the idea of the St. Petersberg Paradox, it seems like Alameda style fraud is +EV.
This is conceding a big part of your argument. You’re basically saying, yes, SBF’s decision was -EV according to any normal analysis, but according to a particular ...
If people inevitably sometimes make mistakes when interpreting theories, and theory-driven mistakes are more likely to be catastrophic than the mistakes people make when acting according to "atheoretical" learning from experience and imitation, then unusually theory-driven people are more likely to make catastrophic mistakes. In the absence of a way to prevent people from sometimes making mistakes when interpreting theories, this seems like a pretty strong argument in favor of atheoretical learning from experience and imitation!
This is particularly pertine...
RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:
...It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of th
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
...As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/o
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for ...
Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.
...I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (ag
I want to revisit what Rob actually wrote:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it's invoking a simplicity prior to me!
Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this cas...
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...you need to first investigate the actual internal representations of the systems in question, and verify that
E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.
I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or no...
Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless:
If you have UTM1 and UTM2, there is a constant-length pr... (read more)