All of dxu's Comments + Replies

dxu*40

The constant bound isn't not that relevant just because of the in principal unbounded size, it also doesn't constrain the induced probabilities in the second coding scheme much at all. It's an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors

Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless:

If you have UTM1 and UTM2, there is a constant-length pr... (read more)

1keith_wynroe
I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined: I agree with all of this, and wasn't gesturing at anything related to it, so I think we're talking past eachother. My point was simply that two UTMs even with not very-large prefix encodings can wind up with extremely different priors, but I don't think that's too relevant to what your main point is I think I disagree with almost all of this. You can fix some gerrymandered extant physical system right now that ends up looking like a garbled world-history optimizer, I doubt that it would take on the order of length ~2^10^80 to specify it. But granting that these systems would in fact have astronomical prefixes, I think this is a ponens/tollens situation: if these systems actually have a huge prefix, that tells me that some the encoding schemes of some physically realisable systems are deeply incompatible with mine, not that those systems which are out there right now aren't physically realisible.  I imagine an objection is that these physical systems are not actually world-history optimizers and are actually going to be much more compressible than I'm making them out to be, so your argument goes through. In which case I'm fine with this, this just seems like a differing definition of what counts as when two schemes are acting "virtually identically" w.r.t to optimization criteria. If your argument is valid but is bounding this similarity to include e.g random chunks of a rock floating through space, then I'm happy to concede that - seems quite trivial and not at all worrying from the original perspective of bounding the kinds of optimization criteria an AI might have
dxu*31

All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn't achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system's physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization criteria.

1keith_wynroe
The constant bound isn't not that relevant just because of the in principal unbounded size, it also doesn't constrain the induced probabilities in the second coding scheme much at all. It's an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors I have no idea how you're getting to this, not sure if it's claiming a formal result or just like a hunch. But I disagree both that there is a neat correspondence between a system being physically realizable and its having a concise implementation as a TM. Even granting that point, I don't think that nearly all or even most of these physically realisable systems will behave identically or even similarly w.r.t. how they assign codes to "natural" optimization criteria
dxu80

It looks to me like the "updatelessness trick" you describe (essentially, behaving as though certain non-local branches of the decision tree are still counterfactually relevant even though they are not — although note that I currently don't see an obvious way to use that to avoid the usual money pump against intransitivity) recovers most of the behavior we'd see under VNM anyway; and so I don't think I understand your confusion re: VNM axioms.

E.g. can you give me a case in which (a) we have an agent that exhibits preferences against whose naive implementat... (read more)

2Jeremy Gillen
Good point. What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose any plan that does multiple loops, because they are strictly worse. After choosing this plan though, I think it is representable as VNM rational, as you say. And I'm not sure what to do with this. It does seem important. However, I think Scott's argument here satisfies (a) (b) and (c). I think the independence axiom might be special in this respect, because the money pump for independence is exploiting an update on new information.
dxu*42

I think I might be missing something, because the argument you attribute to Dávid still looks wrong to me. You say:

The entropy of the simulators’ distribution need not be more than the entropy of the (square of the) wave function in any relevant sense. Despite the fact that subjective entropy may be huge, physical entropy is still low (because the simulations happen on a high-amplitude ridge of the wave function, after all).

Doesn't this argument imply that the supermajority of simulations within the simulators' subjective distribution over universe his... (read more)

2So8res
I agree that in real life the entropy argument is an argument in favor of it being actually pretty hard to fool a superintelligence into thinking it might be early in Tegmark III when it's not (even if you yourself are a superintelligence, unless you're doing a huge amount of intercepting its internal sanity checks (which puts significant strain on the trade possibilities and which flirts with being a technical-threat)). And I agree that if you can't fool a superintelligence into thinking it might be early in Tegmark III when it's not, then the purchasing power of simulators drops dramatically, except in cases where they're trolling local aliens. (But the point seems basically moot, as 'troll local aliens' is still an option, and so afaict this does all essentially iron out to "maybe we'll get sold to aliens".)
dxu*64

The AI has a similarly hard time to the simulators figuring out what's a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that. So it's probability that it's in a real Everett branch is not p, but p times 2^-N, as it has only a 2^-N prior probability that the kind of word it observes is the kind of thing that can come up in a real Everett branch. So it's balanced out with the simulation hypothesis, and as long as the simulators are s

... (read more)
4So8res
I basically endorse @dxu here. Fleshing out the argument a bit more: the part where the AI looks around this universe and concludes it's almost certainly either in basement reality or in some simulation (rather than in the void between branches) is doing quite a lot of heavy lifting. You might protest that neither we nor the AI have the power to verify that our branch actually has high amplitude inherited from some very low-entropy state such as the big bang, as a Solomonoff inductor would. What's the justification for inferring from the observation that we seem to have an orderly past, to the conclusion that we do have an orderly past? This is essentially Boltzmann's paradox. The solution afaik is that the hypothesis "we're a Boltzmann mind somewhere in physics" is much, much more complex than the hypothesis "we're 13Gy down some branch eminating from a very low-entropy state". The void between branches is as large as the space of all configurations. The hypothesis "maybe we're in the void between branches" constrains our observations not-at-all; this hypothesis is missing details about where in the void between rbanches we are, and with no ridges to walk along we have to specify the contents of the entire Boltzmann volume. But the contents of the Boltzmann volume are just what we set out to explain! This hypothesis has hardly compressed our observations. By contrast, the hypothesis "we're 13Gy down some ridge eminating from the big bang" is penalized only according to the number of bits it takes to specify a branch index, and the hypothesis "we're inside a simulation inside of some ridge eminating from the big bang" is penalized only according to the number of bits it takes to specify a branch index, plus the bits necessary to single out a simulation. And there's a wibbly step here where it's not entirely clear that the simple hypothesis does predict our observations, but like the Boltzmann hypothesis is basically just a maximum entropy hypothesis and doesn'
3David Matolcsi
I think this is mistaken. In one case, you need to point out the branch, planet Earth within our Universe, and the time and place of the AI on Earth. In the other case, you need to point out the branch, the planet on which a server is running the simulation, and the time and place of the AI on the simulated Earth. Seems equally long to me.  If necessary, we can run let pgysical biological life emerge on the faraway planet and develop AI while we are observing them from space. This should make it clear that Solomonoff doesn't favor the AI being on Earth instead of this random other planet. But I'm pretty certain that the sim being run on a computer doesn't make any difference.
dxu71

These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular exte

... (read more)
dxu213

It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on "imitation" as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.

8Mark Xu
I think "basically obviates" is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don't get any incentive towards productively useing human-legible cogntive strategies.
7Bogdan Ionut Cirstea
Disagree that it's obvious, depends a lot on how efficient (large-scale) RL (post-training) is at very significantly changing model internals, rather than just 'wrapping it around', making the model more reliable, etc. In the past, post-training (including RL) has been really bad at this.
dxu40

Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers.

Ah, yeah, I can see how I might've been unclear there. I was implicitly taking CoT into account when I talked about the "base distribution" of the model's outputs, as it's essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model's O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different di... (read more)

3Bogdan Ionut Cirstea
I suspect we probably have quite differing intuitions about what research processes/workflows tend to look like. In my view, almost all research looks quite a lot (roughly) like iterative improvements on top of existing literature(s) or like literature-based discovery, combining already-existing concepts, often in pretty obvious ways (at least in retrospect). This probably applies even more to ML research, and quite significantly to prosaic safety research too. Even the more innovative kind of research, I think, often tends to look like combining existing concepts, just at a higher level of abstraction, or from more distanced/less-obviously-related fields. Almost zero research is properly de novo (not based on any existing - including multidisciplinary - literatures). (I might be biased though by my own research experience and taste, which draw very heavily on existing literatures.)  If this view is right, then LM agents might soon have an advantage even in the ideation stage, since they can do massive (e.g. semantic) retrieval at scale and much cheaper / faster than humans; + they might already have much longer short-term-memory equivalents (context windows). I suspect this might compensate a lot for them likely being worse at research taste (e.g. I'd suspect they'd still be worse if they could only test a very small number of ideas), especially when there are decent proxy signals and the iteration time is short and they can make a lot of tries cheaply; and I'd argue that a lot of prosaic safety research does seem to fall into this category. Even when it comes to the base models themselves, I'm unsure how much worse they are at this point (though I do think they are worse than the best researchers, at least). I often find Claude-3.5 to be very decent at (though maybe somewhat vaguely) combining a couple of different ideas from 2 or 3 papers, as long as they're all in its context; while being very unlikely to be x-risky, since sub-ASL-3, very unlikely to be schemi
dxu84

And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood's control agenda, etc.). I'd be curious where you'd disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they're (roughly) human-level at safety research, or they never scale to human-level, etc.?

Jeremy's response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching... (read more)

5Bogdan Ionut Cirstea
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers. And also intuitively, I expect, for example, that Sakana's agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of fulll paragraphs from papers, etc.
dxu40

I'm interested! Also curious as to how this is implemented; are you using retrieval-augmented generation, and if so, with what embeddings?

4Ruby
You are added! Claude 3.5 Sonnet is the chat client, and yes, with RAG using OpenAI text-embedding-3-large for embeddings.
dxu*20

Epistemic status: exploratory, "shower thought", written as part of a conversation with Claude:

For any given entity (broadly construed here to mean, essentially, any physical system), it is possible to analyze that entity as follows:

Define the set of possible future trajectories that entity might follow, according to some suitably uninformative ignorance prior on its state and (generalized) environment. Then ask, of that set, whether there exists some simple, obvious, or otherwise notable prior on the set in question, that assigns probabilities to var

... (read more)
dxu*139

The rule of thumb test I tend to use to assess proposed definitions of agency (at least from around these parts) is whether they'd class a black hole as an agent. It's not clear to me whether this definition does; I would have said it very likely does based on everything you wrote, except for this one part here:

A cubic meter of rock has a persistent boundary over time, but no interior, states in an informational sense and therefore are not agents. To see they have no interior, note that anything that puts information into the surface layer of the rock tr

... (read more)
5edbs
Ah, yes, this took me a long time to grok. It's subtle and not explained well in most of the literature IMO. Let me take a crack at it. When you're talking about agents, you're talking about the domain of coupled dynamic systems. This can be modeled as a set of internal states, a set of blanket states divided into active and sensory, and a set of external states (it's worth looking at this diagram to get a visual). When modeling an agent, we model the agent as the combination of all internal states and all blanket states. The active states are how the agent takes action, the sensory states are how the agent gets observations, and the internal states have their own dynamics as a generative model.  But how did we decide which part of this coupled dynamic system was the agent in the first place? Well, we picked one of the halves and said "it's this half". Usually we pick the smaller half (the human) rather than the larger half (the entire rest of the universe) but mathematically there is no distinction. From this lens they are both simply coupled systems. So let's reverse it and model the environment instead. What do we see then? We see a set of states internal to the environment (called "external states" in the diagram)...and a bunch of blanket states. The same blanket states, with the labels switched. The agent's active states are the environment's sensory states, the agent's sensory states are the environment's active states. But those are just labels, the states themselves belong to both the environment and the agent equally. OK, so what does this have to do with a rock? Well, the very surface of the rock is obviously blanket state. When you lightly press the surface of the rock, you move the atoms in the surface of the rock. But because they are rigidly connected to the next atoms, you move them too. And again. And again. The whole rock acts as a single set of sensory states. When you lightly press the rock, the rock presses back against you, but again not just
dxu41

How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?

They're not! But humans aren't ideal Bayesians, and it's entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one's intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to "de-update", because the evidence that went into the update isn't ... (read more)

dxu2-2

There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.

My sense is that this is because different people have different intuitive priors, an... (read more)

1[anonymous]
I'm not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
dxu40

Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can't see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility f... (read more)

dxu20

I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: "There's no canonical way to scale me up."

Does that sound right to you?

1[anonymous]
I mentioned it above :)
dxu20

Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.

(Just to be clear: yes, I know what training and test sets ar... (read more)

1Ebenezer Dukakis
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam's Razor, I expect the internal ontology to be biased towards that of an English-language speaker. I'm imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it's being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern? I'm still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to 'random' failures on predicting future/counterfactual data in a way that's fairly obvious.
dxu20

I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]

I think I don't understand what you're imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?

[I also think I don't understand why you make the bracketed claim you do, but perhaps hashing that out isn't a conversational priority.]

As I understand this thought experiment, we're doing next-t

... (read more)
1Ebenezer Dukakis
Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation". I suppose I should've said various documents are IID to be more clear. I would certainly guess they are. Generally speaking, yes. Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it. In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn't make a difference under normal circumstances. It sounds like maybe you're discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it's able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern? (To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it's worth stress-testing alignment proposals under these sort of extreme scenarios, but I'm not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model's performance would drop on data generated after training, and that would hurt the company's bottom line, and they would have a strong financial incenti
dxu40

I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'.

Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).

If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher's spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?

Quantum fie... (read more)

2Ebenezer Dukakis
I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.] It's not the genius philosopher that's the core task, it's the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we're doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that's not quite enough -- you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of 'reading tokens out of a QFT simulation' thing would be very common, thus something the system gets good at in order to succeed at next-token prediction. ---------------------------------------- I think perhaps there's more to your thought experiment than just alien abstractions, and it's worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it's doing retrodiction. It's making 'predictions' about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential -- there's no time component to the data points (unless it's a time-series problem, which it usually isn't). The fact that we're choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears. So basically I suspect what you're really trying to claim here, which incidentally I've also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can i
dxu*8-1

I'd assume that when we tell it, "optimize this company, in a way that we would accept, after a ton of deliberation", this could be instead described as, "optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology"

The problem shows up when the system finds itself acting in a regime where the notion of us (humans) "accepting" its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of "would a hu... (read more)

6Ebenezer Dukakis
I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume... * That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set? * That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher? If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher's spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions? Another probe: Is alignment supposed to be hard in this hypothetical because the AI can't represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there's no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason? This sounds a lot like saying "it might fail to generalize". Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn't that imply our systems are getting better at choosing proxies which generalize even when the human isn't 'present'?
dxu*2-2

To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.

Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don't expect to be able to get t... (read more)

2tailcalled
Are we using the word "transformative" in the same way? I imagine that if society got reorganized into e.g. AI minds that hire tons of people to continually learn novel tasks that it can then imitate, that would be considered transformative because it would entirely change people's role in society, like the agricultural revolution did. Whereas right now very few people have jobs that are explicitly about pushing the frontier of knowledge, in the future that might be ~the only job that exists (conditional on GPT+DPO being the future, which again is not a mainline scenario).
dxu20

That (on it's own, without further postulates) is a fully general argument against improving intelligence.

Well, it's a primarily a statement about capabilities. The intended construal is that if a given system's capabilities profile permits it to accomplish some sufficiently transformative task, then that system's capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that ... (read more)

2tailcalled
What I'm saying is that if GPT+DPO creates imitation-based intelligences that can be dangerous due to being intentionally instructed to do something bad ("hey, please kill that guy" and then it kills him), then that's not particularly concerning from an AI alignment perspective, because it has a similar danger profile to telling humans this. You would still want policy to govern it, similar to how we have policy to govern human-on-human violence, but it's not the kind of x-risk that notkilleveryoneism is about. So basically you can have "GPT+DPO is superintelligent, capable and dangerous" without having "GPT+DPO is an x-risk". That said, I expect GPT+DPO to be stagnate and be replaced by something else, and that something else could be an x-risk (and conditional on the negation of natural impact regularization, I strongly expect it would be).
dxu31

The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it's genuinely not that dangerous.

I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).

There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative -> dangerous” ... (read more)

2tailcalled
That (on it's own, without further postulates) is a fully general argument against improving intelligence. We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn't present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
dxu42

(9) is a values thing, not a beliefs thing per se. (I.e. it's not an epistemic claim.)

(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.

dxu*186

Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:

The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.

In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (a

... (read more)
dxu*Ω244817

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here's an existing Nate!comment that I find reasonably... (read more)

ryan_greenblatt*Ω1018-2

I don't see why you can't just ask at each point in time "Which action would maximize the expected value of X". It seems like asking once and asking many times as new things happen in reality don't have particularly different properties.

More detailed comment

Paul noted:

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which acti

... (read more)
dxu114

I think I'm not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V "inside" the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I'd expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn't regressional, and so V and X aren't independent.

(Consider e.g. two arbitrary functions U' and V', and compute the "error term" X' be... (read more)

4Thomas Kwa
I think independence is probably the biggest weakness of the post just because it's an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they're ways to trick the overseer). The example of two arbitrary functions doesn't seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won't be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness. Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.
dxu90

(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."

I expect this sort of thing to be less common with AI systems that can have much bigger "cranial capacity". But then again, I guess that at what

... (read more)
dxu20

It sounds like you're arguing that uploading is impossible, and (more generally) have defined the idea of "sufficiently OOD environments" out of existence. That doesn't seem like valid thinking to me.

2jacob_cannell
Of course i'm not arguing that uploading is impossible, and obviously there are always hypothetical "sufficiently OOD environments". But from the historical record so far we can only conclude that evolution's alignments of brains was robust enough compared to the environment distribution shift encountered - so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy. These are just extremely different: * an argument from historical observations * an argument from future predicted observations It's like I'm arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn't because you predict the next digit is 31. Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
dxu20

Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).

Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.

The general point being that degree of mi

... (read more)
2jacob_cannell
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not "is this design perfectly aligned according to my adversarial pedantic critique". The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails. In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence - which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try. I think this entire line of thinking is wrong - you have little idea what environmental changes are plausible and next to no idea of how brains would adapt. When you move the discussion to speculative future technology to support the argument from a historical analogy - you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
dxu41

No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormou

... (read more)
8jacob_cannell
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function). Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model. The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility. From the perspective of evolutionary fitness, humanity is the penultimate runaway success - AFAIK we are possibly the species with the fastest growth in fitness ever in the history of life. This completely overrides any and all arguments about possible misalignment, because any such misalignment is essentially epsilon in comparison to the fitness gain brains provided. For AGI, there is a singular correct notion of misalignment which actually matters: how does the creation of AGI - as an action - translate into differential utility, according to the utility function of its creators? If AGI is aligned to humanity about the same as brains are aligned to evolution, then AGI will result in an unimaginable increase in differential utility which vastly exceeds any slight misalignment. You can speculate all you want about the future and how brains may be become misaligned in the future, but that is just speculation. If you actually believe the sharp left turn argument holds water, where is the evidence? As as I said earlier this evidence must take a specific form, as evidence in the historical record:
dxu30

It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn't ma... (read more)

4EJT
That's not quite right. If we're comparing two lotteries, one of which gives lower expected utility than the other conditional on shutdown at some timestep and greater expected utility than the other conditional on shutdown at some other timestep, then neither of these lotteries timestep dominates the other. And then the Timestep Dominance principle doesn't apply, because it's a conditional rather than a biconditional. The Timestep Dominance Principle just says: if X timestep dominates Y, then the agent strictly prefers X to Y. It doesn't say anything about cases where neither X nor Y timestep dominates the other. For all we've said so far, the agent could have any preference relation between such lotteries. That said, your line of questioning is a good one, because there almost certainly are lotteries X and Y such that (1) neither of X and Y timestep dominates the other, and yet (2) we want the agent to strictly prefer X to Y. If that's the case, then we'll want to train the agent to satisfy other principles besides Timestep Dominance. And there's still some figuring out to be done here: what should these other principles be? can we find principles that lead agents to pursue goals competently without these principles causing trouble elsewhere? I don't know but I'm working on it. Can you say a bit more about this? Humans don't reason by Timestep Dominance, but they don't do explicit EUM calculations either and yet EUM-representability is commonly considered a natural form for preferences to take.
dxu20

In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other.

But it sounds like the agent's initial choice between A and B is forced, yes? (Otherwise, it wouldn't be the case that the agent is permitted to end up with either A+ or B, but not A.) So the presence of A+ within a particular continuation of the decision tree influences the agent's choice at the initial... (read more)

dxu62

This is a good post! It feels to me like a lot of discussion I've recently encountered seem to be converging on this topic, and so here's something I wrote on Twitter not long ago that feels relevant:

I think most value functions crystallized out of shards of not-entirely-coherent drives will not be friendly to the majority of the drives that went in; in humans, for example, a common outcome of internal conflict resolution is to explicitly subordinate one interest to another.

I basically don’t think this argument differs very much between humans and ASI

... (read more)
dxu40

The main way I'd imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is "trying" to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what's going on, and some of those subgoals won't play well with shutdown. That's the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system cat

... (read more)
dxuΩ120

This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."

I don't think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in "problematic" behavi... (read more)

1SCP
In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won't have reason to push probability mass from one towards the other. This is reasonable but I think my response to your comment will mainly involve re-stating what I wrote in the post, so maybe it'll be easier to point to the relevant sections: 3.1. for what DSM mandates when the agent has beliefs about its decision tree, 3.2.2 for what DSM mandates when the agent hadn't considered an actualised continuation of its decision tree, and 3.3. for discussion of these results. In particular, the following paragraphs are meant to illustrate what DSM mandates in the least favourable epistemic state that the agent could be in (unawareness with new options appearing):
dxuΩ120

My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.

On my understanding, the argument isn’t that your DSM agent can... (read more)

1SCP
I don't see how this could be right. Consider the bounding results on trammelling under unawareness (e.g. Proposition 10). They show that there will always be a set of options between which DSM does not require choosing one over the other. Suppose these are X and Y. The agent will always be able to choose either one. They might end up always choosing X, always Y, switching back and forth, whatever. This doesn't look like the outcome of two subagents, one preferring X and the other Y, negotiating to get some portion of the picks. Forgive me; I'm still not seeing it. For coming up with examples, I think for now it's unhelpful to use the shutdown problem, because the actual proposal from Thornley includes several more requirements. I think it's perfectly fine to construct examples about trammelling and subagents using something like this: A is a set of options with typical member ai. These are all comparable and ranked according to their subscripts. That is, a1 is preferred to a2, and so on. Likewise with set B. And all options in A are incomparable to all options in B. This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."
dxuΩ120

I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.

Flagging here that I don't think the subagent framing is super important and/or necessary for "collusion" to happen. Even if the "outer" agent isn't literally built from subagents, "collusion" can still occur in the sense that it [the... (read more)

1SCP
I disagree; see my reply to John above.
dxu112

If we live in an “alignment by default” universe, that means we can get away with being careless, in the sense of putting forth minimal effort to align our AGI, above and beyond the effort put in to get it to work at all.

This would be great if true! But unfortunately, I don’t see how we’re supposed to find out that it’s true, unless we decide to be careless right now, and find out afterwards that we got lucky. And in a world where we were that lucky—lucky enough to not need to deliberately try to get anything right, and get away with it—I mostly think misu... (read more)

dxu62

Can you say more about how a “frame” differs from a “model”, or a “hypothesis”?

(I understand the distinction between those three and “propositions”. It’s less clear to me how they differ from each other. And if they don’t differ, then I’m pretty sure you can just integrate over different “frames” in the usual way to produce a final probability/EV estimate on whatever proposition/decision you’re interested in. But I’m pretty sure you don’t need Garrabrant induction to do that, so I mostly think I don’t understand what you’re talking about.)

4Richard_Ngo
In the bayesian context, hypotheses are taken as mutually exclusive. Frames aren't mutually exclusive, so you can't have a probability distribution over them. For example, my physics frame and my biology frame could both be mostly true, but have small overlaps where they disagree (e.g. cases where the physics frame computes an answer using one approximation, and the biology frame computes an answer using another approximation). A tentative example: the physics frame endorses the "calories in = calories out" view on weight loss, whereas this may be a bad model of it under most practical circumstances. By contrast, the word "model" doesn't have connotations of being mutually exclusive, you can have many models of many different domains. The main difference between frames and models is that the central examples of models are purely empirical (i.e. they describe how the world works) rather than having normative content. Whereas the frames that many people find most compelling (e.g. environmentalism, rationalism, religion, etc) have a mixture of empirical and normative content, and in fact the normative content is compelling in part due to the empirical content. Very simple example: religions make empirical claims about god existing, and moral claims too, but when you no longer believe that god exists, you typically stop believing in their (religion-specific) moral claims.
dxu52

I’ll bite even further, and ask for the concept of “recurrence” itself to be dumbed down. What is “recurrence”, why is it important, and in what sense does e.g. a feedforward network hooked up to something like MCTS not qualify as relevantly “recurrent”?

1mishka
"Hooked up to something" might make a difference. (To me one important aspect is whether computation is fundamentally limited to a fixed number of steps vs. having a potentially unbounded loop. The autoregressive version is an interesting compromise: it's a fixed number of steps per token, but the answer can unfold in an unbounded fashion. An interesting tid-bit here is that for traditional RNNs it is one loop iteration per an input token, but in autoregressive Transformers it is one loop iteration per an output token.)
dxu2710

You have my (mostly abstract, fortunately/unfortunately) sympathies for what you went through, and I’m glad for you that you sound to be doing better than you were.

Having said that: my (rough) sense, from reading this post, is that you’ve got a bunch of “stuff” going on, some of it plausibly still unsorted, and that that stuff is mixed together in a way that I feel is unhelpful. For example, the things included at the beginning of the post as “necessary background” don’t feel to me entirely separate from what you later describe occurring; they mostly feel ... (read more)

5Raphaël
I always thought that this impression I had that the rationalist memeplex was an attractor for people like that was simply survivorship bias on people reporting their experience. This impression was quite reinforced by the mental health figures on the SSC surveys once the usual confounders were controlled for.
dxu203

I am pushing back because, if you are St. Petersberg Paradox-pilled like SBF and make public statements that actually you should keep taking double or nothing bets, perhaps you are more likely to make tragic betting decisions and that's because of you're taking certain ideas seriously. If you have galaxy brained the idea of the St. Petersberg Paradox, it seems like Alameda style fraud is +EV.

This is conceding a big part of your argument. You’re basically saying, yes, SBF’s decision was -EV according to any normal analysis, but according to a particular ... (read more)

If people inevitably sometimes make mistakes when interpreting theories, and theory-driven mistakes are more likely to be catastrophic than the mistakes people make when acting according to "atheoretical" learning from experience and imitation, then unusually theory-driven people are more likely to make catastrophic mistakes. In the absence of a way to prevent people from sometimes making mistakes when interpreting theories, this seems like a pretty strong argument in favor of atheoretical learning from experience and imitation!

This is particularly pertine... (read more)

3Noosphere89
Eh, I'm a little concerned in general, because this, without restrictions could be used to redirect blame away from the theory, even in cases where the implementation of a theory is evidence against the theory. The best example is historical non-capitalist societies, especially communist societies where communists claimed when responding to criticism roughly said that the communist societies weren't truly communist, and thus communism could still work if they were truly communist. This is the best example of this phenomenon, but I'm sure there's other examples of this phenomenon.
dxu351

RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:

It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of th

... (read more)
dxu40

I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:

As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/o

... (read more)
1simon
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I'm addressing them here. Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM's overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects. Let me also point out your implicit assumption that there is an 'inner' cognition which is not literally the mask. Here is some other claim someone could make: This person would be saying, "hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own 'inner' cognition." I think that you are making the same philosophical error that this claim would be making. However, if we didn't understand GPUs we could still imagine that the datacenter does have its own, independent 'inner' cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be 'acting' for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.  The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing? Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it's a bit more reasonable to expect there to be inte
1simon
OK, I think I'm now seeing what you're saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):  In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the "mask" and the actual machinery that generated it, which may in fact be the entire network, as the "actor".  Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of "mask" than I had before. However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model. In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask). While this "actor" is indeed not the same as any of the "masks", it doesn't know the answer "itself" to any of the questions. It needs to generate and "wear" the mask to do that. ---------------------------------------- This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.  This also is an update for me - I was not understanding that this is what y
dxu40

Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.

Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.

...yes? And this is obviously very, very different from how humans represent things internally?

I mean, for ... (read more)

2jacob_cannell
I think we are starting to talk past each other, so let me just summarize my position (and what i'm not arguing): 1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)). 2.) Because of 1.), there is no problem with 'alien thoughts' based on mindspace geometry. That was just never going to be a problem. 3.) Neither 1 or 2 are sufficient for alignment by default - both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general. Earlier you said: I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something. So having exactly the same world model is not sufficient for alignment - I'm not and would never argue that But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn't guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.
dxu20

Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.

I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (ag

... (read more)
1simon
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions. I think the most key disagreement is this: As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions. If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn't actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced. Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a "goal slot" remains more parsimonious than an actor with a different underlying goal. Regarding the evolutionary analogy, while I'd generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution -> training and human behaviour/goals -> the mask. Note, it's entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
dxu1015

I want to revisit what Rob actually wrote:

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.

(emphasis mine)

That sounds a whole lot like it's invoking a simplicity prior to me!

Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).

dxu84

LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.

This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this cas... (read more)

Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.

Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.

you need to first investigate the actual internal representations of the systems in question, and verify that

... (read more)
dxu20

E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.

Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.

I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or no... (read more)

1simon
Yes.   I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask. I would not consider this case to be "one" inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I'm more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details. What this multi-optimizer (which I would not call "inner") is going to "end up" wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It's not going to be some mysterious inner thing. They aren't? In your example, the mask wanted to play chess, didn't it, and what you call the "inner" optimizer returned a good move, didn't it? I can see two things you might mean about the mask not actually being in control: 1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask's goal, and it is only satisfying the mask's goal instrumentally. This I think is very unlikely for the reasons I put in the original post. It's extra machinery that isn't returning any value in training. 2. That this optimizer might at some times change goals (e.g. when the mask changes). It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it's better in my view to see it as part of/controlled by th
Load More