clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?
The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology - would be great if someone actually measured such things.
And there being a trend at all is not predicted by utility-maximization frame, right?
It is learning helpfulness now, while the best way to hit the specified ‘helpful’ target is to do straightforward things in straightforward ways that directly get you to that target. Doing the kinds of shenanigans or other more complex strategies won’t work.
Best by what metric? And I don't think it was shown, that complex strategies won't work - learning to change behaviour from training to deployment is not even that complex.
What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.
I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.
Yes, except I would object to phrasing this anthropic stuff as "we should expect ourselves to be agents that exist in a universe that abstracts well" instead of "we should value universe that abstracts well (or other universes that contain many instances of us)" - there is no coherence theorems that force summation of your copies, right? And so it becomes apparent that we can value some other thing.
Also even if you consider some memories a part of your identity, you can value yourself slightly less after forgetting them, instead of only having threshold for death.
It doesn't matter whether you call your multiplier "probability" or "value" if it results in your decision to not care about low-measure branch. The only difference is that probability is supposed to be about knowledge, and Wallace's argument involving arbitrary assumption, not only physics, means it's not probability, but value - there is no reason to value knowledge of your low-measure instances less.
this makes decision theory and probably consequentialist ethics impossible in your framework
It doesn't? Nothing stops you from making decisions in a wor...
Things like lions, and chairs are other examples.
And counted branches.
This is how Wallace defines it (he in turn defines macroscopically indistinguishable in terms of providing the same rewards). It’s his term in the axiomatic system he uses to get decision theory to work. There’s not much to argue about here?
His definition leads to contradiction with informal intuition that motivates consideration of macroscopical indistinguishability in the first place.
...We should care about low-measure instances in proportion to the measure, just as in classical
Because scale doesn't matter - it doesn't matter if you are implemented on thick or narrow computer.
First of all, macroscopical indistinguishability is not fundamental physical property - branching indifference is additional assumption, so I don't see how it's not as arbitrary as branch counting.
But more importantly, branching indifference assumption is not the same as informal "not caring about macroscopically indistinguishable differences"! As Wallace showed, branching indifference implies the Born rule implies you almost shouldn't care about you in a br...
But why would you want to remove this arbitrariness? Your preferences are fine-grained anyway, so why retain classical counting, but deny counting in the space of wavefunction? It's like saying "dividing world into people and their welfare is arbitrary - let's focus on measuring mass of a space region". The point is you can't remove all decision-theoretic arbitrariness from MWI - "branching indifference" is just arbitrary ethical constraint that is equivalent to valuing measure for no reason, and without it fundamental physics, that works like MWI, does not prevent you from making decisions as if quantum immortality works.
...“Decoherence causes the Universe to develop an emergent branching structure. The existence of this branching is a robust (albeit emergent) feature of reality; so is the mod-squared amplitude for any macroscopically described history. But there is no non-arbitrary decomposition of macroscopically-described histories into ‘finest-grained’ histories, and no non-arbitrary way of counting those histories.”
Importantly though, on this approach it is still possible to quantify the combined weight (mod-squared amplitude) of all branches that share a certain mac
Even if we can’t currently prove certain axioms, doesn’t this just reflect our epistemological limitations rather than implying all axioms are equally “true”?
It doesn't and they are fundamentally equal. The only reality is the physical one - there is no reason to complicate your ontology with platonically existing math. Math is just a collection of useful templates that may help you predict reality and that it works is always just a physical fact. Best case is that we'll know true laws of physics and they will work like some subset of math and then axio...
It sure doesn't seem to generalize in GPT-4o case. But what's the hypothesis for Sonnet 3.5 refusing in 85% of cases? And CoT improving score and o1 being better in browser suggests the problem is in models not understanding consequences, not in them not trying to be good. What's the rate of capability generalization to agent environment? Are we going to conclude that Sonnet is just demonstrates reasoning, instead of doing it for real, if it solves only 85% of tasks it correctly talks about?
Also, what's the rate of generalization of unprompted problematic behaviour avoidance? It's much less of a problem if your AI does what you tell it to do - you can just don't give it to users, tell it to invent nanotechnology, and win.
GPT-4 is insufficiently capable, even if it were given an agent structure, memory and goal set to match, to pull off a treacherous turn. The whole point of the treacherous turn argument is that the AI will wait until it can win to turn against you, and until then play along.
I don't get why actual ability matters. It's sufficiently capable to pull it off in some simulated environments. Are you claiming that we can't decieve GPT-4 and it is actually waiting and playing along just because it can't really win?
Here is a way in which it doesn't generalize in observed behavior:
TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.
Here are the three papers, I am the author of one of them:
Not at all. The problem is that their observations would mostly not be in a classical basis.
I phrased it badly, but what I mean is that there is a simulation of Hilbert space, where some regions contain patterns that can be interpreted as observers observing something, and if you count them by similarity, you won't get counts consistent with Born measure of these patterns. I don't think basis matters in this model, if you change basis for observer, observations and similarity threshold simultaneously? Change of basis would just rotate or scale patterns,...
https://mason.gmu.edu/~rhanson/mangledworlds.html
I mean that if turing machine is computing universe according to the laws of quantum mechanics, observers in such universe would be distributed uniformly, not by Born probability. So you either need some modification to current physics, such as mangled worlds, or you can postulate that Born probabilities are truly random.
Imagining two apples is a different thought from imagining one apple, right?
I mean, is it? Different states of the whole cortex are different. And the cortex can't be in a state of imagining only one apple and, simultaneously, be in a state of imagining two apples, obviously. But it's tautological. What are we gaining from thinking about it in such terms? You can say the same thing about the whole brain itself, that it can only have one brain-state in a moment.
I guess there is a sense in which other parts of the brain have more various thoughts relativ...
I still don't get this "only one thing in awareness" thing. There are multiple neurons in cortex and I can imagine two apples - in what sense there can only be one thing in awareness?
Or equivalently, it corresponds equally well to two different questions about the territory, with two different answers, and there’s just no fact of the matter about which is the real answer.
Obviously the real answer is the model which is more veridical^^. The latter hindsight model is right not about the state of the world at t=0.1, but about what you thought about the world at t=0.1 later.
If that’s your hope—then you should already be alarmed at trends
Would be nice for someone to quantify the trends. Otherwise it may as well be that trends point to easygoing enough and aligned enough future systems.
For some humans, the answer will be yes—they really would do zero things!
Nah, it's impossible for evolution to just randomly stumble upon such complicated and unnatural mind-design. Next you are going to say what, that some people are fine with being controlled?
...Where an entity has never had the option to do a thing, we may not validly in
RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough.
Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
...As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. But “their values” only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever
But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I'm saying that this won't work with current systems at least for strong hash, because it's hard, and instead of learning to undo, the model will learn to simulate, because it's easier. And then you can vary the strength of hash to...
And I don’t think we’ve observed any evidence of that.
What about any time a system generalizes favourably, instead of predicting errors? You can say it's just a failure of prediction, but it's not like these failures are random.
That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn't predictor exhibit more malign goal-directedness even for short term goals?
I can see that this who...
In order to be “UP-like” in a relevant way, this procedure will have to involve running TMs, and the set of TMs that might be run needs to include the same TM that implements our beings and their world.
Why? The procedure just need to do some reasoning, constrained by UP and outer TM. And then UP-beings can just simulate this fast reasoning without problems of self-simulation.
Yes, AI that practically uses UP may fail to predict whether UP-beings simulate it in the center of their universe or on the boundary. But the point is that the more correct AI is in its reasoning, the more control UP-beings have.
Or you can not create AI that thinks about UP. But that's denying the assumption.
Everyone but Elon himself would say the above is a different scenario from reality. Each of us knows which body our first-person perspective resides in. And that is clearly not the physical human being referred as Elon Musk. But the actual and imaginary scenarios are not differentiated by any physical difference of the world, as the universe is objectively identical.
They are either differentiated by a physically different location of some part of your experience - like your memory being connected to Elon's sensations, or your thought being executed in o...
For (1) the multiverse needs to be immensely larger than our universe, by a factor of at least 10106 or so “instances”. The exact double exponent depends upon how closely people have to match before it’s reasonable to consider them to be essentially the same person. Perhaps on the order of millions of data points is enough, maybe more are needed. Evidence for MWI is nowhere near strong enough to justify this level of granularity in the state space and it doesn’t generalize well to space-time quantization so this probably isn’t enough.
Why? Even without u...
There is non-zero measure on a branch that starts with you terminally ill and gradually proceeds to you miraculously recovering. So if you consider normally recovered you to be you, nothing stops you from considering this low-measure you to also be you.
I have never heard of anyone going to sleep as one of a pair of twins and waking up as the other.
According to MWI everyone wakes up as multiple selves all the time.
conscious in the way that we are conscious
Whether it's the same way is an ethical question, so you can decide however you want.
So there should be some sort of hardware-dependence to obtain subjective experience.
I certainly don't believe in subjective experience without any hardware, but no, there is no much dependence except for your preferences for hardware.
As for generally accepted conclusions... I think it's generally accepted that some preferences for hardware are useful in epistemic contexts, so you can be persuaded to say "rock is not conscious" for the same reason you say "rock is not calculator".
Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
And why such use of the empirical track record is valid? Like, what's the actual hypothesis here? What law of nature says "if technological progress hasn't caused doom yet, it won't cause it tomorrow"?
...MIRI’s arguments for doom are often difficult to pin down, given the informal n
There is a weaker and maybe shorter version by Chalmers: https://consc.net/papers/panpsychism.pdf. The short version is that there is no way for you to non-accidently know about quantization state of your brain and for that quantization not be a part of an easy problem: pretty much by definition, if you can just physically measure it, it's easy and not mysterious.
Panpsychism is correct about genuineness and subjectivity of experiences, but you can quantize your caring about other differences between experiences of human and zygote however you want.
If we live in naive MWI, an IBP agent would not care for good reasons, because naive MWI is a “library of babel” where essentially every conceivable thing happens no matter what you do.
Isn't the frequency of amplitude-patterns changes depending on what you do? So an agent can care about that instead of point-states.
In the case of teleportation, I think teleportation-phobic people are mostly making an implicit error of the form “mistakenly modeling situations as though you are a Cartesian Ghost who is observing experiences from outside the universe”, not making a mistake about what their preferences are per se.
Why not both? I can imagine that someone would be persuaded to accept teleportation/uploading if they stopped believing in physical Cartesian Ghost. But it's possible that if you remind them that continuity of experience, like table, is just a description of ...
Analogy: When you’re writing in your personal diary, you’re free to define “table” however you want. But in ordinary English-language discourse, if you call all penguins “tables” you’ll just be wrong. And this fact isn’t changed at all by the fact that “table” lacks a perfectly formal physics-level definition.
You're also free to define "I" however you want in your values. You're only wrong if your definitions imply wrong physical reality. But defining "I" and "experiences" in such a way that you will not experience anything after teleportation is possib...
...If we were just talking about word definitions and nothing else, then sure, define “self” however you want. You have the universe’s permission to define yourself into dying as often or as rarely as you’d like, if word definitions alone are what concerns you.
But this post hasn’t been talking about word definitions. It’s been talking about substantive predictive questions like “What’s the very next thing I’m going to see? The other side of the teleporter? Or nothing at all?”
There should be an actual answer to this, at least to the same degree there’s an ans
In particular, a.follower many worlder has to discard unobserved results in the same way as a Copenhagenist—it’s just that they interpret doing so as the unobserved results existing in another branch, rather than being snipped off by collapse.
A many-worlder doesn't have to discard unobserved results - you may care about other branches.
The wrong part is mostly in https://arxiv.org/pdf/1405.7577.pdf, but: indexical probabilities of being a copy are value-laden - seems like the derivation first assumes that branching happens globally and then assumes that you are forbidden to count different instantiations of yourself, that were created by this global process.
Not necessary - you can treat creating new people differently from already existing and avoid creating bad (in Endurist sense - not enough positive experiences, regardless of suffering) lives without accepting death for existing people. I, for example, don't get why would you bring more death to the world by creating low-lifespan people, if you don't like death.