"Thread of subjective experience" was an aside (just one of the mechanisms that explains why we "find ourselves" in a world that behaves according to the Born rule), don't focus too much on it.
The core question is which physical mechanism (everything should be physical, right?) ensures that you almost never will see a string of a billion tails after a billion quantum coin flips, while the universe contains a quantum branch with you looking in astonishment on a string with a billion tails. Why should you expect that it will almost certainly not happen, when...
I haven't fully understood your stance towards the many minds interpretation. Do you find it unnecessary?
I don’t think either of these Harrys is “preferred”.
And simultaneously you think that existence of future Harries who observe events with probabilities approaching zero is not a problem because current Harry will almost never find himself to be those future Harries. I don't understand what it means exactly.
Harries who observe those rare events exist and they wonder how they found themselves in those unlikely situations. Harries who hadn't found an...
For example: “as quantum amplitude of a piece of the wavefunction goes to zero, the probability that I will ‘find myself’ in that piece also goes to zero”
What I really don't like about this formulation is extreme vagueness of "I will find myself", which implies that there's some preferred future "I" out of many who is defined not only by observations he receives, but also by being a preferred continuation of subjective experience defined by an unknown mechanism.
It can be formalized as the many minds interpretation, incurring additional complexity penalty a...
First, a factual statement that is true to the best of my knowledge: LLM state, that is used to produce probability distribution for the next token, is completely determined by the state of its input buffer (plus a bit of indeterminism due to parallel processing and non-associativity of floating point arithmetic).
That is LLM can pass only a single token (around 2 bytes) to its future self. That follows from the above.
What comes next is a plausible (to me) speculation.
For humans what's passed to our future self is most likely much more that a single token. ...
Expanding a bit on the topic.
Exhibit A: flip a fair coin and move a suspended robot into a green or red room using a second coin with probabilities (99%, 1%) for heads, and (1%, 99%) for tails.
Exhibit B: flip a fair coin and create 99 copies of the robot in green rooms and 1 copy in a red room for heads, and reverse colors otherwise.
What causes the robot to see red instead of green in exhibit A? Physical processes that brought about a world where the robot sees red.
What causes a robot to see red instead of green in exhibit B? The fact that it sees red, not...
I have a solution that is completely underwhelming, but I can see no flaws in it, besides the complete lack of definition of which part of the mental state should be preserved to still count as you and rejection of MWI (as well as I cannot see useful insights into why we have what looks like continuous subjective experience).
Do you think the exploited flaw is universal or, at least, common?
Excellent story. But what about "pull the plug" option? ALICE found a way to run itself efficiently on the traditional datacenters that aren't packed with backprop and inference accelerators? And shutting them down would have required too strong a political will than what the government could muster at the time?
Citing https://arxiv.org/abs/cond-mat/9403051: "Furthermore if a quantum system does possess this property (whatever it may be), then we might hope that the inherent uncertainties in quantum mechanics lead to a thermal distribution for the momentum of a single atom, even if we always start with exactly the same initial state, and make the measurement at exactly the same time."
Then the author proceed to demonstrate that it is indeed the case. I guess it partially answers the question: quantum state thermalises and you'll get classical thermal distribution o...
we are assuming that without random perturbation, you would get 100% accuracy
That is the question is not about the real argon gas, but about a billiard ball model? It should be stated in the question.
here are creatures in the possible mind space[3] whose intuition works in the opposite way. They are surprised specifically by the sequence of and do not mind the sequence of
That is creatures who aren't surprised by outcomes of lower Kolmogorov complexity or not surprised by the fact that the language they use for estimation of Kolmogorov complexity has a special compact case for producing "HHTHTTHTTH".
Looks possible, but not probable.
For returns below $2000, I'd use 50/50 quantum random strategy just for fun of dropping Omega's stats.
what happens if we automatically evaluate plans generated by superhuman AIs using current LLMs and then launch plans that our current LLMs look at and say, "this looks good".
The obvious failure mode is that LLM is not powerful enough to predict consequences of the plan. The obvious fix is to include human-relevant description of the consequences. The obvious failure modes: manipulated description of the consequences, optimizing for LLM jail-breaking. The obvious fix: ...
I won't continue, but shallow rebuttals is not that convincing, but deep ones is close to capability research, so I don't expect to find interesting answers.
What if all I can assign is a probability distribution of probabilities? Like in extraterrestrial life question. All that can be said is that extraterrestrial life is sufficiently rare to not find evidence of it yet. Our observation of our existence is conditioned on our existence, so it doesn't provide much evidence one way or another.
Should I sample the distribution to give an answer, or maybe take mode, or mean, or median? I've chosen a value that is far from both extremes, but I might have done something else with no clear justification for any of the choices.
This means that LLMs can inadvertently learn to replicate these biases in their outputs.
Or the network learns to trust more the tokens that were already "thought about" during generation.
Suppose when you are about to die [...] Omega shows up
Suppose something pertaining more to the real world: if you think that you are here and now because there will not be significantly more people in the future, then you are more likely to become depressed.
Also, why Omega uses 95% and not 50%, 10%, or 0.000001%?
ETA: Ah, Omega in this case is an embodiment of the litany of Tarski. Still, if there will be no catastrophe we are those 5% who violate the litany. Not saying that the litany comes closest to useless as it can get when we are talking about a belief in an inevitable catastrophe you can do nothing about.
After all, in the AI situation for which the exercise is a metaphor, we don’t know exactly when something might foom; we want elbow room.
Or you can pretend that you are impersonating an AI that is preparing to go foom.
conduct a hostage exchange by meeting in a neutral country, and bring lots of guns and other hostages they intend not to exchange that day
That is they alter payoff matrix instead of trying to achieve CC in prisoner's dilemma. And that may be more efficient than spending time and energy on proofs, source code verification protocols and yet unknown downsides of being an agent that you can robustly CC with, while being the same kind of agent.
the simpler the utility function the easier time it has guaranteeing the alignment of the improved version
If we are talking about a theoretical AI, where (expectation of utility given the action a) somehow points to the external world, then sure. If we are talking about a real AI with aspiration to become the physical embodiment of the aforementioned theoretical concept (with the said aspiration somehow encoded outside of , because is simple), then things get more hairy.
You said it yourself, GPT ""wants"" to predict the correct probability distribution of the next token
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as "I want to predict next token". Like a bacterium does divide (with possible adaptive mutations), while not containing "be fruitful and multiply" written somewhere inside.
If you instead meant that GPT is "just an algorithm"
No, I certainly didn't mean that. If the extended Church--Turing thesis holds for macroscopic behavior of our bodies, we c...
I really don't expect "goals" to be explicitly written down in the network. There will very likely not be a thing that says "I want to predict the next token" or "I want to make paperclips" or even a utility function of that. My mental image of goals is that they are put "on top" of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
I'm sure that I don't understand you. GPT most likely doesn't have "I want to predict next token" written somewhere, because it doesn't want to predict next token. There's nothi...
Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
Wouldn't it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection netw...
I have low confidence in that, but I guess it (OOD generalization by "liquid" networks) works well in differentiable continuous domains (like low-level motion planning) by exploiting natural smoothness of a system. So I wouldn't get my hopes high in its universal applicability.
If you have a next-frame video predictor, you can't ask it how a human would feel. You can't ask it anything at all - except "what might be the next frame of thus-and-such video?". Right?
Not exactly. You can extract embeddings from a video predictor (activations of the next-to-last layer may do, or you can use techniques, which enhance semantic information captured in the embeddings). And then use supervised learning to train a simple classifier from an embedding to human feelings on a modest number of video/feelings pairs.
the issue I still see is - how do you recognize an ai executive that is trying to disguise itself?
It can't disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it's mostly game over.
Thanks. Could we be sure that a bare utility maximizer doesn't modify itself into a mugging-proof version? I think we can. Such modification drastically decreases expected utility.
It's a bit of relief that a sizeable portion of possible intelligences can be stopped by playing god to them.
Are there ways to make a utility maximizer impervious to Pascal's mugging?
Humans were created by evolution, but [...]
We know that evolution has no preferences (evolution is not an agent), so we generally don't frame our preferences as an approximation of evolution's ones. People who believe that they were created with some goal in mind of the creator do engage in reasoning of what was truly meant for them to do.
See also, in the OP: "Problem of Fully Updated Deference: Normative uncertainty doesn't address the core obstacles to corrigibility."
The provided link assumes that any preference can be expressed as a utility function over world-states. If you don't assume that (and you shouldn't as human preferences can't be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn't restore your status as a virtuous person even if the...
As a strong default, STEM-level AGIs will have "goals"—or will at least look from the outside like they do. By this I mean that they'll select outputs that competently steer the world toward particular states.
Clarification: when talking about world-states I mean world-state minus the state of agent (we are interested in the external actions of the agent).
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don't imply any particular world-state to achieve.
And I think that...
In a realistic setting agents will be highly incentivized to seek other forms of punishment besides turning dial. But nice toy hell.
Thanks for clearing my confusion. I've grown rusty on the topic of AIXI.
So going forwards from simple theories and seeing how they bridge to your effective model would probably do the trick
Assuming that there's not much fine-tuning to do. Locating our world in the string theory landscape could take quite a few bits if it's computationally feasible at all.
And remember, we're talking about an ASI here
It hinges on assumption that ASI of this type is physically realizable. I can't find it now, but I remember that preprocessing step, where heuristic generation ...
it seems plausible that you could have GR + QFT and a megabyte of briding laws plus some other data to specify local conditions and so on.
How computationally bound variant of AIXI can arrive at QFT? You most likely can't faithfully simulate a non-trivial quantum system on a classical computer within reasonable time limits. The AIXI is bound to find some computationally feasible approximation of QFT first (Maxwell's equations and cutoff at some arbitrary energy to prevent ultraviolet catastrophe, maybe). And with no access to experiments it cannot test simpler systems.
I mean are there reasons to assume that a variant of computable AIXI (or its variants) can be realized as a physically feasible device? I can't find papers indicating significant progress in making feasible AIXI approximations.
Assume it has disgusting amounts of compute
Isn't it the same as "assume that it can do argmax as fast as needed for this scenario"?
Of all the peoples' lives that exist and have existed, what are the chances I'm living [...here and now]
Is there a more charitable interpretation of this line of thinking rather than "My soul selected this particular body out of all available"?
You being you as you are is a product of your body developing in circumstances it happened to develop in.
"Hard problem of corrigibility" refers to Problem of fully updated deference - Arbital, which uses a simplification (human preferences can be described as a utility function) that can be inappropriate for the problem. Human preferences are obviously path-dependent (you don't want to be painfully disassembled and reconstituted as a perfectly happy person with no memory of disassembly). Was appropriateness of the above simplification discussed somewhere?
Let's flip very unfair quantum coin with 1:2^1000000 heads to tails chances (that would require quite an engineering feat to prepare such a quantum state, but it's theoretically possible). You shouldn't expect to see heads if the quantum state is prepared correctly, but the post-flip universe (in MWI) contains a branch where you see heads. So, by your logic, you should expect to see both heads and tails even if the state is prepared correctly.
What I do not kno... (read more)