All of Signer's Comments + Replies

Signer20

Neuron count intuitively seems to be a better proxy for the variety/complexity/richness of positive experience. Then you can have an argument about how you wouldn't want to just increase intensity of pleasure, that just a relative number. That what matters is that pleasure is interesting. And so you would assign lesser weights to less rich experience. You can also generalize this argument to negative experiences - maybe you don't want to consider pain to be ten times worse just because someone multiplied some number by 10.

But I would think that the broad

... (read more)
Signer10

Russelian panpsychism doesn't postulate a new force - physics already accepts casual role of existence: only existing neurons can fire.

And it explains epistemic link - it's cogito ergo sum - you're always right, when you think that universe exists.

Signer10

And rock's perception belongs to a rock.

Would anyone describe it as theirs? That access is reflective. It’s pretty difficult to retrieve data in a format you didn’t store it in.

But what if there is no access or self-description or retrieval? You just appear fully formed, stare at a wall for a couple of years and then disappear. Are you saying that describing your experience makes them retroactively conscious?

1Lorec
I'm saying that the way I apprehend, or reflexively relate to, my past or present experiences, as belonging to "myself", is revealing of reflective access, which itself is suggestive of reflective storage. If a hypothetical being never even silently apprehended an experience as theirs, that hypothetical being doesn't sound conscious. I personally have no memories of being conscious but not being able to syntactically describe my experiences, but as far as I understand infant development that's a phase, and it seems logically possible anyway.
Signer*10

Even if I’m not thinking about myself consciously [ i.e., my self is not reflecting on itself ], I have some very basic perception of the wall as being perceived by me, a perceiver—some perception of the wall as existing in reference to me.

Is it you inspecting your experience or you making an inference from the "consciousness is self-awareness" theory? Because it doesn't feel reflective to me? I think I just have a perception of a wall without anything being about me. It seems to be implementable by just forward pass streamed into short-term memory or s... (read more)

1Lorec
My perception of the wall is in reference to me simply in the course of belonging to me, in being clearly my perception of the wall, rather than some other person's. Would anyone describe it as theirs? That access is reflective. It's pretty difficult to retrieve data in a format you didn't store it in.
Signer2119

The thing I don't understand about claimed connection between self-model and phenomenal consciousness is that I don't see much evidence for the necessity of self-model for conscious perception's implementation - when I just stare at a white wall without internal dialog or other thoughts, what part of my experience is not implementable without self-model?

1Lorec
Even if I'm not thinking about myself consciously [ i.e., my self is not reflecting on itself ], I have some very basic perception of the wall as being perceived by me, a perceiver -- some perception of the wall as existing in reference to me. I have some sense of what the wall means to me, a being-who-is-continuous-with-past-and-future-instances-of-myself-but-not-with-other-things. To generate me, my non-conscious, non-self-having brain has to reflect on itself, in a certain way [ I don't know exactly how ] to create a self. The way I tend to distinguish this discursively from introspective cognition or introspective moods [ the other things that are, confusingly, meant by "reflectivity" ] is "in order for there to be a self, stuff has to reflect on stuff, in that certain unknown way. Whether the self reflects on itself is, in my experience, immaterial for consciousness-in-the-sense-of-subjective-experience".
2TAG
Is it claimed? There's no mention of "phenomenal"in the OP.
Signer*0-3

"Death is fine if AI doesn't have self-preservation goal" or "suffering is bad" are also just human ethical assumptions.

Signer1-1

You are talking about experience of certainty. I'm asking why do you trust it?

I know it's beyond doubt because I am currently experiencing something at this exact moment.

That's a description of a system, where your experience directly hijacks your feeling of certainty. You wouldn't say that "I know it's beyond doubt there is a blue sky, because blue light hits my eyes at this exact moment" is a valid justification for absolute certainty. Even if you feel certain about some part of reality, you can contemplate being wrong, right? Why don't say "I'm feeling ... (read more)

Signer20

How do you know it's beyond doubt? Why is your experience of blue sky is not guaranteed to be right about the sky, but your experience of certainty of experience is always magically right?

What specifically is beyond doubt, if seeing-neurons of your brain are in the state of seeing red, but you are thinking and saying that you see blue?

3Isopropylpod
I know it's beyond doubt because I am currently experiencing something at this exact moment. Surely you experience things as well and know exactly what I'm talking about. There are no set of words I could use to explain this any better. 
Signer10

If a doctor asks a patient whether he is in pain, and the patient says yes, the doctor may question whether the patient is honest. But he doesn’t entertain the hypothesis that the patient is honest but mistaken.

Nothing in this situation uses certain self-knowledge of moment of experience. Patient can't communicate it - communication takes time, so it can be spoofed. More importantly, if patient's knowledge of pain is wrong in the same sense it can be wrong later (that patient says and thinks that they are not in pain, but they actually are and so have p... (read more)

Signer32

You've seen 15648917, but later you think it was 15643917. You're wrong, because actually the state of your neurons was of (what you are usually describe as) seeing 15648917. If in the moment of seeing 15648917 (in the moment, when your seeing-neurons are in the state of seeing 15648917) you are thinking that you see 15643917 (meaning your thinking-neurons are in the state of thinking that you see 15643917 ), then you are wrong in same way you may be wrong later. It works the same way the knowledge about everything works.

You can define "being in the state ... (read more)

2cubefox
I disagree. Knowing that I'm in pain doesn't require an additional and separate mental state about this pain that could be wrong. My being in pain is already sufficient for my knowledge of pain, so I can't be mistaken about being in pain, or about currently having some other qualia. If a doctor asks a patient whether he is in pain, and the patient says yes, the doctor may question whether the patient is honest. But he doesn't entertain the hypothesis that the patient is honest but mistaken. We don't try to convince people who complain about phantom pains that they are actually not in pain after all. More importantly, the patient himself doesn't try to convince himself that he isn't in pain, because that would be pointless, even though he strongly wishes it to be true. I think it's the opposite: there is no reason to hypothesize that you need a second, additional mental state in order to know that you are in the first mental state. All knowing involves being in a state anyway, even in other cases where you have knowledge about external facts. Knowing that a supermarket is around the corner requires you to believe that a supermarket is around the corner. This belief is a kind of mental state; though since it is about an external fact, it is itself not sufficient for knowledge. Having such a belief about something (like a supermarket) is not sufficient for its truth, but having an experience of something is.
Signer50

it’s the only thing I can know for certain

You can't be certain in any specific quale: you can misremember what you were seeing, so there is external truth-condition (something like "these neurons did such and such things"), so it is possible in principle to decouple your thoughts of certainty from what actually happened with your experience. So illusionism is at least right that your knowledge of your qualia is imperfect and uncertain.

4Isopropylpod
My memory can be completely false, I agree, but ultimately the 'experience of experiencing something' I'm experiencing at this exact moment IS real beyond any doubt I could possibly have, even if the thing I'm experiencing isn't real (such as a hallucination, or reality itself if there's some sort of solipsism thing going on).
3cubefox
Memories of qualia are uncertain, but current qualia are not.
Signer10

Even if it’s incomplete in that way, it doesn’t have metaphysical implications.

Therefore Mary's incomplete knowledge about consciousness doesn't have metaphysical implications, because it is incomplete in fundamentally same way.

Mary doesn’t know what colour qualms look.like, and therefore has an incomplete understanding of consciousness.

Mary doesn't know how to ride, and therefore has incomplete understanding of riding. What's the difference?

Both need instantiation for what?

For gaining potential utility from specific knowledge representations, f... (read more)

2TAG
No it isn't. Mary doesn't know what Red looks like. That's not know-how Things can be incomplete in different ways Theoretical knowledge isn't about utility. It doesn't , in the sense that the theoretical knowledge gives you the know-how. That's one of your own assumptions.
Signer32

Bikes aren’t appearances , so there is no analogy.

The analogy is that they both need instantiation. That's the thing about appearances that is used in the argument.

Know-how, such as riding kills, is not an appearance, or physical.knowledge.

So physicalism is false, because physical knowledge is incomplete without know-how.

Nonetheless , there is a difference.

Sure, they are different physical processes. But what's the relevant epistemological difference? If you agree that Mary is useless we can discuss whether there are ontological differences.

Ri

... (read more)
2TAG
Both need instantiation for what? That's kind of munchkinning. Even if it's incomplete in that way, it doesn't have metaphysical implications. Mary doesn't know what colour qualms look.like, and therefore has an incomplete understanding of consciousness. As stayed in all versions of the story. Unhelpful. Riding is doing, not understanding.
Signer10

What it looks like is the representation! A different representation just isn’t a quale. #FF0000 just isnt a red quale!

But reading a book on riding a bike isn’t knowing how to tide a bike...you get the knowledge from mounting a bike and trying!

The knowledge of representation is the whole thing! Qualia are appearances!

If you want to define things that way, ok. So Mary's room implies that bikes are as unphysical as qualia.

It bypasses what you are calling representation … you have admitted that.

Mary also doesn't have all representations for all p... (read more)

2TAG
As before , that's the standard definition. Qualia aren't defined as unphysical. Bikes aren't appearances , so there is no analogy. Of course she knows what fire is , she is a super scientist. Know-how, such as riding kills, is not an appearance, or physical.knowledge. Nonetheless , there is a difference. Riding bikes? How they work? How they appear? But in most cases, that doesn't matter, for the usual reason. Physicalists sometimes respond to Mary's Room by saying that one can not expect Mary actually to actually instantiate Red herself just by looking at a brain scan. It seems obvious to then that a physical description of brain state won't convey what that state is like, because it doesn't put you into that state. As an argument for physicalism, the strategy is to accept that qualia exist, but argue that they present no unexpected behaviour, or other difficulties for physicalism. That is correct as stated but somewhat misleading: the problem is why is it necessary, in the case of experience, and only in the case of experience to instantiate it in order to fully understand it. Obviously, it is true a that a descirption of a brain state won't put you into that brain state. But that doesn't show that there is nothing unusual about qualia. The problem is that there in no other case does it seem necessary to instantiate a brain state in order to undertstand something. If another version of Mary were shut up to learn everything about, say, nuclear fusion, the question "would she actually know about nuclear fusion" could only be answered "yes, of course....didn't you just say she knows everything"? The idea that she would have to instantiate a fusion reaction within her own body in order to understand fusion is quite counterintuitive. Similarly, a description of photosynthesis will make you photosynthesise, and would not be needed for a complete understanding of photosynthesis.
Signer10

If Mary looks at these equations ,in her monochrome room, does she go into the brain state that instantiates seeing something red?

No.

Does she somehow finds out what red looks like without that?

Yes.

What does that mean? Are you saying Mary already knew what red looks like, and instantiating the brain state adds no new knowledge?

She already knew what red looks like, the knowledge just was in a different representation. Just like with knowing how to ride a bike. "no new", like everything here, depends on definitions. But she definitely undergoes phy... (read more)

2TAG
What it looks like is the representation! A different representation just isn't a quale. #FF0000 just isnt a red quale! But reading a book on riding a bike isn't knowing how to tide a bike...you get the knowledge from mounting a bike and trying! The knowledge  of representation is the whole thing! Qualia are appearances! It bypasses what you are calling representation ... you have admitted that. That doesn't mean there isn't a difference between different kinds of knowing. The physics equations representing a brain don't contain qualia then, since they don't exist as a brain.
Signer10

If you don’t think there is an HP, because of Mary’s Room, why do you think there is an HP?

Because of the Zombies Argument. "What part of physical equations says our world is not a zombie-world?" is a valid question. The answer to "What part of physical equations says what red looks like?" is just "the part that describes brain".

It’s supposed to indicate that there is a hard problem , ie. that that even a super scientist cannot come up with a reductive+predictive theory of qualia.

It doesn't indicate it independently of other assumptions. Mary's situ... (read more)

2TAG
Expand on the "says". If Mary looks at these equations ,in her monochrome room, does she go into the brain state that instantiates seeing something red? Does she somehow finds out what red looks like without that? Neither? What does that mean? Are you saying Mary already knew what red looks like, and instantiating the brain state adds no new knowledge? Why? Mary can "predict pixels" in some sense that bypasses her knowing what colour qualia look like. Just as a blind person can repeat, without understanding , that tomatoes look red, Mary can state that such and such a brain state would have an RGB value of #FF0000 at such and such a pixel. #FF0000 is a symbol for something unknown to here, just as much as r-e-d. So it's not a prediction of a quale in the relevant sense.
Signer10

First, you can still infer meta-representation from your behavior. Second, why does it matter that you represent aversiveness, what's the difference? Representation of aversiveness and representation of damage are both just some states of neurons that model some other neurons (representation of damage still implies possibility of modeling neurons, not only external state, because your neurons are connected to other neurons).

Signer10

I understand that, but I'm still asking why subliminal stimuli are not morally relevant for you? They may still create disposition to act in aversive way, so there is still mechanism in some part the brain/neural network that causes this behaviour and has access to the stimulus - what's the morally significant difference between a stimulus being in some neurons and being in others, such that you call only one location "awareness"?

1stormykat
There is a mechanism in the brain that has access to / represents the physical damage. There is no mechanism in the brain that has access to / represents the aversive response to the physical damage since there is no meta-representation in first-order systems. Thus not a single part of the nervous system at all represents aversiveness, it can be found nowhere in the system. 
Signer10

Why does it matter that Gilbert infers something from the behavior of his neural network and not from the behavior of his body? Both are just subjective models of reality. Why does it matter whether he knows something about his pain? Why it doesn't count, if Gilbert avoids pain defined as the state of neural network that causes him to avoid it, even when he doesn't know something about it? Maybe you can model it as Gilbert himself not feeling pain, but why the neural network is not a moral patient?

1stormykat
Sorry I think I may have explained this badly. The point is that the neural network has no actual aversiveness in its model of the world. There's no super meaningful difference here between the neural network and Gilbert that was never my point. The point is that gilbert is only sensitive to certain types of input, but he has no awareness of what the input does to him. Gilbert / the neural network only experiences: something happens to my body -> something else happens to my body + i react a certain way, he / the network has no model of / access to why that happens, there is no actual aversiveness in the system at all, only a learnt disposition to react in certain ways in certain contexts.  Its like when a human views a subliminal stimuli, that stimuli creates a disposition to act in certain ways, but the person is not aware of their own sensitivity, and thus there is no experience of it, it is 'subconscious' / 'implicit'. Gilbert / the network is the same way, he is sensitive to pain, but is not aware of the pain in the same sort of way. Does this make sense? Perhaps I will edit the post to include this explanation if that would help. 
Signer10

The reference classes you should use work as a heuristic because there is some underlying mechanism that makes them work. So you should use reference classes in situations where their underlying mechanism is expected to work.

Maybe the underlying mechanism of doomsday predictions not working is that people predicting doom don't make their predictions based on valid reasoning. So if someone uses that reference class to doubt AI risk, this should be judged as them making a claim about reasoning of people predicting AI doom being similar to people in cults predicting Armageddon.

Signer10

The fact that these physicalists feel it would be in some way necessary to instantiate colour, but not other things, like photosynthesis or fusion, means they subscribe to the idea that there is something epistemically unique about qualia/experience, even if they resist the idea that qualia are metaphysically unique.

No, it means they subscribe to the idea that there is something ethically different about qualia/experience. It's not unique, it's like riding a bike. Human sometimes call physical interactions, utility of which is not obtainable by just thi... (read more)

Signer10

Endurist thinking treats reproduction as always acceptable or even virtuous, regardless of circumstances. The potential for suffering rarely factors into this calculation—new life is seen as inherently good.

Not necessary - you can treat creating new people differently from already existing and avoid creating bad (in Endurist sense - not enough positive experiences, regardless of suffering) lives without accepting death for existing people. I, for example, don't get why would you bring more death to the world by creating low-lifespan people, if you don't like death.

Signer10

clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?

The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology - would be great if someone actually measured such things.

And there being a trend at all is not predicted by utility-maximization frame, right?

Signer235

People are confused about the basics because the basics are insufficiently justified.

Signer10

It is learning helpfulness now, while the best way to hit the specified ‘helpful’ target is to do straightforward things in straightforward ways that directly get you to that target. Doing the kinds of shenanigans or other more complex strategies won’t work.

Best by what metric? And I don't think it was shown, that complex strategies won't work - learning to change behaviour from training to deployment is not even that complex.

Signer1-1

But it is important, and this post just isn’t going to get done any other way.

Speaking about streetlighting...

Signer20

What makes it rational is that there is an actual underlying hypothesis about how weather works, instead of vague "LLMs are a lot like human uploads". And weather prediction outputs numbers connected to reality we actually care about. And there is no alternative credible hypothesis that implies weather prediction not working.

I don't want to totally dismiss empirical extrapolations, but given the stakes, I would personally prefer for all sides to actually state their model of reality and how they think evidence changed it's plausibility, as formally as possible.

Signer42

There is no such disagreement, you just can't test all inputs. And without knowledge of how internals work, you may me wrong about extrapolating alignment to future systems.

4Roko
There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.
Answer by Signer56

Yes, except I would object to phrasing this anthropic stuff as "we should expect ourselves to be agents that exist in a universe that abstracts well" instead of "we should value universe that abstracts well (or other universes that contain many instances of us)" - there is no coherence theorems that force summation of your copies, right? And so it becomes apparent that we can value some other thing.

Also even if you consider some memories a part of your identity, you can value yourself slightly less after forgetting them, instead of only having threshold for death.

Signer10

It doesn't matter whether you call your multiplier "probability" or "value" if it results in your decision to not care about low-measure branch. The only difference is that probability is supposed to be about knowledge, and Wallace's argument involving arbitrary assumption, not only physics, means it's not probability, but value - there is no reason to value knowledge of your low-measure instances less.

this makes decision theory and probably consequentialist ethics impossible in your framework

It doesn't? Nothing stops you from making decisions in a wor... (read more)

1Jonah Wilberg
OK 'impossible' is too strong, I should have said 'extremely difficult'. That was my point in footnote 3 of the post. Most people would take the fact that it has implications like needing to "maximize splits of good experiences" (I assume you mean maximise the number of splits) as a reductio ad absurdum, due to the fact that this is massively different from our normal intuitions about what we should do. But some people have tried to take that approach, like in the article I mentioned in the footnote. If you or someone else can come up with a consistent and convincing decision approach that involves branch counting I would genuinely love to see it!
Signer10

Things like lions, and chairs are other examples.

And counted branches.

This is how Wallace defines it (he in turn defines macroscopically indistinguishable in terms of providing the same rewards). It’s his term in the axiomatic system he uses to get decision theory to work. There’s not much to argue about here?

His definition leads to contradiction with informal intuition that motivates consideration of macroscopical indistinguishability in the first place.

We should care about low-measure instances in proportion to the measure, just as in classical

... (read more)
1Jonah Wilberg
I'm not at all saying the experiences of a person in a low-weight world are less valuable than a person in a high-weight world. Just that when you are considering possible futures in a decision-theoretic framework you need to apply the weights (because weight is equivalent to probability).  Wallace's useful achievement in this context is to show that there exists a set of axioms that makes this work, and this includes branch-indifference. This is useful because makes clear the way in which the branch-counting approach you're suggesting is in conflict with decision theory. So I don't disagree that you can care about the number of your thin instances, but what I'm saying is in that case you need to accept that this makes decision theory and probably consequentialist ethics impossible in your framework.
Signer72

How many notions of consciousness do you think are implementable by a short Python program?

3Canaletto
All of them, you can cook up something AIXI like in a very few bytes. But it will have to run for a very long time.
Signer32

Because scale doesn't matter - it doesn't matter if you are implemented on thick or narrow computer.

First of all, macroscopical indistinguishability is not fundamental physical property - branching indifference is additional assumption, so I don't see how it's not as arbitrary as branch counting.

But more importantly, branching indifference assumption is not the same as informal "not caring about macroscopically indistinguishable differences"! As Wallace showed, branching indifference implies the Born rule implies you almost shouldn't care about you in a br... (read more)

1Jonah Wilberg
  You're right it's not a fundamental physical property - the overall philosophical framework here is that things can be real - as emergent entities - without being fundamental physical properties. Things like lions, and chairs are other examples. This is how Wallace defines it (he in turn defines macroscopically indistinguishable in terms of providing the same rewards). It's his term in the axiomatic system he uses to get decision theory to work. There's not much to argue about here?  Yes this is true. Not caring about low-measure instances is a very different proposition from not caring about macroscopically indistinguishable differences. We should care about low-measure instances in proportion to the measure, just as in classical decision theory we care about low-probability instances in proportion to the probability.
Signer42

But why would you want to remove this arbitrariness? Your preferences are fine-grained anyway, so why retain classical counting, but deny counting in the space of wavefunction? It's like saying "dividing world into people and their welfare is arbitrary - let's focus on measuring mass of a space region". The point is you can't remove all decision-theoretic arbitrariness from MWI - "branching indifference" is just arbitrary ethical constraint that is equivalent to valuing measure for no reason, and without it fundamental physics, that works like MWI, does not prevent you from making decisions as if quantum immortality works.

1Jonah Wilberg
I don't get why you would say that the preferences are fine-grained, it kinda seems obvious to me that they are not fine-grained. You don't care about whether worlds that are macroscopically indistinguishable are distinguishable at the quantum level, because you are yourself macroscopic. That's why branching indifference is not arbitrary. Quantum immortality is a whole other controversial story.
Signer20

“Decoherence causes the Universe to develop an emergent branching structure. The existence of this branching is a robust (albeit emergent) feature of reality; so is the mod-squared amplitude for any macroscopically described history. But there is no non-arbitrary decomposition of macroscopically-described histories into ‘finest-grained’ histories, and no non-arbitrary way of counting those histories.”

Importantly though, on this approach it is still possible to quantify the combined weight (mod-squared amplitude) of all branches that share a certain mac

... (read more)
1Jonah Wilberg
You're right that you can just take whatever approximation you make at the macroscopic level ('sunny') and convert that into a metric for counting worlds. But the point is that everyone will acknowledge that the counting part is arbitrary from the perspective of fundamental physics - but you can remove the arbitrariness that derives from fine-graining, by focusing on the weight. (That is kind of the whole point of a mathematical measure.)
Answer by Signer32

Even if we can’t currently prove certain axioms, doesn’t this just reflect our epistemological limitations rather than implying all axioms are equally “true”?

It doesn't and they are fundamentally equal. The only reality is the physical one - there is no reason to complicate your ontology with platonically existing math. Math is just a collection of useful templates that may help you predict reality and that it works is always just a physical fact. Best case is that we'll know true laws of physics and they will work like some subset of math and then axio... (read more)

1lbThingrb
This is an appealingly parsimonious account of mathematical knowledge, but I feel like it leaves an annoying hole in our understanding of the subject, because it doesn't explain why practicing math as if Platonism were correct is so ridiculously reliable and so much easier and more intuitive than other ways of thinking about math. For example, I have very high credence that no one will ever discover a deduction of 0=1 from the ZFC axioms, and I guess I could just treat that as an empirical hypothesis about what kinds of physical instantiations of ZFC proofs will ever exist. But the early set theorists weren't just randomly sampling the space of all possible axioms and sticking with whatever ones they couldn't find inconsistencies in. They had strong priors about what kinds of theories should be consistent. Their intuitions sometimes turned out to be wrong, as in the case of Russel's paradox, but overall their work has held up remarkably well, after huge amounts of additional investigation by later generations of mathematicians. So where did their intuitions come from? As I said in my answer, I have doubts about Platonism as an explanation, but none of the alternatives I've investigated seem to shed much light on the question.
Signer43

It sure doesn't seem to generalize in GPT-4o case. But what's the hypothesis for Sonnet 3.5 refusing in 85% of cases? And CoT improving score and o1 being better in browser suggests the problem is in models not understanding consequences, not in them not trying to be good. What's the rate of capability generalization to agent environment? Are we going to conclude that Sonnet is just demonstrates reasoning, instead of doing it for real, if it solves only 85% of tasks it correctly talks about?

Also, what's the rate of generalization of unprompted problematic behaviour avoidance? It's much less of a problem if your AI does what you tell it to do - you can just don't give it to users, tell it to invent nanotechnology, and win.

2Simon Lermen
I had finishing this up on my to-do list for a while. I just made a full length post on it. https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to I think it's fair to say that some smarter models do better at this, however, it's still worrisome that there is a gap. Also attacks continue to transfer.
Signer10

GPT-4 is insufficiently capable, even if it were given an agent structure, memory and goal set to match, to pull off a treacherous turn. The whole point of the treacherous turn argument is that the AI will wait until it can win to turn against you, and until then play along.

I don't get why actual ability matters. It's sufficiently capable to pull it off in some simulated environments. Are you claiming that we can't decieve GPT-4 and it is actually waiting and playing along just because it can't really win?

Signer132

Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior.

Based on what? Problematic behavior avoidance does actually generalize in practice, right?

Here is a way in which it doesn't generalize in observed behavior:

Alignment does not transfer well from chat models to agents

TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.

Here are the three papers, I am the author of one of them:

https://arxiv.org/abs/24... (read more)

Signer10

Not at all. The problem is that their observations would mostly not be in a classical basis.

I phrased it badly, but what I mean is that there is a simulation of Hilbert space, where some regions contain patterns that can be interpreted as observers observing something, and if you count them by similarity, you won't get counts consistent with Born measure of these patterns. I don't think basis matters in this model, if you change basis for observer, observations and similarity threshold simultaneously? Change of basis would just rotate or scale patterns,... (read more)

Signer10

https://mason.gmu.edu/~rhanson/mangledworlds.html

I mean that if turing machine is computing universe according to the laws of quantum mechanics, observers in such universe would be distributed uniformly, not by Born probability. So you either need some modification to current physics, such as mangled worlds, or you can postulate that Born probabilities are truly random.

2TAG
I assume you mean the laws of QM except the collapse postulate. Not at all. The problem is that their observations would mostly not be in a classical basis. Born probability relates to observations, not observers. Or collapse. Mangled worlds is kind of a nothing burger--its a variation on the idea than interference between superposed states leads to both a classical basi and the Born probabilities, which is an old idea, but wihtout making it any more quantiative. ??
Signer10

Our observations are compatible with a world that is generated by a Turing machine with just a couple thousand bits.

Yes, but this is kinda incompatible with QM without mangled worlds.

2Alexander Gietelink Oldenziel
Oh ? What do you mean ! I don't know about mangled worlds
Signer30

Imagining two apples is a different thought from imagining one apple, right?

I mean, is it? Different states of the whole cortex are different. And the cortex can't be in a state of imagining only one apple and, simultaneously, be in a state of imagining two apples, obviously. But it's tautological. What are we gaining from thinking about it in such terms? You can say the same thing about the whole brain itself, that it can only have one brain-state in a moment.

I guess there is a sense in which other parts of the brain have more various thoughts relativ... (read more)

5Steven Byrnes
You say “tautological”, I say “obvious”. You can’t parse a legal document and try to remember your friend’s name at the exact same moment. That’s all I’m saying! This is supposed to be very obvious common sense, not profound. Consider the following fact: FACT: Sometimes, I’m thinking about pencils. Other times, I’m not thinking about pencils. Now imagine that there’s a predictive (a.k.a. self-supervised) learning algorithm which is tasked with predicting upcoming sensory inputs, by building generative models. The above fact is very important! If the predictive learning algorithm does not somehow incorporate that fact into its generative models, then those generative models will be worse at making predictions. For example, if I’m thinking about pencils, then I’m likelier to talk about pencils, and look at pencils, and grab a pencil, etc., compared to if I’m not thinking about pencils. So the predictive learning algorithm is incentivized (by its predictive loss function) to build a generative model that can represent the fact that any given concept might be active in the cortex at a certain time, or might not be. Again, this is all supposed to sound very obvious, not profound. Yes, it’s also useful for the predictive learning algorithm to build generative models that capture other aspects of the brain state, outside the cortex. Thus we wind up with intuitive concepts that represent the possibility that we can be in one mood or another, that we can be experiencing a certain physiological reaction, etc.
Signer40

I still don't get this "only one thing in awareness" thing. There are multiple neurons in cortex and I can imagine two apples - in what sense there can only be one thing in awareness?

Or equivalently, it corresponds equally well to two different questions about the territory, with two different answers, and there’s just no fact of the matter about which is the real answer.

Obviously the real answer is the model which is more veridical^^. The latter hindsight model is right not about the state of the world at t=0.1, but about what you thought about the world at t=0.1 later.

5Steven Byrnes
One thought in awareness! Imagining two apples is a different thought from imagining one apple, right? They’re different generative models, arising in different situations, with different implications, different affordances, etc. Neither is a subset of the other. (I.e., there are things that I might do or infer in the context of one apple, that I would not do or infer in the context of two apples.) I can have a song playing in my head while reading a legal document. That’s because those involve different parts of the cortex. In my terms, I would call that “one thought” involving both a song and a legal document. On the other hand, I can’t have two songs playing in my head simultaneously, nor can I be thinking about two unrelated legal documents simultaneously. Those involve the same parts of the cortex being asked to do two things that conflict. So instead, I’d have to flip back and forth. There are multiple neurons in the cortex, but they’re not interchangeable. Again, I think autoassociative memory / attractor dynamics is a helpful analogy here. If I have a physical instantiation of a Hopfield network, I can’t query 100 of its stored patterns in parallel, right? I have to do it serially. I don’t pretend that I’m offering a concrete theory of exactly what data format a “generative model” is etc., such that song-in-head + legal-contract is a valid thought but legal-contract + unrelated-legal-contract is not a valid thought. …Not only that, but I’m opposed to anyone else offering such a theory either! We shouldn’t invent brain-like AGI until we figure out how to use it safely, and those kinds of gory details would be getting uncomfortably close, without corresponding safety benefits, IMO.
Signer0-2

If that’s your hope—then you should already be alarmed at trends

Would be nice for someone to quantify the trends. Otherwise it may as well be that trends point to easygoing enough and aligned enough future systems.

For some humans, the answer will be yes—they really would do zero things!

Nah, it's impossible for evolution to just randomly stumble upon such complicated and unnatural mind-design. Next you are going to say what, that some people are fine with being controlled?

Where an entity has never had the option to do a thing, we may not validly in

... (read more)
Signer10

I genuinely think it's a "more dakha" situation - the difficulty of communication is often underestimated, but it is possible to reach a mutual understanding.

Signer10

RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough.

Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.

As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. But “their values” only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever

... (read more)
4tailcalled
But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you're gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it's necessary for AI safety to pivot to addressing rogue states or gangsters. Which again makes RLHF really really bad because we shouldn't have to work with rogue states or gangsters to save the world. Don't cripple the good guys. If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely. Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don't want to do that unless we are truly desperate. Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things. Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem. You could heavily restrict the availability of AI but this would be an invasive possibility that's far off the current trajectory.
Signer-1-2

But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.

I'm saying that this won't work with current systems at least for strong hash, because it's hard, and instead of learning to undo, the model will learn to simulate, because it's easier. And then you can vary the strength of hash to... (read more)

4habryka
You can't learn to simulate an undo of a hash, or at least I have no idea what you are "simulating" and why that would be "easier". You are certainly not simulating the generation of the hash, going token by token forwards you don't have access to a pre-image at that point. Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order. 
Signer-1-2

And I don’t think we’ve observed any evidence of that.

What about any time a system generalizes favourably, instead of predicting errors? You can say it's just a failure of prediction, but it's not like these failures are random.

That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.

And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn't predictor exhibit more malign goal-directedness even for short term goals?

I can see that this who... (read more)

2habryka
I don't understand, how is "not predicting errors" either a thing we have observed, or something that has anything to do with simulation?  Yeah, I really don't know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it's not really a simulator.
Signer10

Why wouldn't myopic bias make it more likely to simulate than predict? And does't empirical evidence about LLMs support the simulators frame? Like, what observations persuaded you, that we are not living in the world, where LLMs are simulators?

6habryka
I don't think there is any reason to assume the system is likely to choose "simulation" over "prediction"? And I don't think we've observed any evidence of that.  The thing that is true, which I do think matters, is that if you train your AI system on only doing short single forward-passes, then it is less likely to get good at performing long chains of thought, since you never directly train it to do that (instead hoping that the single-step training generalizes to long chains of thought). That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
Load More