All of Oliver Sourbut's Comments + Replies

Transformers Represent Belief State Geometry in their Residual Stream

the original 'theorem' was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have

$X$ (hidden) state
$Y$ output/observation

and a predictor

$S$ (predictor) state
$^Y$ predictor output
$R$ the reward or goal or what have you (some way of scoring 'wa

Oliver Sourbut1y20

I guess my question would be 'how else did you think a well-generalising sequence model would achieve this?' Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)

From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn't (it strikes me as a lesser-notice... (read more)

7Alexander Gietelink Oldenziel1y

I agree with you that the new/surprising thing is the linearity of the probe. Also I agree that not entirely clear how surprising & new linearity of the probe is. If you understand how the causal states construction & the MSP works in computational mechanics the experimental results isn't surprising. Indeed, it can't be any other way! That's exactly the magic of the definition of causal states. What one person might find surprising or new another thinks trivial. The subtle magic of the right theoretical framework is that it makes the complex simple, surprising phenomena apparent. Before learning about causal states I would have not even considered that there is a unique (!) optimal minimal predictor canonical constructible from the data. Nor that the geometry of synchronizing belief states is generically a fractal. Of course, once one has properly internalized the definitions this is almost immediate. Pretty pictures can be helpful in building that intuition ! Adam and I (and many others) have been preaching the gospel of computational mechanics for a while now. Most of it has fallen on deaf ears before. Like you I have been (positively!) surprised and amused by the sudden outpouring of interest. No doubt it's in part a the testimony to the Power of the Visual! Never look a gift horse in the mouth ! _ I would say the parts of computational mechanics I am really excited are a little deeper - downstream of causal states & the MSP. This is just a taster. I'm confused & intrigued by your insistence that this is follows from the good regulator theorem. Like Adam I don't understand it. It is my understanding is that the original 'theorem' was wordcelled nonsense but that John has been able to formulate a nontrivial version of the theorem. My experience is that it the theorem is often invoked in a handwavey way that leaves me no less confused than before. No doubt due to my own ignorance ! I would be curious to hear a *precise * statement why the result here follows

Nice explanation of MSP and good visuals.

This is surprising!

Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem^[1].)

In general I'd encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.

John has a good explication here ↩︎

3Adam Shai1y

It's surprising for a few reasons: * The structure of the points in the simplex is NOT * The next token prediction probabilities (ie. the thing we explicitly train the transformer to do) * The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not) The first would be not surprising because it's literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say "model of the world." But the MSP structure is neither of those things. It's the structure of inference over the model of the world, which is quite a different beast than the model of the world. Others might not find it as surprising as I did - everyone is working off their own intuitions. edit: also I agree with what Kave said about the linear representation.

7kave1y

IIUC, the good regulator theorem doesn't say anything about how the model of the system should be represented in the activations of the residual stream. I think the potentially surprising part is that the model is recoverable with a linear probe.

Incidentally I noticed Yudkowsky uses 'brainware' in a few places (e.g. in conversation with Paul Christiano). But it looks like that's referring to something more analogous to 'architecture and learning algorithms', which I'd put more in the 'software' camp when in comes to the taxonomy I'm pointing at (the 'outer designer' is writing it deliberately).

Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

Oliver Sourbut1y*52

Mmm, I misinterpreted at first. It's only a v-structure if $X$ and $Z$ are not connected. So this is a property which needs to be maintained effectively 'at the boundary' of the fully-connected cluster which we're rewriting. I think that tallies with everything else, right?

ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would... (read more)

1Oliver Sourbut1y

Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

That's concerning. It would appear to make both our proofs invalid.

But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

[This comment is no longer endorsed by its author]Reply

Oliver Sourbut1y*50

Aha. Preserving v-structures (colliders like $X \to Y \leftarrow Z$ ) is necessary and sufficient for equivalence^[1]. So when rearranging fully-connected subgraphs, certainly we can't do it (cost-free) if it introduces or removes any v-structures.

Plausibly if we're willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs - but they'd be non-invertible. Haven't thought about that.

Verma & Pearl, Equivalence and Synthesis of Causal Models 1990 ↩︎

5johnswentworth1y

Either that's wrong or I'm misinterpreting it, because a fully-connected DAG should be equivalent to any other fully-connected DAG but they all have different v-structures.

1Oliver Sourbut1y

That's concerning. It would appear to make both our proofs invalid. But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

Oliver Sourbut1y50

Mhm, OK I think I see. But $X_{¯ i}, N, M$ appear to me to make a complete subgraph, and all I did was redirect the $N \to M$ . I confess I am mildly confused by the 'reorder complete subgraph' bookkeeping rule. It should apply to the $A \to B$ in $A \to B \leftarrow C$ , right? But then I'd be able to deduce $A \leftarrow B \leftarrow C$ which is strictly different. So it must mean something other than what I'm taking it to mean.

Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)

5johnswentworth1y

Oh you're right, the rule about reordering complete subgraphs is missing some constraints. That's a mistake on my part. Thinking about it for ~60 seconds: when reordering a complete subgraph, arrows going out of the subgraph stay the same, but if there's arrows going in then we may need to introduce new arrows between (former) spouses.

Oliver Sourbut1y*50

Perhaps more importantly, I think with Node Introduction we really don't need $M \leftarrow X \to N$ after all?

With Node Introduction and some bookkeeping, we can get the $N$ and $M$ graphs topologically compatible, and Frankenstein them. We can't get as neat a merge as if we also had $M \leftarrow X \to N$ - in particular, we can't get rid of the arrow $M \to N$ . But that's fine, we were about to draw that arrow in anyway for the next step!

Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstrea... (read more)

5johnswentworth1y

I think one of the reorder steps on the bottom line is invalid. Stock counterexample: * Let Xi and X¯i each be strings of bits. The zeroth bit of each is equal (i.e. X0i=X0¯i); the rest are independent. * M is another bitstring with zeroth bit equal to the zeroth bit of Xi, and all other bits independent of X. * N is a bitstring with zeroth bit equal to the zeroth bit of Xi, with the rest of the bits given by: * odd-numbered bits are the odd-numbered independent bits of M xor independent bits of Xi * even-numbered bits are the even-numbered independent bits of M xor independent bits of X¯i Then the starting diagrams in your proof are all satisfied, but the final diagram is not. The step which breaks for this counterexample is the first reorder step on the bottom line.

Oliver Sourbut1y132

I had another look at this with a fresh brain and it was clearer what was happening.

TL;DR: It was both of 'I'm missing something', and a little bit 'Frankenstein is invalid' (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don't need Stitching for the end-to-end proof. I'm also fairly confident Frankenstein subsumes Stitching in the general case. A 'deductive system' lens makes this all clearer (for me).

My Frankenstein mistake

The key invalid move I was making when I said

But this same

... (read more)

5Oliver Sourbut1y

Perhaps more importantly, I think with Node Introduction we really don't need M←X→N after all? With Node Introduction and some bookkeeping, we can get the N and M graphs topologically compatible, and Frankenstein them. We can't get as neat a merge as if we also had M←X→N - in particular, we can't get rid of the arrow M→N. But that's fine, we were about to draw that arrow in anyway for the next step! Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary. ETA: it's cleared up below - there's an invalid Reorder here (it removes a v-structure).

Oliver Sourbut1y3-2

I'd probably be more specific and say 'gradient hacking' or 'update hacking' for deception of a training process which updates NN internals.

I see what you're saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.

More centrally, 'training hacking' might refer to a situation with denser oversight and explicit updating/gating.

Deceptive alignment during this period is just one way of tra... (read more)

Oliver Sourbut1y32

I mean the deliberation happens in a neural network. Maybe you thought I meant 'net' as in 'after taking into account all contributions'? I should say 'NN-internal' instead, probably.

Oliver Sourbut1y50

deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
this does support an expectation of deceptive alignment in my terms, because de

... (read more)

2Algon1y

What does "net internal" mean?

Oliver Sourbut1yΩ330

This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be... (read more)

3Joe Collman1y

I think the broader use is sensible - e.g. to include post-training. However, I'm not sure how narrow you'd want [training hacking] to be. Do you want to call it training only if NN internals get updated by default? Or just that it's training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a ...selection... process that could be ongoing], seems to cover all deceptive alignment - potential deletion/adjustment being a selection process). Fine if there's no bright line - I'd just be curious to know your criteria.

5Oliver Sourbut1y

Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we'll share some content from it at some point. In the mean time, my take after that is roughly * deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some) * this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others) * it's still unclear how much deliberation about deception can/will happen 'net-internally' vs externalised * externalised deliberation about deceptive alignment is still deceptive alignment in my terms! * I keep notes in my diary about how I'm going to coordinate my coup * steganographic deliberation about deceptive alignment is scarier * my notes are encrypted * fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch? * like, it's all in my brain I think another thing people are often arguing about without making it clear is how 'net internal' the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it's basically orthogonal to the discussion about deception and deceptive alignment.[1] More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible - though we don't have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there's not-very

Oliver Sourbut1y20

I wasn't eager on this, but your justification updated me a bit. I think the most important distinction is indeed the 'grown/evolved/trained/found, not crafted', and 'brainware' didn't immediately evoke that for me. But you're right, brains are inherently grown, they're very diverse, we can probe them but don't always/ever grok them (yet), structure is somewhat visible, somewhat opaque, they fit into a larger computational chassis but adapt to their harness somewhat, properties and abilities can be elicited by unexpected inputs, they exhibit various kinds of learning on various timescales, ...

1Oliver Sourbut1y

Mold like fungus or mold like sculpt? I like this a bit, and I can imagine it might... grow on me. (yeuch)

Mold-as-in-sculpt has the benefit that it encompasses weirder stuff like prompt-wrangled and scaffolded stuff, and also kinda large-scale GOFAI-like things alla 'MCTS' and whatnot.

Oliver Sourbut1y*30

Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition... something which enforces the choice of topo ordering to be within the right class of topo orderings? That's what the chain $M \leftarrow X \to N$ does - it means we can assign orderings $M, X_{¯ i}, X_{i}, N$ or $M, X_{i}, X_{¯ i}, N$ , but not e.g. $M, X_{¯ i}, N, X_{i}$ , even though that order is consistent with both of the other original graphs.

If I get some time I'll return to this and think harder but I can't guarantee it.

ETA I did spend a bit more time, and the below mostly resol... (read more)

Oliver Sourbut1y132

I had another look at this with a fresh brain and it was clearer what was happening.

My Frankenstein mistake

The key invalid move I was making when I said

But this same

... (read more)

Natural Latents: The Math

But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I'm not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven't thought this through yet.

i.e. I think either

I'm missing something
Frankenstein is invalid
You don't need $M \leftarrow X \to N$

3Oliver Sourbut1y

Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition... something which enforces the choice of topo ordering to be within the right class of topo orderings? That's what the chain M←X→N does - it means we can assign orderings M,X¯i,Xi,N or M,Xi,X¯i,N, but not e.g. M,X¯i,N,Xi, even though that order is consistent with both of the other original graphs. If I get some time I'll return to this and think harder but I can't guarantee it. ETA I did spend a bit more time, and the below mostly resolves it: I was indeed missing something, and Frankenstein indeed needs an extra condition, but you do need M←X→N.

Oliver Sourbut1y42

One thing that initially stood out to me on the fundamental theorem was: where did the $N \to M$ arrow come from? It 'gets introduced' in the first bookkeeping step (we draw $N \to M$ and then reorder the $(N, M, X_{¯ i})$ subgraph at each $i$ .

This seemed suspicious to me at first! It seemed like kind of a choice, so what if we just didn't add that arrow? Could we land at a conclusion of $N$ AND $M \to X$ ? That's way too strong! But I played with it a bit, and there's no obvious way to do the second frankenstitch which brings everything together unless you draw in that extra arrow and ... (read more)

(I said Frankenstitch advisedly, I think they're kinda the same rule, but in particular in this case it seems either rule does the job.)

I might be missing something, but I don't see where $M \leftarrow X \to N$ is actually used in the worked example.

It seems that there's a consistent topo order between the $M$ and $N$ diagrams, so we Frankenstitch them. Then we draw an edge from $N$ to $M$ and reorder (bookkeep). Then we Frankenstein the $M$ diagrams and the resulting diagram again. Then we collect the $X_{i}$ together (bookkeep). Where's $M \leftarrow X \to N$ used?

3johnswentworth1y

We need it for the stitching rule; otherwise there could be weird three-variable interactions (e.g. one thing is an xor of two other things) which mess up the stitch.

3Oliver Sourbut1y

(I said Frankenstitch advisedly, I think they're kinda the same rule, but in particular in this case it seems either rule does the job.)

Oh yeah, I don't know how common it is, but when manipulating graphs, if there's a topo order, I seem to strongly prefer visualising things with that order respected on the page (vertically or horizontally). So your images committed a few minor crimes according to that aesthetic. I can also imagine that some other aesthetics would strongly prefer writing things the way you did though, e.g. with $M \leftarrow X \to N$ . (My preference would put $M$ and $N$ slightly lower, as you did with the $M$ , $X_{i}$ graph.)

Oliver Sourbut1y*52

This is really great!

A few weeks ago I was playing with the Good Regulator and John's Gooder version and incidentally I also found myself pulling out some simple graphical manipulation rules. Your 'Markov re-rooting' came into play, and also various of the 'Bookkeeping' rules. You have various more exciting rules here too, thanks!

I also ended up noticing a kind of 'good regulator motif' as I tried expanded the setting with a few temporal steps and partial observability and so forth. Basically, doing some bookkeeping and coarse-graining, you can often find ... (read more)

3Oliver Sourbut1y

Oh yeah, I don't know how common it is, but when manipulating graphs, if there's a topo order, I seem to strongly prefer visualising things with that order respected on the page (vertically or horizontally). So your images committed a few minor crimes according to that aesthetic. I can also imagine that some other aesthetics would strongly prefer writing things the way you did though, e.g. with M←X→N. (My preference would put M and N slightly lower, as you did with the M, Xi graph.)

@the gears to ascension , could you elaborate on what the ~25% react on 'hardware' in

Would it be useful to have a term, analogous to 'hardware', ...

means? Is it responding to the whole sentence, 'Would it be useful to have...?' or some other proposition?

2the gears to ascension1y

that was due to a bug in how lesswrong figures out what text a recorded react applies to. I'm not sure which react that was supposed to be, but my reacts weren't valuable enough, so I simply removed them.

Separately, I'm not a fan of 'evolveware' or 'evoware' in particular, though I can't put my finger on exactly why. Possibly it's because of a connotation of ongoing evolution, which is sorta true in some cases but could be misleading as a signifier. Though the same criticism could be levelled against 'ML-ware', which I like more.

I hate to wheel this out again but evolution-broadly-construed is actually a very close fit for gradient methods. Agreed there's a whole lot of specifics in biological natural selection, and a whole lot of specifics in gradient-methods-as-practiced, but they are quite akin really.

3the gears to ascension1y

please wheel such things out every time they seem relevant until such time as someone finds a strong argument not to, people underrecommend sturdy work imo. in this case, I think the top comment on that post raises some issues with it that I'd like to see resolved before I'd feel like I could rely on it to be a sturdy generalization. but I appreciate the attempt.

1Oliver Sourbut1y

Separately, I'm not a fan of 'evolveware' or 'evoware' in particular, though I can't put my finger on exactly why. Possibly it's because of a connotation of ongoing evolution, which is sorta true in some cases but could be misleading as a signifier. Though the same criticism could be levelled against 'ML-ware', which I like more.

Oliver Sourbut1y54

This is nice in its way, and has something going for it, but to me it's far too specific, while also missing the 'how we got this thing' aspect which (I think) is the main reason to emphasise the difference through terminology.

Oliver Sourbut1y22

This is simple but surprisingly good, for the reasons you said. It's also easy to say and write. Along with fuzz-, and hunch-, this is my favourite candidate so far.

AI Is Not Software

Hardware, software, ... deepware? I quite like this actually. It evokes deep learning, obviously, but also 'deep' maybe expresses the challenge of knowing what's happening inside it. Doesn't evoke the 'found/discovered' nature of it.

On the lethality of biased human reward ratings

Nice! 'Idioware'? Risks sounding like 'idiotware'...

2Ann1y

'Idiomware'? Since idioms are expressions with a meaning that can't be deciphered from the individual words used, and AI models are data with a function that can't be easily deciphered from the specific code used?

AI Is Not Software

Oliver Sourbut1y40

noware? everyware? anyware? selfaware? please-beware?

(jokes, don't crucify me)

I have a serious question with some serious suggestions too

On the lethality of biased human reward ratings

If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of "actually care about your friends", is competitive with "always be calculating your personal advantage."

I think there's a missing connection here. At least, it seemed a non sequitur at first read to me. At my first read, I thought this was positing that scaling up given humans' computational capacity ceteris paribus makes... (read more)

3johnswentworth1y

+1, that's what I understood the claim to be.

Oliver Sourbut1y*30

And as the saying goes, "humans are the least general intelligence which can manage to take over the world at all" - otherwise we'd have taken over the world earlier.

A classic statement of this is by Bostrom, in Superintelligence.

Far from being the smartest possible biological species, we are probably better thought of as the stupidest possible biological species capable of starting a technological civilization - a niche we filled because we got there first, not because we are in any sense optimally adapted to it.

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.

In a world where the 'big two' (R&D and executive capacity) are characterised by driving ... (read more)

In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.

I think it’s most likely that for a while centaurs will significantly outperform fully automated systems

Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.

How Would an Utopia-Maximizer Look Like?

Oliver Sourbut1y42

I think there are two really important applications, which have the potential to radically reshape the world:
Research
The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
In particular the automation of further AI development is likely to be important
There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fund

... (read more)

1Oliver Sourbut1y

In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.

1Oliver Sourbut1y

Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.

I swiftly edited that to read

we have not found it written in the universe

but your reply obviously beat me to it! I agree, there is plausibly some 'actual valence magnitude' which we 'should' normatively account for in aggregations.

In behavioural practice, it comes down to what cooperative/normative infrastructure is giving rise to the cooperative gains which push toward the Pareto frontier. e.g.

explicit instructions/norms (fair or otherwise)
'exchange rates' between goods or directly on utilities
marginal production returns on given resources
startin

How Would an Utopia-Maximizer Look Like?

I think this post is mostly about how to do the reflection, consistentising, and so on.

But at the risk of oversimplifying, let's pretend for a moment we just have some utility functions.

Then you can for sure aggregate them into a mega utility function (at least in principle). This is very underspecified!! predominantly as a consequence of the question of how to weight individual utility functions in the aggregation. (Holden has a nice discussion of Harsanyi's aggregation theorem which goes into some more discussion, but yes, we have not found it written in... (read more)

2cousin_it1y

I think it's written, but not in behavior. Imagine two people whose behavior is encoded by the same utility function - they both behave as if they valued chocolate as 1 and vanilla as 2. But internally, the first person feels very strongly about all of their preferences, while the second one is very even-tempered and mostly feels ok no matter what. (They'd climb the same height of stairs to get vanilla, too, because the second person is more indifferent about vanilla but also is less bothered by climbing stairs.) Then we want to give them different weight in the aggregation, even though they have the same utility function. That means the correct weighting should be inferred from internal feelings, not only from behavior. Another, more drastic thought experiment: imagine a box that has no behavior at all, but in fact there's a person inside. You have to decide whether to send resources into the box. For that you need to know what's in the box and what feelings it contains.