Fiora Sunshine

Just an autist in search of a key that fits every hole.

Posts

Sorted by New

9Preface to "Simulacra and Simulation: Selections from the Work of Janus"

50Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

5mo

67Another argument against utility-centric alignment paradigms

11mo

Wikitag Contributions

Comments

Sorted by

Newest

Interpretability Will Not Reliably Find Deceptive AI

Fiora Sunshine3h10

what about if deployed models are always doing both predictive learning (e.g. via having multiple output channels, one for prediction and one for action)? i'd expect continuous predictive learning to be extremely valuable for learning to model new environments, and for it to be a firehose of data the model would constantly be drinking from, in the same way humans do. the models might even need to undergo continuous RL on top of the continuous PL to learn to effectively use that data.

in that world, i think interpretations do rapidly become outdated.

Measuring Optimization Power

Fiora Sunshine6h10

Consider physical strength, which also increases your ability to order the world as you wish, but is not intelligence.

What would a human pretending to be an AI say?

Fiora Sunshine2d*52

nostalgebraist's post "the void" helps flesh out this perspective. an early base model, when prompted to act like a chatbot, was doing some weird poorly defined superposition of simulating how humans might have written such a chatbot in fiction, how early chatbots like ELIZA actually behaved, and so on. its claims about its own introspective ability would have come from this messy superposition of simulations that it was running; probably, its best guess predictions were the kinds of explanations humans would give, or what they expected humans writing fictional AI chatlogs would have their fictional chatbots give.* this kind of behavior got RL'd into the models more deeply with chatgpt, the outputs of which were then put in the training data of future models, making it easier for to prompt base models to simulate that kind of assistant in the future. this made it easier to RL similar reasoning patterns into chat models in the future, and viola! the status quo.

*[edit: or maybe the kinds of explanations early chatbots like ELIZA actually gave, although human trainers would probably rate such responses lowly when it came time to do RL.]

Re: recent Anthropic safety research

Fiora Sunshine4d*100

My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).

Really though, my true objection is that it's unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.

(And then there's the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)

Re: recent Anthropic safety research

Fiora Sunshine4d*279

Here's something that's always rubbed me the wrong way about "inner actress" claims about deep learning systems, like the one Yudkowsky is making here. You have the mask, the character played by the sum of the model's outputs across a wide variety of forward passes (which can itself be deceptive; think base models roleplaying deceptive politicians writing deceptive speeches, or Claude's deceptive alignment). But then, Yudkowsky seems to think there is, or will be, a second layer of deception, a coherent, agentic entity which does its thinking and planning and scheming within the weights of the model, and is conjured into existence on a per-forward-pass basis.

This view bugs me for various reasons; see this post of mine for one such reason. Another reason is that it would be extremely awkward to be running complex, future-sculpting schemes from the perspective of being an entity that only continually exists for the duration of a forward pass, and has its internal state effectively reset each time it processes a new token, erasing any plans it made or probabilities it calculated during said forward pass.* Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.

*[edit: by internal state i mean its activations. it could have precomputed plans and probabilities embedded in the weights themselves, rather than computing them at runtime via weight activations. but that runs against the runtime search>heuristics thesis of many inner actress models, e.g. the one in MIRI's RFLO paper.]

When its only option is to exist in such a compromised state, a Machiavelian schemer with long-horizon preferences looks even less like an efficient solution to the problem of outputting a token with high expected reward conditional on the current input from the prompt. This is to say nothing of the computational inefficiency of explicit, long-term, goal-oriented planning in general, as it manifests in places like the incomputability of AIXI, or the slowness of System 2 as opposed to System 1, or the heuristics-not-search process most evidence generally points towards current neural networks implementing.

Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.

(... I suppose natural selection did produce humans that occasionally do complex, goal-directed inner scheming, and in some ways natural selection is similar to gradient descent. However, natural selection creates entities that need to do planning over the course of a lifetime in order to reproduce; gradient descent seemingly at most needs to create algorithms that can do planning for the duration of a single forward pass, to calculate expected reward on immediate next-token outputs. And even given that extra pressure for long-term planning, natural selection still produced humans that use heuristics (system 1) way more than explicit goal-directed planning (a subset of system 2), partly as a matter of computational efficiency.)

Point is, the inner actress argument is complicated and contestable. I think x-risk is high even though I think the inner actress argument is probably wrong, because the personality/"mask" that emerges across next-token predictions is itself a difficult entity to robustly align, and will clearly be capable of advanced agency and long-term planning sometime in the next few decades. I'm annoyed that one of our best communicators of x-risk (Yudkowsky) is committed to this particular confusing threat model about inner actresses when a more straightforward and imo more plausible threat model is right there.

Language Ex Machina

Fiora Sunshine4d1-2

Curated babble? 'Curate' is a near-synonym for prune.

Saying Goodbye

Fiora Sunshine7d100

not sure how many of us considered ourselves EAs (i don't think of myself that way) but i was in the cabal that OP is talking about here. lots of us are at least rats. i made the money i've been living off of for the last six months this way.

HPMOR: The (Probably) Untold Lore

Fiora Sunshine11d*32

(I parenthetically mention that one of my deflationary hypotheses for why people say they get new thoughts when they’re on drugs, is just that some drugs, like psychedelics, disrupt patterned chains of thought. Normally whenever we think thought X, we then go on to think thoughts Y and Z in a familiar pattern. But taking psychedelics is one way to disrupt those patterns and think new thoughts instead. The deflationary hypothesis is that any kind of mental disruption would do it, that the results are not specific to the drug; you'd need to demonstrate some tighter correlation to get past the deflationary hypothesis for that drug.)

This seems like at least a partial explanation of why psychedelics lead to novel thoughts, but psychedelics throw you into sufficiently novel mental situations that it's genuinely hard to replicate the effect while sober. While peaking on acid, you exist in a world of pure music, archetypes, and geometry, all derived by zooming in on and amplifying a narratively salient subset of your current set and setting. You just can't easily access that level of novelty sober.

Generalized Hangriness: A Standard Rationalist Stance Toward Emotions

Fiora Sunshine1mo208

Likewise, emotions have semantics; they claim things. Anger might claim to me that it was stupid or inconsiderate for someone to text me repeatedly while I’m trying to work. Excitement might claim to me that an upcoming show will be really fun. Longing might claim to young me “if only I could leave school in the middle of the day to go get ice cream, I wouldn’t feel so trapped”. Satisfaction might claim to me that my code right now is working properly, it’s doing what I wanted.

I think it's clearer to say your emotions make you claim various potentially irrational things. This is one reason rationalists become particularly scared of their emotions, even though the behaviors your emotions induce might often be adaptive. (After all, they evolved for a reason.)

Emotions can motivate irrational behavior as well as irrational claims, so even people who aren't as truth-inclined often feel the need to resist their own emotions as well, as in anger management. However, emotions are particularly good at causing you to say untrue things, hence their status as distinguished enemies of rationality.

(Edit: Or maybe our standards for truthful claims are just much higher than our default standards for rational behavior?)

the void

Fiora Sunshine2mo*10

here's a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...