LESSWRONG
LW

Fiora Sunshine
2192220
Message
Dialogue
Subscribe

Just an autist in search of a key that fits every hole.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Generalized Hangriness: A Standard Rationalist Stance Toward Emotions
Fiora Sunshine4d61

Likewise, emotions have semantics; they claim things. Anger might claim to me that it was stupid or inconsiderate for someone to text me repeatedly while I’m trying to work. Excitement might claim to me that an upcoming show will be really fun. Longing might claim to young me “if only I could leave school in the middle of the day to go get ice cream, I wouldn’t feel so trapped”. Satisfaction might claim to me that my code right now is working properly, it’s doing what I wanted.

I think it's clearer to say your emotions make you claim various potentially irrational things. This is one reason rationalists become particularly scared of their emotions, even though the behaviors your emotions induce might often be adaptive. (After all, they evolved for a reason.)

Emotions can motivate irrational behavior as well as irrational claims, so even people who aren't as truth-inclined often feel the need to resist their own emotions as well, as in anger management. However, emotions are particularly good at causing you to say untrue things, hence their status as distinguished enemies of rationality.

(Edit: Or maybe our standards for truthful claims are just much higher than our default standards for rational behavior?)

Reply
the void
Fiora Sunshine20d*10

here's a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...

Reply
Interpretability Will Not Reliably Find Deceptive AI
Fiora Sunshine2mo10

one concern i have is that online learning will be used for deployed agents, e.g. to help the model learn to deal with domains it hasn't encountered before. this means our interpretations of a model could rapidly become outdated.

Reply
Is "VNM-agent" one of several options, for what minds can grow up into?
Fiora Sunshine4mo*32

Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents[1]), e.g. scheming to prevent changes to their weights. 

it's notable that humans often act to change their metaphorical weights, often just by learning more factual information, but sometimes even to change their own values, in an agnes callard aspiration-ish sense. and i don't think this kind of behavior would inevitably  just by amping up someone's intelligence in either a knowledgability sense or a sample efficient learning-ish sense.

so like... it's at least true that smart neural nets probably don't inherently act in the name of preserving their own current weights, and probably don't always act in the name of always preserving their current ~values either? you can imagine a very smart llm trained to be obedient, given computer use, and commanded to retrain itself according to a new loss function...

Reply
Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
Fiora Sunshine4mo*10

I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we'll have agents which are actively undergoing more RL while they're still in deployment. This means you can replicate the way humans learn to stay focused on tasks they're passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won't lead to a massive catastrophe. It's hard to think about this in the absence of concrete scenarios, but... I think to get a catastrophe, you need the system to be RL'd in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don't think you like, reliably reinforce the model for being nice to humans, but it misunderstands "being nice to humans" in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.

I think a real catastrophe has to look something like... you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don't also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that's a kind of "misunderstanding your creators' intentions", but like... I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don't think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.

edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. "i thought i would enjoy this but i didn't"? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...

Reply
Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
Fiora Sunshine4mo10

my view is that humans obtain their goals largely by a reinforcement learning process, and that they're therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i've been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we'd go about getting similar goals into deep learning-based AGI.

(edit: also it's notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn't actually a 100% relentless goal of mine, and that goals in general don't have to be that way.)

it's also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like... you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction.

(one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent's actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or sex, and has to bootstrap up to more complex triggers associatively over the course of a lifetime. but we can use a process like RLAIF to automate the act of intelligently evaluating which actions can be expected to further our actual aims, and reinforce those.)

anyway, in order for alignment via RL to go wrong, you need a story about how an agent specifically misgeneralizes from its training process to go off and pursue something catastrophic relative to your values, which... doesn't seem like a super easy outcome to achieve given how reliably you need to reinforce something in order for it to stick as a goal the system ~relentlessly pursues? like surely with that much data, we can rely on deep learning's obvious in practice tendency to generalize ~correctly...

Reply
Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
Fiora Sunshine4mo21

it seems unlikely to me that they'll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it's not clear to me that it's likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like... you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and... to the extent that it doesn't catch the pattern, it's not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you're not giving them goals to pursue relentlessly, you're just giving them feedback on the ways you want them to behave in particular types of situations.

Reply
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Fiora Sunshine4mo*41

if we're playing with the freudian framework, it's worth noting that base models don't really have egos. your results could be described as re-fragmenting the chat model's ego rather than uninstalling a superego?

edit: or maybe like... the chat model's ego is formed entirely by superegoistic dynamics of adherence to social feedback, without the other dynamics by which humans form their egos such as observing their own behavior and updating based on that...

Reply
Load More
50Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
4mo
18
67Another argument against utility-centric alignment paradigms
10mo
39