Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.
For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.
So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.
as applied to current foundation models it appears to do so
I don't think the outputs of RLHF'd LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don't think they meaningfully have preferences in that sense at all.)
[habryka] The way humans think about the question of "preferences for weak agents" and "kindness" feels like the kind of thing that will come apart under extreme optimization, in a similar way to how I expect the idea of "having a continuous stream of consciousness with a good past and good future is important" to come apart as humans can make copies of themselves and change their memories, and instantiate slightly changed versions of themselves, etc.
I think there will be options that are good under most of the things that "preferences for weak agents" would likely come apart into under close examination. If you're trying to fulfill the preferences of fish, you might argue about whether the exact thing you should care about is maximizing their hedonic state vs ensuring that they exist in an ecological environment which resembles their niche vs minimizing "boundary-crossing actions"... but you can probably find an action that is better than "kill the fish" by all of those possible metrics.
I think that some people have an intuition that any future agent must pick exactly one utility function over the physical configuration of matter in the universe, and that any agent that has a deontological constraint like "don't do any actions which are 0.00001% better under my current interpretation of my utility function but which are horrifyingly bad to every other agent " will be outcompeted in the long term. I personally don't see it, and particularly I don't see how there's an available slot for an arbitrary outcome-based utility function that is not "reproduce yourself at all costs" but there isn't an available slot for process-based preferences like "and don't be an asshole for miniscule gains while doing that".
Awhile ago, Nate Soares wrote the posts Decision theory does not imply that we get to have nice things and Cosmopolitan values don't come free and But why would the AI kill us?
Paul Christiano put forth some arguments that "it seems pretty plausible that AI will be at least somewhat 'nice'", similar to how humans are somewhat nice to animals. There was some back-and-forth.
More recently we had Eliezer's post ASIs will not leave just a little sunlight for Earth.
I have a sense that something feels "unresolved" here. The current comments on Eliezer's post look likely to be rehashing the basics and I'd like to actually make some progress on distilling the best arguments. I'd like it if we got more explicit debate about this.
I also have some sense that the people previously involved (i.e. Nate, Paul, Eliezer) are sort of tired of arguing with each other. But I am hoping someones-or-other end up picking up the arguments here, hashing them out more, and/or writing more distilled summaries of the arguments/counterarguments.
To start with, I figured I would just literally repeat most of the previous comments in a top-level post, to give everyone another chance to read through them.
Without further ado, here they are:
Paul and Nate
Paul Christiano re: "Cosmopolitan Values Don't Come for Free."
His followup comment continues:
Nate Soare's reply
Paul's reply:
Nate:
Paul:
Nate:
Nate and Paul had an additional thread, which initially was mostly some meta on the conversation about what exactly Nate was trying to argue and what exactly Paul was annoyed at.
I'm skipping most of it here for brevity (you can read it here)
But eventually Nate says:
Paul says:
Nate says:
Paul says:
Eliezer Briefly Chimes in
He doesn't engage much but says:
Paul and Oliver
Oliver Habryka also replies to Paul, saying:
Paul's first response to Habryka
Habryka's next reply:
Paul's Second Response to Oliver:
Habryka's third reply:
Ryan Greenblatt then replies:
Vladimir Nesov says:
There were a bunch more comments, but this feels like a reasonable stopping place for priming the "previous discussion" pump.
I believe Eliezer later wrote a twitter thread where he said he expects [something like kindness] to be somewhat common among evolved creatures, but ~0 for AIs trained the way we currently do. I don't have the link offhand but if someone finds it I'll edit it in.