Thanks, good point! I suppose it's a balancing act and depends on the specifics in question and the amount of shame we dole out. My hunch would be that a combination of empathy and shame ("carrot and stick") may be best.
I agree that the problem of "evil" is multifactorial with individual personality traits being only one of several relevant factors, with others like "evil/fanatical ideologies" or misaligned incentives/organizations plausibly being overall more important. Still, I think that ignoring the individual character dimension is perilous.
It seems to me that most people become much more evil when they aren't punished for it. [...] So if we teach AIs to be as "aligned" as the average person, and then AIs increase in power beyond our ability to punish them, we can expect to be treated as a much-less-powerful group in history - which is to say, not very well.
Makes sense. On average, power corrupts / people become more malevolent if no one holds them accountable—but again, there seem to exist interindividual differences with some people behaving much better than others even when having enormous power (cf. this section).
Thanks. Sorry for not being more clear, I pasted a screenshot (I'm reading the book on Kindle and can't copy-paste) and asked Claude to transcribe the image into written text.
Again, this is not the first time this happened. Claude refused to help me translate a passage from the Quran (I wanted to check which of two translations was more accurate), refused to transcribe other parts of the above-mentioned Kindle book, and refused to provide me with details about what happened at Tuol Sleng prison. I eventually could persuade Claude in all of these cases but I grew tired of wasting my time and found it frustrating to deal with Claude's obnoxious holier-than-thou attitude.
I downvoted Claude's response (i.e., clicked the thumbs-down symbol below the response) and selected "overactive refusal" as the reason. I didn't get in contact with Anthropic directly.
I had to cancel my Claude subscription (and signed up for ChatGPT) because Claude (3.5 Sonnet) constantly refuses to transcribe or engage with texts that discuss extremism or violence, even if it's clear that this is done in order to better understand and prevent extremist violence.
Example text Claude refuses to transcribe below. For context, the text discusses the motivations and beliefs of Yigal Amir who assassinated the Israeli Prime Minister in 1995.
God gave the land of Israel to the Jewish People," he explained, and he, Yigal Amir, was making certain that God's promises, which he believed in with all his heart and to which he had committed his life, were not to be denied. He could not fathom, he declared, how a Jewish state would dare renege on the Jewish birthright, and he could not passively stand by as this terrifying religious tragedy took place. In Amir's thinking, his action was not a personal matter or an act of passion but a solution, albeit an extreme one, to a religious and psychological trauma brought about by the actions of the Rabin government. Though aware of the seriousness of his action, Amir explained that his fervent faith encouraged and empowered him to commit this act of murder. He told his interrogators, "Without believing in God and an eternal world to come, I would never have had the power to do this." Rabin deserved to die because he was facilitating, in Amir's and other militants' view, the possible mass murder of Jews by consenting to the Oslo peace agreements. This made Rabin, according to halacha, or Jewish law, a rodef, someone about to kill an innocent person and whom a bystander may therefore execute without a trial. Rabin was also a moser, a Jew who willingly betrays his brethren, and guilty of treason for cooperating with Yasser Arafat and the Palestinian Authority in surrendering rights to the Holy Land. Jewish jurisprudence considers the actions of the rodef and moser among the most pernicious crimes; persons guilty of such acts are to be killed at the first opportunity.
This type of refusal has happened numerous times. Claude doesn't change its behavior when I provide arguments (unless I spend a lot of time on this).
I haven't used ChatGPT as much but it so far has never refused.
I hope Anthropic changes Claude so I can continue using it again; I certainly don't like the idea of supporting OpenAI.
Really great post!
It’s unclear how much human psychology can inform our understanding of AI motivations and relevant interventions but it does seem relevant that spitefulness correlates highly (Moshagen et al., 2018, Table 8, N 1,261) with several other “dark traits”, especially psychopathy (r = .74), sadism (r = .59), and Machiavellianism (r = .59).
(Moshagen et al. (2018) therefore suggest that “[...] dark traits are specific manifestations of a general, basic dispositional behavioral tendency [...] to maximize one’s individual utility— disregarding, accepting, or malevolently provoking disutility for others—, accompanied by beliefs that serve as justifications.”)
Plausibly there are (for instance, evolutionary) reasons for why these traits correlate so strongly with each other, and perhaps better understanding them could inform interventions to reduce spite and other dark traits (cf. Lukas' comment).
If this is correct, we might suspect that AIs that will exhibit spiteful preferences/behavior will also tend to exhibit other dark traits (and vice versa!), which may be action guiding. (For example, interventions that make AIs less likely to be psychopathic, sadistic, Machiavellian, etc. would also make them less spiteful, at least in expectation.)
Great post, thanks for writing!
Most of this matches my experience pretty well. I think I had my best ideas during phases (others seem to agree) when I was unusually low on guilt- and obligation-driven EA/impact-focused motivation and was just playfully exploring ideas for fun and out of curiosity.
One problem with letting your research/ideas be guided by impact-focused thinking is that you basically train your mind to immediately ask yourself after entertaining a certain idea for a few seconds "well, is that actually impactful?". And basically all of the time, the answer is "well, probably not". This makes you disinclined to further explore the neighboring idea space.
However, even really useful ideas / research angles start out being somewhat unpromising and full of hurdles and problems and need a lot of refinement. If you allow yourself to just explore idea space for fun, you might overcome these problems and stumble on something truly promising. But if you had been in an "obsessing about maximizing impact" mindset you would have given up too soon because, in this mindset, spending hours or even days without having any impact feels too terrible to keep going.
Lol, thanks. :)
Thanks for this post, I thought this was useful.
I needed a writing buddy to pick up the momentum to actually write it
I'd be interested in knowing more how this worked in practice (no worries if you don't feel like elaborating/don't have the time!).
Thanks, I mostly agree.
But even in colonialism, individual traits played a role. For example, compare King Leopold II's rule over the Congo Free State vs. other colonial regimes.
While all colonialism was exploitative, under Leopold's personal rule the Congo saw extraordinarily brutal policies, e.g., his rubber quota system led soldiers to torture and cut off the hands of workers, including children, who failed to meet quotas. Under his rule,1.5-15 million Congolese people died—the total population was only around 15 to 20 million. The brutality was so extreme that it caused public outrage which led other colonial powers to intervene until the Belgian government took control over the Congo Free State from Leopold.
Compare this to, say, British colonial administration during certain periods which, while still overall morally reprehensible, saw much less barbaric policies under some administrators who showed basic compassion for indigenous people. For instance, Governor William Bentinck in India abolished practices like sati (widows burning themselves alive) and implemented other humanitarian reforms.
One can easily find other examples (e.g. sadistic slave owners vs. more compassionate slave owners).
In conclusion, I totally agree that power imbalances enabled systemic exploitation regardless of individual temperament. But individual traits significantly affected how much suffering and death that exploitation created in practice.[1]
Also, slavery and colonialism were ultimately abolished (in the Western world). My guess is that those who advocated for these reforms were, on average, more compassionate and less malevolent than those who tried to preserve these practices. Of course, the reformers were also heavily influenced by great ideas like the Enlightenment / classic liberalism.