Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.

Short version

Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.

(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)

Long version

A few times now, I have been part of a community reeling from apparent bad behavior from one of its own. In the two most dramatic cases, the communities seemed pretty split on the question of whether the actor had ill intent.

A recent and very public case was the one of Sam Bankman-Fried, where many seem interested in the question of Sam's mental state vis-a-vis EA. (I recall seeing this in the responses to Kelsey's interview, but haven't done the virtuous thing of digging up links.)

It seems to me that local theories of Sam's mental state cluster along lines very roughly like (these are phrased somewhat hyperbolically):

  1. Sam was explicitly malicious. He was intentionally using the EA movement for the purpose of status and reputation-laundering, while personally enriching himself. If you could read his mind, you would see him making conscious plans to extract resources from people he thought of as ignorant fools, in terminology that would clearly relinquish all his claims to sympathy from the audience. If there were a camera, he would have turned to it and said "I'm going to exploit these EAs for everything they're worth."
  2. Sam was committed to doing good. He may have been ruthless and exploitative towards various individuals in pursuit of his utilitarian goals, but he did not intentionally set out to commit fraud. He didn't conceptualize his actions as exploitative. He tried to make money while providing risky financial assets to the masses, and foolishly disregarded regulations, and may have committed technical crimes, but he was trying to do good, and to put the resources he earned thereby towards doing even more good.

One hypothesis I have for why people care so much about some distinction like this is that humans have social/mental modes for dealing with people who are explicitly malicious towards them, who are explicitly faking cordiality in attempts to extract some resource. And these are pretty different from their modes of dealing with someone who's merely being reckless or foolish. So they care a lot about the mental state behind the act.

(As an example, various crimes legally require mens rea, lit. “guilty mind”, in order to be criminal. Humans care about this stuff enough to bake it into their legal codes.)

A third theory of Sam’s mental state that I have—that I credit in part to Oliver Habryka—is that reality just doesn’t cleanly classify into either maliciousness or negligence.

On this theory, most people who are in effect trying to exploit resources from your community, won't be explicitly malicious, not even in the privacy of their own minds. (Perhaps because the content of one’s own mind is just not all that private; humans are in fact pretty good at inferring intent from a bunch of subtle signals.) Someone who could be exploiting your community, will often act so as to exploit your community, while internally telling themselves lots of stories where what they're doing is justified and fine.

Those stories might include significant cognitive distortion, delusion, recklessness, and/or negligence, and some perfectly reasonable explanations that just don't quite fit together with the other perfectly reasonable explanations they have in other contexts. They might be aware of some of their flaws, and explicitly acknowledge those flaws as things they have to work on. They might be legitimately internally motivated by good intent, even as they wander down the incentive landscape towards the resources you can provide them. They can sub- or semi-consciously mold their inner workings in ways that avoid tripping your malice-detectors, while still managing to exploit you.

And, well, there’s mild versions of the above paragraph that apply to almost everyone, and I’m not sure how to sharpen it. (Who among us doesn’t subconsciously follow incentives, and live under the influence of some self-serving blind spots?)

But in the cases that dramatically blow up, the warp was strong enough to create a variety of advance warning signs that are obvious to hindsight. But also, yeah, it’s a matter of degree. I don’t think there’s a big qualitative divide, that would be stark and apparent if you could listen in on private thoughts.


People do sometimes encounter adversaries who are explicitly malicious towards them. (For a particularly stark example, consider an enemy spy during wartime.) Spies and traitors and turncoats are real phenomena. Sometimes, the person you're interacting with really is treating you as a device that they're trying to extract information or money from; explicit conscious thoughts about this are really what you'd hear if you could read their mind.

I also think that that's not what most of the bad actors in a given community are going to look like. It's easy, and perhaps comfortable, to say "they were just exploiting this community for access to young vulnerable partners" or "they were just exploiting this community for the purpose of reputation laundering" or whatever. But in real life, I bet that if you read their mind, the answer would be far messier, and look much more like they were making various good-faith efforts to live by the values that your community professes.

I think it's important to acknowledge that fact, and build community processes that can deal with bad actors anyway. (Which is a point that I attribute in large part to Aella.)

There's an analogy between the point I'm making here, and the one that Scott Alexander makes in The Media Very Rarely Lies*. Occasionally the media will literally fabricate stories, but usually not.

If our model is that there's a clear divide between people who are literally fabricating and people who are "merely" twisting words and bending truths, and that we mostly just have to worry about the former, then we'll miss most of the harm done. (And we’re likely to end up applying a double standard to misleading reporting done by our allies vs. our enemies, since we’re more inclined to ascribe bad intentions to our enemies.)

There's some temptation to claim that the truth-benders have crossed the bright red line into "lying", so that we can deploy the stronger mental defenses that we use against "liars".

But... that's not quite right; they aren't usually crossing that bright red line, and the places where they do cross that line aren’t necessarily the places where they’re misleading people the most. If you tell people to look out for the bright red line then you'll fail to sensitize them to the actual dangers that they're likely to face. The correct response is to start deploying stronger defenses against people who merely bend the truth.

(Despite the fact that lots of people bend the truth sometimes, like when their mom asks them if they’ve stopped dating blue-eyed people yet while implicitly threatening to feel a bunch of emotional pain if they haven’t, and they technically aren’t dating anyone right now (but of course they’d still date blue-eyed people given the opportunity) so they say “yes”. Which still counts as bending the truth! And differs only by a matter of degree! But which does not deserve a strong community response!)

(Though people do sometimes just make shit up, as is a separate harsh lesson.)

I think there's something similar going on with community bad actors. It's tempting to imagine that the local bad actors crossed bright red lines, and somehow hid that fact from everybody along the way; that they were mustache-twirling villains who were intentionally exploiting you while cackling about it in the depths of their mind. If that were true, it would activate a bunch of psychological and social defense mechanisms that communities often try to use to guard against bad actors.

But... historically, I think our bad actors didn't cross those bright red lines in a convenient fashion. And I think we need to be deploying the stronger community defenses anyway.

I don't really know how to do that (without causing a bunch of collateral damage from false positives, while not even necessarily averting false negatives much). But I hereby make a bid for focusing less on whether somebody is intentionally malicious.

I suggest minting a new word, for people who have the effects of malicious behavior, whether it's intentional or not. People who, if you step back and look at them, seem to leave a trail of misery in their wake, or a history of recklessness, or a pattern of negligence.

It's maybe fun to debate about whether they had mens rea, and the courts might care about the mens rea after it all blows up, but from our perspective, the main question is what behaviors they’re likely to engage in, and there turn out to be many really bad behaviors that don’t require malice at all.

I don't have any terminological suggestions that I love. My top idea so far is to repurpose the old word "malefactor" for someone who has a pattern of ill effects, regardless of their intent. (This in contrast with "enemy", which implies explicit ill intent.)

And for lack of a better word, I’ll suggest the word “maleficence” to describe the not-necessarily-malevolent mental state of a malefactor.

I think we should basically treat discussions about whether someone is malicious as recreation (when they do not explicitly have documentation of being a literal spy/traitor/etc., nor identify as an enemy), and I think that maleficence is what matters when deploying community (or personal) defense mechanisms.

New Comment
61 comments, sorted by Click to highlight new comments since:

It's maybe fun to debate about whether they had mens rea, and the courts might care about the mens rea after it all blows up, but from our perspective, the main question is what behaviors they’re likely to engage in, and there turn out to be many really bad behaviors that don’t require malice at all.

I agree this is the main question, but I think it's bad to dismiss the relevance of mens rea entirely. Knowing what's going on with someone when they cause harm is important for knowing how best to respond, both for the specific case at hand and the strategy for preventing more harm from other people going forward.

I used to race bicycles with a guy who did some extremely unsportsmanlike things, of the sort that gave him an advantage relative to others. After a particularly bad incident (he accepted a drink of water from a rider on another team, then threw the bottle, along with half the water, into a ditch), he was severely penalized and nearly kicked off the team, but the guy whose job was to make that decision was so utterly flabbergasted by his behavior that he decided to talk to him first. As far as I can tell, he was very confused about the norms and didn't realize how badly he'd been violating them. He was definitely an asshole, and he was following clear incentives, but it seems his confusion was a load-bearing part of his behavior because he appeared to be genuinely sorry and started acting much more reasonably after.

Separate from the outcome for this guy in particular, I think it was pretty valuable to know that people were making it through most of a season of collegiate cycling without fully understanding the norms. Like, he knew he was being an asshole, but he didn't really get how bad it was, and looking back I think many of us had taken the friendly, cooperative culture for granted and hadn't put enough effort into acculturating new people.

Again, I agree that the first priority is to stop people from causing harm, but I think that reducing long-term harm is aided by understanding what's going on in people's heads when they're doing bad stuff.

I suggest minting a new word, for people who have the effects of malicious behavior, whether it's intentional or not.

Why only malicious behavior? It seems like the relevant idea is more general: oftentimes we care about what outcomes a pattern of behavior looks optimized to achieve in the world, not about the person's conscious subjective verbal narrative. (Separately from whether we think those outcomes are good or bad.)

Previously, I had suggested "algorithmic" intent, as contrasted to "conscious" intent. Claims about algorithmic intent correspond to predictions about how the behavior responds to interventions. Mistakes that don't repeat themselves when corrected are probably "honest mistakes." "Mistakes" that resist correction, that systematically steer the future in a way that benefits the actor, are probably algorithmically intentional.

"Mistakes" that resist correction, that systematically steer the future in a way that benefits the actor, are probably algorithmically intentional.

is benefits the actor here load-bearing for you (as opposed to just predictably bad for others)? I can think of examples of situations that rarely benefit the actor but seem unlikely to be talked out of (e.g. temper tantrums at the workplace are rarely selfishly positive in professional Western contexts).

Sorry, not load-bearing; I think "steering the future" was the important part of that sentence.

Although in the case of tantrums, I think the game-theoretic logic is pretty clear: if I predictably make a fuss when I don't get my way, then people who don't want me to make a fuss are more likely to let me get my way (to a point). The fact that tantrums don't benefit the actor when they happen, isn't itself enough to show that they're not being used to successfully extort concessions to make them happen less often. If it doesn't work in the modern workplace, it probably worked in the environment of evolutionary adaptedness.

Sometimes also tantrums work in the training distribution of childhood and don't work in the deployment environment of professional work.

I suggest minting a new word, for people who have the effects of malicious behavior, whether it’s intentional or not.

I've long used "destructive" for that.

Harmful, maybe? Not all harms involve destruction (physical or relationships, etc.).

I recently started making a similar distinction in my life and using the word “toxic”

I don't like the word "toxic" because it's kind of essentialist without exposing actual causes/effects/mechanisms/inputs-outputs. I think it's useful sometimes as shorthand between people who have a high degree of agreement on what "toxic" means in a given context, but it's sort of a slippery word.

I also like "problematic" - it could be used as a 'we are not yet quite sure about how bad this is' version of "destructive"

[-][anonymous]58

Problematic is already associated with bigotry and I don't think invoking a political frame is helpful for these sorts of situations.

I don't think it does invoke a political frame if you use it right but perhaps I have too much confidence in how I've used the term

problematic does not differentiate between "bad", "harmful" and "difficult". Replacing the carbouretor in Honda Civic with only a spatula and a corksrew for tools is problematic, but not necessarily harmful or bad.

Maybe "troubling"

I use problematic

Labeling (in particular) catastrophically incompetent people "maleficient" sounds malevolent. While the concern might be valid in theory, this label has connotations that probably don't help with the inherent practical witch hunt and reign of terror risks of the whole concept.

Also, the apparent Chesterton-Schelling fences my intuition is loudly hallucinating at this post say to stop before instituting a habit of using such classification. Immediately-decision-relevant concepts are autonomous superweapons, controversial norms that resist attempts at keeping their boundaries in reasonable/intended places.

My stance is "the more we promote awareness of the psychological landscape around destructive patterns of behavior, the better." This isn't necessarily at odds with what you're saying because "the psychological landscape" is a descriptive thing, whereas your objection to Nate's proposal is that it seeks to be "immediately-decision-relevant," i.e., that it's normative (or comes with direct normative implications). 

So, maybe I'd agree that "maleficient" might be slightly too simplistic of a classification (because we may want to draw action-relevant boundaries in different places depending on the context – e.g., different situations call for different degrees of risk tolerance of false positives vs. false negatives). 

That said, I think there's an important message in Nate's post and (if I had to choose one or the other) I'm more concerned about people not internalizing that message than about it potentially feeding ammunition to witch hunts. (After all, someone who internalizes Nate's message will probably become more concerned about the possibility of witch hunts – if only explicitly-badly-intentioned people instigated witch hunts or added fuel to the fires, history would look very different.)

"maleficient" might be slightly too simplistic of a classification

There is an interesting phenomenon around culture wars where a crazy amount of concepts is generated to describe the contested territory with mind-boggling nuance. I have a hunch that this is not just expertise signaling, but actually useful for dissolving the conceptual superweapons in a sea of distinctions. This divests the original contentious immediately-decision-relevant concept of its special role that gives it power, by replacing it with a hundred slightly-decision-relevant distinctions where none of them have significant power.

A disagreement that was disputing a definition about placement of its boundaries becomes a disagreement about decision procedures in terms of many unchanging and uncontroversial definitions that cover all contested territory in detail. After the dispute is over, most of the technical distinctions can once again be discarded.

Could you give some examples? I understand you may not want to talk about culture war topics on lesswrong, so it's fine if you decline, but without examples I unfortunately cannot picture what you're talking about

so it's fine if you decline

The cost of this statement is feeding the frame where it's not necessarily fine.

Humans care about this stuff enough to bake it into their legal codes.

It's mostly Western culture that does this. There's a lot of variation in how much cultures care about bad intentions.

IANAL, but I believe that the doctrine of mens rea is different from what is suggested here, and the difference has application to the larger context.

The mens rea is simply the intention to have done the actus reus, the illegal act. If, for example, a company director puts their signature to a set of false accounts, knowing they are false, then there is mens rea. It will cut no ice in court for them to profess that "my goodness, I didn't know that was illegal!", or "oh, but surely that wasn't really fraud", or "but it was for a vital cause!"

What matters is that they did the thing, intending to do the thing.

I suggest minting a new word, for people who have the effects of malicious behavior

I thought that "toxic" was the usual word these days.

IANAL either but I do know that certain crimes explicitly do hinge on the perpetrator's knowledge that what they did was illegal, not just that they intended to do it. This isn't common but does apply to some areas with complex legislation like tax evasion and campaign finance. As a high-profile example, Trump Jr. was deemed "too dumb to prosecute" for campaign finance violations.

More generally, there are multiple levels of mens rea. Some crimes require no intent to prosecute ("strict liability"). For those that do, they can be categorized into four levels of increasing severity: acting negligently, acting recklessly, acting knowingly, and acting purposefully. This list is not universal though it is representative. Some US states refer to express/implied "malice".

I understand So8res to be saying that we can treat toxic behavior on a strict liability basis without deciding what level of knowledge and intent to assign the offender.

I think "toxic" is more narrow: it hints at indirect, social, and emotional damage, and does not work well as  term in situations that are just pragmatic in nature.

Copied text from a Facebook post that feels related (separating intent from result):

In Duncan-culture, there are more mistakes you're allowed to make, up-front, with something like "no fault."

e.g. the punch bug thing—if you're in a context where lots of people play punch bug, then you're not MORALLY CULPABLE if you slug somebody on the shoulder and then they say "Ouch, I don't like that, do not do that."

(You're morally culpable if you do it again, after their clear boundary, but Duncan-culture has more wiggle room for first-trespasses.)

However, Duncan-culture is MORE strict about something like ...

"I hurt people! But it's okay, I patched the dynamic that led to the hurt. But then I hurt other people! But it's okay, because I isolated and fixed that set of mistakes, too. But then I hurt other people! But it's okay, because I isolated and fixed that set of mistakes, too. But then I hurt other people! But it's okay..."

In Duncan-culture, you can get away with about two rounds of that. On the third screwup, pretty much everybody joins in to say "no. Stop. You are clearly just capable of inventing new mistakes every time. Cease this iterative process."

And if you don't—if you keep going, making a different error with a similar result every time—

In Duncan-culture, the resulting harm on rounds three and beyond is treated as, essentially, deliberate/intentional. Because the result was predictable, and this fact failed to move you.

This is not, as far as I can tell, robustly/reliably true in the broader culture I'm currently a part of.

EDIT: More disambiguation:

We give people protection, socially speaking, when we consider them to have had good intentions, but to have made a mistake with tragic results.

In Duncan-culture, you can't really get that protection three times in a row for three similar results. If you do A and it leads to X, that's just a mistake and we treat you sympathetically/generously. If you then do B and it leads to X, well, plausibly your first patch wasn't good enough, but like, okay, things are hard, your good intentions shine through, fair game. But if you then do C and it leads to X, all future X's resulting from D and E and so on are considered "your fault" in the not-excusable-as-a-mistake way. Good intentions cease to matter after three different Xings; your job now is to do whatever it takes to avoid more X, or to accept full responsibility for all future X, approximately as if you caused X on purpose/decided X was a side effect you felt worth causing.

In Duncan-culture, when people say "no. Stop", what's the thing that they're saying should stop?

In this specific case, I was writing about a colleague who kept hurting people in their attempts to help them with rationality. They kept managing to hurt people in novel and interesting ways, every time they patched the previous failure mode. "No. Stop." would be in reference to "stop fiddling with people's brains in this way."

Similarly, Brent Dill had in fact been doing different damages to each of his romantic partners, but eventually the Berkeley community was like "no, we are horrified, we don't care if you're not making those specific mistakes anymore, we do not trust you to not make new ones." In that case "No. Stop." was in reference to "dating any of the women in our community."

One hypothesis I have for why people care so much about some distinction like this is that humans have social/mental modes for dealing with people who are explicitly malicious towards them, who are explicitly faking cordiality in attempts to extract some resource. And these are pretty different from their modes of dealing with someone who's merely being reckless or foolish. So they care a lot about the mental state behind the act.

[...]

On this theory, most people who are in effect trying to exploit resources from your community, won't be explicitly malicious, not even in the privacy of their own minds. (Perhaps because the content of one’s own mind is just not all that private; humans are in fact pretty good at inferring intent from a bunch of subtle signals.) Someone who could be exploiting your community, will often act so as to exploit your community, while internally telling themselves lots of stories where what they're doing is justified and fine.

I note that while I find both paragraphs individually reasonable [and I find myself nodding along to them], there seems to be a soft contradiction between them that needs explanation.

Namely, why is human (whether genetic or cultural) evolution maladaptive? "Which humans are bad allies" seems to be close to centrally the problems we should expect evolution in a social context to be good at, so I feel like the burden of proof is on whoever is positing a local deviance to explain why the features are off in this case. Some possibilities:

1. "Our" community is different [why?]

2. People in history are in fact object-level wrong about the existence (or at least prevalence) of evil actors. In reality "Almost no one is evil, almost everything is broken." A possible evolutionarily concordant just-so story here is something in the direction of rational irrationality, perhaps humans are better at tribal ostracism etc if they collectively pretend (and/or genuinely believe) other humans who do bad things are genuinely evil and thus worthy of ostracism. 

3.???

Both explanations are possible but I don't know which one is right (or both, or neither); I just want to highlight there there is something left to be explained in your model so far. 

There's no contradiction. There are two competing sides of the evolutionary process: one side is racing to understand intentions as well as possible, the other side is racing to obscure its intentions, in this case by not having them consciously.

I think one aspect which softens the discrepancy is that our intuitions here might not be adapted to large-scale societies. If everyone really lives mainly with one's own tribe and has kind of isolated interactions with other tribes and maybe tribe-switching people every now and then (similar to village-life compared to city-life), I could well imagine that "are they truly part of our tribe?" actually manages to filter out a large portion of harmful cases.

Also, regarding 2): If indeed almost no one is evil, almost everyone is broken: there are strong incentives to make sure that the social rules do not rule out your way of exploiting the system. Because of this I would not be surprised if "common knowledge" around these things tends to be warped by the class of people who can make the rules. Another factor is that as a coordination problem, using "never try to harm others" seems like a very fine Schelling point to use as common denominator.

It's possible, but I would previously have assumed that sociopathy/intentional maleficence etc to be less common in the ancestral environment relative to other harmful social situations. My own just-so story would suggest that people's intuitions from a tribal context are maladaptive in underpredicting sociopathy or deliberate deception. 

I am not sure we disagree with regards to the prevalence of maleficience. One reason why I would imagine that

"are they truly part of our tribe?" actually manages to filter out a large portion of harmful cases.

works in more tribal contexts would be that cities provide more "ecological" niches (would the term be sociological here?) for this type of behaviour.

intuitions [...] are maladaptive in underpredicting sociopathy or deliberate deception

Interesting. I would mostly think that people today are way more specialized in their "professions" such that for any kind of ability we will come into contact with significantly more skilled people than a typical ancestor of ours would have. If I try to think about examples where people are way too trusting, or way too ready to treat someone as an enemy, I have the impression that for both mistakes examples come to mind quite readily. Due to this, I think I do not agree with "underpredict" as a description and instead tend to a more general "overwhelmed by reality".

Curated. 

In some sense, I knew all this 10 years ago when I first started community-organizing and running into problems with various flavors of deception, manipulation, and people-hurting-each-other. 

But, I definitely struggled to defend my communities against people who didn't quite match my preconception of what "a person I would need to defend against" looked like. My sympathy and empathy for some people made me more hesitant to enforce my boundaries.

I don't know that I'm thrilled with "malefactor" or "maleficence" as words (they seem too similar to "malicious" and don't think they convey the right set of things), but, I very much agree with the distinction being useful.

Interpersonal abuse (eg parental, partner, etc) has a similar issue. People like to talk as if the abuser is twirling their mustache in their abuse-scheme. And while this is occasionally the case, I claim that MOST abuse is perpetrated by people with a certain level of good intent. They may truly love their partner and be the only one who is there for them when they need it, BUT they lack the requisite skills to be in a healthy relationship.

Sadly this is often due to a mental illness, or a history of trauma, or not getting to practice these skills growing up until there was a huge gulf between where they are and where they need to be.

This makes it extra difficult for the victim, because the abuser is sympathetic and seemingly ACTUALLY TRYING. Trying to get advice from the internet may not help when everyone paints your abuser as a scheming villain and you can tell they're not. They're just broken.

I've really appreciated the media that shows a more realistic picture of abusers as people who love you, but are too fucked up to not hurt you. I think more useful advice would acknowledge this harsh reality

I don't have any terminological suggestions that I love

Following on my prior comment, the actual legal terms used for the (oxymoronic) "purposeless and unknowing mens rea" might provide an opening for the legal-social technologies to provide wisdom on operationizing these ideas -  "negligent" at first, and "reckless" when it's reached a tipping point.

When dealing with someone who's doing something bad, and it's not clear whether they're conscious of it or not, one tactic is to tell them about it and see how they respond.  (It is the most obviously prosocial approach.)  Ideally, this will either fix the situation or lead towards establishing that they are, at the very least, reprehensibly negligent, and then you can treat them as malicious.  (In principle, the difference between a malicious person and one who accidentally behaves badly is that, if both of them come to understand that their behavior causes bad results, the latter will stop while the former will keep going.  Applying this to the real world can be messy.)

To take an easy example, if the scenario involves a friend repeatedly doing something that hurts you, then probably you should tell them about it.  If they apologize and try to stop, this is good; if their attempts to stop fail, then you can tell them that too, and take it from there.  If, contrariwise, they insist "this can't actually be hurting you", or deny that it happened, or otherwise reject your feedback, then I'd consider this evidence that they're not such a good friend.

In the case of a non-friend, there is less of a presumption of good faith.  Since the effect of them agreeing with you would mean they have to restrict their behavior or otherwise do stuff they'd rather not, they may be reluctant to agree, and further they might take it as you attempting to grab power or bully them.  Which are things that people sometimes do, and so the details matter: exactly what evidence there is, the relation between them and the person(s) raising the issue, etc.

Suppose the issue involves subjective judgments of how someone behaved in 1:1 contexts.  If one person thought you behaved badly in a situation, and you think differently, maybe you're right.  If, the last N times you were in that type of situation, with N different people, they all thought you behaved badly, then that gets to be strong evidence, as N increases, that your approach is wrong.  (Depending on the issue, it's possible that all N people believe the wrong philosophy—e.g. if the interaction was that they said "Praise Jesus!" and you replied "Sorry, but I'm an atheist".  Though one then asks, why are you getting into all these situations that you can predict will go badly?  Are you doing what you should do to avoid them?)

At a certain point, as the evidence mounts, a responsible person in your position, when confronted with the evidence, should say, "Ok, I still don't agree, but I have to admit there's an X% chance I'm wrong, and if I am wrong and continue like this, then the impact of being wrong is Y; meanwhile, there are certain safeguards, up to and including "stop it completely", which have their own expected values, and at this point safeguards A and B are reasonable and worth doing."  (A truly mature person in certain situations might even say, "I know I'm innocent, but I also know that others have no way of verifying this, and from their perspective there's an X% chance I'm guilty, and I'm in favor of the general policy of responding with these countermeasures to that level of evidence of this crime, and I'm not going to fight them on this.")

A certain kind of narcissist would completely reject the feedback and say they're being unjustly persecuted, and (assuming our evidence is in fact good) we can condemn them here.  Depending on the situation, some predators would say, "Hmmph, those safeguards prevent me from doing the fun stuff or make it unacceptably risky; I'll agree and then just quietly leave the community".  Some others would pretend to agree and then try to continue misbehaving in whatever way they can.  There's always the possibility of an intelligent psychopath behaving exactly like an innocent person.

(If you want to get advanced about it, you could try having the "confronting" be initially done by some person who looks sane but not powerful, to maximize the likelihood that the "prideful narcissist" would openly reject it while the "reasonable, accidental misbehaver" would accept it; or, if the safeguard you have in mind is highly effective but is a major concession, you might have it be done by people who are officially "in charge" (e.g. with the power to ban people from events) so as to pressure cowardly offenders to agree.)

If you don't have enough evidence to be confident that the guy who rejects the feedback and insists he's correct is in fact wrong... Well, at the very least, by telling him, (a) if he's good but misguided, he should at least be more cautious in the future, and there is a chance you've helped; (b) if he's bad and cowardly, he knows that official eyes are on him and he'll have less benefit of the doubt in the future, which may dissuade him.  (This is conventionally known as a "warning".)  Having the right person tell him in the right way may help with (a) and possibly (b).

There may be circumstances in which you don't want to tell him about the evidence you do have.  (Maybe it would break a confidence; maybe it would teach predator-him how to hide his behavior in the future; maybe predator-he would know who snitched on him and take revenge [though my brain volunteers that this would be an excellent way to expose him, if you can protect the witness].)  There are also plenty in which this isn't a problem.

Overall, this is such a large topic, and appropriate responses depend so much on the details, that I think it would help to be more specific.

[edit: fixed link]

Yes! This is an excellent approach. Rather than focusing only on whether there is malicious intent, keeping in mind the more practical goal of wanting bad behavior to *stop* and seeking to understand how it might play out over time is a much more effective way of resolving the problem. Using direct communication to try and fix the situation or ascertain a history of established negligent or malicious behavior is very powerful.

(As an example, various crimes legally require mens rea, lit. “guilty mind”, in order to be criminal. Humans care about this stuff enough to bake it into their legal codes.)

Even in the law of mental states, intent follows the advice in this post. U.S. law commonly breaks down the 'guilty mind' into at least four categories, which, in the absence of a confession, all basically work by observing the defendant's patterns of behaviour. There may be some more operational ideas in the legal treatment of reckless and negligent behaviour.

  1. acting purposely - the defendant had an underlying conscious object to act
  2. acting knowingly - the defendant is practically certain that the conduct will cause a particular result
  3. acting recklessly - The defendant consciously disregarded a substantial and unjustified risk
  4. acting negligently - The defendant was not aware of the risk, but should have been aware of the risk

this might be a bit outside the scope of this post, but it would probably help if there was a way to positively respond to someone who was earnestly messing up in this manner before they cause a huge fiasco. If there's a legitimate belief that they're trying to do better and act in good faith, then what can be done to actually empower them to change in a positive direction? That's of course if they actually want to change, if they're keeping themselves in a state that causes harm because it benefits them while insisting its fine, well, to steal a sith's turn of phrase: airlocked

If there's a legitimate belief that they're trying to do better and act in good faith, then what can be done to actually empower them to change in a positive direction? That's of course if they actually want to change, if they're keeping themselves in a state that causes harm because it benefits them while insisting its fine, well, to steal a sith's turn of phrase: airlocked

I agree that it's important to give people constructive feedback to help them change. However, I see some caveats around this (I think I'm expanding on the points in your comment rather than disagreeing with it). Sometimes it's easier said than done. If part of a person's "destructive pattern" is that they react with utter contempt when you give them well-meant and (reasonably-)well-presented feedback, it's understandable if you don't want to put yourself in the crossfire. In that case, you can always try to avoid contact with someone. Then, if others ask you why you're doing this, you can say something that conveys your honest impressions while making clear that you haven't given this other person much of a chance.

Just like it's important to help people change, I think it's also important to seriously consider the hypothesis that some people are so stuck in their destructive patterns that giving constructive feedback is no longer justifiable in terms of social opportunity costs. (E.g., why invest 100s of hours helping someone become slightly less destructive if you can promote social harmony 50x better by putting your energy into pretty much anyone else.) 

Someone might object as follows. "If someone is 'well-intentioned,' isn't there a series of words you* can kindly say to them so that they'll gain insight into their situation and they'll be able to change?" 

I think the answer here is "no" and I think that's one of the saddest things about life. Even if the answer was, "yes, BUT, ...", I think that wouldn't change too much and would still be sad.

*(Edit) Instead of "you can kindly say to them," the objection seems stronger if this said "someone can kindly say to them." Therapists are well-positioned to help people because they start with a clean history. Accepting feedback from someone you have a messy history with (or feel competitive with, or all kinds of other complications) is going to be much more difficult than the ideal scenario.

One data point that seems relevant here is success probabilities for evidence-based treatments of personality disorders. I don't think personality disorders capture everything about "destructive patterns" (for instance, one obvious thing that they miss is "person behaves destructively due to an addiction"), nor do I think that personality disorders perfectly carve reality at its joints (most traits seem to come on a spectrum!). Still, it seems informative that the treatment success for narcissistic personality disorder seems comparatively very low (but not zero!) for people who are diagnosed with it, in addition to it being vastly under-diagnosed since people with pathological narcissism are less likely to seek therapy voluntarily. (Note that this isn't the case for all personality disorders – e.g., I think I read that BPD without narcissism as a comorbidity has something like 80% chance of improvement with evidence-based therapy.) These stats are some indication that there are differences in people's brain wiring or conditioned patterns that are deep enough that they can't easily be changed with lots of well-intentioned and well-informed communication (e.g., trying to change beliefs about oneself and others). 

So, I think it's a trap to assume that being 'well-intentioned' means that a person is always likely to improve with feedback. Even if, from the outside, it looks as though someone would change if only they could let go of a particular mindset or set of beliefs that seems to be the cause behind their "destructive patterns," consider the possibility that this is more of a symptom rather than the cause (and that the underlying cause is really hard to address). 

I know this post was chronologically first, but since I read them out of order my reaction was "wow, this post is sure using some of the notions from the Waluigi Effect mega-post, but for humans instead of chatbots"!  In particular, they're both pointing at the notion that an agent (human or AI chatbot) can be in something like a superposition between good actor and bad actor unlike the naive two-tone picture of morality one often gets from children's books.

After a recent article in NY Times, I realized that it's a perfect analogy. The smartest people, when motivated by money, get so high that they venture into unsafe territory. They kinda know its unsafe, but even internally it doesn't feel like crossing the red line.

It's not even about the strength of characters, when incentives are aligned 99:1 against your biology, you can try to work against it, but you most probably stand no chance.

It takes enormous willpower to quit smoking explicitly because the risks are invisible and so "small". It's not only you have to fight against this irresistible urge, BUT there's also nobody on "your side", except for intellectual realization, of which you're not even so sure of.

In the same vein, being a CEO of a big startup, being able to single-handedly choose direction, and getting used to people around you being less smart, less hard-working, less competitive, you start trusting your own decision-process much more. That's when incentives start to water down through the cracks in the shell. You don't even remember what feels right anymore, the only thing you know is taking bold actions brings you more power, more money, more dukka. And you do those.

Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.

Very excited for there to be definitely no differences between stereotypical malefactors and actual malefactors; no differences between stereotypical maleficence and actual maleficence; very excited for there to be no gameable cultural impressions about what makes a person a probable malefactor

... Not to imply that any gaming that would take place would be intentional, of course.

This isn’t to say no coordination happens. I expect a little coordination happens openly, through prosocial slogans, just to overcome free rider problems. Remember Trivers’ theory of self-deception – that if something is advantageous to us, we naturally and unconsciously make up explanations for why it’s a good prosocial policy, and then genuinely believe those explanations. If you are rich and want to oppress the poor, you can come up with some philosophy of trickle-down or whatever that makes it sound good. Then you can talk about it with other rich people openly, no secret organizations in smoke-filled rooms necessary, and set up think tanks together. If you’re in the patriarchy, you can push nice-sounding things about gender roles and family values. There is no secret layer beneath the public layer – no smoke-filled room where the rich people get together and say “Let’s push prosocial slogans about rising tides, so that secretly we can dominate everything”. It all happens naturally under the hood, and the Basic Argument isn’t violated."

https://slatestarcodex.com/2019/01/14/too-many-people-dare-call-it-conspiracy/

I agree with this very intensely. I strongly regret unilaterally promoting the CFAR Handbook on various groups on Facebook; I thought that it was critical to minimize the number of AI safety and adjacent people using Facebook and that spreading the CFAR handbook was the best way to do that, and I mistakenly believed that CFAR was bad at marketing their material instead of choosing not to in order to avoid overcomplicating things. I had no way of knowing about the long list of consequences for CFAR for spreading their research in the wrong places, and CFAR had no way of warning me because they had no idea who I was and what I would do in response to their request. Hopefully, this won't make it harder for CFAR to post helpful content to Lesswrong in the future.

There are too many outside-the-box thinkers, the chaos factor is so high that it's like herding cats even when 99% of agents want to be cooperative. There needs to be defense mechanisms that take confusion into account so that well-intentioned unilateralists don't get tangled up in systems meant for deliberate, consistently strategic harm-maximizers (who very clearly and unambiguously exist). The only thing I can think of is finding ways to discourage every cooperative person from acting unilaterally in the first place, but I agree with So8res that I can't think of good ways to do that.

I thought that it was critical to minimize the number of AI safety and adjacent people using Facebook and that spreading the CFAR handbook was the best way to do tha

Wait your TOC for spreading the CFAR handbook on Facebook was that doing so would be so annoying that it'd get people to quit Facebook? If true, this is rather surprising to me and I did not predict this.

I read his thesis as

  1. FB use reduces the effectiveness of AI safety researchers and
  2. the techniques in the CFAR handbook can help people resist attention hijacking schemes like FB, therefore
  3. a FB group for EAs is a high leverage place to spread the CFAR handbook

the long list of consequences for CFAR for spreading their research in the wrong places

What are these consequences? Is this “long list” published anywhere?

Here is a fictional, but otherwise practical example: the attempted rape that sets in motion the action of "Thelma and Louise". Here on YouTube. Notice what Harlan says at 0:50: "I'm not gonna hurt you".

How does he experience his intentions at that moment? At the moment after Thelma slaps him and he beats her?

Does it matter?

Yeah, we don't know if the people who sent the Boy Who Had Cried Wolf to guard the sheep were stupid or evil. But we do know they committed murder.

What material policy changes are being advocated for, here? I am having trouble imagining how this won't turn into a witch-hunt.

Harmful people often lack explicit malicious intent.

I was having a discussion with ChatGPT where it also claimed to believe the same thing as this. I asked it to explain why it thinks this. It's reasoning was that well-intentioned people often make mistakes, and that malign actors do not always succeed in their aims. I'll say!

I disagree completely with the idea that well-intentioned people can actually cause any harm, but even if you presume that they could, it isn't clear to me how malign actors being unable to succeed in their aims is enough to balance out the consequences such that more negativity falls on the well-intentioned. Perhaps the unsuccess of malign actors is due to correctly narrowing our focus onto them only?

Also, in my experience, I think if we follow the advice to focus on effects only, that if we were well-intentioned about doing this, we'd end up focusing on only the truly malign actors anyway. "Deploying defenses" against honest mistake-making just doesn't intuitively result in actions that don't seem a bit cartoonishly ironically villainous. 

A version of this tends to happen with rather unintelligent or incompent people placed in positions of power over other people, who can unintentionally harm people without having any intention to harm them.

Probably the best example here is the Great Chinese Famine, and the Holodomor to a lesser extent. One of the major problems was that the leadership had set severely unrealistic goals because they didn't know enough and combined with incompetence, caused catastrophes on the scale of millions to tens of millions of lives.

As a former EA, I basically agree with this, and I definitely agree that we should start shifting to a norm that focuses on punishing bad actions, rather than trying to infer their mental state.

On SBF, I think a large part of the issue is that he was working in an industry called cryptocurrency that is basically has fraud as the bedrock of it all. There was nothing real about crypto, so the collapse of FTX was basically inevitable.

Even if you accept that all cryptocurrency is valueless, it is possible to operate a crypto-related firm that does what it says it does or one that doesn't.  

For example, if two crypto exchanges accept Bitcoin deposits and say they will keep the Bitcoin in a safe vault for their customers, and then one of them keeps the Bitcoin in the vault while the other takes it to cover its founder's personal expenses/an affiliated firm's losses, I think it is fair to say that the second of these has committed fraud and the first has not, regardless of whether Bitcoin has anything 'real' about it or whether it disappears into a puff of smoke tomorrow.

On SBF, I think a large part of the issue is that he was working in an industry called cryptocurrency that is basically has fraud as the bedrock of it all. There was nothing real about crypto, so the collapse of FTX was basically inevitable.

I don't deny that the cryptocurrency "industry" has been a huge magnet for fraud, nor that there are structural reasons for that, but "there was nothing real about crypto" is plainly false. The desire to have currencies that can't easily be controlled, manipulated, or implicitly taxed (seigniorage, inflation) by governments or other centralized organizations and that can be transferred without physical presence is real. So is the desire for self-executing contracts. One might believe those to be harmful abilities that humanity would be better off without, but not that they're just nothing.

More specifically, the issue with crypto is that the benefits are much less than promised, and there's a whole lot of bullshit claims on crypto like it being secure or not manipulatable.

On one example of why cryptocurrencies fail as an a currency, one of it's problems is that it's fixed supply and no central entity means the value of that currency swings wildly, which is a dealbreaker for any currency.

Note, this is just one of the many, fractal problems here with crypto.

Crypto isn't all fraud. There's reality, but it's built out of unsound foundations and trying to sell a fake castle to others.

I definitely agree that we should start shifting to a norm that focuses on punishing bad actions, rather than trying to infer their mental state.

Do you have limitations to this in mind?  Consider the political issue of abortion.  One side thinks the other is murdering babies; the other side thinks the first is violating women's rightful ownership of their own bodies.  Each side thinks the other is doing something monstrous.  If that's all you need to justify punishment, then that seems to mean both sides should fight a civil war.

("National politics?  I was talking about..."  The one example the OP gives is SBF, and other language alludes to sex predators and reputation launderers, and the explicit specifiers in the first few paragraphs are "harmful people" and "bad behavior"; it's such a wide range that it seems hard to declare anything offtopic.)

You've actually mentioned a depressing possibility around morality, and it's roughly that without shared ethical assumptions, conflict is the default, and there's nothing imposing any constraints except social norms, which can break down.

My answer for people in general is: Try to see what others think, but remember that sometimes, bad outcomes will happen to stop worse outcomes, and you should always focus on your own values to decide the answers.

[+][comment deleted]2-2