AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles

Max Tegmark

By Max Tegmark & Meia Chita-Tegmark

Of course you have moral principles – but how often do you use them?

I, Meia, am a professor doing psychology research, and I can tell you that most bad outcomes are caused not by lack of moral principles, but by them not being activated. I, Max, am a professor doing AI research, and I can tell you that your choices as an AI researcher truly matter, because you’re helping build what will become the most powerful technology ever: AI will gain the potential to bring either unprecedented health, prosperity, liberty, dignity and empowerment, or a race to replace our jobs, our relationships, our decision-making, our power and even our species.

Hardly a day goes by without the AI community facing moral decisions, on topics ranging from AI companions to surveillance, hacking and military use. Many top AI companies are fighting lawsuits about everything from data centers to AI safety, and Anthropic is in a prolonged showdown with the Pentagon.

So for all you AI researchers out there, here’s a handy checklist to tone up your moral strength.

1. Do you have red lines?

Is there any action that you find so morally unacceptable that, if the organization you work for takes it, you’ll quit? Or take some other pre-determined costly action, say whistleblowing? Such actions are your moral red lines. For example, Rosa Parks got fined and fired for her civil disobedience against segregation, Vasily Arkhipov was criticized after vetoing a Soviet nuclear strike against the US, and Edward Snowden ended up in exile for mass surveillance whistleblowing. Many AI researchers have left top AI companies that crossed their red lines, including Daniel Kokotajlo, who risked almost $2M in equity by quitting OpenAI without signing a non-disparagement agreement. What are your red lines?

2. Have you written them down and shared them?

Both George Washington and Benjamin Franklin wrote down moral guidelines for themselves, with Franklin grading his own performance weekly. This is a powerful tool for avoiding the boiling frog effect, protecting your red lines against gradual erosion as in the examples at the end of the next section. Sharing them with loved ones or online adds social pressure to stick to them. For each red line, make sure to write down what action you commit to taking if it is crossed. You can click here to list your red lines (we will only share them with your permission).

3. Have you resisted moral disengagement?

To further strengthen your moral muscles and ensure that your red lines don’t move, it’s helpful to know what failure mechanisms to watch out for. Disengaging your muscles makes you weak–and this applies to your moral muscles as well. So let’s look at moral disengagement mechanisms identified by Albert Bandura, one of the most impactful psychologists of all time. This will help you spot them and fight them when your red lines get pressured by your company, your social circle, the temptation of personal gain, or the desire to feel good about yourself.

Displacement and diffusion of responsibility: You’ll feel better if you or others convince you that you’re not really responsible for the harm: the real decision maker is leadership, investors, the market, geopolitics, or history (“this technology is inevitable”). When AI work is distributed across large teams, everyone feels less accountable for the collective outcome. “I’m just a researcher” or “I was just doing my job” are archetypical excuses identified by the influential political theorist Hannah Arendt. The satirical musician Tom Lehrer sums it up in this hilarious song about the rocket scientist who switched allegiance from Nazi Germany to the US: “Once the rockets go up, who cares where they come down – that’s not my department, says Wernher von Braun”. For example, an Anthropic researcher reading about how their Claude AI may have been implicated in killing over 150 Iranian school girls, in one of the worst US-caused civilian bloodbaths since the Vietnam War, may be tempted to tell themselves that they’re blameless because only management is responsible for selling their tools for military targeting.

Word games: Both Bandura and Arendt highlight how subtle word choices can reframe what’s moral. We are all familiar with military euphemisms such as “servicing a target” for bombing, “collateral damage” for civilian casualties and “enhanced interrogation techniques” for torture, but AI jargon is full of analogous word games, often encouraged by financially interested parties.

The most basic game is “euphemistic labeling”: replace morally vivid language with positive or emotionally flattened terminology. Researchers are not “helping build systems that may displace workers, manipulate users, centralize power, or heighten existential risk”; they are doing “capabilities research”, “model improvement”or “benchmark progress”. Training on copyrighted data becomes “freedom to learn”. Unpopular data centers become “AI infrastructure”. Firing or deskilling workers becomes “productivity gains” and “Lobby against accountability” becomes “reduce friction”. Please practice using neutral words like “company” instead of “lab” (which sounds cool and innocent) and “AI system” instead of “AI model” (which sounds harmless). Bandura’s point is that euphemism does not merely soften tone; it weakens conscience.

Another word game is blame attribution, where critics become the problem, say “doomers,” “Luddites”, “opportunistic politicians”, “ignorant journalists” or “anti-tech Europeans”. Once opponents are blamed for irrationality or bad faith, the AI researcher feels less obligated to treat criticism as morally serious.

A third word game is soft dehumanization: the unemployed programmer, the individual copyright infringement victim and the chatbot suicide child disappear into categories such as “the labor market”, “creatives” and “edge cases”. The more harms are discussed statistically rather than personally, the less moral pain is triggered.

Selective moral self-exemption: It’s tempting to keep strong moral standards in general, but carve out an exception around the domain from which you benefit most: an AI researcher may be passionately ethical about injustice in the abstract, while suspending those same standards when judging their own employer, AI, salary or stock grant.

Advantageous comparison: It’s tempting to compare yourself only to worse actors: “At least I’m not at the most reckless lab.” “At least I’m not working on autonomous weapons.” “At least I care about alignment.” That lets you feel ethical without asking whether your own conduct is acceptable in absolute terms.

Moral justification: For those acknowledging that they’re causing current harm, it’s tempting to justify it as serving a noble mission, say “helping democracy prevail”, “creating universal abundance” or “making sure that safety has a seat at the table” – without seriously questioning whether those lofty goals are credible, or whether there’s another way to accomplish them with less current harm.

These moral disengagement techniques can be very powerful when combined and escalated: Enron executives gradually escalated from minor financial manipulations, justified as necessary for company survival and diffused through leadership directives, to massive fraud like hiding debt. Bernie Madoff started with small return fudges rationalized as client aid, then displaced blame onto markets and dehumanized victims, leading to a $65 billion fraud through incremental moral disengagement. In the Vietnam War, soldiers obediently followed orders in a “just war”, starting with minor transgressions that escalated to massacres like My Lai through diffused responsibility and victim dehumanization.
The frontier AI researcher’s signature Bandurian mantra is “I’m not a well-paid participant in a harmful race; I am a responsible, realistic, morally serious person helping guide inevitable progress”. But is the race to replace truly inevitable, given polling finding it wildly unpopular, or is it a Bandurian excuse and self-fulfilling prophecy?

4. Do you maintain situational awareness?

Do you actively research whether your red lines are being crossed? This includes investigating the indirect consequences of what your organization does. Hannah Arendt wrote about “the banality of evil”. arguing that the greatest harms are often done not by malice, but by obedient and conscientious technocrats who don’t think about the bigger picture. We talked above about taking known harms and using word games to downplay and reframe them as manageable, transitional or outweighed by upside. But there’s also another powerful moral disengagement technique: staying conveniently ignorant by not putting in the effort to know about the harms you’re contributing to in the first place. Ignorance is a bad excuse if you could have found out by looking into it: German Chemist Bruno Tesch was convicted and executed in 1946 for supplying Zyklon B gas to Auschwitz-Birkenau despite claiming he didn’t know what it would be used for.

So please ask obvious questions regularly. For example, which if any red lines does your organization have? Is it actively lobbying against AI safety legislation that you support? Have you looked it up in the AI safety index? How are its products used? If you work for Google or OpenAI, have you skimmed any of the lawsuits against your company for alleged chatbot-linked suicide? If you work for Anthropic, how much do you know about that girl school strike?

Ironically, thanks to modern LLMs, there’s really no excuse for not knowing about things like these, since they’re just a prompt away. For example, you can try this monthly:

"Please make a list of morally questionable/controversial behavior by [MY COMPANY] in recent years, including a) controversial use of its tools (say for suicide, crime, surveillance or weapons), b) harm allegedly caused by its tools, c) alleged lies or broken promises by the company or its leadership, d) perverse incentives for the company to pursue profit over what truly benefits humanity."

These are the ChatGPT responses we got for Anthropic, Google, OpenAI, Meta and xAI on March 29 2026.

5. Do you make noise internally?

If you learn about something that’s close to one of your red lines, then ask questions internally to find out more. Although there were historical situations where criticizing one’s organization could get one killed, doing so in an AI company today is unlikely to even get you fired – and why would you want to keep working for a company that can’t handle respectful questions about your red lines? Most even have whistleblowing policies that protect you (see page 99 here in the AI Safety Index).

If what you find out is unacceptable but you’re not ready to quit, then make noise internally: explain why to colleagues and superiors, and push hard for change. Don’t be like one of the engineers who realized that the cold weather could cause catastrophic O-ring failure in the Challenger Space Shuttle, and later regretted not speaking up forcefully. If you’re in the safety team and don’t know people in the lobbying team or those who make launch decisions: make a sincere effort to connect with them and educate them – don’t become a poster child for bystander syndrome.

6. Do you make noise externally?

Taking a public stance that challenges your own organization can help in many ways, from helping it improve voluntarily to catalyzing external forces that pressure it (and its competitors) to improve. This doesn’t mean you need to risk exile like Edward Snowden: there are many recent cases where AI researchers have gotten away with well-argued criticism of their company without any retaliation whatsoever. What consequences would you face if you publicly criticized your organization or revealed harmful or illegal behavior? Most US AI companies have a whistleblower policy (see above); please read yours! In addition, a simple search (just don’t do it with your own company’s LLM… :-) will show you many reputable whistleblower organizations offering help with everything from legal support to financial aid should you get fired or sued.

So having read this, how would you rate your moral muscles? How many moral disengagement techniques did you recognize in yourself, and how strong has your research been on potential harms caused by your company? Please don’t feel disheartened if you scored low despite meaning well. Instead, think of it as going to the gym for the first time, and discovering that you can’t even bench 50 pounds: muscles need to be used to get strong, and this 6-step plan can strengthen your moral muscles in no-time – and you’ll start feeling really great looking at yourself in the mirror!

Strong downvoted the post because:

The tone is incredibly off-putting and imo not appropriate for lesswrong.
The advice doesn't actually seem very actionable or good, nor does it seem like this post will actually change how people act
The Word Games section appears entirely AI generated without being noted as such. This violates Lesswrong's LLM use policy.

Mod note: This post contains substantive unmarked AI-written content, which is against our LLM-use policy. Please put all LLM-written content into LLM blocks! This is purely a warning, but if you do it again we might end up limiting your posting privileges.

A frontier lab researcher I worked with used to disagree with colleagues but say nothing. Two conversations later he was holding ground with people he looked up to. A year and a half later he still pushes back. I hunt bounties like this that improve AI alignment, DM me

I don't think this post should be on the front page. As Sodium mentions, it involves a great deal of AI-written content; not coincidentally, that content is also very poorly-written and seems to fail one of the three stated requirements for being on the front page, to wit, "aim to explain, not persuade". It seems like it's getting upvotes mostly because people agree with the claim which is being made (approx. "working on AI is bad and you should try to slow it down if you can"), rather than because it makes a compelling case for that claim. It's very very very hard for me to imagine anyone who works on capabilities research reading this post and being anything but annoyed. I, for one, do not work in capabilities research and still am nothing but annoyed (well, after the first two points, which I agree are good ideas); "you should always use language which prioritises my particular idea of what the bad results of your actions are and implicitly bakes in my worldview" is just an obnoxious thing to say. It's even more obnoxious to treat disagreement with you as an objective flaw that of course someone will need to work on in order to improve as a person.

I don't like reading posts like this especially because I can feel myself becoming biased against the viewpoint being espoused, and I don't actually think that I should believe something less just because some of its proponents make bad arguments. All I can do is hope that this effect doesn't last too long.

Yep, this post shouldn't have been on the frontpage, I moved it to personal blog.

This piece struck me as being similar to a dad book on something like stoicism, where instead of explaining the author uses lots of humorous examples and persuasion to make an idea appealing. There is a difference between "X is, because Y" and "X is, because it appeals to me for X" but both can be masked by "I believe X". A basic lesson, but a red line worth stating out loud from time to time.

I disagree with many of the Tegmarks' re-labellings - in that I think in many cases the original label is more fair, accurate, and appropriate than their proposed new label - but I do not think it is obnoxious at all to say "here are the labels you probably currently use; here are some others you might consider".

"you should always use language which prioritises my particular idea of what the bad results of your actions are and implicitly bakes in my worldview"

I think this is a pretty uncharitable characterisation of what the authors were trying to do. A more fair characterisation would be, "If you agree with our view of the situation, don't get drawn-in to using language that implies a different situation; that language is chosen intentionally to support that outlook, so if you wish to support our outlook you should be intentional with your language too".

Even though I disagree pretty strongly with the actual outlook proposed in many of the examples I think this is essentially a form of intellectual honesty, not "obnoxious" or "annoying", and I could absolutely come up with non-AI-related examples of the same phenomenon I've seen in business (for example employees being referred to as "people" when the company is doing something nice to them but as "resources" when the company is doing something unpleasant to them, "pay rise" used when it would be more accurate to say "pay calculation restructuring that's on net a pay cut for practically everybody", etc. I'm sure you've seen many such examples, too!)

It's explaining a subtle mechanism for manufacturing agreement at a less-than-fully-concious level and showing how you can carefully use language to maintain the position you'd previously arrived at through rational thought rather than being caught-up by the language and allowing yourself to be unconsciously diverted into the other party's opinion: that's pretty classic LessWrong!

'seems to fail one of the three stated requirements for being on the front page, to wit, "aim to explain, not persuade".'

I don't read it as trying to persuade the reader of AI harms. I read it as being aimed at readers who are already persuaded of AI harms, explaining how to respond to those harms (showing how responses to other research-level harms worked and where they failed, etc.)

Perhaps the article could have been more explicit about saying this (and, as I say, personally I'm absolutely not persuaded of many of the harms being presented) but I don't think an objective reader could really mistake it for an article trying to persuade the reader of harms, and I think it's entirely fair to assume the intended audience (AI capabilities researchers on LessWrong) would have already encountered arguments for and against AI harms and formed a position, without needing the argument re-explained to them from scratch yet again. (Although a few hyperlinked citations would have been nice, of course..!)

More generally, I think it's reasonable to say "Given problem P, here's what I think we should do" and it would be impractical and counterproductive to require every such post on LessWrong to first present a full argument for P from first principles.

"Vasily Arkhipov was criticized after vetoing a Soviet nuclear strike against the US"

My goodness, I think this might be the most beautifully understated claim in all of philosophy!

In addition to the "enduring criticism" angle, I think people in large corporations are under social and economic pressure to follow-along with the corporation's position and are often expected to make moral decisions whilst enduring personal discomforts (lack of sleep, pressure of deadlines, family/life worries, etc.), and Arkhipov also provides an excellent example/inspiration for thinking for oneself whilst under immense pressure (...in Arkhipov's case literally as well as figuratively...) and despite intolerable personal discomforts (in his case sleep-deprivation, oxygen-deprivation, separation from his family, fear of imminent death, etc.)

"Hannah Arendt wrote about “the banality of evil”. arguing that the greatest harms are often done not by malice, but by obedient and conscientious technocrats who don’t think about the bigger picture."

"There are hardly any excesses of the most crazed psychopath that cannot easily be duplicated by a normal kindly family man who just comes in to work every day and has a job to do."

-- Sir Terry Pratchett, "Small Gods"

(Convergent evolution, or had Pratchett read Arendt..?)

It is very interesting analogy that you use - "moral muscles". I suggest to making a look at this analogy more deeply.

on the one hand - muscles are subjective thing that is managed internally.
on the other hand - the current mainstream says that moral is external rules.

this is a contradiction - merging internal and external as one thing. But most of us do not mention this contradiction. why? maybe because, there is no contradiction at our basic understanding? but something is wrong when we define internal as external or vice versa.

more details: in some way environment influences us how we manage our muscles and our morality. In general, we manage them in the way we were trained during childhood. Both are trained. Both can be affected by our environment. And both are internal subjective properties?

If it is true. How it can be? Is moral the internal structurally inevitable thing? That ...

or in other words:
the beginning question should be: "What is morality, and why is it inevitable?"
not: "How do I become more moral?"

[Edit: What follows has turned-out to be a fairly strenuous disagreement with your position. Sorry - I only noticed how strenuous it was after the words/thoughts were done pouring out of my head! I know you're new ..I'm pretty new around these parts myself.. so I thought I'd message to apologise in advance and ask you to not take it personally. Even though I do very much disagree with you I'm nevertheless interested in your opinion and glad to have the discussion!]

I think this is false because the premise "muscles are internal whereas morality is external" is:

1) False, in that morality and moral behaviour can be thought of as "applied conscience", and conscience exists within the mind/brain, which is internal.

2) Irrelevant, in that there doesn't seem to be any practical reason why being internal or external matters; what matters is whether the thing can be trained and strengthened.

You can strengthen muscles -internal- by going to the gym, and you can strengthen a building -external- by adding extra structural supports, digging deeper foundations, etc. Both are important and it doesn't really matter whether strengthening morality is more like strengthening muscles or more like strengthening a building.

Separately, I think it is self-inconsistent: in one place you say "morality is external", and in another place you say "morality is trained since childhood". Since the stuff we train in childhood must be internal (since we don't bring our external childhood environments along with us into adulthood, only our internal selves ....my beloved teddy bear obviously notwithstanding....), I don't think it's possible to simultaneously claim that we train morality in childhood but that morality is external.

(Sure, the training is external: most moral positions we train-on come from other people, we didn't figure them out for ourselves. A gym-goer also didn't invent press-ups and bench-presses and dumb-bells and gymnasia for themself either - but all those external ideas and tools nevertheless improve them internally...)

Despite this, I do think it's entirely legitimate to ask "Is there an objective, universally-true morality that isn't just some guy's opinion?". Philosophers have been debating that for millennia and it'll probably be millennia more before we've made any headway on this question.

But - if we all acted as though there was no objective morality, it was all just some guy's opinion, and literally everything was fundamentally equally acceptable... I'm pretty sure that what remains of civilisation wouldn't last long enough to answer the question either way...

Personally I think that even though it's difficult to know exactly what is and isn't moral in a universally true, non-childhood-dependent, non-external-factor-dependent way, it's nevertheless quite plain to see that an objective universal morality can exist:

Suppose a world full of torture and mutilation and suffering and loneliness and hate and utter, abject misery for every thinking being, all the time, with no respite, for all eternity - a "hell", in other words. I think it takes an unsupported assumption (or a "leap of faith" if you prefer!) to say that such a world would be bad and that we should avoid it and not create it - but it requires the tiniest, most reasonable and most fair possible assumption in all of philosophy.

Once you make that, it's possible to (in theory!) measure the distance between this hell-world and any other world in world-space (not physical space but the theoretical parameter-space describing all possible worlds), and assess how similar or dissimilar other possible worlds are to the hell-world, and upon which axes. From that, I think it's possible to bootstrap an entire objective universal morality.

...in principle! Of course in practice navigating the parameter-space is utterly computationally intractable and the best we can do with our limited tools and intellects is flawed navigation heuristics like "utilitarianism", "virtue ethics", etc. that -much like general relativity- work perfectly in some situations but break-down in others. These heuristics being flawed doesn't mean that there's no objective morality, even if we can't yet figure-out what it says we should do in some situation.

(What about any beings that don't make even the tiniest and most reasonable of assumptions that the hell-world is bad, and would be either supportive of or indifferent to making the real world like that?

On a philosophical level I would say that if they maintain that all worlds are equally acceptable and it literally does not matter what the world is like at all then they don't have moral patienthood - in other words what they want by definition can't matter, since everything is equally acceptable to them.

On a personal level, of course, I would say that such people are ☠️the enemy☠️ and must be opposed with every fibre of our beings if we want the world to ever not be a horrible place for everybody!)

Thank you for your thoughtful and detailed reply - and no need to apologize for the disagreement, I genuinely appreciate the engagement! I want to add that I'm not a native speaker and my words in some cases may sound rude. I'm deeply sorry. I'm trying to be polite. Thank you.

I think there may be a misunderstanding worth clarifying: I never actually claimed that morality is external. I was observing that the mainstream discourse tends to treat morality as external rules - and then pointing out that the "moral muscles" metaphor implicitly contradicts that mainstream framing by treating morality as something internal and subjective. And we accept this internalization naturally.

My point to notice that people casually switch between these two framings (internal and external) without acknowledging the tension - and that this tension is worth examining more deeply.

So actually, I think we largely agree! Your point that morality is "applied conscience" existing within the mind - shaped by external training but ultimately internal - is precisely the kind of deeper examination I was calling for.

Where I was trying to go with this is your own conclusion: maybe morality is "objective, universally-true" - not just a cultural construct, not just "some guy's opinion", but something built into the nature of conscious beings. Your "hell-world" argument is a beautiful illustration of exactly this.

So my real question was never "is morality internal or external?" but rather: "Why does morality exist at all, and is it inevitable given the nature of conscious experience?"

Which, I think, is where you ended up too.

What we dismiss as subjective opinion may actually be a foundational structure of consciousness itself - not something consciousness possesses, but something consciousness requires in order to appear and be.

Okay, it sounds like we're pretty much in full agreement here. (And for the parts where neither of us know how reality works - we seem to nevertheless agree on what we don't know and why we don't know it!)

For what it's worth I did have a bit of trouble parsing your original message in places, and I clearly misunderstood parts of it (sorry!) - but it absolutely didn't appear rude or impolite at all to me; just as arguing (politely) for something that, it turns-out, you weren't actually arguing for. I'm sure it's my error as much as yours!

My own English is, I'm sure, nothing to write home about ..pun fully intended.. but for what it's worth your writing does seem to include lots of "it's not X, it's Y" and other such sentence structures that lots of English speakers interpret as indications of AI-written text. I don't think it's fair for people to interpret it like this - cf. "I'm [a non-native English Speaker]: I don't write like AI. AI writes like me." - and so long as the author understands and agrees with what is written I don't think there's anything wrong with AI-assisted writing anyway, but I just thought I'd mention it lest you encounter such an accusation!

Thank you for understanding.

Even here it is observable that morality appears even there where people try to hide it.