Artem Karpov

Inducing human-like biases in moral reasoning LMs

Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments. TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models. Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs. Introduction Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as of publicly-available LMs fine-tuned for moral reasoning. During the last few years, a series of high-profile papers have shown that LMs partially converge towards brain-like solutions and share fundamental computational principles with humans, making them a ‘biologically feasible computational framework for studying the neural basis of language’. For more (recent) evidence of apparent convergence towards human-like representations, see also Scaling laws for language encoding models in fMRI, Large Language Models Converge on Brain-Like Word Representations. To the best of our awareness though, the potential LM-brain similarities for linguistic inputs rich in morally-relevant content (e.g. moral scenarios) have not been explored previously. Nor has anyone tried to improve LM moral reasoning using moral reasoning neuroimaging datasets (though similar ideas have been explored for LMs more broadly and e.g. for Convolutional Neural Networks

23Feb 20, 2024

NEST: Nascent Encoded Steganographic Thoughts

20Feb 17

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

20Dec 11, 2025

How dangerous is encoded reasoning?

17Jun 30, 2025

Artem Karpov

Message

160

NEST: Nascent Encoded Steganographic Thoughts

Existing claims of steganographic chain-of-thought in LLMs conflate it with dog-whistling, ciphered reasoning, and gaslighting — none of which involve truly hidden writing. I tested frontier models on actual steganographic tasks: encoding secrets via arithmetic coding (too complex for models to follow) and hiding reasoning in acrostics. Models failed at...

Feb 1720

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess. There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT)...

Dec 11, 202520

The Illegible Chain-of-Thought Menagerie

A Taxonomy of Encoded Reasoning and Messages Epistemic status: Exploratory, aims to consolidate my reasoning This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own...

Nov 18, 20252

artkpv's Shortform

Oct 12, 20252

How dangerous is encoded reasoning?

Encoded reasoning occurs when a language model (LM) agent hides its true reasoning inside its chain-of-thought (CoT). It is one of the three types of unfaithful reasoning[1] and the most dangerous one because it undermines CoT monitoring , enables scheming and collusion (undesired coordination).[2] Only a few publications[3][4][5][6] tried to...

Jun 30, 202517

Philosophical Jailbreaks: Demo of LLM Nihilism

Epistemic Status: Exploratory It was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to...

Jun 4, 20253

The Steganographic Potentials of Language Models

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without...

May 8, 20259

Load More (7/11)

LESSWRONG
LW

LESSWRONG
LW

Artem Karpov

Artem Karpov

Artem Karpov

Inducing human-like biases in moral reasoning LMs

NEST: Nascent Encoded Steganographic Thoughts

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

How dangerous is encoded reasoning?

Artem Karpov

NEST: Nascent Encoded Steganographic Thoughts

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

The Illegible Chain-of-Thought Menagerie

artkpv's Shortform

How dangerous is encoded reasoning?

Philosophical Jailbreaks: Demo of LLM Nihilism

The Steganographic Potentials of Language Models

Inducing human-like biases in moral reasoning LMs

NEST: Nascent Encoded Steganographic Thoughts

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

How dangerous is encoded reasoning?

NEST: Nascent Encoded Steganographic Thoughts

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

The Illegible Chain-of-Thought Menagerie

artkpv's Shortform

How dangerous is encoded reasoning?

Philosophical Jailbreaks: Demo of LLM Nihilism

The Steganographic Potentials of Language Models