LESSWRONG
LW

Mariia Koroliuk — LessWrong

3mo

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!

Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream

Architectures for Externalisation of Cognition

Many AI safety researchers are converging around the idea that the chain-of-thought... (read 3789 more words →)

Playing Dixit with AI: How Well LLMs Detect 'Me-ness'

Mariia Koroliuk

If Netflix can predict my next favorite show, to which extent can an LLM predict patterns in my choices?
To check, I done an experiment inspired by Dixit (board game, goal is to guess which card the storyteller selected); and I compared %, guessed between models; also added a baseline from "human guesses".

Why relevant? If AI can predict my actions well and with little prior information, this open path for both inner alignment, but also manipulation.

Methodology and Results Below; seeking community input on potential confounders and ways to make this exploration more rigorous.

Core Question:

If LLM can introspect about their own behavior (as was shown in "Looking Inward" - that was a... (read 581 more words →)

Playing Dixit with AI: Can AI Systems Identify Misalignments in My Personalized Statements?

Mariia Koroliuk

AI systems are increasingly capable of introspection—reasoning about their internal states and identifying complex patterns. But can these systems extend this capability outward to recognize authentic, user-specific patterns from minimal input? Inspired by the game Dixit and studies such as Looking Inward: Language Models Can Learn About Themselves by Introspection, I designed a simple experiment to test this and I would like your feedback on it.

⚠️ Disclaimer: This is a highly exploratory study with an extremely small sample size (3 biographical facts and one set of 19 statements, tested on 6 AI models and one "human" therapist of mine). This is not a finished research project but a preliminary framework for future... (read 399 more words →)

LESSWRONG
LW

LESSWRONG
LW

Mariia Koroliuk

Mariia Koroliuk

Architectures for Increased Externalisation of Reasoning

Playing Dixit with AI: How Well LLMs Detect 'Me-ness'

Playing Dixit with AI: Can AI Systems Identify Misalignments in My Personalized Statements?

Mariia Koroliuk

Mariia Koroliuk

Architectures for Increased Externalisation of Reasoning

Playing Dixit with AI: How Well LLMs Detect 'Me-ness'

Playing Dixit with AI: Can AI Systems Identify Misalignments in My Personalized Statements?

Architectures for Externalisation of Cognition

Core Question: