Greetings, all! I discovered LessWrong when trying to locate a place or group of people who might be interested in an AI interpretability framework I developed to help teach myself a bit more about how LLMs work.

It began as a 'roguelite' framework and evolved into a project called AI-rl, an experiment to test how different AI models handle alignment challenges. I started this as a way to better understand how LLMs process ethical dilemmas, handle recursive logic traps, and navigate complex reasoning tasks.

So far, I’ve tested it on models like Claude, Grok, Gemini, Perplexity, and Llama. Each seems to have distinct tendencies in how they fail or succeed at these challenges:

Claude (Anthropic)... (read more)

1

4

0

LESSWRONG
LW

LESSWRONG
LW

Carl Nehring

Carl Nehring

Carl Nehring

Carl Nehring

Carl Nehring

Carl Nehring