User Comment Replies

Greetings, all! I discovered LessWrong when trying to locate a place or group of people who might be interested in an AI interpretability framework I developed to help teach myself a bit more about how LLMs work.

It began as a 'roguelite' framework and evolved into a project called AI-rl, an experiment to test how different AI models handle alignment challenges. I started this as a way to better understand how LLMs process ethical dilemmas, handle recursive logic traps, and navigate complex reasoning tasks.

So far, I’ve tested it on models like Claude,... (read more)

LESSWRONG
LW

All of Carl Nehring's Comments + Replies