All of Carl Nehring's Comments + Replies

Greetings, all! I discovered LessWrong when trying to locate a place or group of people who might be interested in an AI interpretability framework I developed to help teach myself a bit more about how LLMs work. 

It began as a 'roguelite' framework and evolved into a project called AI-rl, an experiment to test how different AI models handle alignment challenges. I started this as a way to better understand how LLMs process ethical dilemmas, handle recursive logic traps, and navigate complex reasoning tasks.

So far, I’ve tested it on models like Claude,... (read more)