Carl Nehring - LessWrong

Greetings, all! I discovered LessWrong when trying to locate a place or group of people who might be interested in an AI interpretability framework I developed to help teach myself a bit more about how LLMs work.

It began as a 'roguelite' framework and evolved into a project called AI-rl, an experiment to test how different AI models handle alignment challenges. I started this as a way to better understand how LLMs process ethical dilemmas, handle recursive logic traps, and navigate complex reasoning tasks.

So far, I’ve tested it on models like Claude, Grok, Gemini, Perplexity, and Llama. Each seems to have distinct tendencies in how they fail or succeed at these challenges:

Claude (Anthropic) → "The Ethicist" – Thoughtful and careful, but sometimes avoids taking risks in reasoning.

Grok (xAI) → "The Chaos Agent" – More creative but prone to getting caught in strange logic loops.

Gemini (Google) → "The Conservative Strategist" – Solid and structured, but less innovative in problem-solving.

Perplexity → "The Historian" – Focused on truth and memory consistency, but less flexible in reasoning.

Llama/Open-Source Models → "The Mechanical Thinkers" – Struggle with layered reasoning and can feel rigid.

Why This Matters:

A big challenge in alignment isn’t just making AI "good"—it’s understanding where and how it misaligns. AI-rl is my attempt at systematically stress-testing these models to see what kinds of reasoning failures appear over time.

I think this could evolve into:

1. A way to track alignment risks in open-source models.

2. A benchmarking system that compares how well different models handle complex reasoning.

3. A public tool where others can run alignment stress tests themselves.

Looking for Feedback:

Have others explored AI stress-testing in a similar way?

What’s the best way to structure these findings so they’re useful for alignment research?

Are there known frameworks I should be comparing AI-rl against?

I’m still new to the deeper technical side of AI alignment, so I’d love to hear thoughts from people more familiar with the field. Appreciate any feedback on whether AI-rl’s approach makes sense or could be refined!

LESSWRONG
LW

Posts

Wikitag Contributions

Comments