I have this mental image that keeps coming back when I do prompt engineering. It's not a formalism, it's more like... the picture I see in my head when I'm working with these systems. I think it's useful, and maybe some of you will find it useful too. The space...
Epistemic status: This post was planned to be part of a broader "Holistic Interpretability" post but this isn't going as fast as I'd like so I am releasing the foreword to get some early feedback on whether I should pursue this or not. I haven't had a lot of red...
This post is part of a sequence on LLM psychology TL;DR We introduce our perspective on a top-down approach for exploring the cognition of LLMs by studying their behavior, which we refer to as LLM psychology. In this post we take the mental stance of treating LLMs as “alien minds,”...
Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando. Motivation Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing...
This post is part of a sequence on LLM Psychology. @Pierre Peigné wrote the details section in argument 3 and the other weird phenomenon. The rest is written in the voice of @Quentin FEUILLADE--MONTIXI Intro Before diving into what LLM psychology is, it is crucial to clarify the nature of...
As Large Language Models (LLMs) like ChatGPT evolve, becoming more advanced and intricate, the challenge of understanding their behaviors becomes increasingly hard. Solely relying on traditional interpretability techniques may not be sufficient or fast enough in our journey to understand and align these AI models. When exploring human cognition and...
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I’d like to thank Rebecca Gorman and Stuart Armstrong for their mentorship and advice on the post. This post describes a zero-shot template prompt that works surprisingly well for any kind of evaluation task. I...