This post was rejected for the following reason(s):

  • Low Quality or 101-Level AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar. We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. You're welcome to post quotes in the latest AI Questions Open Thread.

  • This feels a bit more like a shortform post than a top-level post. Do you have more detail on the claims you're making here?

I've tried to summarise my view of LLMs and Interpretability of them below, so the statements might be a bit over simplified and missing some nuance but I'm hoping they're broadly correct and allow me to ask the question at the end.

LLMs just perform dimensionality reduction and embed human language into lower dimensional space. Interpretability work performs dimensionality reduction on the weights of LLMs that represent this lower dimensional structure and extract human understandable features. Is this broadly true? There's so much hype around these models and what they're capable of, but I have such a simplified view of them and just see them as very effective ML models. I want to try and understand the diff between my knowledge and the aggregate view of the Internet.

Finally I'm not trying to throw shade on the development of these models and the Interpretability work that's been undertaken, I think it's fantastic work. But I think the volume of hype and some of the outlandish statements I've read (granted not from reputable sources) have led me to question my understanding.

Thanks

New Answer
New Comment