Epistemic Status: Written over the course of a couple days at Inkhaven. Some of the info is old so some newer papers are excluded. TL;DR: People talk about "interpreting planning and goals" in models all the time, but people have different understandings of what exactly "planning" means. I try to...
Epistemic status: experimental results. This is an exploratory work examining an alternative approach to the interpretation of language models. Summary We aim to determine whether the LCM model — which predicts sentence embeddings rather than token embeddings — can predict the outputs of LLM. We compare four architectures, the most...
I wrote about continuity of consciousness in my cryonics post: Two Theories for Cryopreservation. I already stated I’m kind of unsure about it. But now I spend a little more time thinking about it. Could I make a ladder of thought experiments to get me to believing it’s fake? Is...
Epistemic status: experimental results. This is an exploratory work examining an alternative approach to the interpretation of language models. Summary We aim to determine whether the LCM model — which predicts sentence embeddings rather than token embeddings — can predict the outputs of LLMs. Using a dataset of 1 million...
Epistemic status: Old news and well-known, but I find it hard to point at a single post that encapsulates my intuitions on this, so I write them down here. One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear...
After writing about my ethics yesterday, there was some discussion about the axioms that I think are most needed to derive the rest of my ethics. > pleasure is good nobody argues against this. > suffering is bad Someone did argue that this is potentially untrue, that there exists “voluntary...
I’ve been thinking on and off about ethics for around a decade, but mostly in an informal way rather than through formal philosophy. I expect that if I spent more time seriously stress-testing my views, I’d find inconsistencies or conclusions I’m less comfortable with than I currently think. Still, this...