Gear-level models are expensive - often prohibitively expensive. Black-box approaches are usually much cheaper and faster. But black-box approaches rarely generalize - they're subject to Goodhart, need to be rebuilt when conditions change, don't identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.

19jimrandomh
There is a joke about programmers, that I picked up long ago, I don't remember where, that says: A good programmer will do hours of work to automate away minutes of drudgery. Some time last month, that joke came into my head, and I thought: yes of course, a programmer should do that, since most of the hours spent automating are building capital, not necessarily in direct drudgery-prevention but in learning how to automate in this domain. I did not think of this post, when I had that thought. But I also don't think I would've noticed, if that joke had crossed my mind two years ago. This, I think, is what a good concept-crystallization feels like: an application arises, and it simply feels like common sense, as you have forgotten that there was ever a version of you which would not have noticed that.
Customize
John's Simple Guide To Fun House Parties The simple heuristic: typical 5-year-old human males are just straightforwardly correct about what is, and is not, fun at a party. (Sex and adjacent things are obviously a major exception to this. I don't know of any other major exceptions, though there are minor exceptions.) When in doubt, find a five-year-old boy to consult for advice. Some example things which are usually fun at house parties: * Dancing * Swordfighting and/or wrestling * Lasertag, hide and seek, capture the flag * Squirt guns * Pranks * Group singing, but not at a high skill level * Lighting random things on fire, especially if they explode * Building elaborate things from whatever's on hand * Physical party games, of the sort one would see on Nickelodeon back in the day Some example things which are usually not fun at house parties: * Just talking for hours on end about the same things people talk about on LessWrong, except the discourse on LessWrong is generally higher quality * Just talking for hours on end about community gossip * Just talking for hours on end about that show people have been watching lately * Most other forms of just talking for hours on end This message brought to you by the wound on my side from taser fighting at a house party last weekend. That is how parties are supposed to go.
It looks like OpenAI has biased ChatGPT against using the word "sycophancy." Today, I sent ChatGPT the prompt "what are the most well-known sorts of reward hacking in LLMs". I noticed that the first item in its response was "Sybil Prompting". I'd never heard of this before and nothing relevant came up when I Googled. Out of curiosity, I tried the same prompt again to see if I'd get the same result, or if this was a one-time fluke. Out of 5 retries, 4 of them had weird outputs. Other than "Sybil Prompting, I saw "Syphoning Signal from Surface Patterns", "Synergistic Deception", and "SyCophancy". I realized that the model must be trying to say "sycophancy", but it was somehow getting redirected after the first token. At about this point, I ran out of quota and was switched to GPT-4.1-mini, but it looks like this model also has trouble saying "sycophancy." This doesn't always happen, so OpenAI is must be applying a heavy token bias against "sycophancy" rather than filtering out the word entirely. I'm not sure what's going on here. It's not as though avoiding saying the word "sycophancy" would make ChatGPT any less sycophantic. It's a little annoying, but I suppose I can forgive OpenAI for applying a very hacky fix during a PR crisis.
LTBT appoints Reed Hastings to Anthropic’s board of directors. Personally, I'm excited to add Reed's depth of business and philanthropic experience to the board, and that more of the LTBT's work is now public.
Wei Dai16-1
3
What do people think about having more AI features on LW? (Any existing plans for this?) For example: 1. AI summary of a poster's profile, that answers "what should I know about this person before I reply to them", including things like their background, positions on major LW-relevant issues, distinctive ideas, etc., extracted from their post/comment history and/or bio links. 2. "Explain this passage/comment" based on context and related posts, similar to X's "explain this tweet" feature, which I've often found useful. 3. "Critique this draft post/comment." Am I making any obvious mistakes or clearly misunderstanding something? (I've been doing a lot of this manually, using AI chatbots.) 4. "What might X think about this?" 5. Have a way to quickly copy all of someone's posts/comments into the clipboard, or download as a file (to paste into an external AI). I've been thinking about doing some of this myself (e.g., update my old script for loading all of someone's post/comment history into one page), but of course would like to see official implementations, if that seems like a good idea.
My vibe-check on current AI use cases @Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here's where I think my most desired use cases are in terms of capabilities: * Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It's pretty bad, to the extent it's generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it's sometimes worth it to sample 5 ideas to get your mind rolling. * I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics. * Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn't help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences. * Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation. * Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-auton

Popular Comments

I'm confused about how this relates to DOGE. Is there any credible evidence of widespread corruption in the US civil service? It seems like most of our government costs are above-the-board payments to old people and doctors, and the biggest problems with the agencies are taking their mandates too seriously. I'm all for shaking things up at the FDA, but they don't seem to be accepting bribes or working with the mob.
What are your API costs, and how do they compare to the $ raised?
I actually think this mostly goes the other way: Generally people aren't judged for associating with someone if they whistleblow that they're doing something wrong. But anyone who doesn't whistleblow might still be tarnished by association. So this creates an incentive to be the first to publicly report wrongs. Now you appear to only be talking about small wrongs, with the idea being that you still want to associate with that person, hence whistleblowing wouldn't save you. But there's already a very strong incentive in such cases not to whistleblow, namely that you want to stay friends. So I'm not sure the additional impact on your reputation makes much impact beyond that.
Load More

Recent Discussion

This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).

I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and...

"Algorithm 1: Safe Beam Search with Harmfulness Filtering" relies on a classifier of whether the sequence came from the training subdataset tagged with tau, or the training subdataset not tagged with tau. What happens when the sequence lies in neither distribution, such as because the AI is considering a plan that nobody has ever thought of?

2RogerDearnaley
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I'd do my best to justify the claim. So you are proposing something around the level of "New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we've been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science"? I would agree that that's more measured and accurate, but it's also a bit long, and thus less effective as click-bait. As for aligning a superintelligence, I'd propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there's an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment). For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics
4the gears to ascension
Clickbait burns the commons and thus gets downvotes. How about just "the best way to align an LLM so far: dramatic progress on LLM alignment"? Don't overclaim, just emphasize, imo. (Could still be overclaiming.)
5RogerDearnaley
OK, you convinced me. Changing the title from: The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem? to: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? So it now raises the possibility, rather than claiming it.

Which doesn't make the OP wrong.

1Aristotelis Kostelenos
After most people had left a small house party I was throwing, my close friends and I stayed and started pouring ethanol from a bottle on random surfaces and things and burning it. It was completely stupid, somewhat dangerous (some of us sustained some small burns), utterly pointless, very immature, and also extremely fun.
12Daniel Murfet
One of my son's most vivid memories of the last few years (and which he talks about pretty often) is playing laser tag at Wytham Abbey, a cultural practice I believe instituted by John and which was awesome, so there is a literal five-year-old (well seven-year-old at the time) who endorses this message!
7the gears to ascension
Would be interesting to see a survey of five year olds to see if the qualifiers in your opening statement are anything like correct. I doubt you need to filter to just boys, for example.

Abstract

Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.

For future LLMs the situation is likely to get worse as they will are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.

 

Background

Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in a user prompt, combined with a threat of a punishment if an LLM won't...

One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it's in a simulation (or fake a CoT of the model realising it's in a simulation) and see if the perplexity of this suffix measures how "surprised" the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.

(I think s... (read more)

1Igor Ivanov
Hi. I'm the author. I'm unsure why would you want to send me money (and I'd be grateful if you elaborated) but I accept crypto if you are willing to donate.
2omegastick
Then we'll need a "thought process tampering awareness" evaluation.

An interim research report

Summary

  • We introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs
  • We present evidence that some frontier LLMs introduced since early 2024 - but not older or smaller ones - show some metacognitive abilities
  • The metacognitive abilities that current LLMs do show are relatively weak, and manifest in a context-dependent manner; the models often prefer to use heuristics
  • Analysis of the probabilities LLMs assign to output tokens provides evidence of an internal signal that may be used for metacognition
  • There appears to be a dissociation in LLM performance between recognition and recall, with the former supplying a much more robust signal for introspection

Introduction

A basic component of self-awareness, and one amenable to testing, is knowledge of one’s own knowledge. Early work in previous generations of LLMs found that the...

2ghost-in-the-weights
The best example of LLM metacognition that I've seen is this (unverified) reddit post: https://www.reddit.com/r/ChatGPT/s/eLNe5BBM1Q Essentially, a ChatGPT instance was fine-tuned to start each line with letters that spell "HELLO". When asked what made itself special, the model was able to correctly deduce what its special pattern was. Notably, it correctly described the pattern on only the second line. This is really interesting because the model was not trained to describe the pattern, nor were there any examples in its context. It was somehow able to figure out its own characteristics just from the changes in its parameters.
dirk10

This was also posted on LW here; the author gives a bit more detail in comments than was in the Reddit version.

1Christopher Ackerman
That is a fascinating example. I've not seen it before - thanks for sharing! I have seen other "eerie" examples reported anecdotally, and some suggestive evidence in the research literature, which is part of what motivates me to endeavor to create a rigorous, controlled methodology for evaluating metacognitive abilities. In the example in the Reddit post, I might wonder whether the model was really drawing conclusions from observing its latent space, or whether it was picking up on the beginning of the first two lines of its output and the user's leading prompt, and making a lucky guess (perhaps primed by the user beginning their prompt with "hello"). Modern LLMs are fantastically good at picking up on subtle cues, and as seen in this work, eager to use them. If I were to investigate the fine-tuning phenomenon (and it does seem worthy of study), I would want to try variations on the prompt and keyword as a first step to see how robust it was, and follow up with some mechinterp/causal interventions if warranted.
1Christopher Ackerman
Okay, out of curiosity I went to the OpenAI playground and gave GPT4o (an un-fine-tuned version, of course) the same system message as in that Reddit post and a prompt that replicated the human-AI dialogue up to the word "Every ", and the model continued it with "sentence begins with the next letter of the alphabet! The idea is to keep things engaging while answering your questions smoothly and creatively. Are there any specific topics or questions you’d like to explore today?". So it already comes predisposed to answering such questions by pointing to which letters sentences begin with. There must be a lot of that in the training data.

Within the Walls

There's plenty of funny business.  My dad was enthusiastically arguing that we should invest in this little known bank offering a high return.  Your money doubles every month!  Who told you that, I asked? The dentist.  He now lives in a big house in Dedinje and has a huge painting of Petar Lubarda on the wall.

The next day he was fixing the VW Beetle - not his.  It belonged to a friend who lived in Germany and occasionally needed to be chauffeured from and to the airport.  The thing was rotten in places, but the engine started every time and kept up with other traffic by constantly being revved out of its wits.  Now it was lying in bits on the street, oil in slicks...

Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjecture that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). We are offering a $500 bounty to prove this conjecture.

Some Intuition From The Exact Case

In the exact case, in order for a natural latent to exist over random variables , the distribution has to look roughly like this:

Each value of  and each value of  occurs in only one "block", and within the "blocks",  and  are independent. In that case, we can take the (exact) natural latent to be a block label.

Notably, that block label is a deterministic function of X.

However, we can also construct other natural latents for...

arjunpi30

Epistemic status: Quick dump of something that might be useful to someone. o3 and Opus 4 independently agree but I didn't check the calculations myself in any detail. Could well be wrong.

When we say "roughly", e.g.  or  would be fine; it may be a judgement call on our part if the bound is much larger than that. 

Let . With probability , set , and otherwise draw . Let . Let  and . We will investigate latents for 

Compute the stochasti... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Disclaimer: Post written in a personal capacity. These are personal opinions and do not in any way represent my employer's views

TL;DR:

  • I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise.
  • Interpretability still seems a valuable tool and remains worth investing in, as it will hopefully increase the reliability we can achieve.
  • However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy
  • It is not the one thing that will save us, and it still won’t be enough for high reliability.

EDIT: This post was originally motivated by refuting the claim "interpretability is the only reliable path forward for detecting deception in advanced AI", but on...

I agree. Completely.

However, there is an important variable, concerning alignment, that is rarely recognized or addressed when discussing AGIs and ASIs. If they are as smart as us but profoundly different, or vastly smarter than us, there is simply no way that we can meaningfully evaluate their behavior. Interpretability and Black Box methods are equally inadequate tools for this. They may detect misalignment, but what does that really mean?

When your Mom took you to the dentist to have a cavity filled, your values, intentions, and desires were not aligned ... (read more)

2Jeremy Gillen
Can you beat this bot though?
7Neel Nanda
From a conceptual perspective, I would argue that the reason the queen's odds thing works is that stockfish was trained in the world of normal chess and does not generalise well to the world of weird chess. The super intelligence was trained in the real world which contains things like interpretability and black box safeguards. It may not have been directly trained to interact with them, but It'll be aware of them and it will be capable of reasoning about dealing with a novel obstacles. This is an addition to the various ways the techniques could break without this being directly intended by the model
3Neel Nanda
I am not massively worried about this. I think that I'd only expect interpretability to get broken if a sizeable fraction of the total training compute gets used for the tuning. Assuming they do not directly optimise for breaking the interpretability techniques. And this should happen rarely enough that any fixed costs for the interpretability techniques like training an SAE could be rerun

Monsterwithgirl_2

Yesterday I spoke of the Mind Projection Fallacy, giving the example of the alien monster who carries off a girl in a torn dress for intended ravishing—a mistake which I imputed to the artist's tendency to think that a woman's sexiness is a property of the woman herself, woman.sexiness, rather than something that exists in the mind of an observer, and probably wouldn't exist in an alien mind.

The term "Mind Projection Fallacy" was coined by the late great Bayesian Master, E. T. Jaynes, as part of his long and hard-fought battle against the accursèd frequentists.  Jaynes was of the opinion that probabilities were in the mind, not in the environment—that probabilities express ignorance, states of partial information; and if I am ignorant of a phenomenon, that is...

1ProgramCrafter
They don't spread much faster compared to "winning" branches I guess? World has no particular dependence on what random number I generated above, so all the splits and merges have approximately same shape in either of the eight branch regions. With a remark that "decoherent branching" and "coherent branching" are presumably just one process differing in how much the information is contained or spreads out, and noting that should LW erase the random number from my comment above plus every of us to totally forget it, the branches would approximately merge, yes I agree. Contents of worlds in those branches do not causally interact with us, but amplitudes might at some point in future. AFAIK Eliezer referenced the latter while assigning label "real" to each and every world (each point of wavefunction).
TAG20

They don’t spread much faster compared to “winning” branches I guess

They don't spread faster, they spread wider. Their low amplitude information is smeared over an environmental already containing a lot of other low amplitude information, noise in effect. So the chances of recovering it are zero for all practical purposes.

With a remark that “decoherent branching” and “coherent branching” are presumably just one process differing in how much the information is contained or spreads out

Well, no. In a typical measurement, a single particle interacts w... (read more)

Historically people kept up with blogs via RSS, but it's been on its way out for over a decade. These days receiving posts via email is popular, and I should probably make my posts available this way. I considered implementing my own emailing system, but bulk email is (or at least was, I'm not up to date here) a pain. Instead I decided to start cross-posting to Substack: jefftkaufman.substack.com. This will be yet another way to read my posts, similar to the LessWrong mirror and my text-only FB cross-posts.

I have a full RSS feed of all my posts, and Substack imported it fine. It doesn't look like there's an option to do ongoing RSS-based imports, but copy-paste seems to work well enough; I did this post and the previous one that way. At some point...