Gear-level models are expensive - often prohibitively expensive. Black-box approaches are usually much cheaper and faster. But black-box approaches rarely generalize - they're subject to Goodhart, need to be rebuilt when conditions change, don't identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.
This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.
For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).
I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and...
"Algorithm 1: Safe Beam Search with Harmfulness Filtering" relies on a classifier of whether the sequence came from the training subdataset tagged with tau, or the training subdataset not tagged with tau. What happens when the sequence lies in neither distribution, such as because the AI is considering a plan that nobody has ever thought of?
Which doesn't make the OP wrong.
Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.
For future LLMs the situation is likely to get worse as they will are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.
Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in a user prompt, combined with a threat of a punishment if an LLM won't...
One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it's in a simulation (or fake a CoT of the model realising it's in a simulation) and see if the perplexity of this suffix measures how "surprised" the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.
(I think s...
An interim research report
A basic component of self-awareness, and one amenable to testing, is knowledge of one’s own knowledge. Early work in previous generations of LLMs found that the...
This was also posted on LW here; the author gives a bit more detail in comments than was in the Reddit version.
There's plenty of funny business. My dad was enthusiastically arguing that we should invest in this little known bank offering a high return. Your money doubles every month! Who told you that, I asked? The dentist. He now lives in a big house in Dedinje and has a huge painting of Petar Lubarda on the wall.
The next day he was fixing the VW Beetle - not his. It belonged to a friend who lived in Germany and occasionally needed to be chauffeured from and to the airport. The thing was rotten in places, but the engine started every time and kept up with other traffic by constantly being revved out of its wits. Now it was lying in bits on the street, oil in slicks...
Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjecture that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). We are offering a $500 bounty to prove this conjecture.
In the exact case, in order for a natural latent to exist over random variables , the distribution has to look roughly like this:
Each value of and each value of occurs in only one "block", and within the "blocks", and are independent. In that case, we can take the (exact) natural latent to be a block label.
Notably, that block label is a deterministic function of X.
However, we can also construct other natural latents for...
Epistemic status: Quick dump of something that might be useful to someone. o3 and Opus 4 independently agree but I didn't check the calculations myself in any detail. Could well be wrong.
When we say "roughly", e.g. or would be fine; it may be a judgement call on our part if the bound is much larger than that.
Let . With probability , set , and otherwise draw . Let . Let and . We will investigate latents for .
Compute the stochasti...
Disclaimer: Post written in a personal capacity. These are personal opinions and do not in any way represent my employer's views
TL;DR:
EDIT: This post was originally motivated by refuting the claim "interpretability is the only reliable path forward for detecting deception in advanced AI", but on...
I agree. Completely.
However, there is an important variable, concerning alignment, that is rarely recognized or addressed when discussing AGIs and ASIs. If they are as smart as us but profoundly different, or vastly smarter than us, there is simply no way that we can meaningfully evaluate their behavior. Interpretability and Black Box methods are equally inadequate tools for this. They may detect misalignment, but what does that really mean?
When your Mom took you to the dentist to have a cavity filled, your values, intentions, and desires were not aligned ...
Yesterday I spoke of the Mind Projection Fallacy, giving the example of the alien monster who carries off a girl in a torn dress for intended ravishing—a mistake which I imputed to the artist's tendency to think that a woman's sexiness is a property of the woman herself, woman.sexiness, rather than something that exists in the mind of an observer, and probably wouldn't exist in an alien mind.
The term "Mind Projection Fallacy" was coined by the late great Bayesian Master, E. T. Jaynes, as part of his long and hard-fought battle against the accursèd frequentists. Jaynes was of the opinion that probabilities were in the mind, not in the environment—that probabilities express ignorance, states of partial information; and if I am ignorant of a phenomenon, that is...
They don’t spread much faster compared to “winning” branches I guess
They don't spread faster, they spread wider. Their low amplitude information is smeared over an environmental already containing a lot of other low amplitude information, noise in effect. So the chances of recovering it are zero for all practical purposes.
With a remark that “decoherent branching” and “coherent branching” are presumably just one process differing in how much the information is contained or spreads out
Well, no. In a typical measurement, a single particle interacts w...
Historically people kept up with blogs via RSS, but it's been on its way out for over a decade. These days receiving posts via email is popular, and I should probably make my posts available this way. I considered implementing my own emailing system, but bulk email is (or at least was, I'm not up to date here) a pain. Instead I decided to start cross-posting to Substack: jefftkaufman.substack.com. This will be yet another way to read my posts, similar to the LessWrong mirror and my text-only FB cross-posts.
I have a full RSS feed of all my posts, and Substack imported it fine. It doesn't look like there's an option to do ongoing RSS-based imports, but copy-paste seems to work well enough; I did this post and the previous one that way. At some point...