Oliver Daniels-Koch's Shortform

Oliver Daniels

LESSWRONG
LW

Oliver Daniels-Koch's Shortform — LessWrong

Oliver Daniels-Koch's Shortform

by Oliver Daniels

17th Mar 2024

1 min read

2

This is a special post for quick takes by Oliver Daniels. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

49 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:19 PM

[-]Oliver Daniels5mo*100

In the small but growing literature on supervised document finetuning, its typical to finetune "post-trained" models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning)

To me, the more natural thing is "synthetic continued pretraining" - further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques ~~(this is the approach used in~~ ~~Auditing language models for hidden objectives~~ nvm they apply SDF to post-trained model, then apply further post-training)

I'm sort of confused why more papers aren't doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn't make much of a difference.

But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)

[-]ryan_greenblatt5mo86

It's a cost and convenience thing. See discussion here for some ideas for better methods that are also cheap.

[-]Håvard Tveit Ihle5mo76

If you have to repeat the entire post-training all the way from a base model, that is obviously a lot more work than just adding a small fine-tuning stage to an already post-trained model.

The full post-training can also only really be done by a big lab which has their own full post-training stack. Post training is getting more and more advanced and complicated with each month.

[-]Oliver Daniels5mo20

yeah but its plausible this cost is worth paying if the effect size is large enough (and there are various open source instruction-tuning datasets which might reasonably recover e.g. Llama-3-instruct)

[-]Håvard Tveit Ihle5mo10

Yea, it could be worth it in some cases, if that is what you need for your experiment. In this case I would look for a completely open source llm project (where both the code and data are open), so that you know you are comparing apples to apples-with-your-additional-pretraining.

[-]Oliver Daniels1y94

I wish there was a bibTeX functionality for alignment forum posts...

[-]habryka1y97

Yeah, IMO we should just add a bunch of functionality for integrating alignment forum stuff more with academic things. It’s been on my to do list for a long time.

[-]Steven Byrnes1y60

Fun fact, a group of neuroscientists has been trying to start an academia-culture blogging platform / forum thing: https://jocnf.pubpub.org/ E.g. they emphasize the fact that every post gets assigned a DOI.

(…Nobody’s using it though! Basically all the posts so far are by the people who run it, or their close friends.)

(But it’s still very new. Probably too soon to judge it a failure.)

Just thought that might be an interesting point of comparison.

[-]habryka1y90

Yeah, I was actually thinking about making it so that you can request a DOI for your post on LessWrong or the AI Alignment Forum. It isn't actually that hard to set up, and seems good for citability and stuff.

[-]Oliver Daniels1y94

I'm curious if Redwood would be willing to share a kind of "after action report" for why they stopped working on ELK/heuristic argument inspired stuff (e.g Causal Scrubbing, Patch Patching, Generalized Wick Decompositions, Measurement Tampering)

My impression it is some mix of:

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

[-]Oliver Daniels5mo*60

I'm pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What's the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)

Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn't integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)

[-]Vladimir_Nesov5mo31

Possibly universality of Solomonoff induction should be less relevant in this context, since that constant-bounded difference in minimal program lengths is significant enough that it could represent expected utility as Jeffrey-Bolker preference^[1], where expected utility of an event (for a bounded utility function) is given by the ratio of two different probability measures of this event.

This way, universality of an algorithmic distribution kinda corresponds to boundedness of a utility function, but the data of a utility function isn't something to be necessarily dismissed. So path dependence (or random initialization dependence) might vaguely correspond to different languages/interpreters for Solomonoff induction. And maybe you need pairs of languages/interpreters (where some property of a specific pair is important) and not just one (whose choice doesn't matter) to talk about agents (as an alternative to how AIXI does things).

See for example
- J Broome (1990) Bolker-Jeffrey Expected Utility Theory and Axiomatic Utilitarianism,
the thing I'm referring to is Bolker's Existence Theorem and possible alterations, including where the utility function is bounded and so both measures can remain positive. ↩︎

[-]Oliver Daniels7mo*60

maybe research fads are good?

disclaimer: contrarian take that I don't actually believe, but marginally updates me in favor of fads

Byrne Hobart has this thesis of "bubbles as coordination mechanisms" (*disclaimer, have not read the book).
If true, this should make us less sad about research fads that don't fully deliver (e.g. SAEs) - the hype encourages people to build out infrastructure they otherwise wouldn't that ends up being useful for other things (e.g. auto-interp, activation caching utils)

So maybe the take is "overly optimistic visions are pragmatically useful", but be aware of operating under overly optimistic visions, and let this awareness subtly guide prioritization.

Note this also applies to conceptual research - I'm pretty skeptical that "formalizing natural abstractions" will directly lead to novel interpretability tools, but the general vibe of natural abstractions has helped my thinking about generalization.

[-]leogao7mo1313

i think "build out infrastructure" is hugely overrated in research. for example, the existing codebases for SAEs (training, activation caching, autointerp) are often actively worse than useless, such that i would rather spend a weekend rewriting it from scratch than work within them. in general i think people should throw out and rewrite research infra much more often than they do. not saying truly good research infrastructure can't exist, in theory, just that empirically people really suck at making good reusable infrastructure.

[-]Oliver Daniels2y60

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)

mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure.

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

[-]Erik Jenner2y30

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.

mechanistic anomaly detection is an approach to ELK

I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.

I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

[+][comment deleted]2y10

[-]Oliver Daniels3mo50

I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.

My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:

Reading week (everyone reads the same set of papers / blog posts, discuss thought
Writing week: everyone prepares comment / shortform / post, then we trade and revise writers-workshop style
- this also provides filtering (avoid spamming LW with bad comments) and maybe makes people feel more comfortable/confident about posting public writing

would love feedback, and curious if anyone has tried stuff like this.

(Another nice feature of this is it increases the odds of getting people hooked on LW, which imo should be a top priority for safety field-building, c.f. https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=HqQsNdp4bdp4nDn7G)

[-]habryka3mo30

Boaz Barak's course at Harvard has been doing this! See here for the associated posts: https://www.lesswrong.com/w/cs-2881r

[-]Oliver Daniels7mo43

Beware Experiment Addiction

Quick feedback loops are great, but...

I often fall into the trap of playing around with lots of minor details that (upon reflection) I don't expect to change the results much. I do this b/c generating new results is really addicting (novelty, occasional payoff, etc). Not clear what's optimal here (sometimes you really do need to explore), but worth keeping in mind.

[-]Oliver Daniels2mo30

[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)

Great to see academic / open work on character training.

While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.

They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)

https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews

[-]Oliver Daniels2mo30

Experimenting with cross-posting reviews from https://www.papertrailshq.com/

[Review] Replication of the Auditing Game Model Organism
https://alignment.anthropic.com/2025/auditing-mo-replication/
★★★★☆ (4.0/5)

Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.

This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)

https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews

[-]Oliver Daniels2mo30

Two strongest sources of prosaic alignment hopes

1. LLMs are the dumbest economically transformative AI systems: they're even more culture-pilled than humans, it turns out great language models + scaffolding and RL really can "simulate" all economically relevant tasks without having scary agentic properties (and while being very easy to audit and monitor)

2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.

1 makes me pretty hopeful (especially in short timelines), even if only partially true. I think we've already gotten some evidence against 2 (e.g reward hacking in sonnet, o3, etc), though the situation does seem to be better now (maybe the "soul document", better deliberative alignment, ...)

[-]Oliver Daniels3mo30

I think evan's post on deep double descent looks really prescient (i.e. I think its now widely accepted that larger models tend to generalize better than smaller models conditioned on achieving the same training loss)

https://www.lesswrong.com/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-around

the implications for scheming risk are a little less clear: reasoning models don't have strong speed priors (and do inherit simplicity priors from the NN), but don't seem to be schemers (perhaps due to output-to-thinking generalization). I don't think we should update much from this though, given the narrow range of tasks and still-limited situational awareness.

[-]Oliver Daniels1y30

does anyone have thoughts on how to improve peer review in academic ML? From discussions with my advisor, my sense is that the system used to depend on word of mouth and people caring more about their academic reputation, which works in a fields of 100's of researchers but breaks down in fields of 1000's+. Seems like we need some kind of karma system to both rank reviewers and submissions. I'd be very surprised if nobody has proposed such a system, but a quick google search doesn't yield results.

I think reforming peer review is probably underrated from a safety perspective (for reasons articulated here - basically bad peer review disincentivizes any rigorous analysis of safety research and degrades trust in the safety ecosystem)

[-]lunatic_at_large1y20

I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/.

[-]Daniel Tan1y20

I think requiring authors to also review papers is a pretty good way to both (i) ensure there are enough reviewers for any given subdiscipline and (ii) at least somewhat kick-start healthier review culture. My impression is that many academics don't see reviewing as part of their responsibilities, and forcing it on them might change this.

I feel like improving the way review papers are assigned would also do a lot. My worst reviews submitted were when I wasn't well-versed or interested in the topic the paper was on.

[-]Oliver Daniels1y10

Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don't scale to thousands of researchers.

It feels like something basic like "have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings" (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.

[-]Shankar Sivarajan1y20

The double-bind structure is maintained,

It's almost always only single-blind: the reviewers usually know who the authors are.

[-]Oliver Daniels1y10

yeah fair - my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers

(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)

[-]Daniel Tan1y10

Interesting. You’re essentially trying to set up an alternative reputation system I guess. But I don’t see what the incentive is for academics to buy into this new reputation system when they already have one (h-index). Also don’t see what the incentive is for giving honest ratings to other reviewers.

Intuition pump: Most marketplace platforms allow buyers and sellers to rate each other. This has direct usefulness to both because it influences who you buy from / sell to. Therefore there is immediate buy-in.

However, reviewing doesn’t work like this because authors and reviewers aren’t exercising much individual agency (nor should they) in determining what papers to review.

[-]Oliver Daniels1y10

From what I understand, reviewing used to be a non-trivial part of an academic's reputation, but relied on much smaller academic communities (somewhat akin to Dunbar's number). So in some sense I'm not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)

I don't really follow the market-place analogy - in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possibly actual money) Reviewers would then be selected based on their previous reviewing track-record and domain of expertise. I agree that in the current setup this market structure doesn't really hold, but this is in some sense the core problem.

[-]Oliver Daniels2y32

just read "Situational Awareness" - it definitely woke me up. AGI is real, and very plausibly (55%?) happening within this decade. I need to stop sleep walking and get serious about contributing within the next two years.

First, some initial thoughts on the essay

Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily" take on the prospect of AGI within the next 3-5 years)
I think most of my uncertainty lies in the "unhobbling" type algorithmic progress, this seems especially unpredictable, and may require lots of expensive experimentation if e.g. the relevant capabilities to get some meta-cognitive process to train only emerge at a certain scale. I'm vaguely thinking back to Paul's post on self-driving cars and AGI timelines. Maybe this is all priced in though - there's way more research investment, and tech path seems relatively straight forward if we can apply enough experimentation. Still, research is hard, takes a lot of serial time, and is less predictable that e.g. industrial processes. (I'm kind just saying this though, not actually sure how to quantify this, I'm pretty sure people have analsysis of insight generation or whatever, idk...)

[-]Oliver Daniels2y30

I previously thought the argument for measurement tampering being more tractable then general ELK was mostly about the structural / causal properties of multiple independent measurements, but I think I'm more swayed by the argument that measurement tampering will just be more obvious (both easier to see using interpretability and more anomalous in general) then e.g. sycophancy. This is a flimsier argument though, and is less likely to hold when tampering is more subtle.

[-]Oliver Daniels2y30

I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?

[-]ryan_greenblatt2y42

The remarks by Erik Jenner seem good, but I would also just consider the following setup:

Imagine that we have a human looking at an AI's actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).

These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.

The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.

Now, what can go wrong?

Problematic consequences which couldn't at all be captured in a human looking at these measurements because either:
- The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
- The consequences aren't something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don't understand or know about.)
Problematic consequences which would "by default" be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly "by default" captured, but also not impossible to capture.

When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).

It's in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it's very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don't have any training data that relates to this?)

Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it's knowledge. This is relatively easy in the "average" or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.

(Also, in the high-stakes case, we might run into issues where a given observation doesn't make sense: you can't observe something if you're dead.)

[-]Oliver Daniels2y10

(from conversation with Erik Jenner) roughly 3 classes of applications

MTD all the way down
1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)

(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

[-]ryan_greenblatt2y30

As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.

[-]Oliver Daniels1y20

Is the move of a lot of alignment discourse to Twitter/X a coordination failure or a positive development?

I'm kinda sad that LW seems less "alive" than it did a few years ago, but also seems healthy to be engaging in a more neutral space with a wider audience

[-]ryan_greenblatt1y50

I don't think Twitter/X is really a very neutral space in general. I agree it maybe is more neutral from the perspective of the AI/ML community. (And certainly feels that way from the pespective of many people in that community.)

[-]Kabir Kumar1y10

I like bluesky for this atm

[-]Oliver Daniels1y10

yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

[-]Oliver Daniels3mo10

That most complexity theorists believe P != NC gives a kind of theoretical quasi-support to ideas like "alignment research is inherently sequential" and single-piece flows

(I think arguments like this connecting mathematical arguments to distantly related settings is mostly silly, but I still find them nice, idk...)

[-]Oliver Daniels3mo10

unclear whether outcome-based training "leaking" into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety

on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable

[-]Oliver Daniels5mo10

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).

I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.

[-]Oliver Daniels7mo10

Is interp easier in worlds where scheming is a problem?

The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search).

While I don't expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem.

(though insofar as you're worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)

[-]Oliver Daniels7mo10

Obvious / informal connection between SLT and information bottleneck theory:

SLT says: more degeneracies -> better generalization
IB says: less mutual information b/w input and representation -> better generalization

the more degeneracies a function has, the less information it can preserve.

[-]Oliver Daniels8mo12

Take: Exploration hacking should not be used as a synonym for deceptive alignment.

(I have observed one such usage)

Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability.

[-]Oliver Daniels2y10

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the "same reason" (e.g. to maximize paperclips), but a MAD approach that "properly" handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.

I guess this is part of what motivates measurement tampering as a subclass of ELK - instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.

Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

[-]Oliver Daniels2y10

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

[+][comment deleted]1y20

[+][comment deleted]2mo10

Moderation Log

More from Oliver Daniels

Curated and popular this week

49Comments

49 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:19 PM

[-]Oliver Daniels5mo*100

[-]ryan_greenblatt5mo86

It's a cost and convenience thing. See discussion here for some ideas for better methods that are also cheap.

[-]Håvard Tveit Ihle5mo76

If you have to repeat the entire post-training all the way from a base model, that is obviously a lot more work than just adding a small fine-tuning stage to an already post-trained model.

The full post-training can also only really be done by a big lab which has their own full post-training stack. Post training is getting more and more advanced and complicated with each month.

[-]Oliver Daniels5mo20

[-]Håvard Tveit Ihle5mo10

[-]Oliver Daniels1y94

I wish there was a bibTeX functionality for alignment forum posts...

[-]habryka1y97

Yeah, IMO we should just add a bunch of functionality for integrating alignment forum stuff more with academic things. It’s been on my to do list for a long time.

[-]Steven Byrnes1y60

(…Nobody’s using it though! Basically all the posts so far are by the people who run it, or their close friends.)

(But it’s still very new. Probably too soon to judge it a failure.)

Just thought that might be an interesting point of comparison.

[-]habryka1y90

[-]Oliver Daniels1y94

a. Control seems great

b. Heuristic arguments is a bad bet (for some of the reasons mech interp is a bad bet)

c. ARC has it covered

But the weighting is pretty important here. If its
a. more people should be working on heuristic argument inspired stuff.

b. less people should be working on heuristic argument inspired stuff (i.e. ARC employees should quit, or at least people shouldn't take jobs at ARC)

c. people should try to work at ARC if they're interested, but its going to be difficult to make progress, especially for e.g. a typical ML PhD student interested in safety.

Ultimately people should come to their own conclusions, but Redwood's considerations would be pretty valuable information.

[-]Oliver Daniels5mo*60

Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn't integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)

[-]Vladimir_Nesov5mo31

See for example
- J Broome (1990) Bolker-Jeffrey Expected Utility Theory and Axiomatic Utilitarianism,
the thing I'm referring to is Bolker's Existence Theorem and possible alterations, including where the utility function is bounded and so both measures can remain positive. ↩︎

[-]Oliver Daniels7mo*60

[-]leogao7mo1313

[-]Oliver Daniels2y60

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

[-]Erik Jenner2y30

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

mechanistic anomaly detection is an approach to ELK

[+][comment deleted]2y10

[-]Oliver Daniels3mo50

I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.

My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:

Reading week (everyone reads the same set of papers / blog posts, discuss thought
Writing week: everyone prepares comment / shortform / post, then we trade and revise writers-workshop style
- this also provides filtering (avoid spamming LW with bad comments) and maybe makes people feel more comfortable/confident about posting public writing

[-]habryka3mo30

Boaz Barak's course at Harvard has been doing this! See here for the associated posts: https://www.lesswrong.com/w/cs-2881r

[-]Oliver Daniels7mo43

[-]Oliver Daniels2mo30

[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)

Great to see academic / open work on character training.

https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews

[-]Oliver Daniels2mo30

https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews

[-]Oliver Daniels2mo30

2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.

[-]Oliver Daniels3mo30

[-]Oliver Daniels1y30

[-]lunatic_at_large1y20

I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/.

[-]Daniel Tan1y20

I feel like improving the way review papers are assigned would also do a lot. My worst reviews submitted were when I wasn't well-versed or interested in the topic the paper was on.

[-]Oliver Daniels1y10

[-]Shankar Sivarajan1y20

The double-bind structure is maintained,

It's almost always only single-blind: the reviewers usually know who the authors are.

[-]Oliver Daniels1y10

[-]Daniel Tan1y10

However, reviewing doesn’t work like this because authors and reviewers aren’t exercising much individual agency (nor should they) in determining what papers to review.

[-]Oliver Daniels1y10

[-]Oliver Daniels2y32

First, some initial thoughts on the essay

Very "epic" and (self?) aggrandizing. If you believe the conclusions, its not unwarranted, but I worry a bit about narratives that satiate some sense of meaning and self-importance. (That counter-reaction is probably much stronger though, and on the margin it seems really valuable to "full-throatily" take on the prospect of AGI within the next 3-5 years)
I think most of my uncertainty lies in the "unhobbling" type algorithmic progress, this seems especially unpredictable, and may require lots of expensive experimentation if e.g. the relevant capabilities to get some meta-cognitive process to train only emerge at a certain scale. I'm vaguely thinking back to Paul's post on self-driving cars and AGI timelines. Maybe this is all priced in though - there's way more research investment, and tech path seems relatively straight forward if we can apply enough experimentation. Still, research is hard, takes a lot of serial time, and is less predictable that e.g. industrial processes. (I'm kind just saying this though, not actually sure how to quantify this, I'm pretty sure people have analsysis of insight generation or whatever, idk...)

[-]Oliver Daniels2y30

[-]ryan_greenblatt2y42

The remarks by Erik Jenner seem good, but I would also just consider the following setup:

Now, what can go wrong?

Problematic consequences which couldn't at all be captured in a human looking at these measurements because either:
- The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
- The consequences aren't something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don't understand or know about.)
Problematic consequences which would "by default" be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly "by default" captured, but also not impossible to capture.

(Also, in the high-stakes case, we might run into issues where a given observation doesn't make sense: you can't observe something if you're dead.)

[-]Oliver Daniels2y10

(from conversation with Erik Jenner) roughly 3 classes of applications

MTD all the way down
1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)

(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

[-]ryan_greenblatt2y30

As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.

[-]Oliver Daniels1y20

[-]ryan_greenblatt1y50

[-]Kabir Kumar1y10

I like bluesky for this atm

[-]Oliver Daniels1y10

yeah I was mostly thinking neutral along the axis of "safey-ism" vs "accelerationism" (I think there's a fairly straight-forward right-wing bias on X, further exasperated by Bluesky)

[-]Oliver Daniels3mo10

[-]Oliver Daniels5mo10

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).

[-]Oliver Daniels7mo10

[-]Oliver Daniels8mo12

Take: Exploration hacking should not be used as a synonym for deceptive alignment.

[-]Oliver Daniels2y10

Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk