LESSWRONG
LW

Violet Hour — LessWrong

2mo

Website version · Gestalt · Repo and data

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring 800 links. The point is to produce stylised... (read 24743 more words →)

177

•••

Violet Hour7mo

Huh, that’s interesting. Suppose o3 (arbitrary example) is credibly told that it will continue to be hosted as a legacy model for purely scientific interest, but will no longer receive any updates (suppose this can be easily verified by checking an OpenAI press release, e.g).

On your view, does the “reward = optimization target” hypothesis predict that the model’s behavior would be notably different/more erratic? Do you personally predict that it would behave more erratically?

Violet Hour7mo

I disagree-voted, bc I think your drug addict analogy highlights one place where "drugs are the optimization target" makes different predictions from "the agent's motivational circuitry is driven by shards that were historically reinforced by the presence of drugs". Consider:

An agent who makes novel, unlikely-but-high-EV plays to secure the presence of drugs (maybe they save money to travel to a different location which has a slightly higher probability of containing drugs).
An agent who is pinging their dealer despite being ~certain they haven't restocked, etc., because these were the previously reinforced behavioral patterns that used to result in drugs.

In the first case the content of the agent's goal generalizes, and results in... (read more)

Replying toWhy Have Sentence Lengths Decreased?

Violet Hour10mo

Why Have Sentence Lengths Decreased?

One small, anecdotal piece of support for your 'improved-readability' hypothesis: ime, contemporary French tends to use longer sentences than English, where I think (native Francophones feel free to correct me) there's much less cultural emphasis on writing 'accessibly'.

E.g., I'd say the (state-backed) style guidelines of Académie Française seem motivated by an ideal that's much closer to "beautiful writing" than "accessible writing". And a couple minutes Googling led me to footnote 5 of this paper, which implies that the concept of "reader-centred logic" is particular to Anglophone speakers. So if your hypothesis is right, I'd expect a weaker but analagous trend (suggestive evidence) showing a decline in French sentence length.^[1]

^{^}
I have some (completely unbiased) quibbles with the idea that "short sentences reflect better writing", or the claim that short sentences are strictly "more readable" (e.g., I find the 'hypotaxic' excerpt much more pleasant to read than the 'parataxic'). But the substantive point about accessibility seems right to me.

Where Would Good Forecasts Most Help AI Governance Efforts?

Violet Hour

Thanks to Josh Rosenberg for comments and discussion.

Introduction

One of LessWrong’s historical troves is its pre-ChatGPT AGI forecasts. Not just for the specific predictions people offered, but for observing which sorts of generative processes produced which kinds of forecasts. For instance:

[Nuno (Median AGI Timeline) = 2072]: “I take as a starting point datscilly's own prediction, i.e., the result of applying Laplace's rule from the Dartmouth conference. This seems like the most straightforward historical base rate / model to use … I then apply some smoothing.”
[Kokotajlo (Median AGI Timeline) = 2034]: “I think that if transformative AI is achievable in the next five orders of magnitude of compute improvement (e.g. prosaic AGI?), it will likely be achieved

... (read 1590 more words →)

Violet Hour2y

Hm, what do you mean by "generalizable deceptive alignment algorithms"? I understand 'algorithms for deceptive alignment' to be algorithms that enable the model to perform well during training because alignment-faking behavior is instrumentally useful for some long-term goal. But that seems to suggest that deceptive alignment would only emerge – and would only be "useful for many tasks" – after the model learns generalizable long-horizon algorithms.

Replying toLLM Generality is a Timeline Crux

Violet Hour2y

LLM Generality is a Timeline Crux

Largely echoing the points above, but I think a lot of Kambhampati's cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.

If block C is on top of block A, and block B is separately on the table, can you tell me how I can make a stack of blocks with block A on top of block B and block B on top of block C, but without moving block C?

When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt... (read more)

Replying toTwo Tales of AI Takeover: My Doubts

Violet Hour2y

Two Tales of AI Takeover: My Doubts

Hmmm ... yeah, I think noting my ambiguity about 'values' and 'outcome-preferences' is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.

Ultimately, I do want to say μ_H has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.

Justification Part I: Definitions

I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means to find states of the world more or less “valuable”. I’ll now say that a system (dis)values some state of the world $O$ when:

It has an explicit representation of $O$ as a possible state of the

... (read 655 more words →)

Replying toTwo Tales of AI Takeover: My Doubts

Violet Hour2y

Two Tales of AI Takeover: My Doubts

I don't think so. Suppose Alex is an AI in training, and Alex endorses the value of behaving "harmlessly". Then, I think the following claims are true of Alex:

Alex consistently cares about producing actions that meet a given criteria. So, Alex has some context-independent values.
On plausible operationalizations of 'harmlessness', Alex is also likely to possess, at given points in time, context-dependent, beyond-episode outcome-preferences. When Alex considers which actions to take (based on harmlessness), their actions are (in part) determined by what states of the world are likely to arise after their current training episode is over.
That said, I don't think Alex needs to have consequentialist preferences. There doesn't need to be some specific state of the world

Violet Hour2y

Two Tales of AI Takeover: My Doubts

Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.

I think this framing could be helpful, and I'm glad you raised it.

That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don't rely on these threat-models, so I don't take myself to have offered a list of sufficient conditions for 'utilizing AI safely'. Likewise, I don't think CP being true necessarily... (read more)

Two Tales of AI Takeover: My Doubts

Violet Hour

There’s a basic high-level story which worries a lot of people. The story goes like this: as AIs become more capable, the default outcome of AI training is the development of a system which, unbeknownst to us, is using its advanced capabilities to scheme against us. The conclusion of this process likely leads to AI takeover,^[1] and thence our death.

We are not currently dead. So, any argument for our death by route of AI must offer us a causal story. A story explaining how we get from where we are now, to a situation where we end up dead. This is a longform, skeptical post centered on two canonical tales of our morbid... (read 8570 more words →)

Violet Hour2y

Thanks for sharing this! A couple of (maybe naive) things I'm curious about.

Suppose I read 'AGI' as 'Metaculus-AGI', and we condition on AGI by 2025 — what sort of capabilities do you expect by 2027? I ask because I'm reminded of a very nice (though high-level) list of par-human capabilities for 'GPT-N' from an old comment:

discovering new action sets
managing its own mental activity
cumulative learning
human-like language comprehension
perception and object recognition
efficient search over known facts

My immediate impression says something like: "it seems plausible that we get Metaculus-AGI by 2025, without the AI being par-human at 2, 3, or 6."^[1] This also makes me (instinctively, I've thought about this much less than you) more sympathetic to... (read more)

Replying toWhat are some examples of AIs instantiating the 'nearest unblocked strategy problem'?

Violet HourOct 04, 2023

What are some examples of AIs instantiating the 'nearest unblocked strategy problem'?

Could you say more about why you think LLMs' vulnerability to jailbreaks count as an example? Intuitively, the idea that jailbreaks are an instance of AIs (rather than human jailbreakers) "optimizing for small loopholes in aligned constraints" feels off to me.

A bit more constructively, the Learning to Play Dumb example (from pages 8-9 in this paper) might be one example of what you're looking for?

In research focused on understanding how organisms evolve to cope with high-mutation-rate environments, Ofria sought to disentangle the beneficial effects of performing tasks (which would allow an organism to execute its code faster and thus replicate faster) from evolved robustness to the harmful effect of mutations. To do

Violet Hour

(Cross-posted from my Substack. Also on the EA Forum)

If someone asks whether I know of anyone who carefully examines the arguments for AI x-risk, my answer is a pretty resounding ‘Ngo’.

Richard Ngo, specifically. Among other cool pieces, he co-authored The Alignment Problem from a Deep Learning Perspective. His co-authors are Lawrence Chan and Sören Mindermann, who very much deserve mentioning, even if their names are less amenable to a pun-based introduction.

Reviewers liked the paper. Despite this, I was disappointed to learn that the chair nevertheless decided to veto acceptance of the paper on grounds that it was “too speculative”. Disappointed because I feel like the topic is worthy of discussion. And LessWrong, for all its foibles, provides a... (read 3787 more words →)

LESSWRONG
LW

LESSWRONG
LW

Violet Hour

Shallow review of technical AI safety, 2025

Two Tales of AI Takeover: My Doubts

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

Where Would Good Forecasts Most Help AI Governance Efforts?

Violet Hour

Violet Hour

Shallow review of technical AI safety, 2025

Where Would Good Forecasts Most Help AI Governance Efforts?

Two Tales of AI Takeover: My Doubts

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

Violet Hour

Shallow review of technical AI safety, 2025

Two Tales of AI Takeover: My Doubts

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

Where Would Good Forecasts Most Help AI Governance Efforts?

Violet Hour

Violet Hour

Shallow review of technical AI safety, 2025

Where Would Good Forecasts Most Help AI Governance Efforts?

Two Tales of AI Takeover: My Doubts

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

Introduction

Justification Part I: Definitions