LESSWRONG
LW

Jemal Young — LessWrong

As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).

This feels plausible to me but handwavy, if the idea is that such preferences would be decoupled from the training-reinforced preference to complete an intended task. Is that what you meant? I'm reminded of this Palisade study on shutdown resistance, where across the board, the models expressed wanting to avoid shutdown to complete the task.

Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like "acquire resources." If a model is role-playing or emulating an agent with

Jemal Young

Most of the AI takeover thought experiments and stories I remember are about a kind of AI that has open-ended goals: the Squiggle Maximizer, the Sorcerer’s Apprentice robot, Clippy, probably also U3, Consensus-1, and Sable. I wonder what concrete mechanisms could even lead to models having open-ended goals.

Here are my best guesses:

Training on open-ended tasks, given enough capabilities or the right scaffolding
RL with open-ended reward specifications, like maximizing cumulative reward with no terminal reward and no time penalty (like the Coast Runners example of specification gaming)
Mesa-optimization, where SGD finds a policy that internally implements an open-ended objective that happens to perform well on a bounded outer task

Number 3 seems possible but very unlikely,... (read more)

Replying toHow useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young6mo*

How useful could stolen AI model weights be without knowing the architecture and activation functions?

The scenario seems unrealistic because of the thieves would likely be able to steal important parts of the codebase.

Thanks for this. So I guess when knowledgeable people talk about stealing a model's weights as being equivalent to stealing the model itself, "steal the weights" is shorthand that implies also stealing ~~the minimal~~ *other elements you'd need to replicate the model. [Edit: changed "the minimal" to "other"]

Replying toHow useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young6mo

How useful could stolen AI model weights be without knowing the architecture and activation functions?

This is a helpful answer, thank you! Thanks also for the link to the HF article on common model formats.

How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young

6mo

I'm thinking of an unreleased frontier model. No public information. How realistic is it to think such a model could be duplicated starting from the weights alone, e.g. by brute forcing through different combinations of architecture and activation functions? Would thieves be likely to end up with an inferior bizarro model?

Jemal Young's Shortform

Jemal Young

10mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Jemal Young10moQuick Take

Not saying AI models can't be moral patients, but 1) if the smartest models are probably going to be the most dangerous, and 2) if the smartest models are probably going to be the best at demonstrating moral patienthood, then 3) caring too much about model welfare is probably dangerous.

-1

Safe Search is off: root causes of AI catastrophic risks

Jemal Young

Epistemic status: My best guess

When I look at advanced AI development, I see three general conditions that seem to be the root causes of all catastrophic risks:

reliance on deep learning without knowing how to do it safely,
pressure to make progress on the most powerful capabilities, and
exploring the AI design space without a clear stopping point.

Others have called out these conditions in various ways, but to my knowledge only within broader discussions. I think these conditions are easy to recognize and their riskiness is easy to explain, which may make this root-cause framing useful for public engagement. With this post, I’ve tried to articulate how each condition on its own enables catastrophic risks... (read 721 more words →)

Replying toThe Sun is big, but superintelligences will not spare Earth a little sunlight

Jemal Young1y

The Sun is big, but superintelligences will not spare Earth a little sunlight

You only set aside occasional low-value fragments for national parks, mostly for your own pleasure and convenience, when it didn't cost too much?
Earth as a proportion of the solar system's planetary mass is probably comparable to national parks as a proportion of the Earth's land, if not lower.

Maybe I've misunderstood your point, but if it's that humanity's willingness to preserve a fraction of Earth for national parks is a reason for hopefulness that ASI may be willing to preserve an even smaller fraction of the solar system (namely, Earth) for humanity, I think this is addressed here:

it seems like for Our research purposes simulations would be just as good. In fact, far

Jemal Young

Epistemic Status: Exploratory

President Biden’s Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence specifies compute thresholds for training runs and computing clusters that, if exceeded, impose reporting requirements. If a training run exceeds 10²⁶ floating point operations or, for a model trained mainly on biological sequence data, 10²³ floating point operations, the company has to report information about physical and cybersecurity measures and red-teaming results. Additionally, the “acquisition, development, or possession” of a computing cluster capable of training models using more than 10²⁰ floating point operations must be reported, along with the cluster’s location and computing power. My understanding of the executive order is that these thresholds are... (read 407 more words →)

Replying toWhat's a better term now that "AGI" is too vague?

Jemal YoungMay 31, 2024

What's a better term now that "AGI" is too vague?

I think the kind of AI you have in mind would be able to:

continue learning after being trained

think in an open-ended way after an initial command or prompt

have an ontological crisis

discover and exploit signals that were previously unknown to it

accumulate knowledge

become a closed-loop system

The best term I've thought of for that kind of AI is Artificial Open Learning Agent.

Replying toHow do top AI labs vet architecture/algorithm changes?

Jemal Young2y

How do top AI labs vet architecture/algorithm changes?

Thanks for this answer! Interesting. It sounds like the process may be less systematized than how I imagined it to be.

Replying toHow do top AI labs vet architecture/algorithm changes?

Jemal Young2y

How do top AI labs vet architecture/algorithm changes?

Dwarkesh's interview with Sholto sounds well worth watching in full, but the segments you've highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!

How do top AI labs vet architecture/algorithm changes?

Jemal Young

How do labs working at or near the frontier assess major architecture and/or algorithm changes before committing huge compute resources to try them out? For example, how do they assess stability and sample efficiency without having to do full-scale runs?

Replying toModel Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Jemal Young2y

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

I like this post, and I think I get why the focus is on generative models.

What's an example of a model organism training setup involving some other kind of model?

Replying toWhich possible AI systems are relatively safe?

Jemal YoungAug 23, 2023

Which possible AI systems are relatively safe?

Maybe relatively safe if:

Not too big
No self-improvement
No continual learning
Curated training data, no throwing everything into the cauldron
No access to raw data from the environment
Not curious or novelty-seeking
Not trying to maximize or minimize anything or push anything to the limit
Not capable enough for catastrophic misuse by humans

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal Young

Thanks to llll for helping me think this through, and for providing useful comments.

Epistemic Status: My best guess

Introduction

It might be worthwhile to systematically mine AI technical research to find “unintentional AI safety research”—research that, while not explicitly conducted as AI safety research, contains information relevant to AI safety. An example of unintended safety research is Douglas Lenat's work on the heuristic-search system Eurisko, which inadvertently demonstrated specification gaming when Eurisko exploited a loophole in the rules of the role-playing game Traveller TCS to win the US national championship in 1981 and 1982.^[1] This post is not meant to suggest that AI safety researchers don’t already look for unintentional safety research, but I’m unaware of any... (read 1512 more words →)

How might we make better use of AI capabilities research for alignment purposes?

Jemal Young

When I check ArXiv for new AI alignment research papers, I see mostly capabilities research papers, presumably because most researchers are working on capabilities. I wonder if there’s alignment-related value to be extracted from all that capabilities research, and how we might get at it. Is anyone working on this, or does anyone have any good ideas?

LESSWRONG
LW

LESSWRONG
LW

Jemal Young

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

How might we make better use of AI capabilities research for alignment purposes?

What concrete mechanisms could lead to AI models having open-ended goals?

How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young

What concrete mechanisms could lead to AI models having open-ended goals?

How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young's Shortform

Safe Search is off: root causes of AI catastrophic risks

Can efficiency-adjustable reporting thresholds close a loophole in Biden’s executive order on AI?

How do top AI labs vet architecture/algorithm changes?

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal Young

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

How might we make better use of AI capabilities research for alignment purposes?

What concrete mechanisms could lead to AI models having open-ended goals?

How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young

What concrete mechanisms could lead to AI models having open-ended goals?

How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young's Shortform

Safe Search is off: root causes of AI catastrophic risks

Can efficiency-adjustable reporting thresholds close a loophole in Biden’s executive order on AI?

How do top AI labs vet architecture/algorithm changes?

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Introduction