Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wikitag Contributions

Comments

Sorted by

In a new paper with Aidan Homewood, "Limits of Safe AI Deployment: Differentiating Oversight and Control,"

Link should go to arxiv.

Still reading the paper, but it seems like your main point is that if oversight is "meaningful," then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don't have "meaningful oversight" of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don't catch bad behavior before it happens.

Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.

Is this about right?

I'm always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.

EDIT: Though I guess in this case one might expect to blame a diversity of disjoint RL post-training regimes, not all of which have a clever/expensive, or even any, reward model (not sure how OpenAI does RL on programming tasks). I still think it's possible the role of a human feedback reward model is interesting.

I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)

Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

Maybe you could do something with LLM sentiment analysis of participants' conversations (e.g. when roleplaying discussing what the best thing to do for the company, genuinely trying to do a good job both before and after).

Though for such a scenario, an important thing I imagine is that learning about fallacies only has a limited relation, and only if people learn to notice them in themselves, not just in someone they already disagree with.

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

William Lane Craig is great to watch from meta-perspective. How do you go into someone else's field of expertise and try to beat them in a debate? He clearly thinks about it very carefully, in a way kinda like planning for political debates but with a much higher quality intended output.

I had a pretty different interpretation - that the dirty secrets were plenty conscious (he knew consciously they might be stealing a boat), instead he had unconscious mastery of a sort of people-modeling skill including self-modeling, which let him take self-aware actions in response to this dirty secret.

Load More