Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wikitag Contributions

Comments

Sorted by

Thanks, this was pretty interesting.

Big problem is the free choice of "conceptual language" (universal Turing machine) when defining simplicity/comprehensibility. You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it'll be shared between the humans and the AI. That's not necessarily true, which creates a lot of leaks where an AI might do something that's simple in the AI's internal representation but complicated in the human's.

It's OK to make cars pink by using paint ("spots of paint" is an easier to optimize/comprehend variable). It's not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion ("individual water droplets" is a harder to optimize/comprehend variable).

This raises a second problem, which is the "easy to optimize" criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain. But if we let environmental availability weigh on "easy to to optimize," then the agent will be happy to switch from real paint to a hologram or a human-hack once the technology for those becomes developed and commodified.

When the metric is a bit fuzzy and informal, it's easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.

I agree that trying to "jump straight to the end" - the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus - would be bad.

And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it's eventually unsafe unless you make advancements in learning values in a way that's good according to humans.

Why train a helpful-only model?

If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.

I'm big on point #2 feeding into point #1.

"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.

If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more "boring" specification though.

Here's my shot: You're searching a game tree, and come to a state that has some X and some Y. You compute a "value of X" that's the total discounted future "value of Y" you'll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the "value of Y," which is the same except it's the (discounted, conditional, relative) expected total "value of X" you'll get. You pick actions to steer towards a high sum of these values.

A lot of the effect is picking high-hanging fruit.

Like, go to phys rev D now. There's clearly a lot of hard work still going on. But that hard work seems to be getting less result, because they're doing things like carefully calculating the trailing-order terms of the muon's magnetic moment to get a change many decimal places down. (It turns out that this might be important for studying physics beyond the Standard Model. So this is good and useful work, definitely not being literally stalled.)

Another chunk of the effect is that you generally don't know what's important now. In hindsight you can look back and see all these important bits of progress woven into a sensible narrative. But research that's being done right now hasn't had time to earn its place in such a narrative. Especially if you're an outside observer who has to get the narrative of research third-hand.

In the infographic, are "Leading Chinese Lab" and "Best public model"s' numbers swapped? The best public model is usually said to be ahead of the Chinese.

EDIT: Okay, maybe most of it before the endgame is just unintuitive predictions. In the endgame, when the categories "best OpenBrain model," "best public model" and "best Chinese model" start to become obsolete, I think your numbers are weird for different reasons and maybe you should just set them all equal.

Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.

Reminder that https://ui.stampy.ai/ exists

The idea is interesting, but I'm somewhat skeptical that it'll pan out.

  • RG doesn't help much going backwards - the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don't expect the micro scale to be simple.
  • Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it's likely that large neural networks undergo lots and lots of phase transitions, and that there's not just going to be one phase transition from "naive" to "naughty" that we can model simply.
  • Conversely, lots of important changes might not show up as phase transitions.
  • Some AI architectures are basically designed to be hard to analyze with RG because they want to mix information from a wide variety of scales together. Full attention layers might be such an example.

If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.

On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

Load More