Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar

Reducing Goodhart

Philosophy Corner

Wiki Contributions

Comments

Ilya Sutskever and Jan Leike resign from OpenAI

Charlie Steiner2d42

Well, one big reason is if they were prevented from doing the things they thought would constitute using their position of power to do good, or were otherwise made to feel that OpenAI wasn't a good environment for them.

The Intentional Stance, LLMs Edition

Charlie Steiner6d20

I think this gets deflationary if you think about it, though. Yes, you can apply the intentional stance to the thermostat, but (almost) nobody's going to get confused and start thinking the thermometer has more fancy abilities like long-term planning just because you say "it wants to keep the room at the right temperature." Even though you're using a single word "w.a.n.t." for both humans and thermostats, you don't get them mixed up, because your actual representation of what's going on still distinguishes them based on context. There's not just one intentional stance, there's an stance for thermostats and another for humans, and they make different predictions about behavior, even if they're similar enough that you can call them both intentional stances.

If you buy this, then suddenly applying an intentional stance to LLMs buys you a lot less predictive power, because even intentional stances have a ton of little variables in the mental model they come with, which we will naturally fill in as we learn a stance that works well for LLMs.

Shane Legg's necessary properties for every AGI Safety plan

Charlie Steiner15d40

I think this is a great idea, except that on easy mode "a good specification of values and ethics to follow" means a few pages of text (or even just the prompt "do good things"), while other times "a good specification of values" is a learning procedure that takes input from a broad sample of humanity, and has carefully-designed mechanisms that influence its generalization behavior in futuristic situations (probably trained on more datasets that had to be painstakingly collected), and has been engineered to work smoothly with the reasoning process and not encourage perverse behavior.

[Aspiration-based designs] 2. Formal framework, basic algorithm

Charlie Steiner17d41

So to sum up so far, the basic idea is to shoot for a specific expected value of something by stochastically combining policies that have expected values above and below the target. The policies to be combined should be picked from some "mostly safe" distribution rather being whatever policies are closest to the specific target, because the absolute closest policies might involve inner optimization for exactly that target, when we really want "do something reasonable that gets close to the target."

And the "aspiration updating" thing is a way to track which policy you think you're shooting for, in a way that you're hoping generalizes decently to cases where you have limited planning ability?

Improving Dictionary Learning with Gated Sparse Autoencoders

Charlie Steiner21dΩ340

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so

y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4

where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold.

But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Neural uncertainty estimation review article (for alignment)

Charlie Steiner24d20

I'm actually not familiar with the nitty gritty of the LLM forecasting papers. But I'll happily give you some wild guessing :)

My blind guess is that the "obvious" stuff is already done (e.g. calibrating or fine-tuning single-token outputs on predictions about facts after the date of data collection), but not enough people are doing ensembling over different LLMs to improve calibration.

I also expect a lot of people prompting LLMs to give probabilities in natural language, and that clever people are already combining these with fine-tuning or post-hoc calibration. But I'd bet people aren't doing enough work to aggregate answers from lots of prompting methods, and then tuning the aggregation function based on the data.

Charlie Steiner's Shortform

Charlie Steiner26d20

Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.

Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once - find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.

This is a little bit related to an idea with the handle "concepts live in ontologies." If I say I'm going to the gym, this concept of "going to the gym" lives in an ontology where people and activites are basic components - it's probably also easy to use ideas like "You're eating dinner" in that ontology, but not "1,3-diisocyanatomethylbenzene." When you try to express one idea, you're also picking a "basis" for expressing similar ideas.

Any evidence or reason to expect a multiverse / Everett branches?

Charlie Steiner1mo50

I found someone's thesis from 2020 (Hoi Wai Lai) that sums it up not too badly (from the perspective of someone who wants to make Bohmian mechanics work and was willing to write a thesis about it).

For special relativity (section 6), the problem is that the motion of each hidden particle depends instantaneously on the entire multi-particle wavefunction. According to Lai, there's nothing better than to bite the bullet and define a "real present" across the universe, and have the hyperparticles sometimes go faster than light. What hypersurface counts as the real present is unobservable to us, but the motion of the hidden particles cares about it.

For varying particle number (section 7.4), the problem is that in quantum mechanics you can have a superposition of states with different numbers of particles. If there's some hidden variable tracking which part of the superposition is "real," this hidden variable has to behave totally different than a particle! Lai says this leads to "Bell-type" theories, where there's a single hidden variable, a hidden trajectory in configuration space. Honestly this actually seems more satisfactory than how it deals with special relativity - you just had to sacrifice the notion of independent hidden variables behaving like particles, you didn't have to allow for superluminal communication in a way that highlights how pointless the hidden variables are.

Warning: I have exerted basically no effort to check if this random grad student was accurate.

Any evidence or reason to expect a multiverse / Everett branches?

Charlie Steiner1mo169

My understanding is that pilot wave theory (ie Bohmian mechanics) explains all the quantum physics

This is only true if you don't count relativistic field theory. Bohmian mechanics has mathematical troubles extending to special relativity or particle creation/annihilation operators.

Is there any reason at all to expect some kind of multiverse?

Depending on how big you expect the unobservable universe to be, there can also be a spacelike multiverse.

LLMs for Alignment Research: a safety priority?

Charlie Steiner1moΩ464

Wouldn't other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?

I don't think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like "explain an experiment in natural language and have an AI help implement it rapidly."

So I think both they and this post are describing the strategy of "building very generally useful AI, but the good guys will be using it first." I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.

LESSWRONG
LW

Sequences

Posts

Wiki Contributions

Comments