David Johnston

Wikitag Contributions

Comments

Sorted by

models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe

Only modest confidence, but while there's an observability gap between neuralese and CoT monitoring, I suspect it's smaller than the gap between reasoning traces that haven't been trained against oversight and reasoning traces that have.

  1. I mean, even if you're mostly pursuing a particular set of final values (which is not what you're advocating here), there are probably strong reasons to make coordination a high priority (which is close to what you're advocating here).

  2. Well, I did say "to the extent permitted by 1" - there's probably conflict here - but I wasn't suggesting CEV as something that makes coordination easy. I'm saying it's a good principle for judging final outcomes between two different paths that have similar levels of coordination. Ofc we'd have to estimate the "happiness in hindsight", but this looks tractable to me.

I've thought about it a bit, I have a line of attack for a proof, but there's too much work involved in following it through to an actual proof so I'm going to leave it here in case it helps anyone.

I'm assuming everything is discrete so I can work with regular Shannon entropy.

Consider the range of the function and defined similarly. Discretize and (chop them up into little balls). Not sure which metric to use, maybe TV.

Define to be the index of the ball into which falls, similar. So if is sufficiently small, then .

By the data processing inequality, conditions 2 and 3 still hold for . Condition 1 should hold with some extra slack depending on the coarseness of the discretization.

It takes a few steps, but I think you might be able to argue that, with high probability, for each , the random variable will be highly concentrated (n.b. I've only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven't checked). We then invoke the discretization to argue that is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if is concentrated then it actually has to "collapse" most of its mass onto a few discrete values.

We can then make a similar argument switching the indices to get bounded. Finally, maybe applying conditions 2 and 3 we can get bounded as well, which then gives a bound on .

I did try feeding this to Gemini but it wasn't able to produce a proof.

Wait, I thought the first property was just independence, not also identically distributed.

In principle I could have e.g. two biased coins with their biases different but deterministically dependent.

I think:

  1. Finding principles for AI "behavioural engineering" that reduces people's desire to engage in risky races (e.g. because they find the principles acceptable) seems highly valuable
  2. To the extent permitted by 1, pursuing something CEV like ("we're happier with the outcome in hindsight than we would've been with other outcomes") seems desirable also

I sort of see the former as potentially encouraging diversity (because different groups want different things, and are most likely to agree to "everyone gets what they want"), but the latter may in fact suggest convergence (because, perhaps, there are fairly universal answers to "what makes people happy with the benefit of hindsight?").

You stress the importance of having robust feedback procedures, but having overall goals like this can help to judge which procedures are actually doing what we want.

Your natural latents seem to be quite related to the common construction IID variables conditional on a latent - in fact, all of your examples are IID variables (or "bundles" of IID variables) conditional on that latent. Can you give me an interesting example of a natural latent that is not basically the conditionally IID case?

(I was wondering if the extensive literature on the correspondence between De Finetti type symmetries and conditional IID representations is of any help to your problem. I'm not entirely sure if it is, given that mostly addresses the issue of getting from a symmetry to a conditional independence, whereas you want to get from one conditional independence to another, but it's plausible some of the methods are applicable)

If you're in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen "normal behaviour" to normal behaviour in your situation. Reinforcement learning is limited - you can't always extrapolate past reward - but it's not obvious that imitative regularisation is fundamentally more limited.

(normal does not imply safe, of course)

Their empirical result rhymes with adversarial robustness issues - we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.

I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any "scale", whether we look at them token by token or read a high-level summary of them. However, I suspect their "weird, low-KL" generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it's not immediately obvious if this translates to low and high probability summaries respectively - one would need to test). I think a KL penalty to the "true base policy" should operate this way automatically, but as the authors note we can't actually implement that.

Is your view closer to:

  • there's two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
  • instruction following is the only hard step; if you get that, value alignment is almost certain to follow

Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.

Load More