Is Goodhart's Curse Not Really That Bad?
I'm not implying I'm on to anything others haven't thought of by posting this - I'm asking this so people can tell me if I'm wrong.
Goodhart's Curse is often cited to claim that if a superintelligent AI has a utility function which is a noisy approximation of the intended utility function, the expected proxy error will blow up given a large search space for the optimal policy.
But, assuming Gaussian or sub-Gaussian error, the expected regret is actually something like where is the size of the raw search space. Even if search space grows exponentially with intelligence, expected error isn't really blowing up. If smarter agents make more accurate proxies, then error might very plausibly decrease as intelligence grows.
I understand that there are a lot of big assumptions here which might not hold in practice, but this still seems to suggest there are a lot of worlds where Goodhart's Curse doesn't bite that hard.
If this is too compressed to be legible, please let me know and I will make it a full post.
why is your definition of value so arbitrary as to stipulate that biological meat-humans are necessary
In this kind of conversation, it's important that different people will inevitably want fundamentally different things in a way that cannot be fully reconciled.
Most people recognize that humans in their current form will not exist forever, but our preferences about what comes next are really varied. Many people want to minimize change and want the future to be something which they recognize and understand. Many others see this kind of future as both unrealistic and undesirable, and focus on caring about the manner in which we are replaced, or preserving certain qualities of humans in our successors.
Unfortunately, it's common to assume that there is some correct thing to value, or that a certain class of values (e.g. evolutionary preferences) are well justified or widely accepted.
This isn't a productive framework.
Perhaps I am overly pessimistic, but I see preferences as so varied on this question that I'd guess there is no possible outcome that is desirable to more than half the population.
This doesn't make complete sense to me, but you are going down a line of thought I recognize.
There are certainly stable utility functions which, while having some drawbacks, don't result in dangerous behavior from superintelligences. Finding a good one doesn't seem all that difficult.
The real nasty challenge is how to build a superintelligence that has the utility function we want it to have. If we could do this, then we could start by choosing an extremely conservative utility function and slowly and cautiously iterate towards a balance of safe and useful.
I've been thinking about a similar thing a lot.
Consider a little superintelligent child who always wants to eat as much candy as possible over the course of the next ten minutes. Assume the child doesn't ever care about what happens ten minutes from now.
This child won't work very hard at any instrumental goals like self improvement and conquering the world to redirect resources towards candy production, since that would be a waste of time, even though it might maximize candy consumption in the long term.
AI alignment isn't any easier here, the point of this is just to illustrate that instrumental convergence is far from given.
Note that if computing an optimization step reduces the loss, the training process will reinforce it, even if other layers aren’t doing similar steps, so this is another reason to expect more explicit optimizers.
Basically, self attention is a function of certain matrices, something like this:
Which looks really messy when you put it like this but is pretty natural in context.
If you can get the big messy looking term to approximate a gradient descent step for a given loss function, then you're golden.
In appendix A.1., they show the matrices that yield this gradient descent step. They are pretty simple, and probably an easy point of attraction to find.
All of this reasoning is pretty vague, and without the experimental evidence it wouldn't be nearly good enough. So there's definitely more to understand here. But given the experimental evidence I think this is the right story about what's going on.
I think you do this post a disservice by presenting it as a failure. It had a wrong conclusion, but its core arguments are still interesting and relevant, and exploring the reasons they are wrong is very useful.
Your model of neural nets predicted the wrong thing, that's super exciting! We can improve the model now.
The fundamental idea about genes having an advantage over weights at internally implementing looping algorithms is apparently wrong though (even though I don't understand how the contrary is possible...)
I've been trying to understand this myself. Here’s a the understanding I’ve come to, which is very simplistic. If someone who knows more about transformers than me says I’m wrong I will defer to them.
I used this paper to come to this understanding.
In order to have a mesa-optimizer, lots and lots of layers need to be in on the game of optimization, rather than just one or several key elements which gets referenced repeatedly during the optimization process.
But self-attention is, by default, not very far away from being one step in gradient descent. Every layer doesn't need to learn to do optimization independently from scratch, since it's relatively easy to find given the self-attention architecture.
That's why it's not forbiddingly difficult for neural networks to implement internal optimization algorithms. It still could be forbiddingly difficult for most optimization algorithms, ones that aren't easy to find from the basic architecture.
Why don’t I do the project myself?
Because I think I’m one of the smartest young supergeniuses, and I’m working on things that I think are even more useful in expectation, and which almost nobody except me can do.
Even if this is by some small chance actually true, it's stupid of you to say it, because from the perspective of your readers, you are almost certainly wrong and so you undermine your own credibility. I'm sure you were aware some people would think this, and don't care. Have you experimented with trying not to piss people off and see if it helps you?
As for your actual idea, it's cool and even if it doesn't work out we could learn some important things. Good luck!
Inside of Google Docs you mean? Yeah, for some subset of tasks the 'select > menu click > Refine the selected text' tool has been useful.