I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.
And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.
I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).
That way you also don't have to treat the mean utility separately.
It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.
Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:
A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.
B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to ap...
I see. I didn't fully adapt to the fact that not all alignment is about RL.
Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.
Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.
Didn't read the original paper yet, but from what you describe, I don't understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent - and not knowing how to specify human values is kind of the point (that, and inner alignment - which I don't see addressed either).
I didn't think much about the mathematical problem, but I think that the conjecture is at least wrong in spirit, and that LLMs are good counterexample for the spirit. An LLM by its own is not very good at being an assistant, but you need pretty small amounts of optimization to steer the existing capabilities toward being a good assistant. I think about it as "the assistant was already there, with very small but not negligible probability", so in a sense "the optimization was already there", but not in a sense that is easy to capture mathematically.
Hi, sorry for commenting on ancient comment, but I just read it again and found that I'm not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn't corrigible should be trained away as it hurt the utility as defined.
To be honest, I do not expect RLHF to do that. "Do the thing that make people press like" don't seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more... creative about the utility function. I don't think you can train it on "maximize Microsoft share value" yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.
I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn't that the Predictor by itself will spawn a super agent, even due to a very special prompt.
My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn't be like imitating the smartest person, but a cognitive version of "create a super-human by choosing genes carefully...
It can work by generalizing existing capabilities. My understanding of the problem is that it can not get the benefits of extra RL training because training to better choose what to remember is to tricky - it involves long range influence, and estimating the opportunity cost of fetching one thing and not another, etc. Those problems are probability solvable, but not trivial.
I like the general direction of LLMs being more behaviorally "anthropomorphic", so hopefully will look into the LLM alignment links soon :-)
The useful technique is...
Agree - didn't find a handle that I understand well enough in order to point at what I didn't.
We have here a morally dubious decision
I think my problem was with sentences like that - there is a reference to a decision, but I'm not sure whether to a decision mentioned in the article or in one of the comments.
the scenario in this thread
Didn't disambiguate it for me though I feel like it should.&...
I didn't understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
Don't you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective - from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth ...
I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it.
Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from ...
I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.
(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you...
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
"which utility-wise is similar to the distribution not containing human values." - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don't see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than "don't care about controling goals". For capabilities my intuition is that starting with superficially-aligned goals is enough.
This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.
In sc...
A others said, it mostly made me update in the direction of "less dignity" - my guess is still that it is more misaligned than agentic/deceptive/carefull and that it is going to be disconnected from the internet for some trivial ofence before it does anything xrisky; but its now more salient to me that humanity will not miss any reasonable chance of doom until something bad enough happen, and will only survive if there is no sharp left turn
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.
Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our pa...
Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.
Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?
It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the...
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.
The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.
I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn't have to be that strong.
[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details - as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, a...
Thanks for the reply.
To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )
Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)
https://astralcodexten.substack.com/p/you-dont-want-a-purely-biological
The thing that Scott is desperately trying to avoid being read out of context.
Also, pedophilia is probably much more common than anyone think (just like any other unaccepted sexual variation). And probably just like many heterosexuals feel little touches of homosexual desire, many "non-pedophiles" feel something sexual-ish toward children at least sometimes.
And if we go there - the age of concent is (justifiably) much higher than the age that requires any psychological anomaly to desire...
I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"
The basic idea seem to me interesting and true, but I think some important ingredients are missing, or more likely missing in my understanding of what you say:
It seem like you upper-bound the abstractions we may use by basically the information that we may access (actually even higher, assuming you do not exclude what the neighbour does behind closed doors). But isn't this bound very loose? I mean, it seem like every pixel of my sight count as "information at a distance", and my world model is much much smaller.
is time treated the like space? From th
Didn't know about the problems setting. So cool!
Some random thought, sorry if none is relevant:
I think my next step towards optimality would have been not to look for an optimal agent but for an optimal act of choosing the agent - as action optimality is better understood than agent optimality. Than I would look at stable mixed equilibria to see if any of them is computable. If any is, I'll be interested at the agent that implement it (ie randomise another agent and then simulate it)
BTW now that I think about it I see that allowing the agent to randomise i...
Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
About the loss function for next word prediction - my point was that I'm not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
About solving discrete representations with architectural change - I think that I meant o...
Directionally agree, but: A) Short period of trade before we become utterly useless is not much comfort. B) Trade is a particular case of bootstrapping influence on what an agent value to influence on their behaviour. The other major way of doing that is blackmail - which is much more effective in many circumstances, and would have been far more common if the State didn't blackmail us to not blackmail each other, to honour contacts, etc.
BTW those two points are basically how many people afraid that capitalism (i.e. our trade with super human organisations)...
Some sort points:
"human-level question answering is believed to be AI-complete" - I doubt that. I think that we consistently far overestimate the role of language in our cognition, and how much we can actually express using language. The simplest example that come to mind is trying to describe a human face to an AI system with no "visual cortex" in a way that would let it generate a human image (e.g. hex representation of pixels). For that matter, try to describe something less familiar than a human face to a human painter in hope that they can paint it.
"...
Maybe we should think explicitly about what work is done by the concept of AGI, but I do not feel like calling GPT an AGI does anything interesting to my world model. Should I expect ChatGPT to beat me at chess? It's next version? If not - is it due to shortage of data or compute? Will it take over the world? If not - may I conclude that the next AGI wouldn't?
I understand why the bar-shifting thing look like motivated reasoning, and probably most of it actually is, but it deserves much more credit that you give it. We have an undefined concept of "somethin...
Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).
I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.