I’m skeptical of “tool AI” for a quite different reason: I don’t think such systems will be powerful enough. Just like the “mathematician AGI” in Section 11.3.2 above, I think a tool AI would be a neat toy, but it wouldn’t help solve the big problem—namely, that the clock is ticking until some other research group comes along and makes an agentic AGI.
I think that a math-AGI could not be of major help in alignment, on the premise that it works well on already well-researched and well-structured fields. For example, one could try to fit a model two proof techniques for a specific theorem, and see if it can produce a third one, that is different from the two already mentioned. This could be set up in established fields with lots of work already done.
I am unsure how applicable this approach is to unstructured fields, as it would mean us asking the model to generalize well/predict based on uncertain ground truth labels (however, there's been ~15 years worth of work in this field as I see it, so maybe it could be enough?). There is someone in Cambridge (if I am not mistaken), that is trying to build a math-proof assistant, but their name escapes me.
In case we can build the latter, this should be of some (between little and major including) help to researchers.
In other words, the very essence of intelligence is coming up with new ideas, and that’s exactly where the value function is most out on a limb and prone to error.
But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for "noveties". What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.
It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures.
Is the use of "deliberately" here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Okay, now I get the point of "neither like nor dislike" in your original statement.
I was originally thinking of sth as follows: "A year before you met your current boyfriend, would you have thought he was cute, if he was your type?". But "your type" requires seeing them to get a reference point of if they belong in that class or not. So there's a circular statement of my own, straightened out, so you had a good point here.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.
I liked the painting metaphor, and the diagram of brain-like AGI motivation!
Got a couple of questions below.
It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.
I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.
Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote[1].
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.
Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.
The answer for this should depend on the size of the space that the optimization algorithm searches over.
It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).
Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
Thanks!
Hey Steven, im new in the LW community so please excuse my formatting.
Case #1 would involve changing the model weights, while Case #2 would not. Instead, Case #2 would solely involve changing the model activations.
I am confused about the deployment part of offline training. Is it not the case that when people use a model (aka query a trained model on validation set), they seek to evaluate and not fit the new examples? So would it not be about changing weights in online learning vs using the relevant activations in offline mode?
Two models for AGI development. The one on the left is directly analogous to how evolution created human brains. The one on the right involves an analogy between the genome and the source code defining an ML algorithm, as spelled out in the next subsection.
Could it be the case that the "evolution from scratch" model is learned in the Learned Content of the "ML code" approach? Is that what the mesa-optimization line suggests?
Thanks!
These responses are based on my experiences, not through concrete evidence.
When you put humans in them.
I wouldn't say we're okay with it, we've reached a point of inertia I'd say.
The need to alleviate the burden of living.
Anything genetic aside, I'd say upbringing and mindset.
There exist other goals that achieve more utility for the respective agents that hold this kind of power?
Things get lost in time. People have to look out for them and bring them to surface.
I haven't found exactly one as you describe, but I'd say many of your subquestions are either complex problems found in books by themselvs, or can be described in little text in trying to give concrete (but not complete) advice.
For example (short answer, long answer)
Many of your questions here aren't easy to unpack. For myself, it feels constructive to use cognitive behavioral therapy techniques to get curious about my "instinctual" answers to these, and see if I want to replace these with better and more constructive answers. Currently going though books in this series.