All of Riccardo Volpato's Comments + Replies

Apparently it is keeping around a representation of the token "plasma" with enough resolution to copy it . . . but it only retrieves this representation at the end! (In the rank view, the rank of plasma is quite low until the very end.)

This is surprising to me. The repetition is directly visible in the input: "when people say" is copied verbatim. If you just applied the rule "if input seems to be repeating, keep repeating it," you'd be good. Instead, the model scrambles away the pattern, then recovers it later through some other computational route.

 

O... (read more)

I've spent more of my time thinking about the technical sub-areas, so I'm focused on situations where innovations there can be useful. I don't mean to say that this is the only place where I think progress is useful.

That seems more than reasonable to me, given the current state of AI development.

Thanks for sharing your reflections on my comment.

While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post. 
 

Plus, one could... (read more)

2[anonymous]
Regarding how interpretability can help with addressing motivation issues, I think Chris Olah's views present situations where interpretability can potentially sidestep some of those issues. One such example is that if we use interpretability to aid in model design, we might have confidence that our system isn't a mesa-optimizer, and we've done this without explicitly asking questions about "what our model desires". I agree that this is far from the whole picture. The scenario you describe is an example where we'd want to make interpretability more accessible to more end-users. There is definitely more work to be done to bridge "normal" human explanations with what we can get from our analysis. I've spent more of my time thinking about the technical sub-areas, so I'm focused on situations where innovations there can be useful. I don't mean to say that this is the only place where I think progress is useful.

Thanks for the in-depth post on topic. Your last paragraph on Utility is thought-provoking to say the least. I have seen a lot of work claiming to make models interpretable - and factually doing so as well - about which I felt an itch I could not fully verbalise. I think your point on Utility puts the finger on it: most of these works were technically interpreting the model but not actually useful to the user.

From this, we can also partially explain the current difficulties around "find better ways to formalize what we mean by interpretability". If accepta... (read more)

3[anonymous]
I think that the general form of the problem is context-dependent, as you describe. Useful explanations do seem to depend on the model, task, and risks involved. However, from an AI safety perspective, we're probably only considering a restricted set of interpretability approaches, which might make it easier. In the safety context, we can probably less concerned with interpretability that is useful for laypeople, and focus on interpretability that is useful for the people doing the technical work. To that end, I think that "just" being careful about what the interpretability analysis means can help, like how good statisticians can avoid misuse of statistical testing, even though many practitioners get it wrong. I think it's still an open question, though, what even this sort of "only useful for people who know what they're doing" interpretability analysis would be. Existing approaches still have many issues.

I think we raise children to satisfy our common expected wellbeing (our + theirs + the overall societal one). Thus, the goal-directness comes from society as a whole. I think there is a key difference between this system and one where a a smarter-than-human AI focuses solely on the well-being of its users, even if it does Context Etrapolated Volition, which I think is what you are referring to when you talk about expected well being (which I agree that if you look only at their CEV-like property the two systems are equivalent).

The problem with this line of reasoning is that it assumes that the goal-directness comes from the smarter part of the duo decision-maker and bearer of consequences. With children and animals we consider they preferences as an input into our decision making, which mainly seeks to satisfies our preferences. We do not raise children solely for the purpose of satisfying their preferences.

This is why Rohin stresses particuarly on the idea that the danger in is the source of goal-directedness and if it comes from humans, then we are safer.

2Shmi
We raise children to satisfy their expected well being, not their naive preferences (for chocolate and toys), and that seems similar to what a smarter-than-human AI would do to/for us. Which was my point.

Helpful post - thanks for writing it. From a phenomenological perspective, how can we reason well about the truth of this kind of "principles" (i.e. dual-model where S2 is better than S1 being less effective at dealing with motivational conflicts than than the perspective-shift you suggest) that are to some extent non-falisfiable?

This seems true to me (that it happens all the time). I think the article helps by showing that we often fail to recognise that A) and B) can both be true. Also, if we accept that A) and B) are both true and don't create an identify conflict about it, we can probably be more effective in striking a compromise (i.e. giving up either or finding some other way to get A that does not involve B).

My rough mental summary of these intuitions

  • Generalisation abilities suggests that behaviour is goal directed beucase it demonstrates adaptability (and goals are more adatable/compact ways of defining behaviour than others, like enumeration)
  • Powergrabs suggest that behaviour is goal directed beucase it reveals instrumentalism
  • Our understanding of intelligence migth be limited to human intelligence which is sometime goal directed so we use this a proxy of intelligence (adding some, perhaps unrefutable, skepticism of goal directness as a model of intelligence)

we currently don't have a formal specification of optimization


This seems to me a singificant bottleneck for progress. No formal specification of what optimisation is has been tried before? What has been achieved? Is anyone working on this?

evhubΩ4100

Alex Flint recently wrote up this attempt at defining optimization that I think is pretty good and probably worth taking a look at.

  1. Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn't that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threa

... (read more)

If we model reachability of an objective as simply its length in bits, then distinguishing O-base from every single more reachable O-mesa gets exponentially harder as O-base gets more complex. Thus, for a very complicated O-base, sufficiently incentivizing the base optimizer to find a mesa-optimizer with that O-base is likely to be very difficult, though not impossible

What is the intuition that makes you think that despite being expoentially harder this would not be impossible?

you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between

How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?

4evhub
“Annealing” here simply means decaying over time (as in learning rate annealing), in this case decaying the influence of one of the losses to zero.

I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.

I feel like we as a community still haven't really explored the full space of possible prosaic AI alignment approaches

I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of th

... (read more)

Thanks for the great post. It really provides an awesome overview of the current progress. I will surely come back to this post often and follow pointers as I think about and research things.

Just before I came across this, I was thinking of hosting a discussion about "Current Progres in ML-based AI Safety Proposals" at the next AI Safety Discussion Day (Sunday June 7th).

Having read this, I think that the best thing to do is to host an open-ended discussion about this post. It would be awesome if you can and want to join. More details can be found... (read more)

8evhub
Glad you liked the post! Hopefully it'll be helpful for your discussion, though unfortunately the timing doesn't really work out for me to be able to attend. However, I'd be happy to talk to you or any of the other attendees some other time—I can be reached at evanjhub@gmail.com if you or any of the other attendees want to reach out and schedule a time to chat. In terms of open problems, part of my rationale for writing up this post is that I feel like we as a community still haven't really explored the full space of possible prosaic AI alignment approaches. Thus, I feel like one of the most exciting open problems would be developing new approaches that could be added to this list (like this one, for example). Another open problem is improving our understanding of transparency and interpretability—one thing you might notice with all of these approaches is that they all require at least some degree of interpretability to enable inner alignment to work. I'd also be remiss to mention that if you're interested in concrete ML experiments, I've previously written up a couple of different posts detailing experiments I'd be excited about.

Interesting points. The distinctions you mention could equally apply in distinguishing narrow from ambitious value learning. In fact, I think preference learning is pretty much the same as narrow value learning. Thus, could it be that ambitious value learning research may not be very interested in preference learning to a similar extent in which they are not interested in narrow value learning?

"How important safety concerns" is certainly right, but the story of science teaches us that taking something from a domain with different concerns to another domain has often proven extremely useful.