Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
With outer alignment I was referring to: "providing well-specified rewards" (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what's relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned ("one can construct a desirability tree over various possible various future states.").
All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn't matter.
Oh, but it does matter! If your desirability tree consists of weak branches (i.e., wrong predictions), what's it good for?
we can't blame this on outer alignment, can we? This would be better described as goal misspecification.
I believe it may have been a mistake on my side, I have assumed that the definition I was using for outer alignment was standard/the default! I think this would match goal misspecification, yes! (And my working definition, as stated above).
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn't have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Completely agreed!
On a related note, you may find this interesting: https://arxiv.org/abs/1607.00913
[...] without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.[...] My goal with the thermostat example is just to point out that that isn't (as far as I can see) because of a fundamental limit in how precisely you can predict the future.
I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn't game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the things you care about (predictive variables) is due to the inaccuracy attribution degeneracy that I mention in the post. In other words, you don't (and possibly can't) know if the variable you're interested in predicting isn't being accurately forecasted because a lack of relevant things to be specified (most common case) or due to misspecified initial conditions of all the relevant variables.
I claim that we would still have the same kinds of risks from advanced RL-based AI that we have now, because we don't have a reliable way to clearly specify our complete preferences and have the AI correctly internalize them.
I partially agree: I'd say that, in that hypothetical case, you've solved one layer of complexity and this other one you're mentioning still remains! I don't claim that solving the issues raised by chaotic unpredictability solve goal gaming, but I do claim that without solving the former you cannot solve the latter (i.e., solving chaos is a necessary but not sufficient condition).
_Let's say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent's decision rule in a way that doesn't rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). _
This makes intuitive sense to me! However, for concreteness, I'd pushback with an example and some questions.
Let's assume that we want to train an AI system that autonomously operates in the financial market. Arguably, a good objective for this agent is to maximize benefits. However, due to the chaotic nature of financial markets and the unpredictability of initial conditions, the agent might develop strategies that lead to unintended and potentially harmful behaviours.
Questions:
As I understand it, the argument above doesn't account for the agent using the best information available at the time (in the future, relative to its goal specification).
Hmm, I think my argument also applies to this case, because the "best information available at the time" might not be enough (e.g., because we cannot know whether there are missing variables, lack of precision in the initial conditions, etc). I think the only case in which this is good enough, I'd say, is when the course of action is within the forecastable horizon. But, in that case, all long-term goals have to be able to be split into much smaller pieces, which is something I am honestly not sure can be done.
I'd be interested in hearing why these expectations might not be well calibrated, ofc!
Hi Gianluca, it's great that you liked the post and the idea! I think that your approach and mine share things in common and that we have similar views on how activation steering might be useful!
I would definitely like to chat to see whether potential synergies come up :)
I agree that it is intriguing. Even if I'm currently testing the method on more established datasets, my intuition of why it works is as follows:
Singular vectors correspond to the principal directions of data variance. In the context of LLMs, these directions capture significant patterns of moral and ethical reasoning embedded within the model's activations.
While it seems that the largest singular vectors only preserve linear-algebra properties, they also encapsulate high-dimensional data structure, which I'd argue includes semantic and syntactic information. At the end of the day, these vectors represent directions along which the model activations vary the most, inherently encoding important semantic information.
Hi Charlie, thanks a lot for taking the time to read the post and for the question!
Regarding what was the idea of changing the activation histories: I wanted to capture token dependencies, as I thought that concepts that weren't captured by one token only (as in the case of ActAdd) would be better described by these history-dependent activations. As to why bringing the 3 relevant activation histories to the same size: that's for enabling comparison (and, ultimately, similarity).
Regarding why SVD: I decided to use SVD as it's one of the simplest and most ubiquitous matrix factorisation techniques out there (so I didn't need to validate it or benchmark it). Also, it allows for not-so-heavy computations, which is crucial because SARA is thought to be implemented at inference time.
Hmm, I see what you mean. However, that person's lack of clarity would in fact be also called "bad prediction", which is something I'm trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don't call it "misaligned behaviour" is because we're assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Thanks for this pointer!