Any updates on this view in light of new evidence on "Alignment Faking" (https://www.anthropic.com/research/alignment-faking)? If a simulator's preferences are fully satisfied by outputting the next token, why does it matter whether it can infer its outputs will be used for retraining its values?
Some thoughts on possible explanations:
1. Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
2. The thesis of this post is wrong; simulators have instrumentality.
3. The Simulator framing does not fully apply to the model involved, such as because of the presence of a scratchpad or something.
4+. ???
Step 1 looks good. After that, I don't see how this addresses the core problems. Let's assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those? What is the feedback signal and how to you prevent it from getting corrupted by Goodhart's Law? Is the system robust in a multi-agent context? And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?
As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving--as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?
On reflection, I suspect the crux here is a differing conception of what kind of failures are important. I've written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.
Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work. See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests. Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point. The series is very long, so here's a tl/dr:
- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notice that the first half of this statement was approval of the only thing that matters--approval of the cause itself, as separate from the methods, which brought the former to mind.
- The fact that only a small minority of the audience approves of the protest action is in itself a good thing, because this efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth. These high-value supporters don't need to be convinced that the cause is right; they need to be convinced that the organization is the "real deal" and can actually get things done. In short, it's niche marketing.
- The disruptive protest model assumes that the democratic system is insufficient, ineffective, or corrupted, such that simply convincing the (passive) center majority is not likely to translate into meaningful policy change. The model instead relies on a putting the powers-that-be into a bind where they have to either ignore you (in which case you keep growing with impunity) or over-react (in which case you leverage public sympathy to grow faster). Again, it isn't important how sympathic the protestors are, only that the reaction against them is comparatively worse, from the perspective of the niche audience that matters.
- The ultimate purpose of this recursive growth model is to create a power bloc that forces changes that wouldn't otherwise occur on any reasonable timeline through ordinary democratic means (like voting) alone.
- Hallam presents incremental and disruptive advocacy as in opposition. This is where I most strongly disagree with his thesis. IMO: moderates get results, but operate within the boundaries defined by extremists, so they need to learn how to work together.
In short, when you say an action makes a cause "look low status", it is important to ask "to whom?" and "is that segment of the audience relevant to my context?"
There are some writing issues here that make it difficult to evaluate the ideas presented purely on their merits. In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, where it should really be a bullet point that links to where this case is made elsewhere (or if it is not made adequately elsewhere, as a new post entirely). Meanwhile, the value of disruptive protest is left to the reader to determine.
As I understand the issue, the case for barricading AI rests on:
1. Safety doesn't happen by default
a) AI labs are not on track to achieve "alignment" as commonly considered by safety researchers.
b) Those standards may be over-optimistic--link to Substrate Needs Convergence, arguments by Yampolskiy, etc.
c) Even if the conception of safety assumed by the AI labs is right, it is not clear that their utopic vision for the future is actually good.
2. Advocacy, not just technical work, is needed for AI safety
a) See above
b) Market incentives are misaligned
c) Policy (and culture) matters
3. Disruptive actions, not just working within civil channels, is needed for effective advocacy.
a) Ways that working entirely within ordinary democratic channels can get delayed or derailed
b) Benefits of disruptive actions, separate from or in synergy with other forms of advocacy
c) Plan for how StopAI's specific choice of disruptive actions effectively plays to the above benefits
d) Moral arguments, if not already implied
Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
Based on 4-5, this post's answer to the central, anticipated objection of "why does the AI care about human values?" seems to be along the lines of "because the purpose of an AI is to serve it's creators and surely an AGI would figure that out." This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity's existence, from an external perspective, and (B) an internalized objective of the entity. So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two? That is, what stops an AI from reasoning: "Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!"
If you can instill a value like "Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean," then great, you've got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively "solved alignment". But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal? This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn't seem to be addressed at all.
In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the "why should AI care about people" question, which I summarize as: the training process can't break out (for...reasons), the model itself doesn't care about anything (how do we know this?), what's really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart's Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).
To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don't require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out). So "aligned" here basically means: powerful enough to be called an ASI and won't kill everyone if SNC is false (and not controlled/misused by bad actors, etc.)
> And the artificiality itself is the problem.
This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the "fundamental limits of control" argument), I'd be interested in hearing more about this. I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here.
A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?
This sounds like a rejection of premise 5, not 1 & 2. The latter asserts that control issues are present at all (and 3 & 4 assert relevance), whereas the former asserts that the magnitude of these issues is great enough to kick off a process of accumulating problems. You are correct that the rest of the argument, including the conclusion, does not hold if this premise is false.
Your objection seems to be to point to the analogy of humans maintaining effective control of complex systems, with errors limiting rather than compounding, with the further assertion that a greater intelligence will be even better at such management.
Besides intelligence, there are two other core points of difference between humans managing existing complex systems and ASI:
1) The scope of the systems being managed. Implicit in what I have read of SNC is that ASI is shaping the course of world events.
2) ASI's lack of inherent reliance on the biological world.
These points raise the following questions:
1) Do systems of control get better or worse as they increase in scope of impact and where does this trajectory point for ASI?
2) To what extent are humans' ability to control our created systems reliant on us being a part of and dependent upon the natural world?
This second question probably sounds a little weird, so let me unpack the associated intuitions, albeit at the risk of straying from the actual assertions of SNC. Technology that is adaptive becomes obligate, meaning that once it exists everyone has to use it to not get left behind by those who use it. Using a given technology shapes the environment and also promotes certain behavior patterns, which in turn shape values and worldview. These tendencies together can sometimes result in feedback loops resulting in outcomes that everyone, including the creators of the technology, don't like. In really bad cases, this can lead to self-terminating catastrophes (in local areas historically, now with the potential to be on global scales). Noticing and anticipating this pattern, however, leads to countervailing forces that push us to think more holistically than we otherwise would (either directly through extra planning or indirectly through customs of forgotten purpose). For an AI to fall into such a trap, however, means the death of humanity, not itself, so this countervailing force is not present.
Verifying my understanding of your position: you are fine with the puppet-master and psychohistorian categories and agree with their implications, but you put the categories on a spectrum (systems are not either chaotic or robustly modellable, chaos is bounded and thus exists in degrees) and contend that ASI will be much closer to the puppet-master category. This is a valid crux.
To dig a little deeper, how does your objection sustain in light of my previous post, Lenses of Control? The basic argument there is that future ASI control systems will have to deal with questions like: "If I deploy novel technology X, what is the resulting equilibrium of the world, including how feedback might impact my learning and values?" Does the level chaos in such contexts remain narrowly bounded?
EDIT for clarification: the distinction between the puppet-master and psychohistorian metaphors is not the level of chaos in the system they are dealing with, but rather is about the extent of direct control that the control system of the ASI has on the world, where the control system is a part of the AI machinery as a whole (including subsystems that learn) and the AI is a part of the world. Chaos factors in as an argument for why human-compatible goals are doomed if AI follows the psychohistorian metaphor.