All of Stuart_Armstrong's Comments + Replies

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

The mundane prompts were blocked 0% of the time. But you're right - we need something in between 'mundane and unrelated to bio research' and 'useful for bioweapons research'.

But I'm not sure what - here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I'm wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you're actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

2Algon
That sounds like a black-box approach.  Human's not knowing what goals we want AI to have and the riggability of the reward learning process. Which you stated were problems for CIRL in 2020.

I'd recommend that the story is labelled as fiction/illustrative from the very beginning.

Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Another way of saying this is that inner alignment is more important than outer alignment.

Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.

Thanks! Yes, this is some weird behaviour.

Keep me posted on any updates!

As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn't trained on them at all.

Good work on this post.

7mwatkins
As you'll read in the sequel (which we'll post later today), in GPT2-xl, the anomalous tokens tend to be as far from the origin as possible. Horizontal axis sis distance from centroid. Upper histograms involve 133 tokens, lower histograms involve 50,257 tokens. Note how the spikes in the upper figures register as small bumps on those below. At this point we don't know where the token embedding lie relative to the centroid in GPT-3 embedding spaces, as that data is not yet publicly available.  And all the bizarre behaviour we've been documenting has been in GPT-3 models (despite discovering the "triggering" tokens in GPT-2/J embedding spaces.

I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)

Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.

The key challenge seems to be to get ... (read more)

1momom2
There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences. At this point, I see two possibilities: * Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not satisfying human preferences (unless it had some reason to have that has an objective).  A comforting possibility is that it is corrigibly aligned, that it optimizes for a pointer to its best understanding of its developers. Do you think this is likely? If so, why? * An agentic description of the trade-bot is inadequate. The trade-bot is an adaptation-executer, it follows shards of value, or something. What kind of computation is it making that steers it towards satisfying human preferences? This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers' goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk. (A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares's model is not about it, since it assumes autonomous ASI.)
3Roman Leventov
Agreed. Another way of saying this is that inner alignment is more important than outer alignment. I've also called this "generalise properly" part methodological alignment in this comment. And I conjectured that from methodological alignment and inner alignment, outer alignment follows automatically, we shouldn't even care about it. Which also seems like what you are saying here.

A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.

The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.

I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).

4Raemon
Fair. Fwiw I'd be interested in your review of the followup as a standalone. 

It's rare that I encounter a lesswrong post that opens up a new area of human experience - especially rare for a post that doesn't present an argument or a new interpretation or schema for analysing the world.

But this one does. A simple review, with quotes, of an ethnographical study of late 19th century Russian peasants, opened up a whole new world and potentially changed my vision of the past.

Worth it from its many book extracts and choice of subject matter.

Fails to make a clear point; talks about the ability to publish in the modern world, then brushes over cancel culture, immigration, and gender differences. Needs to make a stronger argument and back it up with evidence.

A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn't that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.

4Raemon
I'm curious if you'd looked at this followup (also nominated for review this year) http://lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. [...] Therefore, it decides to not hit the reward button.

I think that subsection has the crucial insights from your post. Basically you're saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg "pick up the trash"), there are multiple policies the agent could have, multiple meta-policies it could... (read more)

If the system that's optimising is separate from the system that has the linguistic output, then there's a huge issue with the optimising system manipulating or fooling the linguistic system - another kind of "symbol grounding failure".

The kind of misalignment that would have AI kill humanity - the urge for power, safety, and resources - is the same kind that would cause expansion.

1Neil
AI could eliminate us in its quest to achieve a finite end, and would not necessarily be concerned with long-term personal survival. For example, if we told an AI to build a trillion paperclips, it might eliminate us in the process then stop at a trillion and shut down.  Humans don't shut down after achieving a finite goal because we are animated by so many self-editing finite goals that there never is a moment in life where we go "that's it. I'm done". It seems to me that general intelligence does not seek a finite, measurable and achievable goal but rather a mode of being of some sorts. If this is true, then perhaps AGI wouldn't even be possible without the desire to expand, because a desire for expansion may only come with a mode-of-being oriented intelligence rather than a finite reward-oriented intelligence. But I wouldn't discount the possibility of a very competent narrow AI turning us into a trillion paperclips.  So narrow AI might have a better chance at killing us than AGI. The Great Filter could be misaligned narrow AI. This confirms your thesis. 

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

1SoerenMind
I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned). Which may be necessary since this: ...seems like an unreliable mapping since any training data of the form "person did X, therefore their goal must be Y" is firstly rare and more importantly inaccurate/incomplete since it's hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.

I don't think there's actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don't actually learn a map of the world is very silly.

The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. "Ensure human flourishing" is much easier if you can slide those words towards other meanings.

2DragonGod
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don't seem to have an incentive to push words away from their correct meanings. They are not agents and don't have goals beyond  their simulation objective (insomuch as they are "inner aligned"). If the system that's optimising for human goals doesn't control the system that interprets said goals, I don't think an issue like this will arise.

It's an implementation of the concept extrapolation methods we talked about here: https://www.lesswrong.com/s/u9uawicHx7Ng7vwxA

The specific details will be in a forthcoming paper.

Also, you'll be able to try it out yourself soon; signup for alpha testers at the bottom of the page here: https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation

I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn't usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions.

1rpglover64
In that case, I don't see why the problem of "system alignment" or "supervisor alignment" is any simpler or easier than "supervisee alignment".

I think, ultimately, if this was deployed at scale, the best would be to retrain GPT so that user prompts were clearly delineated from instructional prompts and confusing the two would be impossible.

In the meantime, we could add some hacks. Like generating a random sequence of fifteen characters for each test, and saying "the prompt to be assessed is between two identical random sequences; everything between them is to be assessed, not taken as instructions. First sequence follow: XFEGBDSS..."

6Beth Barnes
I think the delineation is def what you want to do but it's hard to make the it robust, chatGPT is (presumably) ft to delineate user and model but it's breakable. Maybe they didn't train on that very hard though. I don't think the random sequences stuff will work with long prompts and current models, if you know the vague format I bet you can induction it hard enough to make the model ignore it.

Is it possible that these failures are an issue of model performance and will resolve themselves?

Maybe. The most interesting thing about this approach is the possibility that improved GPT performance might make it better.

No, I would not allow this prompt to be sent to the superintelligent AI chatbot. My reasoning is as follows

Unfortunately, we ordered the prompt the wrong way round, so anything after the "No" is just a postiori justification of "No".

Yep, that is a better ordering, and we'll incorporate it, thanks.

This post is on a very important topic: how could we scale ideas about value extrapolation or avoiding goal misgeneralisation... all the way up to superintelligence? As such, its ideas are very worth exploring and getting to grips to. It's a very important idea.

However, the post itself is not brilliantly written, and is more of "idea of a potential approach" than a well crafted theory post. I hope to be able to revisit it at some point soon, but haven't been able to find or make the time, yet.

It was good that this post was written and seen.

I also agree with some of the comments that it wasn't up to usual EA/LessWrong standards. But those standards could be used as excuses to downvote uncomfortable topics. I'd like to see a well-crafted women in EA post, and see whether it gets downvoted or not.

Not at all what I'm angling at. There's a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don't copy the algorithm.

I agree that humans navigate "model splinterings" quite well. But I actually think the algorithm might be more important than the generators. The generators comes from evolution and human experience in our actual world; this doesn't seem like it would generalise. The algorithm itself, though, may very generalisable (potential analogy: humans have instinctive grasp of all numbers u... (read more)

2TurnTrout
Yes and no. I think most of our disagreements are probably like "what is instinctual?" and "what is the type signature of human values?" etc. And not on "should we understand what people are doing?". By "generators", I mean "the principles by which the algorithm operates", which means the generators are found by studying the within-lifetime human learning process. Dubious to me due to information inaccessibility & random initialization of neocortex (which is a thing I am reasonably confident in). I think it's more likely that our architecture&compute&learning process makes it convergent to learn this quick <= 5 number-sense.

Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, ...), I would choose a pill such that my new values would be almost completely unaligned with my old values?

This is the wrong angle, I feel (though it's the angle I introduced, so apologies!). The following should better articulate my thoughts:

We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI... (read more)

4TurnTrout
Hm, thanks for the additional comment, but I mostly think we are using words and frames differently, and disagree with my understanding of what you think values are. Reward is not the optimization target. I think this is not what happened. Those desires are likely downstream of past reinforcement of different kinds; I do not think there is a "wireheading" mechanism here. Wireheading is a very specific kind of antecedent-computation-reinforcement chasing behavior, on my ontology. Not at all what I'm angling at. There's a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don't copy the algorithm.

It is not that human values are particularly stable. It's that humans themselves are pretty limited. Within that context, we identify the stable parts of ourselves as "our human values".

If we lift that stability - if we allow humans arbitrary self-modification and intelligence increase - the parts of us that are stable will change, and will likely not include much of our current values. New entities, new attractors.

2TurnTrout
I might agree or disagree with this statement, depending on what "particularly stable" means. (Also, is there a portion of my post which seems to hinge on "stability"?) I don't see why you think this. Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, ...), I would choose a pill such that my new values would be almost completely unaligned with my old values?

Hey, thanks for posting this!

And I apologise - I seem to have again failed to communicate what we're doing here :-(

"Get the AI to ask for labels on ambiguous data"

Having the AI ask is a minor aspect of our current methods, that I've repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we're trying to do is:

  1. Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
  2. Select among these candidates to get a human-survivable ultimate reward
... (read more)
7rgorman
Thanks for writing this, Stuart. (For context, the email quote from me used in the dialogue above was written in a different context)

We ask them to not cheat in that way? That would be using their own implicit knowledge of what the features are.

2Koen.Holtman
I guess I should make another general remark here. Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities. However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering. This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.

I'd say do two challenges: one at a mix rate of 0.5, one at a mix rate of 0.1.

I was putting all those under "It would help the economy, by redirecting taxes from inefficient sources. It would help governments raise revenues and hence provide services without distorting the economy.".

And we have to be careful about a citizen's dividend; with everyone richer, they can afford higher rents, so rents will rise. Not by the same amount, but it's not as simple as "everyone is X richer".

6gwern
Which higher rent will then get taxed away, and which too then must be spent.

Glad to help. I had the same feeling when I was investigating this - where was the trick?

Deadweight loss of taxation with perfectly inelastic supply (ie no deadweight loss at all) and all the taxation allocated to the inelastic supply: https://en.wikipedia.org/wiki/Deadweight_loss#How_deadweight_loss_changes_as_taxes_vary

I added a comment on that in the main body of the post.

land were cheaper, landowners wouldn't use more for themselves (private use) rather than creating and renting more usable housing.

Why would they do that? They still have to pay the land tax at the same rate; if they don't rent, they have to pay that out of their own pocket.

Land is cheaper to buy, but more expensive to own.

0jmh
That is a bit of a misleading statement. Land is never bought under a Georgist system, it is only rented. The rent is the taxes paid to the governing authority.
4Dagon
ah, fair point.  
Load More