silentbob - LessWrong

The Alignment Problem No One Is Talking About

Just to note your last paragraph reminds me of Stuart Russel's approach to AI alignment in Human Compatible. And I agree this sounds like a reasonable starting point.

The Alignment Problem No One Is Talking About

silentbob1d32

Thanks for the post, I find this unique style really refreshing.

I would add to it that there's even an "alignment problem" on the individual level. A single human in different circumstances and at different times can have quite different, sometimes incompatible values, preferences and priorities. And even at any given moment their values may be internally inconsistent and contradictory. So this problem exists on many different levels. We haven't "solved ethics", humanity disagrees about everything, even individual humans disagree with themselves, and now we're suddenly racing towards a point where we need to give AI a definite idea of what is good & acceptable.

Searching for Searching for Search

silentbob17dΩ010

Aren't LLMs already capable of two very different kinds of search? Firstly, their whole deal is predicting the next token - which is a kind of search. They're evaluation all the tokens at every step, and in the end choose the most probable seeming one. Secondly, across-token search when prompted accordingly. Say "Please come up with 10 options for X, then rate them all according to Y, and select the best option" is something that current LLMs can perform very reliably - whether or not "within token search" exists as well. But then again, one might of course argue that search happening within a single forward pass, and maybe even a type of search that "emerged " via SGD rather than being hard baked into the architecture, would be particularly interesting/important/dangerous. We just shouldn't make the mistake of assuming that this would be the only type of search that's relevant.

I think across-token search via prompting already has the potential to lead to the AGI like problems that we associate with mesa optimizers. Evidently the technology is not quite there yet because PoCs like AutoGPT basically don't quite work, so far. But conditional on AGI being developed in the next few years, it would seem very likely to me that this kind of search would be the one that enables it, rather than some hidden "O(1)" search deeply within the network itself.

Edit: I should of course add a "thanks for the post" and mention that I enjoyed reading it, and it made some very useful points!

Searching for Search

silentbob19d10

Great post! Two thoughts that came to mind while reading it:

the post mostly discussed search happening directly within the network, e.g. within a single forward pass; but what can also happen e.g. in the case of LLMs is that search happens across token-generation rather than within. E.g. you could give ChatGPT a chess constellation and then ask it to list all the valid moves, and then check which move would lead to which state, and if that state looks better than the last one. This would be search depth 1 of course, but still a form of search. In practice it may be difficult because ChatGPT likes to give messages only of a certain length, so it probably stops prematurely if the search space gets too big, but still, search most definitely takes place in this case.
somewhat of a project proposal, ignoring my previous point and getting back to "search within a single forward pass of the network": let's assume we can "intelligent design" our way to a neural network that actually does implement some kind of small search to solve a problem. So we know the NN is on some pretty optimal solution for the problem it solves. What does (S)GD look like at or very near to this point? Would it stay close to this optimum, or maybe instantly diverge away, e.g. because the optimum's attractor basin is so unimaginably tiny in weight space that it's just numerically highly unstable? If the latter (and if this finding indeed generalizes meaningfully), then one could assume that even though search "exists" in parameter space, it's impractical to ever be reached via SGD due to the unfriendly shape of the search space.

Failures in Kindness

silentbob1mo30

Thanks a lot! Appreciated, I've adjusted the post accordingly.

Failures in Kindness

silentbob2mo30

Just came to my mind that these are things I tend to think of under the heading "considerateness" rather than kindness

Guess I'd agree. Maybe I was anchored a bit here by the existing term of computational kindness. :)

Failures in Kindness

silentbob2mo20

Fair point. Maybe if I knew you personally I would take you to be the kind of person that doesn't need such careful communication, and hence I would not act in that way. But even besides that, one could make the point that your wondering about my communication style is still a better outcome than somebody else being put into an uncomfortable situation against their will.

I should also note I generally have less confidence in my proposed mitigation strategies than in the phenomena themselves.

The Assumed Intent Bias

silentbob3mo12

Thanks for the example! It reminds me of how I once was a very active Duolingo user, but then they published some update that changed the color scheme. Suddenly the duolingo interface was brighter and lower contrast, which just gave me a headache. At that point I basically instantly stopped using the app, as I found no setting to change it back to higher contrast. It's not quite the same of course, but probably also something that would be surprising to some product designers -- "if people want to learn a language, surely something so banal as a brightening up the font color a bit would not make them stop using our app".

Causality is Everywhere

silentbob3mo10

Another operationalization for the mental model behind this post: let's assume we have two people, Zero-Zoe and Nonzero-Nadia. They are employed by two big sports clubs and are responsible for the living and training conditions of the athletes. Zero-Zoe strictly follows study results that had significant results (and no failed replications) in her decisions. Nonzero-Nadia lets herself be informed by studies in a similar manner, but also takes priors into account for decisions that have little scientific backing, following a "causality is everywhere and effects are (almost) never truly 0" world view, and goes for many speculative but cheap interventions, that are (if indeed non-zero) more likely to be beneficial rather than detrimental.

One view is that Nonzero-Nadia is wasting her time and focuses on too many inconsequential considerations, so will overall do a worse job than Zero-Zoe as she's distracted from where the real benefits can be found.

Another view, and the one I find more likely, is that Nonzero-Nadia can overall achieve better results (in expectation), because she too will follow the most important scientific findings, but on top of that will apply all kinds of small positive effects that Zero-Zoe is missing out on.

(A third view would of course be "it doesn't make any difference at all and they will achieve completely identical results in expectation", but come on, even an "a non-negligible subset of effect sizes is indeed 0"-person would not make that prediction, right?)

Causality is Everywhere

silentbob3mo10

You're right of course - in the quoted part I link to the wikipedia article for "almost surely" (as the analogous opposite case of "almost 0"), so yes indeed it can happen that the effect is actually 0, but this is so extremely rare on a continuum of numbers that it doesn't make much sense to highlight that particular hypothesis.

LESSWRONG
LW

Posts

Wiki Contributions

Comments